All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/21] mm: Preemptibility -v6
@ 2010-11-26 14:38 ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

This patch-set makes part of the mm a lot more preemptible. It converts
i_mmap_lock and anon_vma->lock to mutexes and makes mmu_gather fully
preemptible.

The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.

The side-effects are that can finally make mmu_gather preemptible,
something which lots of people have wanted to do for a long time.

It also gets us anon_vma refcounting, which seems to result in a nice
cleanup of the anon_vma lifetime rules wrt KSM and compaction.

This patch-set is build and boot-tested on x86_64 (a previous version was
also tested on Dave's Niagra2 machines, and I suppose s390 was too when
Martin provided the conversion patch for his arch).

There are no known architectures left unconverted.

Yanmin ran the -v3 posting through the comprehensive Intel test farm
and didn't find any regressions.

( Not included in this posting are the 4 Sparc64 patches that implement
  gup_fast, those can be applied separately after this series gets
  anywhere. )

The full series (including the Sparc64 gup_fast bits) also available in -git
form from (against Linus' tree as of about an hour ago):

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-mmu_preempt.git

Linus seems to want this to get some serious review and be pushed through
Andrew (well, anybody but me actually, but since Andrew is mm master...)

Ben, Thomas, Andrew, can I trick you guys into looking at this stuff?



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 00/21] mm: Preemptibility -v6
@ 2010-11-26 14:38 ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

This patch-set makes part of the mm a lot more preemptible. It converts
i_mmap_lock and anon_vma->lock to mutexes and makes mmu_gather fully
preemptible.

The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.

The side-effects are that can finally make mmu_gather preemptible,
something which lots of people have wanted to do for a long time.

It also gets us anon_vma refcounting, which seems to result in a nice
cleanup of the anon_vma lifetime rules wrt KSM and compaction.

This patch-set is build and boot-tested on x86_64 (a previous version was
also tested on Dave's Niagra2 machines, and I suppose s390 was too when
Martin provided the conversion patch for his arch).

There are no known architectures left unconverted.

Yanmin ran the -v3 posting through the comprehensive Intel test farm
and didn't find any regressions.

( Not included in this posting are the 4 Sparc64 patches that implement
  gup_fast, those can be applied separately after this series gets
  anywhere. )

The full series (including the Sparc64 gup_fast bits) also available in -git
form from (against Linus' tree as of about an hour ago):

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-mmu_preempt.git

Linus seems to want this to get some serious review and be pushed through
Andrew (well, anybody but me actually, but since Andrew is mm master...)

Ben, Thomas, Andrew, can I trick you guys into looking at this stuff?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 00/21] mm: Preemptibility -v6
@ 2010-11-26 14:38 ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

This patch-set makes part of the mm a lot more preemptible. It converts
i_mmap_lock and anon_vma->lock to mutexes and makes mmu_gather fully
preemptible.

The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.

The side-effects are that can finally make mmu_gather preemptible,
something which lots of people have wanted to do for a long time.

It also gets us anon_vma refcounting, which seems to result in a nice
cleanup of the anon_vma lifetime rules wrt KSM and compaction.

This patch-set is build and boot-tested on x86_64 (a previous version was
also tested on Dave's Niagra2 machines, and I suppose s390 was too when
Martin provided the conversion patch for his arch).

There are no known architectures left unconverted.

Yanmin ran the -v3 posting through the comprehensive Intel test farm
and didn't find any regressions.

( Not included in this posting are the 4 Sparc64 patches that implement
  gup_fast, those can be applied separately after this series gets
  anywhere. )

The full series (including the Sparc64 gup_fast bits) also available in -git
form from (against Linus' tree as of about an hour ago):

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-mmu_preempt.git

Linus seems to want this to get some serious review and be pushed through
Andrew (well, anybody but me actually, but since Andrew is mm master...)

Ben, Thomas, Andrew, can I trick you guys into looking at this stuff?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 01/21] mm: Revert page_lock_anon_vma() lock annotation
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Namhyung Kim

[-- Attachment #1: revert-rmap-annotate_lock_context_change_on_page_unlock_anon_vma.patch --]
[-- Type: text/plain, Size: 1731 bytes --]

Its beyond ugly and gets in the way.

Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -241,20 +241,7 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *__page_lock_anon_vma(struct page *page);
-
-static inline struct anon_vma *page_lock_anon_vma(struct page *page)
-{
-	struct anon_vma *anon_vma;
-
-	__cond_lock(RCU, anon_vma = __page_lock_anon_vma(page));
-
-	/* (void) is needed to make gcc happy */
-	(void) __cond_lock(&anon_vma->root->lock, anon_vma);
-
-	return anon_vma;
-}
-
+struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -314,7 +314,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-struct anon_vma *__page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma, *root_anon_vma;
 	unsigned long anon_mapping;
@@ -348,8 +348,6 @@ struct anon_vma *__page_lock_anon_vma(st
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
-	__releases(&anon_vma->root->lock)
-	__releases(RCU)
 {
 	anon_vma_unlock(anon_vma);
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 01/21] mm: Revert page_lock_anon_vma() lock annotation
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Namhyung Kim

[-- Attachment #1: revert-rmap-annotate_lock_context_change_on_page_unlock_anon_vma.patch --]
[-- Type: text/plain, Size: 2027 bytes --]

Its beyond ugly and gets in the way.

Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -241,20 +241,7 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *__page_lock_anon_vma(struct page *page);
-
-static inline struct anon_vma *page_lock_anon_vma(struct page *page)
-{
-	struct anon_vma *anon_vma;
-
-	__cond_lock(RCU, anon_vma = __page_lock_anon_vma(page));
-
-	/* (void) is needed to make gcc happy */
-	(void) __cond_lock(&anon_vma->root->lock, anon_vma);
-
-	return anon_vma;
-}
-
+struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -314,7 +314,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-struct anon_vma *__page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma, *root_anon_vma;
 	unsigned long anon_mapping;
@@ -348,8 +348,6 @@ struct anon_vma *__page_lock_anon_vma(st
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
-	__releases(&anon_vma->root->lock)
-	__releases(RCU)
 {
 	anon_vma_unlock(anon_vma);
 	rcu_read_unlock();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 01/21] mm: Revert page_lock_anon_vma() lock annotation
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Namhyung Kim

[-- Attachment #1: revert-rmap-annotate_lock_context_change_on_page_unlock_anon_vma.patch --]
[-- Type: text/plain, Size: 2027 bytes --]

Its beyond ugly and gets in the way.

Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -241,20 +241,7 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *__page_lock_anon_vma(struct page *page);
-
-static inline struct anon_vma *page_lock_anon_vma(struct page *page)
-{
-	struct anon_vma *anon_vma;
-
-	__cond_lock(RCU, anon_vma = __page_lock_anon_vma(page));
-
-	/* (void) is needed to make gcc happy */
-	(void) __cond_lock(&anon_vma->root->lock, anon_vma);
-
-	return anon_vma;
-}
-
+struct anon_vma *page_lock_anon_vma(struct page *page);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -314,7 +314,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-struct anon_vma *__page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma, *root_anon_vma;
 	unsigned long anon_mapping;
@@ -348,8 +348,6 @@ struct anon_vma *__page_lock_anon_vma(st
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
-	__releases(&anon_vma->root->lock)
-	__releases(RCU)
 {
 	anon_vma_unlock(anon_vma);
 	rcu_read_unlock();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Nick Piggin

[-- Attachment #1: powerpc-pgtable-call_rcu_sched.patch --]
[-- Type: text/plain, Size: 903 bytes --]

PowerPC relies on IRQ-disable to guard against RCU quiecent states,
use the appropriate RCU call version.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/mm/pgtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -92,7 +92,7 @@ static void pte_free_rcu_callback(struct
 
 static void pte_free_submit(struct pte_freelist_batch *batch)
 {
-	call_rcu(&batch->rcu, pte_free_rcu_callback);
+	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
 }
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Nick Piggin

[-- Attachment #1: powerpc-pgtable-call_rcu_sched.patch --]
[-- Type: text/plain, Size: 1199 bytes --]

PowerPC relies on IRQ-disable to guard against RCU quiecent states,
use the appropriate RCU call version.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/mm/pgtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -92,7 +92,7 @@ static void pte_free_rcu_callback(struct
 
 static void pte_free_submit(struct pte_freelist_batch *batch)
 {
-	call_rcu(&batch->rcu, pte_free_rcu_callback);
+	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
 }
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Nick Piggin

[-- Attachment #1: powerpc-pgtable-call_rcu_sched.patch --]
[-- Type: text/plain, Size: 1199 bytes --]

PowerPC relies on IRQ-disable to guard against RCU quiecent states,
use the appropriate RCU call version.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/mm/pgtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -92,7 +92,7 @@ static void pte_free_rcu_callback(struct
 
 static void pte_free_submit(struct pte_freelist_batch *batch)
 {
-	call_rcu(&batch->rcu, pte_free_rcu_callback);
+	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
 }
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 03/21] mm: Improve page_lock_anon_vma() comment
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-page_lock_anon_vma-comment.patch --]
[-- Type: text/plain, Size: 1630 bytes --]

A slightly more verbose comment to go along with the trickery in
page_lock_anon_vma().

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1271158226.4807.1107.camel@twins>
---
 mm/rmap.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -311,8 +311,22 @@ void __init anon_vma_init(void)
 }
 
 /*
- * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
+ *
+ * Since there is no serialization what so ever against page_remove_rmap()
+ * the best this function can do is return a locked anon_vma that might
+ * have been relevant to this page.
+ *
+ * The page might have been remapped to a different anon_vma or the anon_vma
+ * returned may already be freed (and even reused).
+ *
+ * All users of this function must be very careful when walking the anon_vma
+ * chain and verify that the page in question is indeed mapped in it
+ * [ something equivalent to page_mapped_in_vma() ].
+ *
+ * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
+ * that the anon_vma pointer from page->mapping is valid if there is a
+ * mapcount, we can dereference the anon_vma after observing those.
  */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 03/21] mm: Improve page_lock_anon_vma() comment
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-page_lock_anon_vma-comment.patch --]
[-- Type: text/plain, Size: 1926 bytes --]

A slightly more verbose comment to go along with the trickery in
page_lock_anon_vma().

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1271158226.4807.1107.camel@twins>
---
 mm/rmap.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -311,8 +311,22 @@ void __init anon_vma_init(void)
 }
 
 /*
- * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
+ *
+ * Since there is no serialization what so ever against page_remove_rmap()
+ * the best this function can do is return a locked anon_vma that might
+ * have been relevant to this page.
+ *
+ * The page might have been remapped to a different anon_vma or the anon_vma
+ * returned may already be freed (and even reused).
+ *
+ * All users of this function must be very careful when walking the anon_vma
+ * chain and verify that the page in question is indeed mapped in it
+ * [ something equivalent to page_mapped_in_vma() ].
+ *
+ * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
+ * that the anon_vma pointer from page->mapping is valid if there is a
+ * mapcount, we can dereference the anon_vma after observing those.
  */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 03/21] mm: Improve page_lock_anon_vma() comment
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-page_lock_anon_vma-comment.patch --]
[-- Type: text/plain, Size: 1926 bytes --]

A slightly more verbose comment to go along with the trickery in
page_lock_anon_vma().

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1271158226.4807.1107.camel@twins>
---
 mm/rmap.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -311,8 +311,22 @@ void __init anon_vma_init(void)
 }
 
 /*
- * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * Getting a lock on a stable anon_vma from a page off the LRU is tricky!
+ *
+ * Since there is no serialization what so ever against page_remove_rmap()
+ * the best this function can do is return a locked anon_vma that might
+ * have been relevant to this page.
+ *
+ * The page might have been remapped to a different anon_vma or the anon_vma
+ * returned may already be freed (and even reused).
+ *
+ * All users of this function must be very careful when walking the anon_vma
+ * chain and verify that the page in question is indeed mapped in it
+ * [ something equivalent to page_mapped_in_vma() ].
+ *
+ * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap()
+ * that the anon_vma pointer from page->mapping is valid if there is a
+ * mapcount, we can dereference the anon_vma after observing those.
  */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 04/21] mm: Rename drop_anon_vma to put_anon_vma
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: put_anon_vma.patch --]
[-- Type: text/plain, Size: 3443 bytes --]

The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |    4 ++--
 mm/ksm.c             |   10 +++++-----
 mm/migrate.c         |    2 +-
 mm/rmap.c            |    4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -87,7 +87,7 @@ static inline void get_anon_vma(struct a
 	atomic_inc(&anon_vma->external_refcount);
 }
 
-void drop_anon_vma(struct anon_vma *);
+void put_anon_vma(struct anon_vma *);
 #else
 static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
@@ -102,7 +102,7 @@ static inline void get_anon_vma(struct a
 {
 }
 
-static inline void drop_anon_vma(struct anon_vma *anon_vma)
+static inline void put_anon_vma(struct anon_vma *anon_vma)
 {
 }
 #endif /* CONFIG_KSM */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -307,11 +307,11 @@ static void hold_anon_vma(struct rmap_it
 	get_anon_vma(anon_vma);
 }
 
-static void ksm_drop_anon_vma(struct rmap_item *rmap_item)
+static void ksm_put_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	drop_anon_vma(anon_vma);
+	put_anon_vma(anon_vma);
 }
 
 /*
@@ -396,7 +396,7 @@ static void break_cow(struct rmap_item *
 	 * It is not an accident that whenever we want to break COW
 	 * to undo, we also need to drop a reference to the anon_vma.
 	 */
-	ksm_drop_anon_vma(rmap_item);
+	ksm_put_anon_vma(rmap_item);
 
 	down_read(&mm->mmap_sem);
 	if (ksm_test_exit(mm))
@@ -451,7 +451,7 @@ static void remove_node_from_stable_tree
 			ksm_pages_sharing--;
 		else
 			ksm_pages_shared--;
-		ksm_drop_anon_vma(rmap_item);
+		ksm_put_anon_vma(rmap_item);
 		rmap_item->address &= PAGE_MASK;
 		cond_resched();
 	}
@@ -539,7 +539,7 @@ static void remove_rmap_item_from_tree(s
 		else
 			ksm_pages_shared--;
 
-		ksm_drop_anon_vma(rmap_item);
+		ksm_put_anon_vma(rmap_item);
 		rmap_item->address &= PAGE_MASK;
 
 	} else if (rmap_item->address & UNSTABLE_FLAG) {
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -683,7 +683,7 @@ rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
-		drop_anon_vma(anon_vma);
+		put_anon_vma(anon_vma);
 
 	if (rcu_locked)
 		rcu_read_unlock();
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -274,7 +274,7 @@ static void anon_vma_unlink(struct anon_
 	if (empty) {
 		/* We no longer need the root anon_vma */
 		if (anon_vma->root != anon_vma)
-			drop_anon_vma(anon_vma->root);
+			put_anon_vma(anon_vma->root);
 		anon_vma_free(anon_vma);
 	}
 }
@@ -1448,7 +1448,7 @@ int try_to_munlock(struct page *page)
  * we know we are the last user, nobody else can get a reference and we
  * can do the freeing without the lock.
  */
-void drop_anon_vma(struct anon_vma *anon_vma)
+void put_anon_vma(struct anon_vma *anon_vma)
 {
 	BUG_ON(atomic_read(&anon_vma->external_refcount) <= 0);
 	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->root->lock)) {



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 04/21] mm: Rename drop_anon_vma to put_anon_vma
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: put_anon_vma.patch --]
[-- Type: text/plain, Size: 3739 bytes --]

The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |    4 ++--
 mm/ksm.c             |   10 +++++-----
 mm/migrate.c         |    2 +-
 mm/rmap.c            |    4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -87,7 +87,7 @@ static inline void get_anon_vma(struct a
 	atomic_inc(&anon_vma->external_refcount);
 }
 
-void drop_anon_vma(struct anon_vma *);
+void put_anon_vma(struct anon_vma *);
 #else
 static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
@@ -102,7 +102,7 @@ static inline void get_anon_vma(struct a
 {
 }
 
-static inline void drop_anon_vma(struct anon_vma *anon_vma)
+static inline void put_anon_vma(struct anon_vma *anon_vma)
 {
 }
 #endif /* CONFIG_KSM */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -307,11 +307,11 @@ static void hold_anon_vma(struct rmap_it
 	get_anon_vma(anon_vma);
 }
 
-static void ksm_drop_anon_vma(struct rmap_item *rmap_item)
+static void ksm_put_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	drop_anon_vma(anon_vma);
+	put_anon_vma(anon_vma);
 }
 
 /*
@@ -396,7 +396,7 @@ static void break_cow(struct rmap_item *
 	 * It is not an accident that whenever we want to break COW
 	 * to undo, we also need to drop a reference to the anon_vma.
 	 */
-	ksm_drop_anon_vma(rmap_item);
+	ksm_put_anon_vma(rmap_item);
 
 	down_read(&mm->mmap_sem);
 	if (ksm_test_exit(mm))
@@ -451,7 +451,7 @@ static void remove_node_from_stable_tree
 			ksm_pages_sharing--;
 		else
 			ksm_pages_shared--;
-		ksm_drop_anon_vma(rmap_item);
+		ksm_put_anon_vma(rmap_item);
 		rmap_item->address &= PAGE_MASK;
 		cond_resched();
 	}
@@ -539,7 +539,7 @@ static void remove_rmap_item_from_tree(s
 		else
 			ksm_pages_shared--;
 
-		ksm_drop_anon_vma(rmap_item);
+		ksm_put_anon_vma(rmap_item);
 		rmap_item->address &= PAGE_MASK;
 
 	} else if (rmap_item->address & UNSTABLE_FLAG) {
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -683,7 +683,7 @@ rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
-		drop_anon_vma(anon_vma);
+		put_anon_vma(anon_vma);
 
 	if (rcu_locked)
 		rcu_read_unlock();
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -274,7 +274,7 @@ static void anon_vma_unlink(struct anon_
 	if (empty) {
 		/* We no longer need the root anon_vma */
 		if (anon_vma->root != anon_vma)
-			drop_anon_vma(anon_vma->root);
+			put_anon_vma(anon_vma->root);
 		anon_vma_free(anon_vma);
 	}
 }
@@ -1448,7 +1448,7 @@ int try_to_munlock(struct page *page)
  * we know we are the last user, nobody else can get a reference and we
  * can do the freeing without the lock.
  */
-void drop_anon_vma(struct anon_vma *anon_vma)
+void put_anon_vma(struct anon_vma *anon_vma)
 {
 	BUG_ON(atomic_read(&anon_vma->external_refcount) <= 0);
 	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->root->lock)) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 04/21] mm: Rename drop_anon_vma to put_anon_vma
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: put_anon_vma.patch --]
[-- Type: text/plain, Size: 3739 bytes --]

The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |    4 ++--
 mm/ksm.c             |   10 +++++-----
 mm/migrate.c         |    2 +-
 mm/rmap.c            |    4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -87,7 +87,7 @@ static inline void get_anon_vma(struct a
 	atomic_inc(&anon_vma->external_refcount);
 }
 
-void drop_anon_vma(struct anon_vma *);
+void put_anon_vma(struct anon_vma *);
 #else
 static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
@@ -102,7 +102,7 @@ static inline void get_anon_vma(struct a
 {
 }
 
-static inline void drop_anon_vma(struct anon_vma *anon_vma)
+static inline void put_anon_vma(struct anon_vma *anon_vma)
 {
 }
 #endif /* CONFIG_KSM */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -307,11 +307,11 @@ static void hold_anon_vma(struct rmap_it
 	get_anon_vma(anon_vma);
 }
 
-static void ksm_drop_anon_vma(struct rmap_item *rmap_item)
+static void ksm_put_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	drop_anon_vma(anon_vma);
+	put_anon_vma(anon_vma);
 }
 
 /*
@@ -396,7 +396,7 @@ static void break_cow(struct rmap_item *
 	 * It is not an accident that whenever we want to break COW
 	 * to undo, we also need to drop a reference to the anon_vma.
 	 */
-	ksm_drop_anon_vma(rmap_item);
+	ksm_put_anon_vma(rmap_item);
 
 	down_read(&mm->mmap_sem);
 	if (ksm_test_exit(mm))
@@ -451,7 +451,7 @@ static void remove_node_from_stable_tree
 			ksm_pages_sharing--;
 		else
 			ksm_pages_shared--;
-		ksm_drop_anon_vma(rmap_item);
+		ksm_put_anon_vma(rmap_item);
 		rmap_item->address &= PAGE_MASK;
 		cond_resched();
 	}
@@ -539,7 +539,7 @@ static void remove_rmap_item_from_tree(s
 		else
 			ksm_pages_shared--;
 
-		ksm_drop_anon_vma(rmap_item);
+		ksm_put_anon_vma(rmap_item);
 		rmap_item->address &= PAGE_MASK;
 
 	} else if (rmap_item->address & UNSTABLE_FLAG) {
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -683,7 +683,7 @@ rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
-		drop_anon_vma(anon_vma);
+		put_anon_vma(anon_vma);
 
 	if (rcu_locked)
 		rcu_read_unlock();
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -274,7 +274,7 @@ static void anon_vma_unlink(struct anon_
 	if (empty) {
 		/* We no longer need the root anon_vma */
 		if (anon_vma->root != anon_vma)
-			drop_anon_vma(anon_vma->root);
+			put_anon_vma(anon_vma->root);
 		anon_vma_free(anon_vma);
 	}
 }
@@ -1448,7 +1448,7 @@ int try_to_munlock(struct page *page)
  * we know we are the last user, nobody else can get a reference and we
  * can do the freeing without the lock.
  */
-void drop_anon_vma(struct anon_vma *anon_vma)
+void put_anon_vma(struct anon_vma *anon_vma)
 {
 	BUG_ON(atomic_read(&anon_vma->external_refcount) <= 0);
 	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->root->lock)) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 05/21] mm: Move anon_vma ref out from under CONFIG_KSM
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 5216 bytes --]

We need an anon_vma refcount for preemptible anon_vma->lock.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |   40 ++++------------------------------------
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |   14 ++++++--------
 3 files changed, 12 insertions(+), 46 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -27,18 +27,15 @@
 struct anon_vma {
 	struct anon_vma *root;	/* Root of this anon_vma tree */
 	spinlock_t lock;	/* Serialize access to vma list */
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
-
 	/*
-	 * The external_refcount is taken by either KSM or page migration
-	 * to take a reference to an anon_vma when there is no
+	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
 	 * the duration of the operation. A caller that takes
 	 * the reference is responsible for clearing up the
 	 * anon_vma if they are the last user on release
 	 */
-	atomic_t external_refcount;
-#endif
+	atomic_t refcount;
+
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -71,41 +68,12 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
-static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->external_refcount, 0);
-}
-
-static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->external_refcount);
-}
-
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
-	atomic_inc(&anon_vma->external_refcount);
+	atomic_inc(&anon_vma->refcount);
 }
 
 void put_anon_vma(struct anon_vma *);
-#else
-static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-
-static inline void get_anon_vma(struct anon_vma *anon_vma)
-{
-}
-
-static inline void put_anon_vma(struct anon_vma *anon_vma)
-{
-}
-#endif /* CONFIG_KSM */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -268,7 +268,7 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->refcount);
 	anon_vma_unlock(anon_vma);
 
 	if (empty) {
@@ -299,7 +299,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	anonvma_external_refcount_init(anon_vma);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1451,7 +1451,6 @@ int try_to_munlock(struct page *page)
 		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
 /*
  * Drop an anon_vma refcount, freeing the anon_vma and anon_vma->root
  * if necessary.  Be careful to do all the tests under the lock.  Once
@@ -1460,8 +1459,8 @@ int try_to_munlock(struct page *page)
  */
 void put_anon_vma(struct anon_vma *anon_vma)
 {
-	BUG_ON(atomic_read(&anon_vma->external_refcount) <= 0);
-	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->root->lock)) {
+	BUG_ON(atomic_read(&anon_vma->refcount) <= 0);
+	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->root->lock)) {
 		struct anon_vma *root = anon_vma->root;
 		int empty = list_empty(&anon_vma->head);
 		int last_root_user = 0;
@@ -1472,8 +1471,8 @@ void put_anon_vma(struct anon_vma *anon_
 		 * the refcount on the root and check if we need to free it.
 		 */
 		if (empty && anon_vma != root) {
-			BUG_ON(atomic_read(&root->external_refcount) <= 0);
-			last_root_user = atomic_dec_and_test(&root->external_refcount);
+			BUG_ON(atomic_read(&root->refcount) <= 0);
+			last_root_user = atomic_dec_and_test(&root->refcount);
 			root_empty = list_empty(&root->head);
 		}
 		anon_vma_unlock(anon_vma);
@@ -1485,7 +1484,6 @@ void put_anon_vma(struct anon_vma *anon_
 		}
 	}
 }
-#endif
 
 #ifdef CONFIG_MIGRATION
 /*
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -833,7 +833,7 @@ static int unmap_and_move_huge_page(new_
 
 		if (page_mapped(hpage)) {
 			anon_vma = page_anon_vma(hpage);
-			atomic_inc(&anon_vma->external_refcount);
+			atomic_inc(&anon_vma->refcount);
 		}
 	}
 
@@ -845,7 +845,7 @@ static int unmap_and_move_huge_page(new_
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
 
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount,
 					    &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 05/21] mm: Move anon_vma ref out from under CONFIG_KSM
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 5512 bytes --]

We need an anon_vma refcount for preemptible anon_vma->lock.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |   40 ++++------------------------------------
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |   14 ++++++--------
 3 files changed, 12 insertions(+), 46 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -27,18 +27,15 @@
 struct anon_vma {
 	struct anon_vma *root;	/* Root of this anon_vma tree */
 	spinlock_t lock;	/* Serialize access to vma list */
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
-
 	/*
-	 * The external_refcount is taken by either KSM or page migration
-	 * to take a reference to an anon_vma when there is no
+	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
 	 * the duration of the operation. A caller that takes
 	 * the reference is responsible for clearing up the
 	 * anon_vma if they are the last user on release
 	 */
-	atomic_t external_refcount;
-#endif
+	atomic_t refcount;
+
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -71,41 +68,12 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
-static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->external_refcount, 0);
-}
-
-static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->external_refcount);
-}
-
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
-	atomic_inc(&anon_vma->external_refcount);
+	atomic_inc(&anon_vma->refcount);
 }
 
 void put_anon_vma(struct anon_vma *);
-#else
-static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-
-static inline void get_anon_vma(struct anon_vma *anon_vma)
-{
-}
-
-static inline void put_anon_vma(struct anon_vma *anon_vma)
-{
-}
-#endif /* CONFIG_KSM */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -268,7 +268,7 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->refcount);
 	anon_vma_unlock(anon_vma);
 
 	if (empty) {
@@ -299,7 +299,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	anonvma_external_refcount_init(anon_vma);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1451,7 +1451,6 @@ int try_to_munlock(struct page *page)
 		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
 /*
  * Drop an anon_vma refcount, freeing the anon_vma and anon_vma->root
  * if necessary.  Be careful to do all the tests under the lock.  Once
@@ -1460,8 +1459,8 @@ int try_to_munlock(struct page *page)
  */
 void put_anon_vma(struct anon_vma *anon_vma)
 {
-	BUG_ON(atomic_read(&anon_vma->external_refcount) <= 0);
-	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->root->lock)) {
+	BUG_ON(atomic_read(&anon_vma->refcount) <= 0);
+	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->root->lock)) {
 		struct anon_vma *root = anon_vma->root;
 		int empty = list_empty(&anon_vma->head);
 		int last_root_user = 0;
@@ -1472,8 +1471,8 @@ void put_anon_vma(struct anon_vma *anon_
 		 * the refcount on the root and check if we need to free it.
 		 */
 		if (empty && anon_vma != root) {
-			BUG_ON(atomic_read(&root->external_refcount) <= 0);
-			last_root_user = atomic_dec_and_test(&root->external_refcount);
+			BUG_ON(atomic_read(&root->refcount) <= 0);
+			last_root_user = atomic_dec_and_test(&root->refcount);
 			root_empty = list_empty(&root->head);
 		}
 		anon_vma_unlock(anon_vma);
@@ -1485,7 +1484,6 @@ void put_anon_vma(struct anon_vma *anon_
 		}
 	}
 }
-#endif
 
 #ifdef CONFIG_MIGRATION
 /*
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -833,7 +833,7 @@ static int unmap_and_move_huge_page(new_
 
 		if (page_mapped(hpage)) {
 			anon_vma = page_anon_vma(hpage);
-			atomic_inc(&anon_vma->external_refcount);
+			atomic_inc(&anon_vma->refcount);
 		}
 	}
 
@@ -845,7 +845,7 @@ static int unmap_and_move_huge_page(new_
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
 
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount,
 					    &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 05/21] mm: Move anon_vma ref out from under CONFIG_KSM
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 5512 bytes --]

We need an anon_vma refcount for preemptible anon_vma->lock.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |   40 ++++------------------------------------
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |   14 ++++++--------
 3 files changed, 12 insertions(+), 46 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -27,18 +27,15 @@
 struct anon_vma {
 	struct anon_vma *root;	/* Root of this anon_vma tree */
 	spinlock_t lock;	/* Serialize access to vma list */
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
-
 	/*
-	 * The external_refcount is taken by either KSM or page migration
-	 * to take a reference to an anon_vma when there is no
+	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
 	 * the duration of the operation. A caller that takes
 	 * the reference is responsible for clearing up the
 	 * anon_vma if they are the last user on release
 	 */
-	atomic_t external_refcount;
-#endif
+	atomic_t refcount;
+
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -71,41 +68,12 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
-static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->external_refcount, 0);
-}
-
-static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->external_refcount);
-}
-
 static inline void get_anon_vma(struct anon_vma *anon_vma)
 {
-	atomic_inc(&anon_vma->external_refcount);
+	atomic_inc(&anon_vma->refcount);
 }
 
 void put_anon_vma(struct anon_vma *);
-#else
-static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-
-static inline void get_anon_vma(struct anon_vma *anon_vma)
-{
-}
-
-static inline void put_anon_vma(struct anon_vma *anon_vma)
-{
-}
-#endif /* CONFIG_KSM */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -268,7 +268,7 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->refcount);
 	anon_vma_unlock(anon_vma);
 
 	if (empty) {
@@ -299,7 +299,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	anonvma_external_refcount_init(anon_vma);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1451,7 +1451,6 @@ int try_to_munlock(struct page *page)
 		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
-#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
 /*
  * Drop an anon_vma refcount, freeing the anon_vma and anon_vma->root
  * if necessary.  Be careful to do all the tests under the lock.  Once
@@ -1460,8 +1459,8 @@ int try_to_munlock(struct page *page)
  */
 void put_anon_vma(struct anon_vma *anon_vma)
 {
-	BUG_ON(atomic_read(&anon_vma->external_refcount) <= 0);
-	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->root->lock)) {
+	BUG_ON(atomic_read(&anon_vma->refcount) <= 0);
+	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->root->lock)) {
 		struct anon_vma *root = anon_vma->root;
 		int empty = list_empty(&anon_vma->head);
 		int last_root_user = 0;
@@ -1472,8 +1471,8 @@ void put_anon_vma(struct anon_vma *anon_
 		 * the refcount on the root and check if we need to free it.
 		 */
 		if (empty && anon_vma != root) {
-			BUG_ON(atomic_read(&root->external_refcount) <= 0);
-			last_root_user = atomic_dec_and_test(&root->external_refcount);
+			BUG_ON(atomic_read(&root->refcount) <= 0);
+			last_root_user = atomic_dec_and_test(&root->refcount);
 			root_empty = list_empty(&root->head);
 		}
 		anon_vma_unlock(anon_vma);
@@ -1485,7 +1484,6 @@ void put_anon_vma(struct anon_vma *anon_
 		}
 	}
 }
-#endif
 
 #ifdef CONFIG_MIGRATION
 /*
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -833,7 +833,7 @@ static int unmap_and_move_huge_page(new_
 
 		if (page_mapped(hpage)) {
 			anon_vma = page_anon_vma(hpage);
-			atomic_inc(&anon_vma->external_refcount);
+			atomic_inc(&anon_vma->refcount);
 		}
 	}
 
@@ -845,7 +845,7 @@ static int unmap_and_move_huge_page(new_
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
 
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount,
 					    &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 06/21] mm: Simplify anon_vma refcounts
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-use-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 6799 bytes --]

This patch changes the anon_vma refcount to be 0 when the object is
free. It does this by adding 1 ref to being in use in the anon_vma
structure (iow. the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry
on the list.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |   11 +++++--
 mm/ksm.c             |    4 --
 mm/rmap.c            |   79 ++++++++++++++++++---------------------------------
 3 files changed, 38 insertions(+), 56 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -73,7 +73,13 @@ static inline void get_anon_vma(struct a
 	atomic_inc(&anon_vma->refcount);
 }
 
-void put_anon_vma(struct anon_vma *);
+void __put_anon_vma(struct anon_vma *anon_vma);
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		__put_anon_vma(anon_vma);
+}
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
@@ -116,7 +122,6 @@ void unlink_anon_vmas(struct vm_area_str
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
-void anon_vma_free(struct anon_vma *);
 
 static inline void anon_vma_merge(struct vm_area_struct *vma,
 				  struct vm_area_struct *next)
@@ -125,6 +130,8 @@ static inline void anon_vma_merge(struct
 	unlink_anon_vmas(next);
 }
 
+struct anon_vma *page_get_anon_vma(struct page *page);
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -309,9 +309,7 @@ static void hold_anon_vma(struct rmap_it
 
 static void ksm_put_anon_vma(struct rmap_item *rmap_item)
 {
-	struct anon_vma *anon_vma = rmap_item->anon_vma;
-
-	put_anon_vma(anon_vma);
+	put_anon_vma(rmap_item->anon_vma);
 }
 
 /*
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -67,11 +67,24 @@ static struct kmem_cache *anon_vma_chain
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	if (anon_vma) {
+		atomic_set(&anon_vma->refcount, 1);
+		/*
+		 * Initialise the anon_vma root to point to itself. If called from
+		 * fork, the root will be reset to the parents anon_vma.
+		 */
+		anon_vma->root = anon_vma;
+	}
+
+	return anon_vma;
 }
 
-void anon_vma_free(struct anon_vma *anon_vma)
+static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
+	VM_BUG_ON(atomic_read(&anon_vma->refcount));
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -133,11 +146,6 @@ int anon_vma_prepare(struct vm_area_stru
 			if (unlikely(!anon_vma))
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
-			/*
-			 * This VMA had no anon_vma yet.  This anon_vma is
-			 * the root of any anon_vma tree that might form.
-			 */
-			anon_vma->root = anon_vma;
 		}
 
 		anon_vma_lock(anon_vma);
@@ -156,7 +164,7 @@ int anon_vma_prepare(struct vm_area_stru
 		anon_vma_unlock(anon_vma);
 
 		if (unlikely(allocated))
-			anon_vma_free(allocated);
+			put_anon_vma(allocated);
 		if (unlikely(avc))
 			anon_vma_chain_free(avc);
 	}
@@ -237,9 +245,9 @@ int anon_vma_fork(struct vm_area_struct 
 	 */
 	anon_vma->root = pvma->anon_vma->root;
 	/*
-	 * With KSM refcounts, an anon_vma can stay around longer than the
-	 * process it belongs to.  The root anon_vma needs to be pinned
-	 * until this anon_vma is freed, because the lock lives in the root.
+	 * With refcounts, an anon_vma can stay around longer than the
+	 * process it belongs to. The root anon_vma needs to be pinned until
+	 * this anon_vma is freed, because the lock lives in the root.
 	 */
 	get_anon_vma(anon_vma->root);
 	/* Mark this anon_vma as the one where our new (COWed) pages go. */
@@ -249,7 +257,7 @@ int anon_vma_fork(struct vm_area_struct 
 	return 0;
 
  out_error_free_anon_vma:
-	anon_vma_free(anon_vma);
+	put_anon_vma(anon_vma);
  out_error:
 	unlink_anon_vmas(vma);
 	return -ENOMEM;
@@ -268,15 +276,11 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->refcount);
+	empty = list_empty(&anon_vma->head);
 	anon_vma_unlock(anon_vma);
 
-	if (empty) {
-		/* We no longer need the root anon_vma */
-		if (anon_vma->root != anon_vma)
-			put_anon_vma(anon_vma->root);
-		anon_vma_free(anon_vma);
-	}
+	if (empty)
+		put_anon_vma(anon_vma);
 }
 
 void unlink_anon_vmas(struct vm_area_struct *vma)
@@ -1451,38 +1455,11 @@ int try_to_munlock(struct page *page)
 		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
-/*
- * Drop an anon_vma refcount, freeing the anon_vma and anon_vma->root
- * if necessary.  Be careful to do all the tests under the lock.  Once
- * we know we are the last user, nobody else can get a reference and we
- * can do the freeing without the lock.
- */
-void put_anon_vma(struct anon_vma *anon_vma)
-{
-	BUG_ON(atomic_read(&anon_vma->refcount) <= 0);
-	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->root->lock)) {
-		struct anon_vma *root = anon_vma->root;
-		int empty = list_empty(&anon_vma->head);
-		int last_root_user = 0;
-		int root_empty = 0;
-
-		/*
-		 * The refcount on a non-root anon_vma got dropped.  Drop
-		 * the refcount on the root and check if we need to free it.
-		 */
-		if (empty && anon_vma != root) {
-			BUG_ON(atomic_read(&root->refcount) <= 0);
-			last_root_user = atomic_dec_and_test(&root->refcount);
-			root_empty = list_empty(&root->head);
-		}
-		anon_vma_unlock(anon_vma);
-
-		if (empty) {
-			anon_vma_free(anon_vma);
-			if (root_empty && last_root_user)
-				anon_vma_free(root);
-		}
-	}
+void __put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (anon_vma->root != anon_vma)
+		put_anon_vma(anon_vma->root);
+	anon_vma_free(anon_vma);
 }
 
 #ifdef CONFIG_MIGRATION
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -850,7 +850,7 @@ static int unmap_and_move_huge_page(new_
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
-			anon_vma_free(anon_vma);
+			put_anon_vma(anon_vma);
 	}
 
 	if (rcu_locked)



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 06/21] mm: Simplify anon_vma refcounts
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-use-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 7095 bytes --]

This patch changes the anon_vma refcount to be 0 when the object is
free. It does this by adding 1 ref to being in use in the anon_vma
structure (iow. the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry
on the list.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |   11 +++++--
 mm/ksm.c             |    4 --
 mm/rmap.c            |   79 ++++++++++++++++++---------------------------------
 3 files changed, 38 insertions(+), 56 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -73,7 +73,13 @@ static inline void get_anon_vma(struct a
 	atomic_inc(&anon_vma->refcount);
 }
 
-void put_anon_vma(struct anon_vma *);
+void __put_anon_vma(struct anon_vma *anon_vma);
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		__put_anon_vma(anon_vma);
+}
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
@@ -116,7 +122,6 @@ void unlink_anon_vmas(struct vm_area_str
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
-void anon_vma_free(struct anon_vma *);
 
 static inline void anon_vma_merge(struct vm_area_struct *vma,
 				  struct vm_area_struct *next)
@@ -125,6 +130,8 @@ static inline void anon_vma_merge(struct
 	unlink_anon_vmas(next);
 }
 
+struct anon_vma *page_get_anon_vma(struct page *page);
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -309,9 +309,7 @@ static void hold_anon_vma(struct rmap_it
 
 static void ksm_put_anon_vma(struct rmap_item *rmap_item)
 {
-	struct anon_vma *anon_vma = rmap_item->anon_vma;
-
-	put_anon_vma(anon_vma);
+	put_anon_vma(rmap_item->anon_vma);
 }
 
 /*
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -67,11 +67,24 @@ static struct kmem_cache *anon_vma_chain
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	if (anon_vma) {
+		atomic_set(&anon_vma->refcount, 1);
+		/*
+		 * Initialise the anon_vma root to point to itself. If called from
+		 * fork, the root will be reset to the parents anon_vma.
+		 */
+		anon_vma->root = anon_vma;
+	}
+
+	return anon_vma;
 }
 
-void anon_vma_free(struct anon_vma *anon_vma)
+static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
+	VM_BUG_ON(atomic_read(&anon_vma->refcount));
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -133,11 +146,6 @@ int anon_vma_prepare(struct vm_area_stru
 			if (unlikely(!anon_vma))
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
-			/*
-			 * This VMA had no anon_vma yet.  This anon_vma is
-			 * the root of any anon_vma tree that might form.
-			 */
-			anon_vma->root = anon_vma;
 		}
 
 		anon_vma_lock(anon_vma);
@@ -156,7 +164,7 @@ int anon_vma_prepare(struct vm_area_stru
 		anon_vma_unlock(anon_vma);
 
 		if (unlikely(allocated))
-			anon_vma_free(allocated);
+			put_anon_vma(allocated);
 		if (unlikely(avc))
 			anon_vma_chain_free(avc);
 	}
@@ -237,9 +245,9 @@ int anon_vma_fork(struct vm_area_struct 
 	 */
 	anon_vma->root = pvma->anon_vma->root;
 	/*
-	 * With KSM refcounts, an anon_vma can stay around longer than the
-	 * process it belongs to.  The root anon_vma needs to be pinned
-	 * until this anon_vma is freed, because the lock lives in the root.
+	 * With refcounts, an anon_vma can stay around longer than the
+	 * process it belongs to. The root anon_vma needs to be pinned until
+	 * this anon_vma is freed, because the lock lives in the root.
 	 */
 	get_anon_vma(anon_vma->root);
 	/* Mark this anon_vma as the one where our new (COWed) pages go. */
@@ -249,7 +257,7 @@ int anon_vma_fork(struct vm_area_struct 
 	return 0;
 
  out_error_free_anon_vma:
-	anon_vma_free(anon_vma);
+	put_anon_vma(anon_vma);
  out_error:
 	unlink_anon_vmas(vma);
 	return -ENOMEM;
@@ -268,15 +276,11 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->refcount);
+	empty = list_empty(&anon_vma->head);
 	anon_vma_unlock(anon_vma);
 
-	if (empty) {
-		/* We no longer need the root anon_vma */
-		if (anon_vma->root != anon_vma)
-			put_anon_vma(anon_vma->root);
-		anon_vma_free(anon_vma);
-	}
+	if (empty)
+		put_anon_vma(anon_vma);
 }
 
 void unlink_anon_vmas(struct vm_area_struct *vma)
@@ -1451,38 +1455,11 @@ int try_to_munlock(struct page *page)
 		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
-/*
- * Drop an anon_vma refcount, freeing the anon_vma and anon_vma->root
- * if necessary.  Be careful to do all the tests under the lock.  Once
- * we know we are the last user, nobody else can get a reference and we
- * can do the freeing without the lock.
- */
-void put_anon_vma(struct anon_vma *anon_vma)
-{
-	BUG_ON(atomic_read(&anon_vma->refcount) <= 0);
-	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->root->lock)) {
-		struct anon_vma *root = anon_vma->root;
-		int empty = list_empty(&anon_vma->head);
-		int last_root_user = 0;
-		int root_empty = 0;
-
-		/*
-		 * The refcount on a non-root anon_vma got dropped.  Drop
-		 * the refcount on the root and check if we need to free it.
-		 */
-		if (empty && anon_vma != root) {
-			BUG_ON(atomic_read(&root->refcount) <= 0);
-			last_root_user = atomic_dec_and_test(&root->refcount);
-			root_empty = list_empty(&root->head);
-		}
-		anon_vma_unlock(anon_vma);
-
-		if (empty) {
-			anon_vma_free(anon_vma);
-			if (root_empty && last_root_user)
-				anon_vma_free(root);
-		}
-	}
+void __put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (anon_vma->root != anon_vma)
+		put_anon_vma(anon_vma->root);
+	anon_vma_free(anon_vma);
 }
 
 #ifdef CONFIG_MIGRATION
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -850,7 +850,7 @@ static int unmap_and_move_huge_page(new_
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
-			anon_vma_free(anon_vma);
+			put_anon_vma(anon_vma);
 	}
 
 	if (rcu_locked)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 06/21] mm: Simplify anon_vma refcounts
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-use-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 7095 bytes --]

This patch changes the anon_vma refcount to be 0 when the object is
free. It does this by adding 1 ref to being in use in the anon_vma
structure (iow. the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry
on the list.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |   11 +++++--
 mm/ksm.c             |    4 --
 mm/rmap.c            |   79 ++++++++++++++++++---------------------------------
 3 files changed, 38 insertions(+), 56 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -73,7 +73,13 @@ static inline void get_anon_vma(struct a
 	atomic_inc(&anon_vma->refcount);
 }
 
-void put_anon_vma(struct anon_vma *);
+void __put_anon_vma(struct anon_vma *anon_vma);
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		__put_anon_vma(anon_vma);
+}
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
@@ -116,7 +122,6 @@ void unlink_anon_vmas(struct vm_area_str
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
-void anon_vma_free(struct anon_vma *);
 
 static inline void anon_vma_merge(struct vm_area_struct *vma,
 				  struct vm_area_struct *next)
@@ -125,6 +130,8 @@ static inline void anon_vma_merge(struct
 	unlink_anon_vmas(next);
 }
 
+struct anon_vma *page_get_anon_vma(struct page *page);
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -309,9 +309,7 @@ static void hold_anon_vma(struct rmap_it
 
 static void ksm_put_anon_vma(struct rmap_item *rmap_item)
 {
-	struct anon_vma *anon_vma = rmap_item->anon_vma;
-
-	put_anon_vma(anon_vma);
+	put_anon_vma(rmap_item->anon_vma);
 }
 
 /*
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -67,11 +67,24 @@ static struct kmem_cache *anon_vma_chain
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	if (anon_vma) {
+		atomic_set(&anon_vma->refcount, 1);
+		/*
+		 * Initialise the anon_vma root to point to itself. If called from
+		 * fork, the root will be reset to the parents anon_vma.
+		 */
+		anon_vma->root = anon_vma;
+	}
+
+	return anon_vma;
 }
 
-void anon_vma_free(struct anon_vma *anon_vma)
+static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
+	VM_BUG_ON(atomic_read(&anon_vma->refcount));
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -133,11 +146,6 @@ int anon_vma_prepare(struct vm_area_stru
 			if (unlikely(!anon_vma))
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
-			/*
-			 * This VMA had no anon_vma yet.  This anon_vma is
-			 * the root of any anon_vma tree that might form.
-			 */
-			anon_vma->root = anon_vma;
 		}
 
 		anon_vma_lock(anon_vma);
@@ -156,7 +164,7 @@ int anon_vma_prepare(struct vm_area_stru
 		anon_vma_unlock(anon_vma);
 
 		if (unlikely(allocated))
-			anon_vma_free(allocated);
+			put_anon_vma(allocated);
 		if (unlikely(avc))
 			anon_vma_chain_free(avc);
 	}
@@ -237,9 +245,9 @@ int anon_vma_fork(struct vm_area_struct 
 	 */
 	anon_vma->root = pvma->anon_vma->root;
 	/*
-	 * With KSM refcounts, an anon_vma can stay around longer than the
-	 * process it belongs to.  The root anon_vma needs to be pinned
-	 * until this anon_vma is freed, because the lock lives in the root.
+	 * With refcounts, an anon_vma can stay around longer than the
+	 * process it belongs to. The root anon_vma needs to be pinned until
+	 * this anon_vma is freed, because the lock lives in the root.
 	 */
 	get_anon_vma(anon_vma->root);
 	/* Mark this anon_vma as the one where our new (COWed) pages go. */
@@ -249,7 +257,7 @@ int anon_vma_fork(struct vm_area_struct 
 	return 0;
 
  out_error_free_anon_vma:
-	anon_vma_free(anon_vma);
+	put_anon_vma(anon_vma);
  out_error:
 	unlink_anon_vmas(vma);
 	return -ENOMEM;
@@ -268,15 +276,11 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->refcount);
+	empty = list_empty(&anon_vma->head);
 	anon_vma_unlock(anon_vma);
 
-	if (empty) {
-		/* We no longer need the root anon_vma */
-		if (anon_vma->root != anon_vma)
-			put_anon_vma(anon_vma->root);
-		anon_vma_free(anon_vma);
-	}
+	if (empty)
+		put_anon_vma(anon_vma);
 }
 
 void unlink_anon_vmas(struct vm_area_struct *vma)
@@ -1451,38 +1455,11 @@ int try_to_munlock(struct page *page)
 		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 
-/*
- * Drop an anon_vma refcount, freeing the anon_vma and anon_vma->root
- * if necessary.  Be careful to do all the tests under the lock.  Once
- * we know we are the last user, nobody else can get a reference and we
- * can do the freeing without the lock.
- */
-void put_anon_vma(struct anon_vma *anon_vma)
-{
-	BUG_ON(atomic_read(&anon_vma->refcount) <= 0);
-	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->root->lock)) {
-		struct anon_vma *root = anon_vma->root;
-		int empty = list_empty(&anon_vma->head);
-		int last_root_user = 0;
-		int root_empty = 0;
-
-		/*
-		 * The refcount on a non-root anon_vma got dropped.  Drop
-		 * the refcount on the root and check if we need to free it.
-		 */
-		if (empty && anon_vma != root) {
-			BUG_ON(atomic_read(&root->refcount) <= 0);
-			last_root_user = atomic_dec_and_test(&root->refcount);
-			root_empty = list_empty(&root->head);
-		}
-		anon_vma_unlock(anon_vma);
-
-		if (empty) {
-			anon_vma_free(anon_vma);
-			if (root_empty && last_root_user)
-				anon_vma_free(root);
-		}
-	}
+void __put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (anon_vma->root != anon_vma)
+		put_anon_vma(anon_vma->root);
+	anon_vma_free(anon_vma);
 }
 
 #ifdef CONFIG_MIGRATION
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -850,7 +850,7 @@ static int unmap_and_move_huge_page(new_
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
-			anon_vma_free(anon_vma);
+			put_anon_vma(anon_vma);
 	}
 
 	if (rcu_locked)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-ref-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 2058 bytes --]

Convert page_lock_anon_vma() over to use refcounts. This is
done for each of convertion of anon_vma from spinlock to mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -332,9 +332,9 @@ void __init anon_vma_init(void)
  * that the anon_vma pointer from page->mapping is valid if there is a
  * mapcount, we can dereference the anon_vma after observing those.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_get_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma, *root_anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -345,8 +345,10 @@ struct anon_vma *page_lock_anon_vma(stru
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	spin_lock(&root_anon_vma->lock);
+	if (!atomic_inc_not_zero(&anon_vma->refcount)) {
+		anon_vma = NULL;
+		goto out;
+	}
 
 	/*
 	 * If this page is still mapped, then its anon_vma cannot have been
@@ -356,19 +358,31 @@ struct anon_vma *page_lock_anon_vma(stru
 	 * corrupt): with anon_vma_prepare() or anon_vma_fork() redirecting
 	 * anon_vma->root before page_unlock_anon_vma() is called to unlock.
 	 */
-	if (page_mapped(page))
-		return anon_vma;
-
-	spin_unlock(&root_anon_vma->lock);
+	if (!page_mapped(page)) {
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto out;
+	}
 out:
 	rcu_read_unlock();
-	return NULL;
+
+	return anon_vma;
+}
+
+struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = page_get_anon_vma(page);
+
+	if (anon_vma)
+		anon_vma_lock(anon_vma);
+
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	anon_vma_unlock(anon_vma);
-	rcu_read_unlock();
+	put_anon_vma(anon_vma);
 }
 
 /*



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-ref-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 2354 bytes --]

Convert page_lock_anon_vma() over to use refcounts. This is
done for each of convertion of anon_vma from spinlock to mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -332,9 +332,9 @@ void __init anon_vma_init(void)
  * that the anon_vma pointer from page->mapping is valid if there is a
  * mapcount, we can dereference the anon_vma after observing those.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_get_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma, *root_anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -345,8 +345,10 @@ struct anon_vma *page_lock_anon_vma(stru
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	spin_lock(&root_anon_vma->lock);
+	if (!atomic_inc_not_zero(&anon_vma->refcount)) {
+		anon_vma = NULL;
+		goto out;
+	}
 
 	/*
 	 * If this page is still mapped, then its anon_vma cannot have been
@@ -356,19 +358,31 @@ struct anon_vma *page_lock_anon_vma(stru
 	 * corrupt): with anon_vma_prepare() or anon_vma_fork() redirecting
 	 * anon_vma->root before page_unlock_anon_vma() is called to unlock.
 	 */
-	if (page_mapped(page))
-		return anon_vma;
-
-	spin_unlock(&root_anon_vma->lock);
+	if (!page_mapped(page)) {
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto out;
+	}
 out:
 	rcu_read_unlock();
-	return NULL;
+
+	return anon_vma;
+}
+
+struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = page_get_anon_vma(page);
+
+	if (anon_vma)
+		anon_vma_lock(anon_vma);
+
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	anon_vma_unlock(anon_vma);
-	rcu_read_unlock();
+	put_anon_vma(anon_vma);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-ref-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 2354 bytes --]

Convert page_lock_anon_vma() over to use refcounts. This is
done for each of convertion of anon_vma from spinlock to mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -332,9 +332,9 @@ void __init anon_vma_init(void)
  * that the anon_vma pointer from page->mapping is valid if there is a
  * mapcount, we can dereference the anon_vma after observing those.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_get_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma, *root_anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -345,8 +345,10 @@ struct anon_vma *page_lock_anon_vma(stru
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	spin_lock(&root_anon_vma->lock);
+	if (!atomic_inc_not_zero(&anon_vma->refcount)) {
+		anon_vma = NULL;
+		goto out;
+	}
 
 	/*
 	 * If this page is still mapped, then its anon_vma cannot have been
@@ -356,19 +358,31 @@ struct anon_vma *page_lock_anon_vma(stru
 	 * corrupt): with anon_vma_prepare() or anon_vma_fork() redirecting
 	 * anon_vma->root before page_unlock_anon_vma() is called to unlock.
 	 */
-	if (page_mapped(page))
-		return anon_vma;
-
-	spin_unlock(&root_anon_vma->lock);
+	if (!page_mapped(page)) {
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto out;
+	}
 out:
 	rcu_read_unlock();
-	return NULL;
+
+	return anon_vma;
+}
+
+struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = page_get_anon_vma(page);
+
+	if (anon_vma)
+		anon_vma_lock(anon_vma);
+
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	anon_vma_unlock(anon_vma);
-	rcu_read_unlock();
+	put_anon_vma(anon_vma);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 08/21] mm: Preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Martin Schwidefsky, Russell King, Paul Mundt, Jeff Dike,
	Tony Luck

[-- Attachment #1: mm-preempt-tlb-gather.patch --]
[-- Type: text/plain, Size: 10029 bytes --]

Make mmu_gather preemptible by using a small on stack list and use
an option allocation to speed things up.

Preemptible mmu_gather is desired in general and usable once
i_mmap_lock becomes a mutex. Doing it before the mutex conversion
saves us from having to rework the code by moving the mmu_gather
bits inside the i_mmap_lock.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c                 |   10 ++++-----
 include/asm-generic/tlb.h |   51 +++++++++++++++++++++++++++++-----------------
 include/linux/mm.h        |    2 -
 mm/memory.c               |   27 +++++-------------------
 mm/mmap.c                 |   16 +++++++-------
 5 files changed, 53 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -504,7 +504,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
 
@@ -530,12 +530,12 @@ static int shift_arg_pages(struct vm_are
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
+		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	} else {
 		/*
@@ -544,10 +544,10 @@ static int shift_arg_pages(struct vm_are
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
+		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
+	tlb_finish_mmu(&tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -22,14 +22,8 @@
  * and page free order so much..
  */
 #ifdef CONFIG_SMP
-  #ifdef ARCH_FREE_PTR_NR
-    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
-  #else
-    #define FREE_PTE_NR	506
-  #endif
   #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
 #else
-  #define FREE_PTE_NR	1
   #define tlb_fast_mode(tlb) 1
 #endif
 
@@ -39,30 +33,48 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page *		pages[FREE_PTE_NR];
+#ifdef HAVE_ARCH_MMU_GATHER
+	struct arch_mmu_gather	arch;
+#endif
+	struct page		**pages;
+	struct page		*local[8];
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+{
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	}
+}
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized struct mmu_gather.
  */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 
-	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
+
+	if (num_online_cpus() > 1) {
+		tlb->nr = 0;
+		__tlb_alloc_pages(tlb);
+	} else /* Use fast mode if only one CPU is online */
+		tlb->nr = ~0U;
 
 	tlb->fullmm = full_mm_flush;
 
-	return tlb;
+#ifdef HAVE_ARCH_MMU_GATHER
+	tlb->arch = ARCH_MMU_GATHER_INIT;
+#endif
 }
 
 static inline void
@@ -75,6 +87,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
+		if (tlb->pages == tlb->local)
+			__tlb_alloc_pages(tlb);
 	}
 }
 
@@ -90,7 +104,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /* tlb_remove_page
@@ -106,7 +121,7 @@ static inline void tlb_remove_page(struc
 		return;
 	}
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		tlb_flush_mmu(tlb, 0, 0);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -769,7 +769,7 @@ int zap_vma_ptes(struct vm_area_struct *
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1093,17 +1093,14 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
 	long zap_work = ZAP_BLOCK_SIZE;
-	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
-	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1124,11 +1121,6 @@ unsigned long unmap_vmas(struct mmu_gath
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
-			if (!tlb_start_valid) {
-				tlb_start = start;
-				tlb_start_valid = 1;
-			}
-
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
@@ -1149,7 +1141,7 @@ unsigned long unmap_vmas(struct mmu_gath
 
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -1157,19 +1149,13 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
-
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
+				if (i_mmap_lock)
 					goto out;
-				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
-			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
@@ -1189,16 +1175,15 @@ unsigned long zap_page_range(struct vm_a
 		unsigned long size, struct zap_details *details)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
+	tlb_finish_mmu(&tlb, address, end);
 	return end;
 }
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1896,17 +1896,17 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2247,7 +2247,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2272,14 +2272,14 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
-	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	tlb_finish_mmu(&tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 08/21] mm: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Martin Schwidefsky, Russell King, Paul Mundt, Jeff Dike,
	Tony Luck

[-- Attachment #1: mm-preempt-tlb-gather.patch --]
[-- Type: text/plain, Size: 10325 bytes --]

Make mmu_gather preemptible by using a small on stack list and use
an option allocation to speed things up.

Preemptible mmu_gather is desired in general and usable once
i_mmap_lock becomes a mutex. Doing it before the mutex conversion
saves us from having to rework the code by moving the mmu_gather
bits inside the i_mmap_lock.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c                 |   10 ++++-----
 include/asm-generic/tlb.h |   51 +++++++++++++++++++++++++++++-----------------
 include/linux/mm.h        |    2 -
 mm/memory.c               |   27 +++++-------------------
 mm/mmap.c                 |   16 +++++++-------
 5 files changed, 53 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -504,7 +504,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
 
@@ -530,12 +530,12 @@ static int shift_arg_pages(struct vm_are
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
+		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	} else {
 		/*
@@ -544,10 +544,10 @@ static int shift_arg_pages(struct vm_are
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
+		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
+	tlb_finish_mmu(&tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -22,14 +22,8 @@
  * and page free order so much..
  */
 #ifdef CONFIG_SMP
-  #ifdef ARCH_FREE_PTR_NR
-    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
-  #else
-    #define FREE_PTE_NR	506
-  #endif
   #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
 #else
-  #define FREE_PTE_NR	1
   #define tlb_fast_mode(tlb) 1
 #endif
 
@@ -39,30 +33,48 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page *		pages[FREE_PTE_NR];
+#ifdef HAVE_ARCH_MMU_GATHER
+	struct arch_mmu_gather	arch;
+#endif
+	struct page		**pages;
+	struct page		*local[8];
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+{
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	}
+}
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized struct mmu_gather.
  */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 
-	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
+
+	if (num_online_cpus() > 1) {
+		tlb->nr = 0;
+		__tlb_alloc_pages(tlb);
+	} else /* Use fast mode if only one CPU is online */
+		tlb->nr = ~0U;
 
 	tlb->fullmm = full_mm_flush;
 
-	return tlb;
+#ifdef HAVE_ARCH_MMU_GATHER
+	tlb->arch = ARCH_MMU_GATHER_INIT;
+#endif
 }
 
 static inline void
@@ -75,6 +87,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
+		if (tlb->pages == tlb->local)
+			__tlb_alloc_pages(tlb);
 	}
 }
 
@@ -90,7 +104,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /* tlb_remove_page
@@ -106,7 +121,7 @@ static inline void tlb_remove_page(struc
 		return;
 	}
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		tlb_flush_mmu(tlb, 0, 0);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -769,7 +769,7 @@ int zap_vma_ptes(struct vm_area_struct *
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1093,17 +1093,14 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
 	long zap_work = ZAP_BLOCK_SIZE;
-	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
-	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1124,11 +1121,6 @@ unsigned long unmap_vmas(struct mmu_gath
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
-			if (!tlb_start_valid) {
-				tlb_start = start;
-				tlb_start_valid = 1;
-			}
-
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
@@ -1149,7 +1141,7 @@ unsigned long unmap_vmas(struct mmu_gath
 
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -1157,19 +1149,13 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
-
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
+				if (i_mmap_lock)
 					goto out;
-				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
-			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
@@ -1189,16 +1175,15 @@ unsigned long zap_page_range(struct vm_a
 		unsigned long size, struct zap_details *details)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
+	tlb_finish_mmu(&tlb, address, end);
 	return end;
 }
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1896,17 +1896,17 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2247,7 +2247,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2272,14 +2272,14 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
-	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	tlb_finish_mmu(&tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 08/21] mm: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Martin Schwidefsky, Russell King, Paul Mundt, Jeff Dike,
	Tony Luck

[-- Attachment #1: mm-preempt-tlb-gather.patch --]
[-- Type: text/plain, Size: 10325 bytes --]

Make mmu_gather preemptible by using a small on stack list and use
an option allocation to speed things up.

Preemptible mmu_gather is desired in general and usable once
i_mmap_lock becomes a mutex. Doing it before the mutex conversion
saves us from having to rework the code by moving the mmu_gather
bits inside the i_mmap_lock.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c                 |   10 ++++-----
 include/asm-generic/tlb.h |   51 +++++++++++++++++++++++++++++-----------------
 include/linux/mm.h        |    2 -
 mm/memory.c               |   27 +++++-------------------
 mm/mmap.c                 |   16 +++++++-------
 5 files changed, 53 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -504,7 +504,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
 
@@ -530,12 +530,12 @@ static int shift_arg_pages(struct vm_are
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
+		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	} else {
 		/*
@@ -544,10 +544,10 @@ static int shift_arg_pages(struct vm_are
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
+		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
+	tlb_finish_mmu(&tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -22,14 +22,8 @@
  * and page free order so much..
  */
 #ifdef CONFIG_SMP
-  #ifdef ARCH_FREE_PTR_NR
-    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
-  #else
-    #define FREE_PTE_NR	506
-  #endif
   #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
 #else
-  #define FREE_PTE_NR	1
   #define tlb_fast_mode(tlb) 1
 #endif
 
@@ -39,30 +33,48 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page *		pages[FREE_PTE_NR];
+#ifdef HAVE_ARCH_MMU_GATHER
+	struct arch_mmu_gather	arch;
+#endif
+	struct page		**pages;
+	struct page		*local[8];
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+{
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	}
+}
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized struct mmu_gather.
  */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 
-	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
+
+	if (num_online_cpus() > 1) {
+		tlb->nr = 0;
+		__tlb_alloc_pages(tlb);
+	} else /* Use fast mode if only one CPU is online */
+		tlb->nr = ~0U;
 
 	tlb->fullmm = full_mm_flush;
 
-	return tlb;
+#ifdef HAVE_ARCH_MMU_GATHER
+	tlb->arch = ARCH_MMU_GATHER_INIT;
+#endif
 }
 
 static inline void
@@ -75,6 +87,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
+		if (tlb->pages == tlb->local)
+			__tlb_alloc_pages(tlb);
 	}
 }
 
@@ -90,7 +104,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /* tlb_remove_page
@@ -106,7 +121,7 @@ static inline void tlb_remove_page(struc
 		return;
 	}
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		tlb_flush_mmu(tlb, 0, 0);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -769,7 +769,7 @@ int zap_vma_ptes(struct vm_area_struct *
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1093,17 +1093,14 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
 	long zap_work = ZAP_BLOCK_SIZE;
-	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
-	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1124,11 +1121,6 @@ unsigned long unmap_vmas(struct mmu_gath
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
-			if (!tlb_start_valid) {
-				tlb_start = start;
-				tlb_start_valid = 1;
-			}
-
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
@@ -1149,7 +1141,7 @@ unsigned long unmap_vmas(struct mmu_gath
 
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -1157,19 +1149,13 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
-
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
+				if (i_mmap_lock)
 					goto out;
-				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
-			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
@@ -1189,16 +1175,15 @@ unsigned long zap_page_range(struct vm_a
 		unsigned long size, struct zap_details *details)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
+	tlb_finish_mmu(&tlb, address, end);
 	return end;
 }
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1896,17 +1896,17 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2247,7 +2247,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2272,14 +2272,14 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
-	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	tlb_finish_mmu(&tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 09/21] powerpc: Preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-power.patch --]
[-- Type: text/plain, Size: 8705 bytes --]

Fix up powerpc to the new mmu_gather stuffs.

PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.

For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context
switch and use a TLF bit to track the laxy_mmu state.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/include/asm/pgalloc.h     |    4 ++--
 arch/powerpc/include/asm/thread_info.h |    2 ++
 arch/powerpc/include/asm/tlb.h         |   10 ++++++++++
 arch/powerpc/kernel/process.c          |   21 ++++++++++++++++++++-
 arch/powerpc/mm/pgtable.c              |   14 ++++----------
 arch/powerpc/mm/tlb_hash32.c           |    2 +-
 arch/powerpc/mm/tlb_hash64.c           |   12 +++++++-----
 arch/powerpc/mm/tlb_nohash.c           |    2 +-
 8 files changed, 47 insertions(+), 20 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,6 +28,16 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
+#define HAVE_ARCH_MMU_GATHER 1
+
+struct pte_freelist_batch;
+
+struct arch_mmu_gather {
+	struct pte_freelist_batch *batch;
+};
+
+#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/process.c
+++ linux-2.6/arch/powerpc/kernel/process.c
@@ -393,6 +393,9 @@ struct task_struct *__switch_to(struct t
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
+#ifdef CONFIG_PPC64
+	struct ppc64_tlb_batch *batch;
+#endif
 
 #ifdef CONFIG_SMP
 	/* avoid complexity of lazy save/restore of fpu
@@ -511,7 +514,15 @@ struct task_struct *__switch_to(struct t
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
-#endif
+
+	batch = &__get_cpu_var(ppc64_tlb_batch);
+	if (batch->active) {
+		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
+#endif /* CONFIG_PPC64 */
 
 	local_irq_save(flags);
 
@@ -527,6 +538,14 @@ struct task_struct *__switch_to(struct t
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
+#ifdef CONFIG_PPC64
+	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
+		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
+		batch = &__get_cpu_var(ppc64_tlb_batch);
+		batch->active = 1;
+	}
+#endif /* CONFIG_PPC64 */
+
 	local_irq_restore(flags);
 
 	return last;
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,8 +33,6 @@
 
 #include "mmu_decl.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 /*
@@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
  * freeing a page table page that is being walked without locks
  */
 
-static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
 static unsigned long pte_freelist_forced_free;
 
 struct pte_freelist_batch
@@ -97,12 +94,10 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 	unsigned long pgf;
 
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		pgtable_free(table, shift);
 		return;
 	}
@@ -124,10 +119,9 @@ void pgtable_free_tlb(struct mmu_gather 
 	}
 }
 
-void pte_free_finish(void)
+void pte_free_finish(struct mmu_gather *tlb)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 
 	if (*batchp == NULL)
 		return;
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*
@@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
 
 	/* If there's a TLB batch pending, then we must flush it because the
 	 * pages are going to be freed and we really don't want to have a CPU
@@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
 	if (tlbbatch->index)
 		__flush_tlb_pending(tlbbatch);
 
+	put_cpu_var(ppc64_tlb_batch);
+
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6/arch/powerpc/include/asm/thread_info.h
@@ -139,10 +139,12 @@ static inline struct thread_info *curren
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
 #define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
+#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
 #define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
+#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 
 #ifndef __ASSEMBLY__
 #define HAVE_SET_RESTORE_SIGMASK	1
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
 
 #ifdef CONFIG_SMP
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(void);
+extern void pte_free_finish(struct mmu_gather *tlb);
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(void) { }
+static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	}
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -301,7 +301,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	flush_tlb_mm(tlb->mm);
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 09/21] powerpc: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-power.patch --]
[-- Type: text/plain, Size: 9001 bytes --]

Fix up powerpc to the new mmu_gather stuffs.

PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.

For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context
switch and use a TLF bit to track the laxy_mmu state.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/include/asm/pgalloc.h     |    4 ++--
 arch/powerpc/include/asm/thread_info.h |    2 ++
 arch/powerpc/include/asm/tlb.h         |   10 ++++++++++
 arch/powerpc/kernel/process.c          |   21 ++++++++++++++++++++-
 arch/powerpc/mm/pgtable.c              |   14 ++++----------
 arch/powerpc/mm/tlb_hash32.c           |    2 +-
 arch/powerpc/mm/tlb_hash64.c           |   12 +++++++-----
 arch/powerpc/mm/tlb_nohash.c           |    2 +-
 8 files changed, 47 insertions(+), 20 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,6 +28,16 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
+#define HAVE_ARCH_MMU_GATHER 1
+
+struct pte_freelist_batch;
+
+struct arch_mmu_gather {
+	struct pte_freelist_batch *batch;
+};
+
+#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/process.c
+++ linux-2.6/arch/powerpc/kernel/process.c
@@ -393,6 +393,9 @@ struct task_struct *__switch_to(struct t
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
+#ifdef CONFIG_PPC64
+	struct ppc64_tlb_batch *batch;
+#endif
 
 #ifdef CONFIG_SMP
 	/* avoid complexity of lazy save/restore of fpu
@@ -511,7 +514,15 @@ struct task_struct *__switch_to(struct t
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
-#endif
+
+	batch = &__get_cpu_var(ppc64_tlb_batch);
+	if (batch->active) {
+		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
+#endif /* CONFIG_PPC64 */
 
 	local_irq_save(flags);
 
@@ -527,6 +538,14 @@ struct task_struct *__switch_to(struct t
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
+#ifdef CONFIG_PPC64
+	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
+		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
+		batch = &__get_cpu_var(ppc64_tlb_batch);
+		batch->active = 1;
+	}
+#endif /* CONFIG_PPC64 */
+
 	local_irq_restore(flags);
 
 	return last;
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,8 +33,6 @@
 
 #include "mmu_decl.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 /*
@@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
  * freeing a page table page that is being walked without locks
  */
 
-static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
 static unsigned long pte_freelist_forced_free;
 
 struct pte_freelist_batch
@@ -97,12 +94,10 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 	unsigned long pgf;
 
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		pgtable_free(table, shift);
 		return;
 	}
@@ -124,10 +119,9 @@ void pgtable_free_tlb(struct mmu_gather 
 	}
 }
 
-void pte_free_finish(void)
+void pte_free_finish(struct mmu_gather *tlb)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 
 	if (*batchp == NULL)
 		return;
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*
@@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
 
 	/* If there's a TLB batch pending, then we must flush it because the
 	 * pages are going to be freed and we really don't want to have a CPU
@@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
 	if (tlbbatch->index)
 		__flush_tlb_pending(tlbbatch);
 
+	put_cpu_var(ppc64_tlb_batch);
+
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6/arch/powerpc/include/asm/thread_info.h
@@ -139,10 +139,12 @@ static inline struct thread_info *curren
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
 #define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
+#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
 #define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
+#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 
 #ifndef __ASSEMBLY__
 #define HAVE_SET_RESTORE_SIGMASK	1
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
 
 #ifdef CONFIG_SMP
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(void);
+extern void pte_free_finish(struct mmu_gather *tlb);
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(void) { }
+static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	}
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -301,7 +301,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	flush_tlb_mm(tlb->mm);
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 09/21] powerpc: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-power.patch --]
[-- Type: text/plain, Size: 9001 bytes --]

Fix up powerpc to the new mmu_gather stuffs.

PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.

For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context
switch and use a TLF bit to track the laxy_mmu state.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/include/asm/pgalloc.h     |    4 ++--
 arch/powerpc/include/asm/thread_info.h |    2 ++
 arch/powerpc/include/asm/tlb.h         |   10 ++++++++++
 arch/powerpc/kernel/process.c          |   21 ++++++++++++++++++++-
 arch/powerpc/mm/pgtable.c              |   14 ++++----------
 arch/powerpc/mm/tlb_hash32.c           |    2 +-
 arch/powerpc/mm/tlb_hash64.c           |   12 +++++++-----
 arch/powerpc/mm/tlb_nohash.c           |    2 +-
 8 files changed, 47 insertions(+), 20 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,6 +28,16 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
+#define HAVE_ARCH_MMU_GATHER 1
+
+struct pte_freelist_batch;
+
+struct arch_mmu_gather {
+	struct pte_freelist_batch *batch;
+};
+
+#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/process.c
+++ linux-2.6/arch/powerpc/kernel/process.c
@@ -393,6 +393,9 @@ struct task_struct *__switch_to(struct t
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
+#ifdef CONFIG_PPC64
+	struct ppc64_tlb_batch *batch;
+#endif
 
 #ifdef CONFIG_SMP
 	/* avoid complexity of lazy save/restore of fpu
@@ -511,7 +514,15 @@ struct task_struct *__switch_to(struct t
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
-#endif
+
+	batch = &__get_cpu_var(ppc64_tlb_batch);
+	if (batch->active) {
+		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
+#endif /* CONFIG_PPC64 */
 
 	local_irq_save(flags);
 
@@ -527,6 +538,14 @@ struct task_struct *__switch_to(struct t
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
+#ifdef CONFIG_PPC64
+	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
+		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
+		batch = &__get_cpu_var(ppc64_tlb_batch);
+		batch->active = 1;
+	}
+#endif /* CONFIG_PPC64 */
+
 	local_irq_restore(flags);
 
 	return last;
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,8 +33,6 @@
 
 #include "mmu_decl.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 /*
@@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
  * freeing a page table page that is being walked without locks
  */
 
-static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
 static unsigned long pte_freelist_forced_free;
 
 struct pte_freelist_batch
@@ -97,12 +94,10 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 	unsigned long pgf;
 
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		pgtable_free(table, shift);
 		return;
 	}
@@ -124,10 +119,9 @@ void pgtable_free_tlb(struct mmu_gather 
 	}
 }
 
-void pte_free_finish(void)
+void pte_free_finish(struct mmu_gather *tlb)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 
 	if (*batchp == NULL)
 		return;
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*
@@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
 
 	/* If there's a TLB batch pending, then we must flush it because the
 	 * pages are going to be freed and we really don't want to have a CPU
@@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
 	if (tlbbatch->index)
 		__flush_tlb_pending(tlbbatch);
 
+	put_cpu_var(ppc64_tlb_batch);
+
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6/arch/powerpc/include/asm/thread_info.h
@@ -139,10 +139,12 @@ static inline struct thread_info *curren
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
 #define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
+#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
 #define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
+#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 
 #ifndef __ASSEMBLY__
 #define HAVE_SET_RESTORE_SIGMASK	1
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
 
 #ifdef CONFIG_SMP
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(void);
+extern void pte_free_finish(struct mmu_gather *tlb);
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(void) { }
+static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	}
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -301,7 +301,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	flush_tlb_mm(tlb->mm);
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 10/21] sparc: Preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-sparc.patch --]
[-- Type: text/plain, Size: 8846 bytes --]

Rework the sparc mmu_gather usage to conform to the new world order :-)

Sparc mmu_gather does two things:
 - tracks vaddrs to unhash
 - tracks pages to free

Split these two things like powerpc has done and keep the vaddrs
in per-cpu data structures and flush them on context switch.

The remaining bits can then use the generic mmu_gather.

Cc: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sparc/include/asm/pgalloc_64.h  |    3 +
 arch/sparc/include/asm/tlb_64.h      |   91 ++---------------------------------
 arch/sparc/include/asm/tlbflush_64.h |   12 +++-
 arch/sparc/mm/tlb.c                  |   42 +++++++++-------
 arch/sparc/mm/tsb.c                  |   15 +++--
 5 files changed, 51 insertions(+), 112 deletions(-)

Index: linux-2.6/arch/sparc/include/asm/pgalloc_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgalloc_64.h
+++ linux-2.6/arch/sparc/include/asm/pgalloc_64.h
@@ -78,4 +78,7 @@ static inline void check_pgt_cache(void)
 	quicklist_trim(0, NULL, 25, 16);
 }
 
+#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+
 #endif /* _SPARC64_PGALLOC_H */
Index: linux-2.6/arch/sparc/include/asm/tlb_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlb_64.h
+++ linux-2.6/arch/sparc/include/asm/tlb_64.h
@@ -7,66 +7,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-#define TLB_BATCH_NR	192
-
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define FREE_PTE_NR	506
-  #define tlb_fast_mode(bp) ((bp)->pages_nr == ~0U)
-#else
-  #define FREE_PTE_NR	1
-  #define tlb_fast_mode(bp) 1
-#endif
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	unsigned int pages_nr;
-	unsigned int need_flush;
-	unsigned int fullmm;
-	unsigned int tlb_nr;
-	unsigned long vaddrs[TLB_BATCH_NR];
-	struct page *pages[FREE_PTE_NR];
-};
-
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_pending(struct mm_struct *,
 				  unsigned long, unsigned long *);
 #endif
 
-extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
-extern void flush_tlb_pending(void);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
-
-	BUG_ON(mp->tlb_nr);
-
-	mp->mm = mm;
-	mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
-	mp->fullmm = full_mm_flush;
-
-	return mp;
-}
-
-
-static inline void tlb_flush_mmu(struct mmu_gather *mp)
-{
-	if (!mp->fullmm)
-		flush_tlb_pending();
-	if (mp->need_flush) {
-		free_pages_and_swap_cache(mp->pages, mp->pages_nr);
-		mp->pages_nr = 0;
-		mp->need_flush = 0;
-	}
-
-}
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_mm(struct mm_struct *mm);
 #define do_flush_tlb_mm(mm) smp_flush_tlb_mm(mm)
@@ -74,38 +19,14 @@ extern void smp_flush_tlb_mm(struct mm_s
 #define do_flush_tlb_mm(mm) __flush_tlb_mm(CTX_HWBITS(mm->context), SECONDARY_CONTEXT)
 #endif
 
-static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(mp);
-
-	if (mp->fullmm)
-		mp->fullmm = 0;
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
-}
-
-static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
-{
-	if (tlb_fast_mode(mp)) {
-		free_page_and_swap_cache(page);
-		return;
-	}
-	mp->need_flush = 1;
-	mp->pages[mp->pages_nr++] = page;
-	if (mp->pages_nr >= FREE_PTE_NR)
-		tlb_flush_mmu(mp);
-}
-
-#define tlb_remove_tlb_entry(mp,ptep,addr) do { } while (0)
-#define pte_free_tlb(mp, ptepage, addr) pte_free((mp)->mm, ptepage)
-#define pmd_free_tlb(mp, pmdp, addr) pmd_free((mp)->mm, pmdp)
-#define pud_free_tlb(tlb,pudp, addr) __pud_free_tlb(tlb,pudp,addr)
+extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
+extern void flush_tlb_pending(void);
 
-#define tlb_migrate_finish(mm)	do { } while (0)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define tlb_flush(tlb)	flush_tlb_pending()
+
+#include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
Index: linux-2.6/arch/sparc/include/asm/tlbflush_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlbflush_64.h
+++ linux-2.6/arch/sparc/include/asm/tlbflush_64.h
@@ -5,9 +5,17 @@
 #include <asm/mmu_context.h>
 
 /* TSB flush operations. */
-struct mmu_gather;
+
+#define TLB_BATCH_NR	192
+
+struct tlb_batch {
+	struct mm_struct *mm;
+	unsigned long tlb_nr;
+	unsigned long vaddrs[TLB_BATCH_NR];
+};
+
 extern void flush_tsb_kernel_range(unsigned long start, unsigned long end);
-extern void flush_tsb_user(struct mmu_gather *mp);
+extern void flush_tsb_user(struct tlb_batch *tb);
 
 /* TLB flush operations. */
 
Index: linux-2.6/arch/sparc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tlb.c
+++ linux-2.6/arch/sparc/mm/tlb.c
@@ -19,33 +19,33 @@
 
 /* Heavily inspired by the ppc64 code.  */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+static DEFINE_PER_CPU(struct tlb_batch, tlb_batch);
 
 void flush_tlb_pending(void)
 {
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 
-	if (mp->tlb_nr) {
-		flush_tsb_user(mp);
+	if (tb->tlb_nr) {
+		flush_tsb_user(tb);
 
-		if (CTX_VALID(mp->mm->context)) {
+		if (CTX_VALID(tb->mm->context)) {
 #ifdef CONFIG_SMP
-			smp_flush_tlb_pending(mp->mm, mp->tlb_nr,
-					      &mp->vaddrs[0]);
+			smp_flush_tlb_pending(tb->mm, tb->tlb_nr,
+					      &tb->vaddrs[0]);
 #else
-			__flush_tlb_pending(CTX_HWBITS(mp->mm->context),
-					    mp->tlb_nr, &mp->vaddrs[0]);
+			__flush_tlb_pending(CTX_HWBITS(tb->mm->context),
+					    tb->tlb_nr, &tb->vaddrs[0]);
 #endif
 		}
-		mp->tlb_nr = 0;
+		tb->tlb_nr = 0;
 	}
 
-	put_cpu_var(mmu_gathers);
+	put_cpu_var(tlb_batch);
 }
 
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig)
 {
-	struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 	unsigned long nr;
 
 	vaddr &= PAGE_MASK;
@@ -77,21 +77,27 @@ void tlb_batch_add(struct mm_struct *mm,
 
 no_cache_flush:
 
-	if (mp->fullmm)
+	/*
+	if (tb->fullmm) {
+		put_cpu_var(tlb_batch);
 		return;
+	}
+	*/
 
-	nr = mp->tlb_nr;
+	nr = tb->tlb_nr;
 
-	if (unlikely(nr != 0 && mm != mp->mm)) {
+	if (unlikely(nr != 0 && mm != tb->mm)) {
 		flush_tlb_pending();
 		nr = 0;
 	}
 
 	if (nr == 0)
-		mp->mm = mm;
+		tb->mm = mm;
 
-	mp->vaddrs[nr] = vaddr;
-	mp->tlb_nr = ++nr;
+	tb->vaddrs[nr] = vaddr;
+	tb->tlb_nr = ++nr;
 	if (nr >= TLB_BATCH_NR)
 		flush_tlb_pending();
+
+	put_cpu_var(tlb_batch);
 }
Index: linux-2.6/arch/sparc/mm/tsb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tsb.c
+++ linux-2.6/arch/sparc/mm/tsb.c
@@ -47,12 +47,13 @@ void flush_tsb_kernel_range(unsigned lon
 	}
 }
 
-static void __flush_tsb_one(struct mmu_gather *mp, unsigned long hash_shift, unsigned long tsb, unsigned long nentries)
+static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
+			    unsigned long tsb, unsigned long nentries)
 {
 	unsigned long i;
 
-	for (i = 0; i < mp->tlb_nr; i++) {
-		unsigned long v = mp->vaddrs[i];
+	for (i = 0; i < tb->tlb_nr; i++) {
+		unsigned long v = tb->vaddrs[i];
 		unsigned long tag, ent, hash;
 
 		v &= ~0x1UL;
@@ -65,9 +66,9 @@ static void __flush_tsb_one(struct mmu_g
 	}
 }
 
-void flush_tsb_user(struct mmu_gather *mp)
+void flush_tsb_user(struct tlb_batch *tb)
 {
-	struct mm_struct *mm = mp->mm;
+	struct mm_struct *mm = tb->mm;
 	unsigned long nentries, base, flags;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
@@ -76,7 +77,7 @@ void flush_tsb_user(struct mmu_gather *m
 	nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
 	if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 		base = __pa(base);
-	__flush_tsb_one(mp, PAGE_SHIFT, base, nentries);
+	__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
 
 #ifdef CONFIG_HUGETLB_PAGE
 	if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -84,7 +85,7 @@ void flush_tsb_user(struct mmu_gather *m
 		nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
 		if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 			base = __pa(base);
-		__flush_tsb_one(mp, HPAGE_SHIFT, base, nentries);
+		__flush_tsb_one(tb, HPAGE_SHIFT, base, nentries);
 	}
 #endif
 	spin_unlock_irqrestore(&mm->context.lock, flags);



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 10/21] sparc: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-sparc.patch --]
[-- Type: text/plain, Size: 9142 bytes --]

Rework the sparc mmu_gather usage to conform to the new world order :-)

Sparc mmu_gather does two things:
 - tracks vaddrs to unhash
 - tracks pages to free

Split these two things like powerpc has done and keep the vaddrs
in per-cpu data structures and flush them on context switch.

The remaining bits can then use the generic mmu_gather.

Cc: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sparc/include/asm/pgalloc_64.h  |    3 +
 arch/sparc/include/asm/tlb_64.h      |   91 ++---------------------------------
 arch/sparc/include/asm/tlbflush_64.h |   12 +++-
 arch/sparc/mm/tlb.c                  |   42 +++++++++-------
 arch/sparc/mm/tsb.c                  |   15 +++--
 5 files changed, 51 insertions(+), 112 deletions(-)

Index: linux-2.6/arch/sparc/include/asm/pgalloc_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgalloc_64.h
+++ linux-2.6/arch/sparc/include/asm/pgalloc_64.h
@@ -78,4 +78,7 @@ static inline void check_pgt_cache(void)
 	quicklist_trim(0, NULL, 25, 16);
 }
 
+#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+
 #endif /* _SPARC64_PGALLOC_H */
Index: linux-2.6/arch/sparc/include/asm/tlb_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlb_64.h
+++ linux-2.6/arch/sparc/include/asm/tlb_64.h
@@ -7,66 +7,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-#define TLB_BATCH_NR	192
-
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define FREE_PTE_NR	506
-  #define tlb_fast_mode(bp) ((bp)->pages_nr == ~0U)
-#else
-  #define FREE_PTE_NR	1
-  #define tlb_fast_mode(bp) 1
-#endif
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	unsigned int pages_nr;
-	unsigned int need_flush;
-	unsigned int fullmm;
-	unsigned int tlb_nr;
-	unsigned long vaddrs[TLB_BATCH_NR];
-	struct page *pages[FREE_PTE_NR];
-};
-
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_pending(struct mm_struct *,
 				  unsigned long, unsigned long *);
 #endif
 
-extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
-extern void flush_tlb_pending(void);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
-
-	BUG_ON(mp->tlb_nr);
-
-	mp->mm = mm;
-	mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
-	mp->fullmm = full_mm_flush;
-
-	return mp;
-}
-
-
-static inline void tlb_flush_mmu(struct mmu_gather *mp)
-{
-	if (!mp->fullmm)
-		flush_tlb_pending();
-	if (mp->need_flush) {
-		free_pages_and_swap_cache(mp->pages, mp->pages_nr);
-		mp->pages_nr = 0;
-		mp->need_flush = 0;
-	}
-
-}
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_mm(struct mm_struct *mm);
 #define do_flush_tlb_mm(mm) smp_flush_tlb_mm(mm)
@@ -74,38 +19,14 @@ extern void smp_flush_tlb_mm(struct mm_s
 #define do_flush_tlb_mm(mm) __flush_tlb_mm(CTX_HWBITS(mm->context), SECONDARY_CONTEXT)
 #endif
 
-static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(mp);
-
-	if (mp->fullmm)
-		mp->fullmm = 0;
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
-}
-
-static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
-{
-	if (tlb_fast_mode(mp)) {
-		free_page_and_swap_cache(page);
-		return;
-	}
-	mp->need_flush = 1;
-	mp->pages[mp->pages_nr++] = page;
-	if (mp->pages_nr >= FREE_PTE_NR)
-		tlb_flush_mmu(mp);
-}
-
-#define tlb_remove_tlb_entry(mp,ptep,addr) do { } while (0)
-#define pte_free_tlb(mp, ptepage, addr) pte_free((mp)->mm, ptepage)
-#define pmd_free_tlb(mp, pmdp, addr) pmd_free((mp)->mm, pmdp)
-#define pud_free_tlb(tlb,pudp, addr) __pud_free_tlb(tlb,pudp,addr)
+extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
+extern void flush_tlb_pending(void);
 
-#define tlb_migrate_finish(mm)	do { } while (0)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define tlb_flush(tlb)	flush_tlb_pending()
+
+#include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
Index: linux-2.6/arch/sparc/include/asm/tlbflush_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlbflush_64.h
+++ linux-2.6/arch/sparc/include/asm/tlbflush_64.h
@@ -5,9 +5,17 @@
 #include <asm/mmu_context.h>
 
 /* TSB flush operations. */
-struct mmu_gather;
+
+#define TLB_BATCH_NR	192
+
+struct tlb_batch {
+	struct mm_struct *mm;
+	unsigned long tlb_nr;
+	unsigned long vaddrs[TLB_BATCH_NR];
+};
+
 extern void flush_tsb_kernel_range(unsigned long start, unsigned long end);
-extern void flush_tsb_user(struct mmu_gather *mp);
+extern void flush_tsb_user(struct tlb_batch *tb);
 
 /* TLB flush operations. */
 
Index: linux-2.6/arch/sparc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tlb.c
+++ linux-2.6/arch/sparc/mm/tlb.c
@@ -19,33 +19,33 @@
 
 /* Heavily inspired by the ppc64 code.  */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+static DEFINE_PER_CPU(struct tlb_batch, tlb_batch);
 
 void flush_tlb_pending(void)
 {
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 
-	if (mp->tlb_nr) {
-		flush_tsb_user(mp);
+	if (tb->tlb_nr) {
+		flush_tsb_user(tb);
 
-		if (CTX_VALID(mp->mm->context)) {
+		if (CTX_VALID(tb->mm->context)) {
 #ifdef CONFIG_SMP
-			smp_flush_tlb_pending(mp->mm, mp->tlb_nr,
-					      &mp->vaddrs[0]);
+			smp_flush_tlb_pending(tb->mm, tb->tlb_nr,
+					      &tb->vaddrs[0]);
 #else
-			__flush_tlb_pending(CTX_HWBITS(mp->mm->context),
-					    mp->tlb_nr, &mp->vaddrs[0]);
+			__flush_tlb_pending(CTX_HWBITS(tb->mm->context),
+					    tb->tlb_nr, &tb->vaddrs[0]);
 #endif
 		}
-		mp->tlb_nr = 0;
+		tb->tlb_nr = 0;
 	}
 
-	put_cpu_var(mmu_gathers);
+	put_cpu_var(tlb_batch);
 }
 
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig)
 {
-	struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 	unsigned long nr;
 
 	vaddr &= PAGE_MASK;
@@ -77,21 +77,27 @@ void tlb_batch_add(struct mm_struct *mm,
 
 no_cache_flush:
 
-	if (mp->fullmm)
+	/*
+	if (tb->fullmm) {
+		put_cpu_var(tlb_batch);
 		return;
+	}
+	*/
 
-	nr = mp->tlb_nr;
+	nr = tb->tlb_nr;
 
-	if (unlikely(nr != 0 && mm != mp->mm)) {
+	if (unlikely(nr != 0 && mm != tb->mm)) {
 		flush_tlb_pending();
 		nr = 0;
 	}
 
 	if (nr == 0)
-		mp->mm = mm;
+		tb->mm = mm;
 
-	mp->vaddrs[nr] = vaddr;
-	mp->tlb_nr = ++nr;
+	tb->vaddrs[nr] = vaddr;
+	tb->tlb_nr = ++nr;
 	if (nr >= TLB_BATCH_NR)
 		flush_tlb_pending();
+
+	put_cpu_var(tlb_batch);
 }
Index: linux-2.6/arch/sparc/mm/tsb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tsb.c
+++ linux-2.6/arch/sparc/mm/tsb.c
@@ -47,12 +47,13 @@ void flush_tsb_kernel_range(unsigned lon
 	}
 }
 
-static void __flush_tsb_one(struct mmu_gather *mp, unsigned long hash_shift, unsigned long tsb, unsigned long nentries)
+static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
+			    unsigned long tsb, unsigned long nentries)
 {
 	unsigned long i;
 
-	for (i = 0; i < mp->tlb_nr; i++) {
-		unsigned long v = mp->vaddrs[i];
+	for (i = 0; i < tb->tlb_nr; i++) {
+		unsigned long v = tb->vaddrs[i];
 		unsigned long tag, ent, hash;
 
 		v &= ~0x1UL;
@@ -65,9 +66,9 @@ static void __flush_tsb_one(struct mmu_g
 	}
 }
 
-void flush_tsb_user(struct mmu_gather *mp)
+void flush_tsb_user(struct tlb_batch *tb)
 {
-	struct mm_struct *mm = mp->mm;
+	struct mm_struct *mm = tb->mm;
 	unsigned long nentries, base, flags;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
@@ -76,7 +77,7 @@ void flush_tsb_user(struct mmu_gather *m
 	nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
 	if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 		base = __pa(base);
-	__flush_tsb_one(mp, PAGE_SHIFT, base, nentries);
+	__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
 
 #ifdef CONFIG_HUGETLB_PAGE
 	if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -84,7 +85,7 @@ void flush_tsb_user(struct mmu_gather *m
 		nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
 		if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 			base = __pa(base);
-		__flush_tsb_one(mp, HPAGE_SHIFT, base, nentries);
+		__flush_tsb_one(tb, HPAGE_SHIFT, base, nentries);
 	}
 #endif
 	spin_unlock_irqrestore(&mm->context.lock, flags);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 10/21] sparc: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-sparc.patch --]
[-- Type: text/plain, Size: 9142 bytes --]

Rework the sparc mmu_gather usage to conform to the new world order :-)

Sparc mmu_gather does two things:
 - tracks vaddrs to unhash
 - tracks pages to free

Split these two things like powerpc has done and keep the vaddrs
in per-cpu data structures and flush them on context switch.

The remaining bits can then use the generic mmu_gather.

Cc: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sparc/include/asm/pgalloc_64.h  |    3 +
 arch/sparc/include/asm/tlb_64.h      |   91 ++---------------------------------
 arch/sparc/include/asm/tlbflush_64.h |   12 +++-
 arch/sparc/mm/tlb.c                  |   42 +++++++++-------
 arch/sparc/mm/tsb.c                  |   15 +++--
 5 files changed, 51 insertions(+), 112 deletions(-)

Index: linux-2.6/arch/sparc/include/asm/pgalloc_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgalloc_64.h
+++ linux-2.6/arch/sparc/include/asm/pgalloc_64.h
@@ -78,4 +78,7 @@ static inline void check_pgt_cache(void)
 	quicklist_trim(0, NULL, 25, 16);
 }
 
+#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+
 #endif /* _SPARC64_PGALLOC_H */
Index: linux-2.6/arch/sparc/include/asm/tlb_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlb_64.h
+++ linux-2.6/arch/sparc/include/asm/tlb_64.h
@@ -7,66 +7,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-#define TLB_BATCH_NR	192
-
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define FREE_PTE_NR	506
-  #define tlb_fast_mode(bp) ((bp)->pages_nr == ~0U)
-#else
-  #define FREE_PTE_NR	1
-  #define tlb_fast_mode(bp) 1
-#endif
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	unsigned int pages_nr;
-	unsigned int need_flush;
-	unsigned int fullmm;
-	unsigned int tlb_nr;
-	unsigned long vaddrs[TLB_BATCH_NR];
-	struct page *pages[FREE_PTE_NR];
-};
-
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_pending(struct mm_struct *,
 				  unsigned long, unsigned long *);
 #endif
 
-extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
-extern void flush_tlb_pending(void);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
-
-	BUG_ON(mp->tlb_nr);
-
-	mp->mm = mm;
-	mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
-	mp->fullmm = full_mm_flush;
-
-	return mp;
-}
-
-
-static inline void tlb_flush_mmu(struct mmu_gather *mp)
-{
-	if (!mp->fullmm)
-		flush_tlb_pending();
-	if (mp->need_flush) {
-		free_pages_and_swap_cache(mp->pages, mp->pages_nr);
-		mp->pages_nr = 0;
-		mp->need_flush = 0;
-	}
-
-}
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_mm(struct mm_struct *mm);
 #define do_flush_tlb_mm(mm) smp_flush_tlb_mm(mm)
@@ -74,38 +19,14 @@ extern void smp_flush_tlb_mm(struct mm_s
 #define do_flush_tlb_mm(mm) __flush_tlb_mm(CTX_HWBITS(mm->context), SECONDARY_CONTEXT)
 #endif
 
-static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(mp);
-
-	if (mp->fullmm)
-		mp->fullmm = 0;
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
-}
-
-static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
-{
-	if (tlb_fast_mode(mp)) {
-		free_page_and_swap_cache(page);
-		return;
-	}
-	mp->need_flush = 1;
-	mp->pages[mp->pages_nr++] = page;
-	if (mp->pages_nr >= FREE_PTE_NR)
-		tlb_flush_mmu(mp);
-}
-
-#define tlb_remove_tlb_entry(mp,ptep,addr) do { } while (0)
-#define pte_free_tlb(mp, ptepage, addr) pte_free((mp)->mm, ptepage)
-#define pmd_free_tlb(mp, pmdp, addr) pmd_free((mp)->mm, pmdp)
-#define pud_free_tlb(tlb,pudp, addr) __pud_free_tlb(tlb,pudp,addr)
+extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
+extern void flush_tlb_pending(void);
 
-#define tlb_migrate_finish(mm)	do { } while (0)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define tlb_flush(tlb)	flush_tlb_pending()
+
+#include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
Index: linux-2.6/arch/sparc/include/asm/tlbflush_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlbflush_64.h
+++ linux-2.6/arch/sparc/include/asm/tlbflush_64.h
@@ -5,9 +5,17 @@
 #include <asm/mmu_context.h>
 
 /* TSB flush operations. */
-struct mmu_gather;
+
+#define TLB_BATCH_NR	192
+
+struct tlb_batch {
+	struct mm_struct *mm;
+	unsigned long tlb_nr;
+	unsigned long vaddrs[TLB_BATCH_NR];
+};
+
 extern void flush_tsb_kernel_range(unsigned long start, unsigned long end);
-extern void flush_tsb_user(struct mmu_gather *mp);
+extern void flush_tsb_user(struct tlb_batch *tb);
 
 /* TLB flush operations. */
 
Index: linux-2.6/arch/sparc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tlb.c
+++ linux-2.6/arch/sparc/mm/tlb.c
@@ -19,33 +19,33 @@
 
 /* Heavily inspired by the ppc64 code.  */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+static DEFINE_PER_CPU(struct tlb_batch, tlb_batch);
 
 void flush_tlb_pending(void)
 {
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 
-	if (mp->tlb_nr) {
-		flush_tsb_user(mp);
+	if (tb->tlb_nr) {
+		flush_tsb_user(tb);
 
-		if (CTX_VALID(mp->mm->context)) {
+		if (CTX_VALID(tb->mm->context)) {
 #ifdef CONFIG_SMP
-			smp_flush_tlb_pending(mp->mm, mp->tlb_nr,
-					      &mp->vaddrs[0]);
+			smp_flush_tlb_pending(tb->mm, tb->tlb_nr,
+					      &tb->vaddrs[0]);
 #else
-			__flush_tlb_pending(CTX_HWBITS(mp->mm->context),
-					    mp->tlb_nr, &mp->vaddrs[0]);
+			__flush_tlb_pending(CTX_HWBITS(tb->mm->context),
+					    tb->tlb_nr, &tb->vaddrs[0]);
 #endif
 		}
-		mp->tlb_nr = 0;
+		tb->tlb_nr = 0;
 	}
 
-	put_cpu_var(mmu_gathers);
+	put_cpu_var(tlb_batch);
 }
 
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig)
 {
-	struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 	unsigned long nr;
 
 	vaddr &= PAGE_MASK;
@@ -77,21 +77,27 @@ void tlb_batch_add(struct mm_struct *mm,
 
 no_cache_flush:
 
-	if (mp->fullmm)
+	/*
+	if (tb->fullmm) {
+		put_cpu_var(tlb_batch);
 		return;
+	}
+	*/
 
-	nr = mp->tlb_nr;
+	nr = tb->tlb_nr;
 
-	if (unlikely(nr != 0 && mm != mp->mm)) {
+	if (unlikely(nr != 0 && mm != tb->mm)) {
 		flush_tlb_pending();
 		nr = 0;
 	}
 
 	if (nr == 0)
-		mp->mm = mm;
+		tb->mm = mm;
 
-	mp->vaddrs[nr] = vaddr;
-	mp->tlb_nr = ++nr;
+	tb->vaddrs[nr] = vaddr;
+	tb->tlb_nr = ++nr;
 	if (nr >= TLB_BATCH_NR)
 		flush_tlb_pending();
+
+	put_cpu_var(tlb_batch);
 }
Index: linux-2.6/arch/sparc/mm/tsb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tsb.c
+++ linux-2.6/arch/sparc/mm/tsb.c
@@ -47,12 +47,13 @@ void flush_tsb_kernel_range(unsigned lon
 	}
 }
 
-static void __flush_tsb_one(struct mmu_gather *mp, unsigned long hash_shift, unsigned long tsb, unsigned long nentries)
+static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
+			    unsigned long tsb, unsigned long nentries)
 {
 	unsigned long i;
 
-	for (i = 0; i < mp->tlb_nr; i++) {
-		unsigned long v = mp->vaddrs[i];
+	for (i = 0; i < tb->tlb_nr; i++) {
+		unsigned long v = tb->vaddrs[i];
 		unsigned long tag, ent, hash;
 
 		v &= ~0x1UL;
@@ -65,9 +66,9 @@ static void __flush_tsb_one(struct mmu_g
 	}
 }
 
-void flush_tsb_user(struct mmu_gather *mp)
+void flush_tsb_user(struct tlb_batch *tb)
 {
-	struct mm_struct *mm = mp->mm;
+	struct mm_struct *mm = tb->mm;
 	unsigned long nentries, base, flags;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
@@ -76,7 +77,7 @@ void flush_tsb_user(struct mmu_gather *m
 	nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
 	if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 		base = __pa(base);
-	__flush_tsb_one(mp, PAGE_SHIFT, base, nentries);
+	__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
 
 #ifdef CONFIG_HUGETLB_PAGE
 	if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -84,7 +85,7 @@ void flush_tsb_user(struct mmu_gather *m
 		nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
 		if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 			base = __pa(base);
-		__flush_tsb_one(mp, HPAGE_SHIFT, base, nentries);
+		__flush_tsb_one(tb, HPAGE_SHIFT, base, nentries);
 	}
 #endif
 	spin_unlock_irqrestore(&mm->context.lock, flags);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 11/21] s390: preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Martin Schwidefsky

[-- Attachment #1: martin-mm-preempt-tlb-gather-s390.patch --]
[-- Type: text/plain, Size: 2570 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/s390/include/asm/tlb.h |   43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6/arch/s390/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/tlb.h
+++ linux-2.6/arch/s390/include/asm/tlb.h
@@ -28,44 +28,50 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS	1
-#else
-#define TLB_NR_PTRS	508
-#endif
-
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
 	unsigned int nr_ptes;
 	unsigned int nr_pxds;
-	void *array[TLB_NR_PTRS];
+	unsigned int max;
+	void **array;
+	void *local[8];
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
-						unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->array = (void *) addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
 
+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+				  struct mm_struct *mm,
+				  unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = TLB_NR_PTRS;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	return tlb;
+	else
+		__tlb_alloc_pages(tlb);
+	tlb->nr_ptes = 0;
+	tlb->nr_pxds = tlb->max;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb,
 				 unsigned long start, unsigned long end)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
 		__tlb_flush_mm(tlb->mm);
 	while (tlb->nr_ptes > 0)
 		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < TLB_NR_PTRS)
+	while (tlb->nr_pxds < tlb->max)
 		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
 }
 
@@ -79,7 +85,8 @@ static inline void tlb_finish_mmu(struct
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->array != tlb->local)
+		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 11/21] s390: preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Martin Schwidefsky

[-- Attachment #1: martin-mm-preempt-tlb-gather-s390.patch --]
[-- Type: text/plain, Size: 2866 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/s390/include/asm/tlb.h |   43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6/arch/s390/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/tlb.h
+++ linux-2.6/arch/s390/include/asm/tlb.h
@@ -28,44 +28,50 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS	1
-#else
-#define TLB_NR_PTRS	508
-#endif
-
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
 	unsigned int nr_ptes;
 	unsigned int nr_pxds;
-	void *array[TLB_NR_PTRS];
+	unsigned int max;
+	void **array;
+	void *local[8];
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
-						unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->array = (void *) addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
 
+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+				  struct mm_struct *mm,
+				  unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = TLB_NR_PTRS;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	return tlb;
+	else
+		__tlb_alloc_pages(tlb);
+	tlb->nr_ptes = 0;
+	tlb->nr_pxds = tlb->max;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb,
 				 unsigned long start, unsigned long end)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
 		__tlb_flush_mm(tlb->mm);
 	while (tlb->nr_ptes > 0)
 		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < TLB_NR_PTRS)
+	while (tlb->nr_pxds < tlb->max)
 		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
 }
 
@@ -79,7 +85,8 @@ static inline void tlb_finish_mmu(struct
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->array != tlb->local)
+		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 11/21] s390: preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Martin Schwidefsky

[-- Attachment #1: martin-mm-preempt-tlb-gather-s390.patch --]
[-- Type: text/plain, Size: 2866 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/s390/include/asm/tlb.h |   43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6/arch/s390/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/tlb.h
+++ linux-2.6/arch/s390/include/asm/tlb.h
@@ -28,44 +28,50 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS	1
-#else
-#define TLB_NR_PTRS	508
-#endif
-
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
 	unsigned int nr_ptes;
 	unsigned int nr_pxds;
-	void *array[TLB_NR_PTRS];
+	unsigned int max;
+	void **array;
+	void *local[8];
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
-						unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->array = (void *) addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
 
+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+				  struct mm_struct *mm,
+				  unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = TLB_NR_PTRS;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	return tlb;
+	else
+		__tlb_alloc_pages(tlb);
+	tlb->nr_ptes = 0;
+	tlb->nr_pxds = tlb->max;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb,
 				 unsigned long start, unsigned long end)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
 		__tlb_flush_mm(tlb->mm);
 	while (tlb->nr_ptes > 0)
 		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < TLB_NR_PTRS)
+	while (tlb->nr_pxds < tlb->max)
 		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
 }
 
@@ -79,7 +85,8 @@ static inline void tlb_finish_mmu(struct
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->array != tlb->local)
+		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 12/21] arm: Preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Russell King

[-- Attachment #1: mm-preempt-tlb-gather-arm.patch --]
[-- Type: text/plain, Size: 1111 bytes --]

Fix up the arm mmu_gahter code to conform to the new API.

Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/include/asm/tlb.h |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -40,17 +40,11 @@ struct mmu_gather {
 	unsigned long		range_end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
-
-	return tlb;
 }
 
 static inline void
@@ -61,8 +55,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /*



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 12/21] arm: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Russell King

[-- Attachment #1: mm-preempt-tlb-gather-arm.patch --]
[-- Type: text/plain, Size: 1407 bytes --]

Fix up the arm mmu_gahter code to conform to the new API.

Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/include/asm/tlb.h |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -40,17 +40,11 @@ struct mmu_gather {
 	unsigned long		range_end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
-
-	return tlb;
 }
 
 static inline void
@@ -61,8 +55,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 12/21] arm: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Russell King

[-- Attachment #1: mm-preempt-tlb-gather-arm.patch --]
[-- Type: text/plain, Size: 1407 bytes --]

Fix up the arm mmu_gahter code to conform to the new API.

Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/include/asm/tlb.h |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -40,17 +40,11 @@ struct mmu_gather {
 	unsigned long		range_end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
-
-	return tlb;
 }
 
 static inline void
@@ -61,8 +55,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 13/21] sh: Preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Paul Mundt

[-- Attachment #1: mm-preempt-tlb-gather-sh.patch --]
[-- Type: text/plain, Size: 1303 bytes --]

Fix up the sh mmu_gahter code to conform to the new API.

Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh/include/asm/tlb.h |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Index: linux-2.6/arch/sh/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/tlb.h
+++ linux-2.6/arch/sh/include/asm/tlb.h
@@ -23,8 +23,6 @@ struct mmu_gather {
 	unsigned long		start, end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void init_tlb_gather(struct mmu_gather *tlb)
 {
 	tlb->start = TASK_SIZE;
@@ -36,17 +34,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 static inline void
@@ -57,8 +51,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 static inline void



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 13/21] sh: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Paul Mundt

[-- Attachment #1: mm-preempt-tlb-gather-sh.patch --]
[-- Type: text/plain, Size: 1599 bytes --]

Fix up the sh mmu_gahter code to conform to the new API.

Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh/include/asm/tlb.h |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Index: linux-2.6/arch/sh/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/tlb.h
+++ linux-2.6/arch/sh/include/asm/tlb.h
@@ -23,8 +23,6 @@ struct mmu_gather {
 	unsigned long		start, end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void init_tlb_gather(struct mmu_gather *tlb)
 {
 	tlb->start = TASK_SIZE;
@@ -36,17 +34,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 static inline void
@@ -57,8 +51,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 static inline void


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 13/21] sh: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Paul Mundt

[-- Attachment #1: mm-preempt-tlb-gather-sh.patch --]
[-- Type: text/plain, Size: 1599 bytes --]

Fix up the sh mmu_gahter code to conform to the new API.

Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh/include/asm/tlb.h |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Index: linux-2.6/arch/sh/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/tlb.h
+++ linux-2.6/arch/sh/include/asm/tlb.h
@@ -23,8 +23,6 @@ struct mmu_gather {
 	unsigned long		start, end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void init_tlb_gather(struct mmu_gather *tlb)
 {
 	tlb->start = TASK_SIZE;
@@ -36,17 +34,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 static inline void
@@ -57,8 +51,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 static inline void


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 14/21] um: Preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Jeff Dike

[-- Attachment #1: mm-preempt-tlb-gather-um.patch --]
[-- Type: text/plain, Size: 1576 bytes --]

Fix up the um mmu_gather code to conform to the new API.

Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/um/include/asm/tlb.h |   16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

Index: linux-2.6/arch/um/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/um/include/asm/tlb.h
+++ linux-2.6/arch/um/include/asm/tlb.h
@@ -22,9 +22,6 @@ struct mmu_gather {
 	unsigned int		fullmm; /* non-zero means full mm flush */
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
 					  unsigned long address)
 {
@@ -47,20 +44,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-/* tlb_gather_mmu
- *	Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -87,8 +77,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /* tlb_remove_page



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 14/21] um: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Jeff Dike

[-- Attachment #1: mm-preempt-tlb-gather-um.patch --]
[-- Type: text/plain, Size: 1872 bytes --]

Fix up the um mmu_gather code to conform to the new API.

Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/um/include/asm/tlb.h |   16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

Index: linux-2.6/arch/um/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/um/include/asm/tlb.h
+++ linux-2.6/arch/um/include/asm/tlb.h
@@ -22,9 +22,6 @@ struct mmu_gather {
 	unsigned int		fullmm; /* non-zero means full mm flush */
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
 					  unsigned long address)
 {
@@ -47,20 +44,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-/* tlb_gather_mmu
- *	Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -87,8 +77,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /* tlb_remove_page


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 14/21] um: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Jeff Dike

[-- Attachment #1: mm-preempt-tlb-gather-um.patch --]
[-- Type: text/plain, Size: 1872 bytes --]

Fix up the um mmu_gather code to conform to the new API.

Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/um/include/asm/tlb.h |   16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

Index: linux-2.6/arch/um/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/um/include/asm/tlb.h
+++ linux-2.6/arch/um/include/asm/tlb.h
@@ -22,9 +22,6 @@ struct mmu_gather {
 	unsigned int		fullmm; /* non-zero means full mm flush */
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
 					  unsigned long address)
 {
@@ -47,20 +44,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-/* tlb_gather_mmu
- *	Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -87,8 +77,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /* tlb_remove_page


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 15/21] ia64: Preemptible mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Tony Luck

[-- Attachment #1: mm-preempt-tlb-gather-ia64.patch --]
[-- Type: text/plain, Size: 3081 bytes --]

Fix up the ia64 mmu_gather code to conform to the new API.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/include/asm/tlb.h |   39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

Index: linux-2.6/arch/ia64/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/tlb.h
+++ linux-2.6/arch/ia64/include/asm/tlb.h
@@ -47,21 +47,21 @@
 #include <asm/machvec.h>
 
 #ifdef CONFIG_SMP
-# define FREE_PTE_NR		2048
 # define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
 #else
-# define FREE_PTE_NR		0
 # define tlb_fast_mode(tlb)	(1)
 #endif
 
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;		/* == ~0U => fast mode */
+	unsigned int		max;
 	unsigned char		fullmm;		/* non-zero means full mm flush */
 	unsigned char		need_flush;	/* really unmapped some PTEs? */
 	unsigned long		start_addr;
 	unsigned long		end_addr;
-	struct page 		*pages[FREE_PTE_NR];
+	struct page		**pages;
+	struct page		*local[8];
 };
 
 struct ia64_tr_entry {
@@ -90,9 +90,6 @@ extern struct ia64_tr_entry *ia64_idtrs[
 #define RR_RID_MASK	0x00000000ffffff00L
 #define RR_TO_RID(val) 	((val >> 8) & 0xffffff)
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * Flush the TLB for address range START to END and, if not in fast mode, release the
  * freed pages that where gathered up to this point.
@@ -147,15 +144,23 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
 	}
 }
 
-/*
- * Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu (struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
 
+
+static inline void
+tlb_gather_mmu (struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
 	/*
 	 * Use fast mode if only 1 CPU is online.
 	 *
@@ -172,7 +177,6 @@ tlb_gather_mmu (struct mm_struct *mm, un
 	tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
 	tlb->fullmm = full_mm_flush;
 	tlb->start_addr = ~0UL;
-	return tlb;
 }
 
 /*
@@ -191,7 +195,8 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /*
@@ -208,8 +213,12 @@ tlb_remove_page (struct mmu_gather *tlb,
 		free_page_and_swap_cache(page);
 		return;
 	}
+
+	if (!tlb->nr && tlb->pages == tlb->local)
+		__tlb_alloc_pages(tlb);
+
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
 }
 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 15/21] ia64: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Tony Luck

[-- Attachment #1: mm-preempt-tlb-gather-ia64.patch --]
[-- Type: text/plain, Size: 3079 bytes --]

Fix up the ia64 mmu_gather code to conform to the new API.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/include/asm/tlb.h |   39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

Index: linux-2.6/arch/ia64/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/tlb.h
+++ linux-2.6/arch/ia64/include/asm/tlb.h
@@ -47,21 +47,21 @@
 #include <asm/machvec.h>
 
 #ifdef CONFIG_SMP
-# define FREE_PTE_NR		2048
 # define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
 #else
-# define FREE_PTE_NR		0
 # define tlb_fast_mode(tlb)	(1)
 #endif
 
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;		/* == ~0U => fast mode */
+	unsigned int		max;
 	unsigned char		fullmm;		/* non-zero means full mm flush */
 	unsigned char		need_flush;	/* really unmapped some PTEs? */
 	unsigned long		start_addr;
 	unsigned long		end_addr;
-	struct page 		*pages[FREE_PTE_NR];
+	struct page		**pages;
+	struct page		*local[8];
 };
 
 struct ia64_tr_entry {
@@ -90,9 +90,6 @@ extern struct ia64_tr_entry *ia64_idtrs[
 #define RR_RID_MASK	0x00000000ffffff00L
 #define RR_TO_RID(val) 	((val >> 8) & 0xffffff)
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * Flush the TLB for address range START to END and, if not in fast mode, release the
  * freed pages that where gathered up to this point.
@@ -147,15 +144,23 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
 	}
 }
 
-/*
- * Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu (struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
 
+
+static inline void
+tlb_gather_mmu (struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
 	/*
 	 * Use fast mode if only 1 CPU is online.
 	 *
@@ -172,7 +177,6 @@ tlb_gather_mmu (struct mm_struct *mm, un
 	tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
 	tlb->fullmm = full_mm_flush;
 	tlb->start_addr = ~0UL;
-	return tlb;
 }
 
 /*
@@ -191,7 +195,8 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /*
@@ -208,8 +213,12 @@ tlb_remove_page (struct mmu_gather *tlb,
 		free_page_and_swap_cache(page);
 		return;
 	}
+
+	if (!tlb->nr && tlb->pages == tlb->local)
+		__tlb_alloc_pages(tlb);
+
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
 }
 

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 15/21] ia64: Preemptible mmu_gather
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell,
	Tony Luck

[-- Attachment #1: mm-preempt-tlb-gather-ia64.patch --]
[-- Type: text/plain, Size: 3377 bytes --]

Fix up the ia64 mmu_gather code to conform to the new API.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/include/asm/tlb.h |   39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

Index: linux-2.6/arch/ia64/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/tlb.h
+++ linux-2.6/arch/ia64/include/asm/tlb.h
@@ -47,21 +47,21 @@
 #include <asm/machvec.h>
 
 #ifdef CONFIG_SMP
-# define FREE_PTE_NR		2048
 # define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
 #else
-# define FREE_PTE_NR		0
 # define tlb_fast_mode(tlb)	(1)
 #endif
 
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;		/* == ~0U => fast mode */
+	unsigned int		max;
 	unsigned char		fullmm;		/* non-zero means full mm flush */
 	unsigned char		need_flush;	/* really unmapped some PTEs? */
 	unsigned long		start_addr;
 	unsigned long		end_addr;
-	struct page 		*pages[FREE_PTE_NR];
+	struct page		**pages;
+	struct page		*local[8];
 };
 
 struct ia64_tr_entry {
@@ -90,9 +90,6 @@ extern struct ia64_tr_entry *ia64_idtrs[
 #define RR_RID_MASK	0x00000000ffffff00L
 #define RR_TO_RID(val) 	((val >> 8) & 0xffffff)
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * Flush the TLB for address range START to END and, if not in fast mode, release the
  * freed pages that where gathered up to this point.
@@ -147,15 +144,23 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
 	}
 }
 
-/*
- * Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu (struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
 
+
+static inline void
+tlb_gather_mmu (struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
 	/*
 	 * Use fast mode if only 1 CPU is online.
 	 *
@@ -172,7 +177,6 @@ tlb_gather_mmu (struct mm_struct *mm, un
 	tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
 	tlb->fullmm = full_mm_flush;
 	tlb->start_addr = ~0UL;
-	return tlb;
 }
 
 /*
@@ -191,7 +195,8 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /*
@@ -208,8 +213,12 @@ tlb_remove_page (struct mmu_gather *tlb,
 		free_page_and_swap_cache(page);
 		return;
 	}
+
+	if (!tlb->nr && tlb->pages == tlb->local)
+		__tlb_alloc_pages(tlb);
+
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 16/21] mm, powerpc: Move the RCU page-table freeing into generic code
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:38   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-rcu.patch --]
[-- Type: text/plain, Size: 12582 bytes --]

In case other architectures require RCU freed page-tables to implement
gup_fast() and software filled hashes and similar things, provide the
means to do so by moving the logic into generic code.

Requested-by: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig                       |    3 +
 arch/powerpc/Kconfig               |    1 
 arch/powerpc/include/asm/pgalloc.h |   21 ++++++-
 arch/powerpc/include/asm/tlb.h     |   10 ---
 arch/powerpc/mm/pgtable.c          |   98 -------------------------------------
 arch/powerpc/mm/tlb_hash32.c       |    3 -
 arch/powerpc/mm/tlb_hash64.c       |    3 -
 arch/powerpc/mm/tlb_nohash.c       |    3 -
 include/asm-generic/tlb.h          |   57 +++++++++++++++++++--
 mm/memory.c                        |   77 +++++++++++++++++++++++++++++
 10 files changed, 151 insertions(+), 125 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -31,14 +31,29 @@ static inline void pte_free(struct mm_st
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(struct mmu_gather *tlb);
+struct mmu_gather;
+extern void tlb_remove_table(struct mmu_gather *, void *);
+
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,16 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
-#define HAVE_ARCH_MMU_GATHER 1
-
-struct pte_freelist_batch;
-
-struct arch_mmu_gather {
-	struct pte_freelist_batch *batch;
-};
-
-#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
-
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,104 +33,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_SMP
-
-/*
- * Handle batching of page table freeing on SMP. Page tables are
- * queued up and send to be freed later by RCU in order to avoid
- * freeing a page table page that is being walked without locks
- */
-
-static unsigned long pte_freelist_forced_free;
-
-struct pte_freelist_batch
-{
-	struct rcu_head	rcu;
-	unsigned int	index;
-	unsigned long	tables[0];
-};
-
-#define PTE_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(unsigned long))
-
-static void pte_free_smp_sync(void *arg)
-{
-	/* Do nothing, just ensure we sync with all CPUs */
-}
-
-/* This is only called when we are critically out of memory
- * (and fail to get a page in pte_free_tlb).
- */
-static void pgtable_free_now(void *table, unsigned shift)
-{
-	pte_freelist_forced_free++;
-
-	smp_call_function(pte_free_smp_sync, NULL, 1);
-
-	pgtable_free(table, shift);
-}
-
-static void pte_free_rcu_callback(struct rcu_head *head)
-{
-	struct pte_freelist_batch *batch =
-		container_of(head, struct pte_freelist_batch, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++) {
-		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
-		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
-
-		pgtable_free(table, shift);
-	}
-
-	free_page((unsigned long)batch);
-}
-
-static void pte_free_submit(struct pte_freelist_batch *batch)
-{
-	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
-}
-
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-	unsigned long pgf;
-
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
-		pgtable_free(table, shift);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
-		if (*batchp == NULL) {
-			pgtable_free_now(table, shift);
-			return;
-		}
-		(*batchp)->index = 0;
-	}
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf = (unsigned long)table | shift;
-	(*batchp)->tables[(*batchp)->index++] = pgf;
-	if ((*batchp)->index == PTE_FREELIST_SIZE) {
-		pte_free_submit(*batchp);
-		*batchp = NULL;
-	}
-}
-
-void pte_free_finish(struct mmu_gather *tlb)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-
-	if (*batchp == NULL)
-		return;
-	pte_free_submit(*batchp);
-	*batchp = NULL;
-}
-
-#endif /* CONFIG_SMP */
-
 static inline int is_exec_fault(void)
 {
 	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		 */
 		_tlbia();
 	}
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		__flush_tlb_pending(tlbbatch);
 
 	put_cpu_var(ppc64_tlb_batch);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range);
 void tlb_flush(struct mmu_gather *tlb)
 {
 	flush_tlb_mm(tlb->mm);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -27,6 +27,49 @@
   #define tlb_fast_mode(tlb) 1
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+/*
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement software pagetable walkers.
+ *
+ * gup_fast() and other software pagetable walkers do a lockless page-table
+ * walk and therefore needs some synchronization with the freeing of the page
+ * directories. The chosen means to accomplish that is by disabling IRQs over
+ * the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the copmletion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
+ * holds off grace periods.
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
+ */
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#endif
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
@@ -36,11 +79,12 @@ struct mmu_gather {
 	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-#ifdef HAVE_ARCH_MMU_GATHER
-	struct arch_mmu_gather	arch;
-#endif
 	struct page		**pages;
 	struct page		*local[8];
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+#endif
 };
 
 static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
@@ -72,8 +116,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 
 	tlb->fullmm = full_mm_flush;
 
-#ifdef HAVE_ARCH_MMU_GATHER
-	tlb->arch = ARCH_MMU_GATHER_INIT;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
 #endif
 }
 
@@ -84,6 +128,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct t
 
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for software page-table walkers that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	tlb->need_flush = 1;
+
+	/*
+	 * When there's less then two users of this mm there cannot be a
+	 * concurrent page-table walk.
+	 */
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
+		__tlb_remove_table(table);
+		return;
+	}
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_ATOMIC);
+		if (*batch == NULL) {
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -175,4 +175,7 @@ config HAVE_PERF_EVENTS_NMI
 config HAVE_ARCH_JUMP_LABEL
 	bool
 
+config HAVE_RCU_TABLE_FREE
+	bool
+
 source "kernel/gcov/Kconfig"
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
 	select HAVE_PERF_EVENTS
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
+	select HAVE_RCU_TABLE_FREE if PPC64
 
 config EARLY_PRINTK
 	bool



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 16/21] mm, powerpc: Move the RCU page-table freeing into generic code
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-rcu.patch --]
[-- Type: text/plain, Size: 12878 bytes --]

In case other architectures require RCU freed page-tables to implement
gup_fast() and software filled hashes and similar things, provide the
means to do so by moving the logic into generic code.

Requested-by: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig                       |    3 +
 arch/powerpc/Kconfig               |    1 
 arch/powerpc/include/asm/pgalloc.h |   21 ++++++-
 arch/powerpc/include/asm/tlb.h     |   10 ---
 arch/powerpc/mm/pgtable.c          |   98 -------------------------------------
 arch/powerpc/mm/tlb_hash32.c       |    3 -
 arch/powerpc/mm/tlb_hash64.c       |    3 -
 arch/powerpc/mm/tlb_nohash.c       |    3 -
 include/asm-generic/tlb.h          |   57 +++++++++++++++++++--
 mm/memory.c                        |   77 +++++++++++++++++++++++++++++
 10 files changed, 151 insertions(+), 125 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -31,14 +31,29 @@ static inline void pte_free(struct mm_st
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(struct mmu_gather *tlb);
+struct mmu_gather;
+extern void tlb_remove_table(struct mmu_gather *, void *);
+
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,16 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
-#define HAVE_ARCH_MMU_GATHER 1
-
-struct pte_freelist_batch;
-
-struct arch_mmu_gather {
-	struct pte_freelist_batch *batch;
-};
-
-#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
-
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,104 +33,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_SMP
-
-/*
- * Handle batching of page table freeing on SMP. Page tables are
- * queued up and send to be freed later by RCU in order to avoid
- * freeing a page table page that is being walked without locks
- */
-
-static unsigned long pte_freelist_forced_free;
-
-struct pte_freelist_batch
-{
-	struct rcu_head	rcu;
-	unsigned int	index;
-	unsigned long	tables[0];
-};
-
-#define PTE_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(unsigned long))
-
-static void pte_free_smp_sync(void *arg)
-{
-	/* Do nothing, just ensure we sync with all CPUs */
-}
-
-/* This is only called when we are critically out of memory
- * (and fail to get a page in pte_free_tlb).
- */
-static void pgtable_free_now(void *table, unsigned shift)
-{
-	pte_freelist_forced_free++;
-
-	smp_call_function(pte_free_smp_sync, NULL, 1);
-
-	pgtable_free(table, shift);
-}
-
-static void pte_free_rcu_callback(struct rcu_head *head)
-{
-	struct pte_freelist_batch *batch =
-		container_of(head, struct pte_freelist_batch, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++) {
-		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
-		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
-
-		pgtable_free(table, shift);
-	}
-
-	free_page((unsigned long)batch);
-}
-
-static void pte_free_submit(struct pte_freelist_batch *batch)
-{
-	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
-}
-
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-	unsigned long pgf;
-
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
-		pgtable_free(table, shift);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
-		if (*batchp == NULL) {
-			pgtable_free_now(table, shift);
-			return;
-		}
-		(*batchp)->index = 0;
-	}
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf = (unsigned long)table | shift;
-	(*batchp)->tables[(*batchp)->index++] = pgf;
-	if ((*batchp)->index == PTE_FREELIST_SIZE) {
-		pte_free_submit(*batchp);
-		*batchp = NULL;
-	}
-}
-
-void pte_free_finish(struct mmu_gather *tlb)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-
-	if (*batchp == NULL)
-		return;
-	pte_free_submit(*batchp);
-	*batchp = NULL;
-}
-
-#endif /* CONFIG_SMP */
-
 static inline int is_exec_fault(void)
 {
 	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		 */
 		_tlbia();
 	}
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		__flush_tlb_pending(tlbbatch);
 
 	put_cpu_var(ppc64_tlb_batch);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range);
 void tlb_flush(struct mmu_gather *tlb)
 {
 	flush_tlb_mm(tlb->mm);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -27,6 +27,49 @@
   #define tlb_fast_mode(tlb) 1
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+/*
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement software pagetable walkers.
+ *
+ * gup_fast() and other software pagetable walkers do a lockless page-table
+ * walk and therefore needs some synchronization with the freeing of the page
+ * directories. The chosen means to accomplish that is by disabling IRQs over
+ * the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the copmletion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
+ * holds off grace periods.
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
+ */
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#endif
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
@@ -36,11 +79,12 @@ struct mmu_gather {
 	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-#ifdef HAVE_ARCH_MMU_GATHER
-	struct arch_mmu_gather	arch;
-#endif
 	struct page		**pages;
 	struct page		*local[8];
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+#endif
 };
 
 static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
@@ -72,8 +116,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 
 	tlb->fullmm = full_mm_flush;
 
-#ifdef HAVE_ARCH_MMU_GATHER
-	tlb->arch = ARCH_MMU_GATHER_INIT;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
 #endif
 }
 
@@ -84,6 +128,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct t
 
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for software page-table walkers that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	tlb->need_flush = 1;
+
+	/*
+	 * When there's less then two users of this mm there cannot be a
+	 * concurrent page-table walk.
+	 */
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
+		__tlb_remove_table(table);
+		return;
+	}
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_ATOMIC);
+		if (*batch == NULL) {
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -175,4 +175,7 @@ config HAVE_PERF_EVENTS_NMI
 config HAVE_ARCH_JUMP_LABEL
 	bool
 
+config HAVE_RCU_TABLE_FREE
+	bool
+
 source "kernel/gcov/Kconfig"
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
 	select HAVE_PERF_EVENTS
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
+	select HAVE_RCU_TABLE_FREE if PPC64
 
 config EARLY_PRINTK
 	bool


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 16/21] mm, powerpc: Move the RCU page-table freeing into generic code
@ 2010-11-26 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-preempt-tlb-gather-rcu.patch --]
[-- Type: text/plain, Size: 12878 bytes --]

In case other architectures require RCU freed page-tables to implement
gup_fast() and software filled hashes and similar things, provide the
means to do so by moving the logic into generic code.

Requested-by: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig                       |    3 +
 arch/powerpc/Kconfig               |    1 
 arch/powerpc/include/asm/pgalloc.h |   21 ++++++-
 arch/powerpc/include/asm/tlb.h     |   10 ---
 arch/powerpc/mm/pgtable.c          |   98 -------------------------------------
 arch/powerpc/mm/tlb_hash32.c       |    3 -
 arch/powerpc/mm/tlb_hash64.c       |    3 -
 arch/powerpc/mm/tlb_nohash.c       |    3 -
 include/asm-generic/tlb.h          |   57 +++++++++++++++++++--
 mm/memory.c                        |   77 +++++++++++++++++++++++++++++
 10 files changed, 151 insertions(+), 125 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -31,14 +31,29 @@ static inline void pte_free(struct mm_st
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(struct mmu_gather *tlb);
+struct mmu_gather;
+extern void tlb_remove_table(struct mmu_gather *, void *);
+
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,16 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
-#define HAVE_ARCH_MMU_GATHER 1
-
-struct pte_freelist_batch;
-
-struct arch_mmu_gather {
-	struct pte_freelist_batch *batch;
-};
-
-#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
-
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,104 +33,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_SMP
-
-/*
- * Handle batching of page table freeing on SMP. Page tables are
- * queued up and send to be freed later by RCU in order to avoid
- * freeing a page table page that is being walked without locks
- */
-
-static unsigned long pte_freelist_forced_free;
-
-struct pte_freelist_batch
-{
-	struct rcu_head	rcu;
-	unsigned int	index;
-	unsigned long	tables[0];
-};
-
-#define PTE_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(unsigned long))
-
-static void pte_free_smp_sync(void *arg)
-{
-	/* Do nothing, just ensure we sync with all CPUs */
-}
-
-/* This is only called when we are critically out of memory
- * (and fail to get a page in pte_free_tlb).
- */
-static void pgtable_free_now(void *table, unsigned shift)
-{
-	pte_freelist_forced_free++;
-
-	smp_call_function(pte_free_smp_sync, NULL, 1);
-
-	pgtable_free(table, shift);
-}
-
-static void pte_free_rcu_callback(struct rcu_head *head)
-{
-	struct pte_freelist_batch *batch =
-		container_of(head, struct pte_freelist_batch, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++) {
-		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
-		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
-
-		pgtable_free(table, shift);
-	}
-
-	free_page((unsigned long)batch);
-}
-
-static void pte_free_submit(struct pte_freelist_batch *batch)
-{
-	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
-}
-
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-	unsigned long pgf;
-
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
-		pgtable_free(table, shift);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
-		if (*batchp == NULL) {
-			pgtable_free_now(table, shift);
-			return;
-		}
-		(*batchp)->index = 0;
-	}
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf = (unsigned long)table | shift;
-	(*batchp)->tables[(*batchp)->index++] = pgf;
-	if ((*batchp)->index == PTE_FREELIST_SIZE) {
-		pte_free_submit(*batchp);
-		*batchp = NULL;
-	}
-}
-
-void pte_free_finish(struct mmu_gather *tlb)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-
-	if (*batchp == NULL)
-		return;
-	pte_free_submit(*batchp);
-	*batchp = NULL;
-}
-
-#endif /* CONFIG_SMP */
-
 static inline int is_exec_fault(void)
 {
 	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		 */
 		_tlbia();
 	}
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		__flush_tlb_pending(tlbbatch);
 
 	put_cpu_var(ppc64_tlb_batch);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range);
 void tlb_flush(struct mmu_gather *tlb)
 {
 	flush_tlb_mm(tlb->mm);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -27,6 +27,49 @@
   #define tlb_fast_mode(tlb) 1
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+/*
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement software pagetable walkers.
+ *
+ * gup_fast() and other software pagetable walkers do a lockless page-table
+ * walk and therefore needs some synchronization with the freeing of the page
+ * directories. The chosen means to accomplish that is by disabling IRQs over
+ * the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the copmletion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
+ * holds off grace periods.
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
+ */
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#endif
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
@@ -36,11 +79,12 @@ struct mmu_gather {
 	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-#ifdef HAVE_ARCH_MMU_GATHER
-	struct arch_mmu_gather	arch;
-#endif
 	struct page		**pages;
 	struct page		*local[8];
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+#endif
 };
 
 static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
@@ -72,8 +116,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 
 	tlb->fullmm = full_mm_flush;
 
-#ifdef HAVE_ARCH_MMU_GATHER
-	tlb->arch = ARCH_MMU_GATHER_INIT;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
 #endif
 }
 
@@ -84,6 +128,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct t
 
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for software page-table walkers that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	tlb->need_flush = 1;
+
+	/*
+	 * When there's less then two users of this mm there cannot be a
+	 * concurrent page-table walk.
+	 */
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
+		__tlb_remove_table(table);
+		return;
+	}
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_ATOMIC);
+		if (*batch == NULL) {
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -175,4 +175,7 @@ config HAVE_PERF_EVENTS_NMI
 config HAVE_ARCH_JUMP_LABEL
 	bool
 
+config HAVE_RCU_TABLE_FREE
+	bool
+
 source "kernel/gcov/Kconfig"
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
 	select HAVE_PERF_EVENTS
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
+	select HAVE_RCU_TABLE_FREE if PPC64
 
 config EARLY_PRINTK
 	bool


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 17/21] lockdep, mutex: Provide mutex_lock_nest_lock
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:39   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mutex_lock_nest_lock.patch --]
[-- Type: text/plain, Size: 5442 bytes --]

Provide the mutex_lock_nest_lock() annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/lockdep.h |    3 +++
 include/linux/mutex.h   |    9 +++++++++
 kernel/mutex.c          |   25 +++++++++++++++++--------
 3 files changed, 29 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -492,12 +492,15 @@ static inline void print_irqtrace_events
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # ifdef CONFIG_PROVE_LOCKING
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 2, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 2, n, i)
 # else
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 1, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 1, n, i)
 # endif
 # define mutex_release(l, n, i)			lock_release(l, n, i)
 #else
 # define mutex_acquire(l, s, t, i)		do { } while (0)
+# define mutex_acquire_nest(l, s, t, n, i)	do { } while (0)
 # define mutex_release(l, n, i)			do { } while (0)
 #endif
 
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -124,6 +124,7 @@ static inline int mutex_is_locked(struct
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
+extern void _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock);
 extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
@@ -132,6 +133,13 @@ extern int __must_check mutex_lock_killa
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+
+#define mutex_lock_nest_lock(lock, nest_lock)				\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);		\
+} while (0)
+
 #else
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
@@ -140,6 +148,7 @@ extern int __must_check mutex_lock_killa
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
+# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
 #endif
 
 /*
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -140,14 +140,14 @@ EXPORT_SYMBOL(mutex_unlock);
  */
 static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+		    struct lockdep_map *nest_lock, unsigned long ip)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
 
 	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
@@ -285,16 +285,25 @@ void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
 
+void __sched
+_mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest)
+{
+	might_sleep();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, nest, _RET_IP_);
+}
+
+EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock);
+
 int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
@@ -303,7 +312,7 @@ mutex_lock_interruptible_nested(struct m
 {
 	might_sleep();
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
-				   subclass, _RET_IP_);
+				   subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -409,7 +418,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -417,7 +426,7 @@ __mutex_lock_killable_slowpath(atomic_t 
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_KILLABLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -425,7 +434,7 @@ __mutex_lock_interruptible_slowpath(atom
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 #endif
 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 17/21] lockdep, mutex: Provide mutex_lock_nest_lock
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mutex_lock_nest_lock.patch --]
[-- Type: text/plain, Size: 5738 bytes --]

Provide the mutex_lock_nest_lock() annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/lockdep.h |    3 +++
 include/linux/mutex.h   |    9 +++++++++
 kernel/mutex.c          |   25 +++++++++++++++++--------
 3 files changed, 29 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -492,12 +492,15 @@ static inline void print_irqtrace_events
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # ifdef CONFIG_PROVE_LOCKING
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 2, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 2, n, i)
 # else
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 1, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 1, n, i)
 # endif
 # define mutex_release(l, n, i)			lock_release(l, n, i)
 #else
 # define mutex_acquire(l, s, t, i)		do { } while (0)
+# define mutex_acquire_nest(l, s, t, n, i)	do { } while (0)
 # define mutex_release(l, n, i)			do { } while (0)
 #endif
 
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -124,6 +124,7 @@ static inline int mutex_is_locked(struct
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
+extern void _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock);
 extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
@@ -132,6 +133,13 @@ extern int __must_check mutex_lock_killa
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+
+#define mutex_lock_nest_lock(lock, nest_lock)				\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);		\
+} while (0)
+
 #else
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
@@ -140,6 +148,7 @@ extern int __must_check mutex_lock_killa
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
+# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
 #endif
 
 /*
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -140,14 +140,14 @@ EXPORT_SYMBOL(mutex_unlock);
  */
 static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+		    struct lockdep_map *nest_lock, unsigned long ip)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
 
 	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
@@ -285,16 +285,25 @@ void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
 
+void __sched
+_mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest)
+{
+	might_sleep();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, nest, _RET_IP_);
+}
+
+EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock);
+
 int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
@@ -303,7 +312,7 @@ mutex_lock_interruptible_nested(struct m
 {
 	might_sleep();
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
-				   subclass, _RET_IP_);
+				   subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -409,7 +418,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -417,7 +426,7 @@ __mutex_lock_killable_slowpath(atomic_t 
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_KILLABLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -425,7 +434,7 @@ __mutex_lock_interruptible_slowpath(atom
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 #endif
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 17/21] lockdep, mutex: Provide mutex_lock_nest_lock
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mutex_lock_nest_lock.patch --]
[-- Type: text/plain, Size: 5738 bytes --]

Provide the mutex_lock_nest_lock() annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/lockdep.h |    3 +++
 include/linux/mutex.h   |    9 +++++++++
 kernel/mutex.c          |   25 +++++++++++++++++--------
 3 files changed, 29 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -492,12 +492,15 @@ static inline void print_irqtrace_events
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # ifdef CONFIG_PROVE_LOCKING
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 2, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 2, n, i)
 # else
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 1, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 1, n, i)
 # endif
 # define mutex_release(l, n, i)			lock_release(l, n, i)
 #else
 # define mutex_acquire(l, s, t, i)		do { } while (0)
+# define mutex_acquire_nest(l, s, t, n, i)	do { } while (0)
 # define mutex_release(l, n, i)			do { } while (0)
 #endif
 
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -124,6 +124,7 @@ static inline int mutex_is_locked(struct
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
+extern void _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock);
 extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
@@ -132,6 +133,13 @@ extern int __must_check mutex_lock_killa
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+
+#define mutex_lock_nest_lock(lock, nest_lock)				\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);		\
+} while (0)
+
 #else
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
@@ -140,6 +148,7 @@ extern int __must_check mutex_lock_killa
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
+# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
 #endif
 
 /*
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -140,14 +140,14 @@ EXPORT_SYMBOL(mutex_unlock);
  */
 static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+		    struct lockdep_map *nest_lock, unsigned long ip)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
 
 	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
@@ -285,16 +285,25 @@ void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
 
+void __sched
+_mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest)
+{
+	might_sleep();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, nest, _RET_IP_);
+}
+
+EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock);
+
 int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
@@ -303,7 +312,7 @@ mutex_lock_interruptible_nested(struct m
 {
 	might_sleep();
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
-				   subclass, _RET_IP_);
+				   subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -409,7 +418,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -417,7 +426,7 @@ __mutex_lock_killable_slowpath(atomic_t 
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_KILLABLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -425,7 +434,7 @@ __mutex_lock_interruptible_slowpath(atom
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 #endif
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 18/21] mutex: Provide mutex_is_contended
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:39   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mutex-is-contended.patch --]
[-- Type: text/plain, Size: 677 bytes --]

Usable for lock-breaks and such.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -118,6 +118,11 @@ static inline int mutex_is_locked(struct
 	return atomic_read(&lock->count) != 1;
 }
 
+static inline int mutex_is_contended(struct mutex *lock)
+{
+	return atomic_read(&lock->count) < 0;
+}
+
 /*
  * See kernel/mutex.c for detailed documentation of these APIs.
  * Also see Documentation/mutex-design.txt.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 18/21] mutex: Provide mutex_is_contended
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mutex-is-contended.patch --]
[-- Type: text/plain, Size: 973 bytes --]

Usable for lock-breaks and such.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -118,6 +118,11 @@ static inline int mutex_is_locked(struct
 	return atomic_read(&lock->count) != 1;
 }
 
+static inline int mutex_is_contended(struct mutex *lock)
+{
+	return atomic_read(&lock->count) < 0;
+}
+
 /*
  * See kernel/mutex.c for detailed documentation of these APIs.
  * Also see Documentation/mutex-design.txt.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 18/21] mutex: Provide mutex_is_contended
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mutex-is-contended.patch --]
[-- Type: text/plain, Size: 973 bytes --]

Usable for lock-breaks and such.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -118,6 +118,11 @@ static inline int mutex_is_locked(struct
 	return atomic_read(&lock->count) != 1;
 }
 
+static inline int mutex_is_contended(struct mutex *lock)
+{
+	return atomic_read(&lock->count) < 0;
+}
+
 /*
  * See kernel/mutex.c for detailed documentation of these APIs.
  * Also see Documentation/mutex-design.txt.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:39   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-mutex.patch --]
[-- Type: text/plain, Size: 20461 bytes --]

Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/gfs2/main.c            |    2 +-
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 fs/nilfs2/page.c          |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 include/linux/rmap.h      |   12 ++++++------
 kernel/fork.c             |    4 ++--
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   12 ++++++------
 mm/memory-failure.c       |    4 ++--
 mm/memory.c               |   14 +++++++-------
 mm/migrate.c              |    4 ++--
 mm/mmap.c                 |   18 +++++++++---------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   22 +++++++++++-----------
 18 files changed, 60 insertions(+), 60 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -72,7 +72,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -97,7 +97,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -413,10 +413,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -295,7 +295,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	mutex_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -635,7 +635,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct mutex		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -768,7 +768,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct mutex *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -26,7 +26,7 @@
  */
 struct anon_vma {
 	struct anon_vma *root;	/* Root of this anon_vma tree */
-	spinlock_t lock;	/* Serialize access to vma list */
+	struct mutex lock;	/* Serialize access to vma list */
 	/*
 	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
@@ -93,24 +93,24 @@ static inline void vma_lock_anon_vma(str
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->root->lock);
+		mutex_lock(&anon_vma->root->lock);
 }
 
 static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->root->lock);
+		mutex_unlock(&anon_vma->root->lock);
 }
 
 static inline void anon_vma_lock(struct anon_vma *anon_vma)
 {
-	spin_lock(&anon_vma->root->lock);
+	mutex_lock(&anon_vma->root->lock);
 }
 
 static inline void anon_vma_unlock(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->root->lock);
+	mutex_unlock(&anon_vma->root->lock);
 }
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -370,7 +370,7 @@ static int dup_mmap(struct mm_struct *mm
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			spin_lock(&mapping->i_mmap_lock);
+			mutex_lock(&mapping->i_mmap_lock);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
@@ -378,7 +378,7 @@ static int dup_mmap(struct mm_struct *mm
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+			mutex_unlock(&mapping->i_mmap_lock);
 		}
 
 		/*
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -183,7 +183,7 @@ __xip_unmap (struct address_space * mapp
 		return;
 
 retry:
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -201,7 +201,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	if (locked) {
 		mutex_unlock(&xip_sparse_mutex);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -211,13 +211,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsign
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -2316,9 +2316,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	__unmap_hugepage_range(vma, start, end, ref_page);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 /*
@@ -2350,7 +2350,7 @@ static int unmap_ref_private(struct mm_s
 	 * this mapping should be shared between all the VMAs,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
@@ -2368,7 +2368,7 @@ static int unmap_ref_private(struct mm_s
 				address, address + huge_page_size(h),
 				page);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	return 1;
 }
@@ -2850,7 +2850,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
@@ -2865,7 +2865,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c
+++ linux-2.6/mm/memory-failure.c
@@ -428,7 +428,7 @@ static void collect_procs_file(struct pa
 	 */
 
 	read_lock(&tasklist_lock);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
@@ -448,7 +448,7 @@ static void collect_procs_file(struct pa
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	read_unlock(&tasklist_lock);
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1177,7 +1177,7 @@ unsigned long unmap_vmas(struct mmu_gath
 {
 	long zap_work = ZAP_BLOCK_SIZE;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct mutex *i_mmap_lock = details ? details->i_mmap_lock : NULL;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1227,7 +1227,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
+				(i_mmap_lock && mutex_is_contended(i_mmap_lock))) {
 				if (i_mmap_lock)
 					goto out;
 				cond_resched();
@@ -2522,7 +2522,7 @@ static int unmap_mapping_range_vma(struc
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || mutex_is_contended(details->i_mmap_lock);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -2536,9 +2536,9 @@ static int unmap_mapping_range_vma(struc
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	mutex_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	mutex_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -2634,7 +2634,7 @@ void unmap_mapping_range(struct address_
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2649,7 +2649,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -456,7 +456,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 
@@ -464,7 +464,7 @@ static void vma_link(struct mm_struct *m
 	__vma_link_file(vma);
 
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -567,7 +567,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -641,7 +641,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		anon_vma_unlock(anon_vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file) {
@@ -2500,7 +2500,7 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_lock_nest_lock(&anon_vma->root->lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&anon_vma->root->lock, &mm->mmap_sem);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->root->lock. If some other vma in this mm shares
@@ -2530,7 +2530,7 @@ static void vm_lock_mapping(struct mm_st
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
 	}
 }
 
@@ -2629,7 +2629,7 @@ static void vm_unlock_mapping(struct add
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
 			BUG();
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -90,7 +90,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -122,7 +122,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -302,7 +302,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	mutex_init(&anon_vma->lock);
 	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
@@ -641,7 +641,7 @@ static int page_referenced_file(struct p
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -666,7 +666,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -753,7 +753,7 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED) {
 			unsigned long address = vma_address(page, vma);
@@ -762,7 +762,7 @@ static int page_mkclean_file(struct addr
 			ret += page_mkclean_one(page, vma, address);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1327,7 +1327,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1373,7 +1373,7 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
+	cond_resched();
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -1395,7 +1395,7 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
+		cond_resched();
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1407,7 +1407,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1522,7 +1522,7 @@ static int rmap_walk_file(struct page *p
 
 	if (!mapping)
 		return ret;
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1536,7 +1536,7 @@ static int rmap_walk_file(struct page *p
 	 * never contain migration ptes.  Decide what to do about this
 	 * limitation to linear when we need rmap_walk() on nonlinear.
 	 */
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
Index: linux-2.6/fs/gfs2/main.c
===================================================================
--- linux-2.6.orig/fs/gfs2/main.c
+++ linux-2.6/fs/gfs2/main.c
@@ -62,7 +62,7 @@ static void gfs2_init_gl_aspace_once(voi
 	memset(mapping, 0, sizeof(*mapping));
 	INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
 	spin_lock_init(&mapping->tree_lock);
-	spin_lock_init(&mapping->i_mmap_lock);
+	mutex_init(&mapping->i_mmap_lock);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
Index: linux-2.6/fs/nilfs2/page.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/page.c
+++ linux-2.6/fs/nilfs2/page.c
@@ -500,7 +500,7 @@ void nilfs_mapping_init_once(struct addr
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 
-	spin_lock_init(&mapping->i_mmap_lock);
+	mutex_init(&mapping->i_mmap_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
 	INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
 }
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -845,10 +845,10 @@ static int unmap_and_move_huge_page(new_
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
 
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount,
+	if (anon_vma && atomic_dec_and_mutex_lock(&anon_vma->refcount,
 					    &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (empty)
 			put_anon_vma(anon_vma);
 	}



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-mutex.patch --]
[-- Type: text/plain, Size: 20757 bytes --]

Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/gfs2/main.c            |    2 +-
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 fs/nilfs2/page.c          |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 include/linux/rmap.h      |   12 ++++++------
 kernel/fork.c             |    4 ++--
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   12 ++++++------
 mm/memory-failure.c       |    4 ++--
 mm/memory.c               |   14 +++++++-------
 mm/migrate.c              |    4 ++--
 mm/mmap.c                 |   18 +++++++++---------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   22 +++++++++++-----------
 18 files changed, 60 insertions(+), 60 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -72,7 +72,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -97,7 +97,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -413,10 +413,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -295,7 +295,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	mutex_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -635,7 +635,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct mutex		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -768,7 +768,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct mutex *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -26,7 +26,7 @@
  */
 struct anon_vma {
 	struct anon_vma *root;	/* Root of this anon_vma tree */
-	spinlock_t lock;	/* Serialize access to vma list */
+	struct mutex lock;	/* Serialize access to vma list */
 	/*
 	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
@@ -93,24 +93,24 @@ static inline void vma_lock_anon_vma(str
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->root->lock);
+		mutex_lock(&anon_vma->root->lock);
 }
 
 static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->root->lock);
+		mutex_unlock(&anon_vma->root->lock);
 }
 
 static inline void anon_vma_lock(struct anon_vma *anon_vma)
 {
-	spin_lock(&anon_vma->root->lock);
+	mutex_lock(&anon_vma->root->lock);
 }
 
 static inline void anon_vma_unlock(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->root->lock);
+	mutex_unlock(&anon_vma->root->lock);
 }
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -370,7 +370,7 @@ static int dup_mmap(struct mm_struct *mm
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			spin_lock(&mapping->i_mmap_lock);
+			mutex_lock(&mapping->i_mmap_lock);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
@@ -378,7 +378,7 @@ static int dup_mmap(struct mm_struct *mm
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+			mutex_unlock(&mapping->i_mmap_lock);
 		}
 
 		/*
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -183,7 +183,7 @@ __xip_unmap (struct address_space * mapp
 		return;
 
 retry:
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -201,7 +201,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	if (locked) {
 		mutex_unlock(&xip_sparse_mutex);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -211,13 +211,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsign
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -2316,9 +2316,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	__unmap_hugepage_range(vma, start, end, ref_page);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 /*
@@ -2350,7 +2350,7 @@ static int unmap_ref_private(struct mm_s
 	 * this mapping should be shared between all the VMAs,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
@@ -2368,7 +2368,7 @@ static int unmap_ref_private(struct mm_s
 				address, address + huge_page_size(h),
 				page);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	return 1;
 }
@@ -2850,7 +2850,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
@@ -2865,7 +2865,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c
+++ linux-2.6/mm/memory-failure.c
@@ -428,7 +428,7 @@ static void collect_procs_file(struct pa
 	 */
 
 	read_lock(&tasklist_lock);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
@@ -448,7 +448,7 @@ static void collect_procs_file(struct pa
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	read_unlock(&tasklist_lock);
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1177,7 +1177,7 @@ unsigned long unmap_vmas(struct mmu_gath
 {
 	long zap_work = ZAP_BLOCK_SIZE;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct mutex *i_mmap_lock = details ? details->i_mmap_lock : NULL;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1227,7 +1227,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
+				(i_mmap_lock && mutex_is_contended(i_mmap_lock))) {
 				if (i_mmap_lock)
 					goto out;
 				cond_resched();
@@ -2522,7 +2522,7 @@ static int unmap_mapping_range_vma(struc
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || mutex_is_contended(details->i_mmap_lock);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -2536,9 +2536,9 @@ static int unmap_mapping_range_vma(struc
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	mutex_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	mutex_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -2634,7 +2634,7 @@ void unmap_mapping_range(struct address_
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2649,7 +2649,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -456,7 +456,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 
@@ -464,7 +464,7 @@ static void vma_link(struct mm_struct *m
 	__vma_link_file(vma);
 
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -567,7 +567,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -641,7 +641,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		anon_vma_unlock(anon_vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file) {
@@ -2500,7 +2500,7 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_lock_nest_lock(&anon_vma->root->lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&anon_vma->root->lock, &mm->mmap_sem);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->root->lock. If some other vma in this mm shares
@@ -2530,7 +2530,7 @@ static void vm_lock_mapping(struct mm_st
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
 	}
 }
 
@@ -2629,7 +2629,7 @@ static void vm_unlock_mapping(struct add
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
 			BUG();
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -90,7 +90,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -122,7 +122,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -302,7 +302,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	mutex_init(&anon_vma->lock);
 	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
@@ -641,7 +641,7 @@ static int page_referenced_file(struct p
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -666,7 +666,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -753,7 +753,7 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED) {
 			unsigned long address = vma_address(page, vma);
@@ -762,7 +762,7 @@ static int page_mkclean_file(struct addr
 			ret += page_mkclean_one(page, vma, address);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1327,7 +1327,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1373,7 +1373,7 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
+	cond_resched();
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -1395,7 +1395,7 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
+		cond_resched();
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1407,7 +1407,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1522,7 +1522,7 @@ static int rmap_walk_file(struct page *p
 
 	if (!mapping)
 		return ret;
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1536,7 +1536,7 @@ static int rmap_walk_file(struct page *p
 	 * never contain migration ptes.  Decide what to do about this
 	 * limitation to linear when we need rmap_walk() on nonlinear.
 	 */
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
Index: linux-2.6/fs/gfs2/main.c
===================================================================
--- linux-2.6.orig/fs/gfs2/main.c
+++ linux-2.6/fs/gfs2/main.c
@@ -62,7 +62,7 @@ static void gfs2_init_gl_aspace_once(voi
 	memset(mapping, 0, sizeof(*mapping));
 	INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
 	spin_lock_init(&mapping->tree_lock);
-	spin_lock_init(&mapping->i_mmap_lock);
+	mutex_init(&mapping->i_mmap_lock);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
Index: linux-2.6/fs/nilfs2/page.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/page.c
+++ linux-2.6/fs/nilfs2/page.c
@@ -500,7 +500,7 @@ void nilfs_mapping_init_once(struct addr
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 
-	spin_lock_init(&mapping->i_mmap_lock);
+	mutex_init(&mapping->i_mmap_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
 	INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
 }
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -845,10 +845,10 @@ static int unmap_and_move_huge_page(new_
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
 
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount,
+	if (anon_vma && atomic_dec_and_mutex_lock(&anon_vma->refcount,
 					    &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (empty)
 			put_anon_vma(anon_vma);
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-mutex.patch --]
[-- Type: text/plain, Size: 20757 bytes --]

Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/gfs2/main.c            |    2 +-
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 fs/nilfs2/page.c          |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 include/linux/rmap.h      |   12 ++++++------
 kernel/fork.c             |    4 ++--
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   12 ++++++------
 mm/memory-failure.c       |    4 ++--
 mm/memory.c               |   14 +++++++-------
 mm/migrate.c              |    4 ++--
 mm/mmap.c                 |   18 +++++++++---------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   22 +++++++++++-----------
 18 files changed, 60 insertions(+), 60 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -72,7 +72,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -97,7 +97,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -413,10 +413,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -295,7 +295,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	mutex_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -635,7 +635,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct mutex		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -768,7 +768,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct mutex *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -26,7 +26,7 @@
  */
 struct anon_vma {
 	struct anon_vma *root;	/* Root of this anon_vma tree */
-	spinlock_t lock;	/* Serialize access to vma list */
+	struct mutex lock;	/* Serialize access to vma list */
 	/*
 	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
@@ -93,24 +93,24 @@ static inline void vma_lock_anon_vma(str
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->root->lock);
+		mutex_lock(&anon_vma->root->lock);
 }
 
 static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->root->lock);
+		mutex_unlock(&anon_vma->root->lock);
 }
 
 static inline void anon_vma_lock(struct anon_vma *anon_vma)
 {
-	spin_lock(&anon_vma->root->lock);
+	mutex_lock(&anon_vma->root->lock);
 }
 
 static inline void anon_vma_unlock(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->root->lock);
+	mutex_unlock(&anon_vma->root->lock);
 }
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -370,7 +370,7 @@ static int dup_mmap(struct mm_struct *mm
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			spin_lock(&mapping->i_mmap_lock);
+			mutex_lock(&mapping->i_mmap_lock);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
@@ -378,7 +378,7 @@ static int dup_mmap(struct mm_struct *mm
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+			mutex_unlock(&mapping->i_mmap_lock);
 		}
 
 		/*
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -183,7 +183,7 @@ __xip_unmap (struct address_space * mapp
 		return;
 
 retry:
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -201,7 +201,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	if (locked) {
 		mutex_unlock(&xip_sparse_mutex);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -211,13 +211,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsign
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -2316,9 +2316,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	__unmap_hugepage_range(vma, start, end, ref_page);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 /*
@@ -2350,7 +2350,7 @@ static int unmap_ref_private(struct mm_s
 	 * this mapping should be shared between all the VMAs,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
@@ -2368,7 +2368,7 @@ static int unmap_ref_private(struct mm_s
 				address, address + huge_page_size(h),
 				page);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	return 1;
 }
@@ -2850,7 +2850,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
@@ -2865,7 +2865,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c
+++ linux-2.6/mm/memory-failure.c
@@ -428,7 +428,7 @@ static void collect_procs_file(struct pa
 	 */
 
 	read_lock(&tasklist_lock);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
@@ -448,7 +448,7 @@ static void collect_procs_file(struct pa
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	read_unlock(&tasklist_lock);
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1177,7 +1177,7 @@ unsigned long unmap_vmas(struct mmu_gath
 {
 	long zap_work = ZAP_BLOCK_SIZE;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct mutex *i_mmap_lock = details ? details->i_mmap_lock : NULL;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1227,7 +1227,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
+				(i_mmap_lock && mutex_is_contended(i_mmap_lock))) {
 				if (i_mmap_lock)
 					goto out;
 				cond_resched();
@@ -2522,7 +2522,7 @@ static int unmap_mapping_range_vma(struc
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || mutex_is_contended(details->i_mmap_lock);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -2536,9 +2536,9 @@ static int unmap_mapping_range_vma(struc
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	mutex_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	mutex_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -2634,7 +2634,7 @@ void unmap_mapping_range(struct address_
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2649,7 +2649,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -456,7 +456,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 
@@ -464,7 +464,7 @@ static void vma_link(struct mm_struct *m
 	__vma_link_file(vma);
 
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -567,7 +567,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -641,7 +641,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		anon_vma_unlock(anon_vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file) {
@@ -2500,7 +2500,7 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_lock_nest_lock(&anon_vma->root->lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&anon_vma->root->lock, &mm->mmap_sem);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->root->lock. If some other vma in this mm shares
@@ -2530,7 +2530,7 @@ static void vm_lock_mapping(struct mm_st
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
 	}
 }
 
@@ -2629,7 +2629,7 @@ static void vm_unlock_mapping(struct add
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
 			BUG();
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -90,7 +90,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -122,7 +122,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -302,7 +302,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	mutex_init(&anon_vma->lock);
 	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
@@ -641,7 +641,7 @@ static int page_referenced_file(struct p
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -666,7 +666,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -753,7 +753,7 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED) {
 			unsigned long address = vma_address(page, vma);
@@ -762,7 +762,7 @@ static int page_mkclean_file(struct addr
 			ret += page_mkclean_one(page, vma, address);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1327,7 +1327,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1373,7 +1373,7 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
+	cond_resched();
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -1395,7 +1395,7 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
+		cond_resched();
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1407,7 +1407,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1522,7 +1522,7 @@ static int rmap_walk_file(struct page *p
 
 	if (!mapping)
 		return ret;
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1536,7 +1536,7 @@ static int rmap_walk_file(struct page *p
 	 * never contain migration ptes.  Decide what to do about this
 	 * limitation to linear when we need rmap_walk() on nonlinear.
 	 */
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
Index: linux-2.6/fs/gfs2/main.c
===================================================================
--- linux-2.6.orig/fs/gfs2/main.c
+++ linux-2.6/fs/gfs2/main.c
@@ -62,7 +62,7 @@ static void gfs2_init_gl_aspace_once(voi
 	memset(mapping, 0, sizeof(*mapping));
 	INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
 	spin_lock_init(&mapping->tree_lock);
-	spin_lock_init(&mapping->i_mmap_lock);
+	mutex_init(&mapping->i_mmap_lock);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
Index: linux-2.6/fs/nilfs2/page.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/page.c
+++ linux-2.6/fs/nilfs2/page.c
@@ -500,7 +500,7 @@ void nilfs_mapping_init_once(struct addr
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 
-	spin_lock_init(&mapping->i_mmap_lock);
+	mutex_init(&mapping->i_mmap_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
 	INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
 }
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -845,10 +845,10 @@ static int unmap_and_move_huge_page(new_
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
 
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount,
+	if (anon_vma && atomic_dec_and_mutex_lock(&anon_vma->refcount,
 					    &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (empty)
 			put_anon_vma(anon_vma);
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 20/21] mm: Extended batches for generic mmu_gather
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:39   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-tlb_gather-more-batch.patch --]
[-- Type: text/plain, Size: 5425 bytes --]

Instead of using a single batch (the small on-stack, or an allocated
page), try and extend the batch every time it runs out and only flush
once either the extend fails or we're done.

Requested-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/asm-generic/tlb.h |  122 ++++++++++++++++++++++++++++++----------------
 1 file changed, 82 insertions(+), 40 deletions(-)

Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -17,16 +17,6 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
-#else
-  #define tlb_fast_mode(tlb) 1
-#endif
-
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -70,31 +60,66 @@ extern void tlb_remove_table(struct mmu_
 
 #endif
 
+struct mmu_gather_batch {
+	struct mmu_gather_batch	*next;
+	unsigned int		nr;
+	unsigned int		max;
+	struct page		*pages[0];
+};
+
+#define MAX_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
-	unsigned int		nr;	/* set to ~0U means fast mode */
-	unsigned int		max;	/* nr < max */
-	unsigned int		need_flush;/* Really unmapped some ptes? */
-	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page		**pages;
-	struct page		*local[8];
+	unsigned int		need_flush : 1,	/* Did free PTEs */
+				fast_mode  : 1; /* No batching   */
+	unsigned int		fullmm;		/* Flush full mm */
+
+	struct mmu_gather_batch *active;
+	struct mmu_gather_batch	local;
+	struct page		*__pages[8];
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 };
 
-static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+  #define tlb_fast_mode(tlb) (tlb->fast_mode)
+#else
+  #define tlb_fast_mode(tlb) 1
+#endif
+
+static inline int tlb_next_batch(struct mmu_gather *tlb)
 {
-	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+	struct mmu_gather_batch *batch;
 
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	batch = tlb->active;
+	if (batch->next) {
+		tlb->active = batch->next;
+		return 1;
 	}
+
+	batch = (void *)__get_free_pages(GFP_ATOMIC, 0);
+	if (!batch)
+		return 0;
+
+	batch->next = NULL;
+	batch->nr   = 0;
+	batch->max  = MAX_GATHER_BATCH;
+
+	tlb->active->next = batch;
+	tlb->active = batch;
+
+	return 1;
 }
 
 /* tlb_gather_mmu
@@ -105,17 +130,16 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 {
 	tlb->mm = mm;
 
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-
-	if (num_online_cpus() > 1) {
-		tlb->nr = 0;
-		__tlb_alloc_pages(tlb);
-	} else /* Use fast mode if only one CPU is online */
-		tlb->nr = ~0U;
-
+	tlb->need_flush = 0;
+	if (num_online_cpus() == 1)
+		tlb->fast_mode = 1;
 	tlb->fullmm = full_mm_flush;
 
+	tlb->local.next = NULL;
+	tlb->local.nr   = 0;
+	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
+	tlb->active     = &tlb->local;
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
@@ -124,6 +148,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 static inline void
 tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch;
+
 	if (!tlb->need_flush)
 		return;
 	tlb->need_flush = 0;
@@ -131,12 +157,14 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
-	if (!tlb_fast_mode(tlb)) {
-		free_pages_and_swap_cache(tlb->pages, tlb->nr);
-		tlb->nr = 0;
-		if (tlb->pages == tlb->local)
-			__tlb_alloc_pages(tlb);
+	if (tlb_fast_mode(tlb))
+		return;
+
+	for (batch = &tlb->local; batch; batch = batch->next) {
+		free_pages_and_swap_cache(batch->pages, batch->nr);
+		batch->nr = 0;
 	}
+	tlb->active = &tlb->local;
 }
 
 /* tlb_finish_mmu
@@ -146,13 +174,18 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch, *next;
+
 	tlb_flush_mmu(tlb, start, end);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
+	for (batch = tlb->local.next; batch; batch = next) {
+		next = batch->next;
+		free_pages((unsigned long)batch, 0);
+	}
+	tlb->local.next = NULL;
 }
 
 /* tlb_remove_page
@@ -162,14 +195,23 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  */
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
+	struct mmu_gather_batch *batch;
+
 	tlb->need_flush = 1;
+
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return;
 	}
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		tlb_flush_mmu(tlb, 0, 0);
+
+	batch = tlb->active;
+	if (batch->nr == batch->max) {
+		if (!tlb_next_batch(tlb))
+			tlb_flush_mmu(tlb, 0, 0);
+		batch = tlb->active;
+	}
+
+	batch->pages[batch->nr++] = page;
 }
 
 /**



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 20/21] mm: Extended batches for generic mmu_gather
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-tlb_gather-more-batch.patch --]
[-- Type: text/plain, Size: 5721 bytes --]

Instead of using a single batch (the small on-stack, or an allocated
page), try and extend the batch every time it runs out and only flush
once either the extend fails or we're done.

Requested-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/asm-generic/tlb.h |  122 ++++++++++++++++++++++++++++++----------------
 1 file changed, 82 insertions(+), 40 deletions(-)

Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -17,16 +17,6 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
-#else
-  #define tlb_fast_mode(tlb) 1
-#endif
-
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -70,31 +60,66 @@ extern void tlb_remove_table(struct mmu_
 
 #endif
 
+struct mmu_gather_batch {
+	struct mmu_gather_batch	*next;
+	unsigned int		nr;
+	unsigned int		max;
+	struct page		*pages[0];
+};
+
+#define MAX_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
-	unsigned int		nr;	/* set to ~0U means fast mode */
-	unsigned int		max;	/* nr < max */
-	unsigned int		need_flush;/* Really unmapped some ptes? */
-	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page		**pages;
-	struct page		*local[8];
+	unsigned int		need_flush : 1,	/* Did free PTEs */
+				fast_mode  : 1; /* No batching   */
+	unsigned int		fullmm;		/* Flush full mm */
+
+	struct mmu_gather_batch *active;
+	struct mmu_gather_batch	local;
+	struct page		*__pages[8];
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 };
 
-static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+  #define tlb_fast_mode(tlb) (tlb->fast_mode)
+#else
+  #define tlb_fast_mode(tlb) 1
+#endif
+
+static inline int tlb_next_batch(struct mmu_gather *tlb)
 {
-	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+	struct mmu_gather_batch *batch;
 
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	batch = tlb->active;
+	if (batch->next) {
+		tlb->active = batch->next;
+		return 1;
 	}
+
+	batch = (void *)__get_free_pages(GFP_ATOMIC, 0);
+	if (!batch)
+		return 0;
+
+	batch->next = NULL;
+	batch->nr   = 0;
+	batch->max  = MAX_GATHER_BATCH;
+
+	tlb->active->next = batch;
+	tlb->active = batch;
+
+	return 1;
 }
 
 /* tlb_gather_mmu
@@ -105,17 +130,16 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 {
 	tlb->mm = mm;
 
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-
-	if (num_online_cpus() > 1) {
-		tlb->nr = 0;
-		__tlb_alloc_pages(tlb);
-	} else /* Use fast mode if only one CPU is online */
-		tlb->nr = ~0U;
-
+	tlb->need_flush = 0;
+	if (num_online_cpus() == 1)
+		tlb->fast_mode = 1;
 	tlb->fullmm = full_mm_flush;
 
+	tlb->local.next = NULL;
+	tlb->local.nr   = 0;
+	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
+	tlb->active     = &tlb->local;
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
@@ -124,6 +148,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 static inline void
 tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch;
+
 	if (!tlb->need_flush)
 		return;
 	tlb->need_flush = 0;
@@ -131,12 +157,14 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
-	if (!tlb_fast_mode(tlb)) {
-		free_pages_and_swap_cache(tlb->pages, tlb->nr);
-		tlb->nr = 0;
-		if (tlb->pages == tlb->local)
-			__tlb_alloc_pages(tlb);
+	if (tlb_fast_mode(tlb))
+		return;
+
+	for (batch = &tlb->local; batch; batch = batch->next) {
+		free_pages_and_swap_cache(batch->pages, batch->nr);
+		batch->nr = 0;
 	}
+	tlb->active = &tlb->local;
 }
 
 /* tlb_finish_mmu
@@ -146,13 +174,18 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch, *next;
+
 	tlb_flush_mmu(tlb, start, end);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
+	for (batch = tlb->local.next; batch; batch = next) {
+		next = batch->next;
+		free_pages((unsigned long)batch, 0);
+	}
+	tlb->local.next = NULL;
 }
 
 /* tlb_remove_page
@@ -162,14 +195,23 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  */
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
+	struct mmu_gather_batch *batch;
+
 	tlb->need_flush = 1;
+
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return;
 	}
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		tlb_flush_mmu(tlb, 0, 0);
+
+	batch = tlb->active;
+	if (batch->nr == batch->max) {
+		if (!tlb_next_batch(tlb))
+			tlb_flush_mmu(tlb, 0, 0);
+		batch = tlb->active;
+	}
+
+	batch->pages[batch->nr++] = page;
 }
 
 /**


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 20/21] mm: Extended batches for generic mmu_gather
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-tlb_gather-more-batch.patch --]
[-- Type: text/plain, Size: 5721 bytes --]

Instead of using a single batch (the small on-stack, or an allocated
page), try and extend the batch every time it runs out and only flush
once either the extend fails or we're done.

Requested-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/asm-generic/tlb.h |  122 ++++++++++++++++++++++++++++++----------------
 1 file changed, 82 insertions(+), 40 deletions(-)

Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -17,16 +17,6 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
-#else
-  #define tlb_fast_mode(tlb) 1
-#endif
-
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -70,31 +60,66 @@ extern void tlb_remove_table(struct mmu_
 
 #endif
 
+struct mmu_gather_batch {
+	struct mmu_gather_batch	*next;
+	unsigned int		nr;
+	unsigned int		max;
+	struct page		*pages[0];
+};
+
+#define MAX_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
-	unsigned int		nr;	/* set to ~0U means fast mode */
-	unsigned int		max;	/* nr < max */
-	unsigned int		need_flush;/* Really unmapped some ptes? */
-	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page		**pages;
-	struct page		*local[8];
+	unsigned int		need_flush : 1,	/* Did free PTEs */
+				fast_mode  : 1; /* No batching   */
+	unsigned int		fullmm;		/* Flush full mm */
+
+	struct mmu_gather_batch *active;
+	struct mmu_gather_batch	local;
+	struct page		*__pages[8];
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 };
 
-static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+  #define tlb_fast_mode(tlb) (tlb->fast_mode)
+#else
+  #define tlb_fast_mode(tlb) 1
+#endif
+
+static inline int tlb_next_batch(struct mmu_gather *tlb)
 {
-	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+	struct mmu_gather_batch *batch;
 
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	batch = tlb->active;
+	if (batch->next) {
+		tlb->active = batch->next;
+		return 1;
 	}
+
+	batch = (void *)__get_free_pages(GFP_ATOMIC, 0);
+	if (!batch)
+		return 0;
+
+	batch->next = NULL;
+	batch->nr   = 0;
+	batch->max  = MAX_GATHER_BATCH;
+
+	tlb->active->next = batch;
+	tlb->active = batch;
+
+	return 1;
 }
 
 /* tlb_gather_mmu
@@ -105,17 +130,16 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 {
 	tlb->mm = mm;
 
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-
-	if (num_online_cpus() > 1) {
-		tlb->nr = 0;
-		__tlb_alloc_pages(tlb);
-	} else /* Use fast mode if only one CPU is online */
-		tlb->nr = ~0U;
-
+	tlb->need_flush = 0;
+	if (num_online_cpus() == 1)
+		tlb->fast_mode = 1;
 	tlb->fullmm = full_mm_flush;
 
+	tlb->local.next = NULL;
+	tlb->local.nr   = 0;
+	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
+	tlb->active     = &tlb->local;
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
@@ -124,6 +148,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 static inline void
 tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch;
+
 	if (!tlb->need_flush)
 		return;
 	tlb->need_flush = 0;
@@ -131,12 +157,14 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
-	if (!tlb_fast_mode(tlb)) {
-		free_pages_and_swap_cache(tlb->pages, tlb->nr);
-		tlb->nr = 0;
-		if (tlb->pages == tlb->local)
-			__tlb_alloc_pages(tlb);
+	if (tlb_fast_mode(tlb))
+		return;
+
+	for (batch = &tlb->local; batch; batch = batch->next) {
+		free_pages_and_swap_cache(batch->pages, batch->nr);
+		batch->nr = 0;
 	}
+	tlb->active = &tlb->local;
 }
 
 /* tlb_finish_mmu
@@ -146,13 +174,18 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch, *next;
+
 	tlb_flush_mmu(tlb, start, end);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
+	for (batch = tlb->local.next; batch; batch = next) {
+		next = batch->next;
+		free_pages((unsigned long)batch, 0);
+	}
+	tlb->local.next = NULL;
 }
 
 /* tlb_remove_page
@@ -162,14 +195,23 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  */
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
+	struct mmu_gather_batch *batch;
+
 	tlb->need_flush = 1;
+
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return;
 	}
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		tlb_flush_mmu(tlb, 0, 0);
+
+	batch = tlb->active;
+	if (batch->nr == batch->max) {
+		if (!tlb_next_batch(tlb))
+			tlb_flush_mmu(tlb, 0, 0);
+		batch = tlb->active;
+	}
+
+	batch->pages[batch->nr++] = page;
 }
 
 /**


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 21/21] mm: Optimize page_lock_anon_vma() fast-path
  2010-11-26 14:38 ` Peter Zijlstra
  (?)
@ 2010-11-26 14:39   ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-opt-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 2818 bytes --]

Optimize the page_lock_anon_vma() fast path to be one LOCKed op,
instead of two.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 67 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -357,20 +357,75 @@ out:
 	return anon_vma;
 }
 
+/*
+ * Similar to page_get_anon_vma() except it locks the anon_vma.
+ *
+ * Its a little more complex as it tries to keep the fast path to a single
+ * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
+ * reference like with page_get_anon_vma() and then block on the mutex.
+ */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma = page_get_anon_vma(page);
+	struct anon_vma *anon_vma = NULL;
+	unsigned long anon_mapping;
+
+	rcu_read_lock();
+	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
+		goto out;
+	if (!page_mapped(page))
+		goto out;
+
+	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	if (mutex_trylock(&anon_vma->root->lock)) {
+		/*
+		 * If we observe a !0 refcount, then holding the lock ensures
+		 * the anon_vma will not go away, see __put_anon_vma().
+		 */
+		if (!atomic_read(&anon_vma->refcount)) {
+			anon_vma_unlock(anon_vma);
+			anon_vma = NULL;
+		}
+		goto out;
+	}
 
-	if (anon_vma)
-		anon_vma_lock(anon_vma);
+	/* trylock failed, we got to sleep */
+	if (!atomic_inc_not_zero(&anon_vma->refcount)) {
+		anon_vma = NULL;
+		goto out;
+	}
+
+	if (!page_mapped(page)) {
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto out;
+	}
+
+	/* we pinned the anon_vma, its safe to sleep */
+	rcu_read_unlock();
+	anon_vma_lock(anon_vma);
+
+	if (atomic_dec_and_test(&anon_vma->refcount)) {
+		/*
+		 * Oops, we held the last refcount, release the lock
+		 * and bail -- can't simply use put_anon_vma() because
+		 * we'll deadlock on the anon_vma_lock() recursion.
+		 */
+		anon_vma_unlock(anon_vma);
+		__put_anon_vma(anon_vma);
+		anon_vma = NULL;
+	}
 
 	return anon_vma;
+
+out:
+	rcu_read_unlock();
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	anon_vma_unlock(anon_vma);
-	put_anon_vma(anon_vma);
 }
 
 /*
@@ -1462,6 +1517,14 @@ int try_to_munlock(struct page *page)
 
 void __put_anon_vma(struct anon_vma *anon_vma)
 {
+	/*
+	 * Synchronize against page_lock_anon_vma() such that
+	 * we can safely hold the lock without the anon_vma getting
+	 * freed.
+	 */
+	anon_vma_lock(anon_vma);
+	anon_vma_unlock(anon_vma);
+
 	if (anon_vma->root != anon_vma)
 		put_anon_vma(anon_vma->root);
 	anon_vma_free(anon_vma);



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 21/21] mm: Optimize page_lock_anon_vma() fast-path
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-opt-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 3114 bytes --]

Optimize the page_lock_anon_vma() fast path to be one LOCKed op,
instead of two.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 67 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -357,20 +357,75 @@ out:
 	return anon_vma;
 }
 
+/*
+ * Similar to page_get_anon_vma() except it locks the anon_vma.
+ *
+ * Its a little more complex as it tries to keep the fast path to a single
+ * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
+ * reference like with page_get_anon_vma() and then block on the mutex.
+ */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma = page_get_anon_vma(page);
+	struct anon_vma *anon_vma = NULL;
+	unsigned long anon_mapping;
+
+	rcu_read_lock();
+	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
+		goto out;
+	if (!page_mapped(page))
+		goto out;
+
+	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	if (mutex_trylock(&anon_vma->root->lock)) {
+		/*
+		 * If we observe a !0 refcount, then holding the lock ensures
+		 * the anon_vma will not go away, see __put_anon_vma().
+		 */
+		if (!atomic_read(&anon_vma->refcount)) {
+			anon_vma_unlock(anon_vma);
+			anon_vma = NULL;
+		}
+		goto out;
+	}
 
-	if (anon_vma)
-		anon_vma_lock(anon_vma);
+	/* trylock failed, we got to sleep */
+	if (!atomic_inc_not_zero(&anon_vma->refcount)) {
+		anon_vma = NULL;
+		goto out;
+	}
+
+	if (!page_mapped(page)) {
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto out;
+	}
+
+	/* we pinned the anon_vma, its safe to sleep */
+	rcu_read_unlock();
+	anon_vma_lock(anon_vma);
+
+	if (atomic_dec_and_test(&anon_vma->refcount)) {
+		/*
+		 * Oops, we held the last refcount, release the lock
+		 * and bail -- can't simply use put_anon_vma() because
+		 * we'll deadlock on the anon_vma_lock() recursion.
+		 */
+		anon_vma_unlock(anon_vma);
+		__put_anon_vma(anon_vma);
+		anon_vma = NULL;
+	}
 
 	return anon_vma;
+
+out:
+	rcu_read_unlock();
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	anon_vma_unlock(anon_vma);
-	put_anon_vma(anon_vma);
 }
 
 /*
@@ -1462,6 +1517,14 @@ int try_to_munlock(struct page *page)
 
 void __put_anon_vma(struct anon_vma *anon_vma)
 {
+	/*
+	 * Synchronize against page_lock_anon_vma() such that
+	 * we can safely hold the lock without the anon_vma getting
+	 * freed.
+	 */
+	anon_vma_lock(anon_vma);
+	anon_vma_unlock(anon_vma);
+
 	if (anon_vma->root != anon_vma)
 		put_anon_vma(anon_vma->root);
 	anon_vma_free(anon_vma);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH 21/21] mm: Optimize page_lock_anon_vma() fast-path
@ 2010-11-26 14:39   ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-26 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Stephen Rothwell

[-- Attachment #1: mm-opt-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 3114 bytes --]

Optimize the page_lock_anon_vma() fast path to be one LOCKed op,
instead of two.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 67 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -357,20 +357,75 @@ out:
 	return anon_vma;
 }
 
+/*
+ * Similar to page_get_anon_vma() except it locks the anon_vma.
+ *
+ * Its a little more complex as it tries to keep the fast path to a single
+ * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
+ * reference like with page_get_anon_vma() and then block on the mutex.
+ */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma = page_get_anon_vma(page);
+	struct anon_vma *anon_vma = NULL;
+	unsigned long anon_mapping;
+
+	rcu_read_lock();
+	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
+		goto out;
+	if (!page_mapped(page))
+		goto out;
+
+	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	if (mutex_trylock(&anon_vma->root->lock)) {
+		/*
+		 * If we observe a !0 refcount, then holding the lock ensures
+		 * the anon_vma will not go away, see __put_anon_vma().
+		 */
+		if (!atomic_read(&anon_vma->refcount)) {
+			anon_vma_unlock(anon_vma);
+			anon_vma = NULL;
+		}
+		goto out;
+	}
 
-	if (anon_vma)
-		anon_vma_lock(anon_vma);
+	/* trylock failed, we got to sleep */
+	if (!atomic_inc_not_zero(&anon_vma->refcount)) {
+		anon_vma = NULL;
+		goto out;
+	}
+
+	if (!page_mapped(page)) {
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto out;
+	}
+
+	/* we pinned the anon_vma, its safe to sleep */
+	rcu_read_unlock();
+	anon_vma_lock(anon_vma);
+
+	if (atomic_dec_and_test(&anon_vma->refcount)) {
+		/*
+		 * Oops, we held the last refcount, release the lock
+		 * and bail -- can't simply use put_anon_vma() because
+		 * we'll deadlock on the anon_vma_lock() recursion.
+		 */
+		anon_vma_unlock(anon_vma);
+		__put_anon_vma(anon_vma);
+		anon_vma = NULL;
+	}
 
 	return anon_vma;
+
+out:
+	rcu_read_unlock();
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	anon_vma_unlock(anon_vma);
-	put_anon_vma(anon_vma);
 }
 
 /*
@@ -1462,6 +1517,14 @@ int try_to_munlock(struct page *page)
 
 void __put_anon_vma(struct anon_vma *anon_vma)
 {
+	/*
+	 * Synchronize against page_lock_anon_vma() such that
+	 * we can safely hold the lock without the anon_vma getting
+	 * freed.
+	 */
+	anon_vma_lock(anon_vma);
+	anon_vma_unlock(anon_vma);
+
 	if (anon_vma->root != anon_vma)
 		put_anon_vma(anon_vma->root);
 	anon_vma_free(anon_vma);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-27 10:33     ` Nick Piggin
  -1 siblings, 0 replies; 157+ messages in thread
From: Nick Piggin @ 2010-11-27 10:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell, Nick Piggin

Can this go through powerpc tree as a bugfix?

On Fri, Nov 26, 2010 at 03:38:45PM +0100, Peter Zijlstra wrote:
> PowerPC relies on IRQ-disable to guard against RCU quiecent states,
> use the appropriate RCU call version.
> 

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables
@ 2010-11-27 10:33     ` Nick Piggin
  0 siblings, 0 replies; 157+ messages in thread
From: Nick Piggin @ 2010-11-27 10:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell, Nick Piggin

Can this go through powerpc tree as a bugfix?

On Fri, Nov 26, 2010 at 03:38:45PM +0100, Peter Zijlstra wrote:
> PowerPC relies on IRQ-disable to guard against RCU quiecent states,
> use the appropriate RCU call version.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables
  2010-11-27 10:33     ` Nick Piggin
@ 2010-11-27 21:55       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-27 21:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, David Miller, Hugh Dickins, Mel Gorman,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell, Nick Piggin

On Sat, 2010-11-27 at 21:33 +1100, Nick Piggin wrote:
> Can this go through powerpc tree as a bugfix?
> 
> On Fri, Nov 26, 2010 at 03:38:45PM +0100, Peter Zijlstra wrote:
> > PowerPC relies on IRQ-disable to guard against RCU quiecent states,
> > use the appropriate RCU call version.
> > 

I'm happy to pick that up tomorrow.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables
@ 2010-11-27 21:55       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-27 21:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, David Miller, Hugh Dickins, Mel Gorman,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell, Nick Piggin

On Sat, 2010-11-27 at 21:33 +1100, Nick Piggin wrote:
> Can this go through powerpc tree as a bugfix?
> 
> On Fri, Nov 26, 2010 at 03:38:45PM +0100, Peter Zijlstra wrote:
> > PowerPC relies on IRQ-disable to guard against RCU quiecent states,
> > use the appropriate RCU call version.
> > 

I'm happy to pick that up tomorrow.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 03/21] mm: Improve page_lock_anon_vma() comment
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-29  2:14     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:46 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> A slightly more verbose comment to go along with the trickery in
> page_lock_anon_vma().
> 
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <1271158226.4807.1107.camel@twins>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 03/21] mm: Improve page_lock_anon_vma() comment
@ 2010-11-29  2:14     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:46 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> A slightly more verbose comment to go along with the trickery in
> page_lock_anon_vma().
> 
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <1271158226.4807.1107.camel@twins>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 04/21] mm: Rename drop_anon_vma to put_anon_vma
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-29  2:16     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:47 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> The normal code pattern used in the kernel is: get/put.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 04/21] mm: Rename drop_anon_vma to put_anon_vma
@ 2010-11-29  2:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:47 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> The normal code pattern used in the kernel is: get/put.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 05/21] mm: Move anon_vma ref out from under CONFIG_KSM
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-29  2:19     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:48 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> We need an anon_vma refcount for preemptible anon_vma->lock.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 05/21] mm: Move anon_vma ref out from under CONFIG_KSM
@ 2010-11-29  2:19     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:48 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> We need an anon_vma refcount for preemptible anon_vma->lock.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 06/21] mm: Simplify anon_vma refcounts
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-29  2:30     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:49 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> This patch changes the anon_vma refcount to be 0 when the object is
> free. It does this by adding 1 ref to being in use in the anon_vma
> structure (iow. the anon_vma->head list is not empty).
> 
> This allows a simpler release scheme without having to check both the
> refcount and the list as well as avoids taking a ref for each entry
> on the list.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 06/21] mm: Simplify anon_vma refcounts
@ 2010-11-29  2:30     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:49 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> This patch changes the anon_vma refcount to be 0 when the object is
> free. It does this by adding 1 ref to being in use in the anon_vma
> structure (iow. the anon_vma->head list is not empty).
> 
> This allows a simpler release scheme without having to check both the
> refcount and the list as well as avoids taking a ref for each entry
> on the list.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-29  2:35     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:50 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Convert page_lock_anon_vma() over to use refcounts. This is
> done for each of convertion of anon_vma from spinlock to mutex.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Does this patch affect only page_referenced() and try_to_unmap() ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
@ 2010-11-29  2:35     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:38:50 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Convert page_lock_anon_vma() over to use refcounts. This is
> done for each of convertion of anon_vma from spinlock to mutex.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Does this patch affect only page_referenced() and try_to_unmap() ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 08/21] mm: Preemptible mmu_gather
  2010-11-26 14:38   ` Peter Zijlstra
  (?)
@ 2010-11-29  2:53     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell, Martin Schwidefsky, Russell King, Paul Mundt,
	Jeff Dike, Tony Luck

On Fri, 26 Nov 2010 15:38:51 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Make mmu_gather preemptible by using a small on stack list and use
> an option allocation to speed things up.
> 
> Preemptible mmu_gather is desired in general and usable once
> i_mmap_lock becomes a mutex. Doing it before the mutex conversion
> saves us from having to rework the code by moving the mmu_gather
> bits inside the i_mmap_lock.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Russell King <rmk@arm.linux.org.uk>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Interesting, Hmm, how about using the 1st freed pages as tlb->pages
rathet than calling alloc_page() ? no benefits ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 08/21] mm: Preemptible mmu_gather
@ 2010-11-29  2:53     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell, Martin Schwidefsky, Russell King, Paul Mundt,
	Jeff Dike

On Fri, 26 Nov 2010 15:38:51 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Make mmu_gather preemptible by using a small on stack list and use
> an option allocation to speed things up.
> 
> Preemptible mmu_gather is desired in general and usable once
> i_mmap_lock becomes a mutex. Doing it before the mutex conversion
> saves us from having to rework the code by moving the mmu_gather
> bits inside the i_mmap_lock.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Russell King <rmk@arm.linux.org.uk>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Interesting, Hmm, how about using the 1st freed pages as tlb->pages
rathet than calling alloc_page() ? no benefits ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 08/21] mm: Preemptible mmu_gather
@ 2010-11-29  2:53     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell, Martin Schwidefsky, Russell King, Paul Mundt,
	Jeff Dike, Tony Luck

On Fri, 26 Nov 2010 15:38:51 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Make mmu_gather preemptible by using a small on stack list and use
> an option allocation to speed things up.
> 
> Preemptible mmu_gather is desired in general and usable once
> i_mmap_lock becomes a mutex. Doing it before the mutex conversion
> saves us from having to rework the code by moving the mmu_gather
> bits inside the i_mmap_lock.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Russell King <rmk@arm.linux.org.uk>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Interesting, Hmm, how about using the 1st freed pages as tlb->pages
rathet than calling alloc_page() ? no benefits ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 18/21] mutex: Provide mutex_is_contended
  2010-11-26 14:39   ` Peter Zijlstra
@ 2010-11-29  2:58     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:01 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Usable for lock-breaks and such.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Could you update API list in Documentation/mutex-design.txt ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 18/21] mutex: Provide mutex_is_contended
@ 2010-11-29  2:58     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  2:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:01 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Usable for lock-breaks and such.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Could you update API list in Documentation/mutex-design.txt ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
  2010-11-26 14:39   ` Peter Zijlstra
@ 2010-11-29  3:05     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:02 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

No performance influence ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
@ 2010-11-29  3:05     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:02 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

No performance influence ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 20/21] mm: Extended batches for generic mmu_gather
  2010-11-26 14:39   ` Peter Zijlstra
@ 2010-11-29  3:11     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  3:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:03 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Instead of using a single batch (the small on-stack, or an allocated
> page), try and extend the batch every time it runs out and only flush
> once either the extend fails or we're done.
> 
> Requested-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 20/21] mm: Extended batches for generic mmu_gather
@ 2010-11-29  3:11     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  3:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:03 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Instead of using a single batch (the small on-stack, or an allocated
> page), try and extend the batch every time it runs out and only flush
> once either the extend fails or we're done.
> 
> Requested-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 21/21] mm: Optimize page_lock_anon_vma() fast-path
  2010-11-26 14:39   ` Peter Zijlstra
@ 2010-11-29  3:22     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  3:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:04 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Optimize the page_lock_anon_vma() fast path to be one LOCKed op,
> instead of two.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

The reference count seems to make things easier :)



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 21/21] mm: Optimize page_lock_anon_vma() fast-path
@ 2010-11-29  3:22     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 157+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-29  3:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Fri, 26 Nov 2010 15:39:04 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Optimize the page_lock_anon_vma() fast path to be one LOCKed op,
> instead of two.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

The reference count seems to make things easier :)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2010-11-26 14:38 ` Peter Zijlstra
@ 2010-11-29  9:00   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-29  9:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell


> Linus seems to want this to get some serious review and be pushed through
> Andrew (well, anybody but me actually, but since Andrew is mm master...)
> 
> Ben, Thomas, Andrew, can I trick you guys into looking at this stuff?

It's on my hot todo list this week :-)

I'd like to take out the rcu fix for ppc page table freeing tho and send
it to Linus now if you're ok with that.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2010-11-29  9:00   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-29  9:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell


> Linus seems to want this to get some serious review and be pushed through
> Andrew (well, anybody but me actually, but since Andrew is mm master...)
> 
> Ben, Thomas, Andrew, can I trick you guys into looking at this stuff?

It's on my hot todo list this week :-)

I'd like to take out the rcu fix for ppc page table freeing tho and send
it to Linus now if you're ok with that.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2010-11-29  9:00   ` Benjamin Herrenschmidt
@ 2010-11-29 11:41     ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 11:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Mon, 2010-11-29 at 20:00 +1100, Benjamin Herrenschmidt wrote:
> > Linus seems to want this to get some serious review and be pushed through
> > Andrew (well, anybody but me actually, but since Andrew is mm master...)
> > 
> > Ben, Thomas, Andrew, can I trick you guys into looking at this stuff?
> 
> It's on my hot todo list this week :-)
> 
> I'd like to take out the rcu fix for ppc page table freeing tho and send
> it to Linus now if you're ok with that.

Sure, and thanks!

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2010-11-29 11:41     ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 11:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Mon, 2010-11-29 at 20:00 +1100, Benjamin Herrenschmidt wrote:
> > Linus seems to want this to get some serious review and be pushed through
> > Andrew (well, anybody but me actually, but since Andrew is mm master...)
> > 
> > Ben, Thomas, Andrew, can I trick you guys into looking at this stuff?
> 
> It's on my hot todo list this week :-)
> 
> I'd like to take out the rcu fix for ppc page table freeing tho and send
> it to Linus now if you're ok with that.

Sure, and thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
  2010-11-29  2:35     ` KAMEZAWA Hiroyuki
@ 2010-11-29 20:41       ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Mon, 2010-11-29 at 11:35 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:38:50 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Convert page_lock_anon_vma() over to use refcounts. This is
> > done for each of convertion of anon_vma from spinlock to mutex.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Does this patch affect only page_referenced() and try_to_unmap() ?

git grep page_lock_anon_vma reveals that
mm/memory-failure.c:collect_procs_anon() is also affected.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
@ 2010-11-29 20:41       ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Mon, 2010-11-29 at 11:35 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:38:50 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Convert page_lock_anon_vma() over to use refcounts. This is
> > done for each of convertion of anon_vma from spinlock to mutex.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Does this patch affect only page_referenced() and try_to_unmap() ?

git grep page_lock_anon_vma reveals that
mm/memory-failure.c:collect_procs_anon() is also affected.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 08/21] mm: Preemptible mmu_gather
  2010-11-29  2:53     ` KAMEZAWA Hiroyuki
@ 2010-11-29 20:47       ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell, Martin Schwidefsky, Russell King, Paul Mundt,
	Jeff Dike, Tony Luck

On Mon, 2010-11-29 at 11:53 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:38:51 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Make mmu_gather preemptible by using a small on stack list and use
> > an option allocation to speed things up.
> > 
> > Preemptible mmu_gather is desired in general and usable once
> > i_mmap_lock becomes a mutex. Doing it before the mutex conversion
> > saves us from having to rework the code by moving the mmu_gather
> > bits inside the i_mmap_lock.
> > 
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: David Miller <davem@davemloft.net>
> > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > Cc: Russell King <rmk@arm.linux.org.uk>
> > Cc: Paul Mundt <lethal@linux-sh.org>
> > Cc: Jeff Dike <jdike@addtoit.com>
> > Cc: Tony Luck <tony.luck@intel.com>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Interesting, Hmm, how about using the 1st freed pages as tlb->pages
> rathet than calling alloc_page() ? no benefits ?

We could try that, but we need to be careful there, you need to wait
till after the TLB invalidate and possibly an RCU period for the
page-directory pages (sparc/powerpc).

So doing an optimistic allocation before the TLB invalidate seems saner.

Also, see patch 20.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 08/21] mm: Preemptible mmu_gather
@ 2010-11-29 20:47       ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell, Martin Schwidefsky, Russell King, Paul Mundt,
	Jeff Dike, Tony Luck

On Mon, 2010-11-29 at 11:53 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:38:51 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Make mmu_gather preemptible by using a small on stack list and use
> > an option allocation to speed things up.
> > 
> > Preemptible mmu_gather is desired in general and usable once
> > i_mmap_lock becomes a mutex. Doing it before the mutex conversion
> > saves us from having to rework the code by moving the mmu_gather
> > bits inside the i_mmap_lock.
> > 
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: David Miller <davem@davemloft.net>
> > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > Cc: Russell King <rmk@arm.linux.org.uk>
> > Cc: Paul Mundt <lethal@linux-sh.org>
> > Cc: Jeff Dike <jdike@addtoit.com>
> > Cc: Tony Luck <tony.luck@intel.com>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Interesting, Hmm, how about using the 1st freed pages as tlb->pages
> rathet than calling alloc_page() ? no benefits ?

We could try that, but we need to be careful there, you need to wait
till after the TLB invalidate and possibly an RCU period for the
page-directory pages (sparc/powerpc).

So doing an optimistic allocation before the TLB invalidate seems saner.

Also, see patch 20.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 18/21] mutex: Provide mutex_is_contended
  2010-11-29  2:58     ` KAMEZAWA Hiroyuki
@ 2010-11-29 20:49       ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Mon, 2010-11-29 at 11:58 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:39:01 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Usable for lock-breaks and such.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Could you update API list in Documentation/mutex-design.txt ?

Not sure what to update there, one could add a single like to the end,
not sure documentation really is the answer here.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 18/21] mutex: Provide mutex_is_contended
@ 2010-11-29 20:49       ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Mon, 2010-11-29 at 11:58 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:39:01 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Usable for lock-breaks and such.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Could you update API list in Documentation/mutex-design.txt ?

Not sure what to update there, one could add a single like to the end,
not sure documentation really is the answer here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
  2010-11-29  3:05     ` KAMEZAWA Hiroyuki
@ 2010-11-29 20:50       ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Mon, 2010-11-29 at 12:05 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:39:02 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> No performance influence ?

I couldn't find any, if there was anything the error and per-boot
deviation was larger.

Yanmin also ran this though the intel test/performance farm and didn't
find anything significant.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
@ 2010-11-29 20:50       ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-29 20:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Stephen Rothwell

On Mon, 2010-11-29 at 12:05 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 26 Nov 2010 15:39:02 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> No performance influence ?

I couldn't find any, if there was anything the error and per-boot
deviation was larger.

Yanmin also ran this though the intel test/performance farm and didn't
find anything significant.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 01/21] mm: Revert page_lock_anon_vma() lock annotation
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-30  1:19     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 157+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Stephen Rothwell, Namhyung Kim

> Its beyond ugly and gets in the way.
> 
> Cc: Namhyung Kim <namhyung@gmail.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

+1

/me run away.





^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 01/21] mm: Revert page_lock_anon_vma() lock annotation
@ 2010-11-30  1:19     ` KOSAKI Motohiro
  0 siblings, 0 replies; 157+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Stephen Rothwell, Namhyung Kim

> Its beyond ugly and gets in the way.
> 
> Cc: Namhyung Kim <namhyung@gmail.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

+1

/me run away.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
  2010-11-29  2:35     ` KAMEZAWA Hiroyuki
@ 2010-11-30  1:21       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 157+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30  1:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Peter Zijlstra, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

> On Fri, 26 Nov 2010 15:38:50 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Convert page_lock_anon_vma() over to use refcounts. This is
> > done for each of convertion of anon_vma from spinlock to mutex.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Does this patch affect only page_referenced() and try_to_unmap() ?

I like this conversion too.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma()
@ 2010-11-30  1:21       ` KOSAKI Motohiro
  0 siblings, 0 replies; 157+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30  1:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Peter Zijlstra, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

> On Fri, 26 Nov 2010 15:38:50 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Convert page_lock_anon_vma() over to use refcounts. This is
> > done for each of convertion of anon_vma from spinlock to mutex.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Does this patch affect only page_referenced() and try_to_unmap() ?

I like this conversion too.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
  2010-11-26 14:39   ` Peter Zijlstra
@ 2010-11-30  1:28     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 157+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30  1:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Stephen Rothwell

> Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

+1.

I don't think this conversion makes performance hurt.




^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
@ 2010-11-30  1:28     ` KOSAKI Motohiro
  0 siblings, 0 replies; 157+ messages in thread
From: KOSAKI Motohiro @ 2010-11-30  1:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Stephen Rothwell

> Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

+1.

I don't think this conversion makes performance hurt.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 16/21] mm, powerpc: Move the RCU page-table freeing into generic code
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-30  3:05     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-30  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Fri, 2010-11-26 at 15:38 +0100, Peter Zijlstra wrote:
> plain text document attachment (mm-preempt-tlb-gather-rcu.patch)
> In case other architectures require RCU freed page-tables to implement
> gup_fast() and software filled hashes and similar things, provide the
> means to do so by moving the logic into generic code.

A little bit of build breakage on our side that comes from this patch,
which seem to mostly be fixed by changing:

+	select HAVE_RCU_TABLE_FREE if PPC64

to:

+	select HAVE_RCU_TABLE_FREE if SMP

There's some additional breakage for the 64-bit BookE case though but
that's a different patch (and a different email) :-) I'll do some
runtime testing now.

Cheers,
Ben.


> Requested-by: David Miller <davem@davemloft.net>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/Kconfig                       |    3 +
>  arch/powerpc/Kconfig               |    1 
>  arch/powerpc/include/asm/pgalloc.h |   21 ++++++-
>  arch/powerpc/include/asm/tlb.h     |   10 ---
>  arch/powerpc/mm/pgtable.c          |   98 -------------------------------------
>  arch/powerpc/mm/tlb_hash32.c       |    3 -
>  arch/powerpc/mm/tlb_hash64.c       |    3 -
>  arch/powerpc/mm/tlb_nohash.c       |    3 -
>  include/asm-generic/tlb.h          |   57 +++++++++++++++++++--
>  mm/memory.c                        |   77 +++++++++++++++++++++++++++++
>  10 files changed, 151 insertions(+), 125 deletions(-)
> 
> Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
> +++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
> @@ -31,14 +31,29 @@ static inline void pte_free(struct mm_st
>  #endif
>  
>  #ifdef CONFIG_SMP
> -extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
> -extern void pte_free_finish(struct mmu_gather *tlb);
> +struct mmu_gather;
> +extern void tlb_remove_table(struct mmu_gather *, void *);
> +
> +static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
> +{
> +	unsigned long pgf = (unsigned long)table;
> +	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
> +	pgf |= shift;
> +	tlb_remove_table(tlb, (void *)pgf);
> +}
> +
> +static inline void __tlb_remove_table(void *_table)
> +{
> +	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
> +	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
> +
> +	pgtable_free(table, shift);
> +}
>  #else /* CONFIG_SMP */
>  static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
>  	pgtable_free(table, shift);
>  }
> -static inline void pte_free_finish(struct mmu_gather *tlb) { }
>  #endif /* !CONFIG_SMP */
>  
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
> Index: linux-2.6/arch/powerpc/include/asm/tlb.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
> +++ linux-2.6/arch/powerpc/include/asm/tlb.h
> @@ -28,16 +28,6 @@
>  #define tlb_start_vma(tlb, vma)	do { } while (0)
>  #define tlb_end_vma(tlb, vma)	do { } while (0)
>  
> -#define HAVE_ARCH_MMU_GATHER 1
> -
> -struct pte_freelist_batch;
> -
> -struct arch_mmu_gather {
> -	struct pte_freelist_batch *batch;
> -};
> -
> -#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
> -
>  extern void tlb_flush(struct mmu_gather *tlb);
>  
>  /* Get the generic bits... */
> Index: linux-2.6/arch/powerpc/mm/pgtable.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/pgtable.c
> +++ linux-2.6/arch/powerpc/mm/pgtable.c
> @@ -33,104 +33,6 @@
>  
>  #include "mmu_decl.h"
>  
> -#ifdef CONFIG_SMP
> -
> -/*
> - * Handle batching of page table freeing on SMP. Page tables are
> - * queued up and send to be freed later by RCU in order to avoid
> - * freeing a page table page that is being walked without locks
> - */
> -
> -static unsigned long pte_freelist_forced_free;
> -
> -struct pte_freelist_batch
> -{
> -	struct rcu_head	rcu;
> -	unsigned int	index;
> -	unsigned long	tables[0];
> -};
> -
> -#define PTE_FREELIST_SIZE \
> -	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
> -	  / sizeof(unsigned long))
> -
> -static void pte_free_smp_sync(void *arg)
> -{
> -	/* Do nothing, just ensure we sync with all CPUs */
> -}
> -
> -/* This is only called when we are critically out of memory
> - * (and fail to get a page in pte_free_tlb).
> - */
> -static void pgtable_free_now(void *table, unsigned shift)
> -{
> -	pte_freelist_forced_free++;
> -
> -	smp_call_function(pte_free_smp_sync, NULL, 1);
> -
> -	pgtable_free(table, shift);
> -}
> -
> -static void pte_free_rcu_callback(struct rcu_head *head)
> -{
> -	struct pte_freelist_batch *batch =
> -		container_of(head, struct pte_freelist_batch, rcu);
> -	unsigned int i;
> -
> -	for (i = 0; i < batch->index; i++) {
> -		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
> -		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
> -
> -		pgtable_free(table, shift);
> -	}
> -
> -	free_page((unsigned long)batch);
> -}
> -
> -static void pte_free_submit(struct pte_freelist_batch *batch)
> -{
> -	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
> -}
> -
> -void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
> -{
> -	struct pte_freelist_batch **batchp = &tlb->arch.batch;
> -	unsigned long pgf;
> -
> -	if (atomic_read(&tlb->mm->mm_users) < 2) {
> -		pgtable_free(table, shift);
> -		return;
> -	}
> -
> -	if (*batchp == NULL) {
> -		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
> -		if (*batchp == NULL) {
> -			pgtable_free_now(table, shift);
> -			return;
> -		}
> -		(*batchp)->index = 0;
> -	}
> -	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
> -	pgf = (unsigned long)table | shift;
> -	(*batchp)->tables[(*batchp)->index++] = pgf;
> -	if ((*batchp)->index == PTE_FREELIST_SIZE) {
> -		pte_free_submit(*batchp);
> -		*batchp = NULL;
> -	}
> -}
> -
> -void pte_free_finish(struct mmu_gather *tlb)
> -{
> -	struct pte_freelist_batch **batchp = &tlb->arch.batch;
> -
> -	if (*batchp == NULL)
> -		return;
> -	pte_free_submit(*batchp);
> -	*batchp = NULL;
> -}
> -
> -#endif /* CONFIG_SMP */
> -
>  static inline int is_exec_fault(void)
>  {
>  	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
> Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
> @@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
>  		 */
>  		_tlbia();
>  	}
> -
> -	/* Push out batch of freed page tables */
> -	pte_free_finish(tlb);
>  }
>  
>  /*
> Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
> @@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
>  		__flush_tlb_pending(tlbbatch);
>  
>  	put_cpu_var(ppc64_tlb_batch);
> -
> -	/* Push out batch of freed page tables */
> -	pte_free_finish(tlb);
>  }
>  
>  /**
> Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
> +++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
> @@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range);
>  void tlb_flush(struct mmu_gather *tlb)
>  {
>  	flush_tlb_mm(tlb->mm);
> -
> -	/* Push out batch of freed page tables */
> -	pte_free_finish(tlb);
>  }
>  
>  /*
> Index: linux-2.6/include/asm-generic/tlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/tlb.h
> +++ linux-2.6/include/asm-generic/tlb.h
> @@ -27,6 +27,49 @@
>    #define tlb_fast_mode(tlb) 1
>  #endif
>  
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +/*
> + * Semi RCU freeing of the page directories.
> + *
> + * This is needed by some architectures to implement software pagetable walkers.
> + *
> + * gup_fast() and other software pagetable walkers do a lockless page-table
> + * walk and therefore needs some synchronization with the freeing of the page
> + * directories. The chosen means to accomplish that is by disabling IRQs over
> + * the walk.
> + *
> + * Architectures that use IPIs to flush TLBs will then automagically DTRT,
> + * since we unlink the page, flush TLBs, free the page. Since the disabling of
> + * IRQs delays the copmletion of the TLB flush we can never observe an already
> + * freed page.
> + *
> + * Architectures that do not have this (PPC) need to delay the freeing by some
> + * other means, this is that means.
> + *
> + * What we do is batch the freed directory pages (tables) and RCU free them.
> + * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
> + * holds off grace periods.
> + *
> + * However, in order to batch these pages we need to allocate storage, this
> + * allocation is deep inside the MM code and can thus easily fail on memory
> + * pressure. To guarantee progress we fall back to single table freeing, see
> + * the implementation of tlb_remove_table_one().
> + *
> + */
> +struct mmu_table_batch {
> +	struct rcu_head		rcu;
> +	unsigned int		nr;
> +	void			*tables[0];
> +};
> +
> +#define MAX_TABLE_BATCH		\
> +	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
> +
> +extern void tlb_table_flush(struct mmu_gather *tlb);
> +extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
> +
> +#endif
> +
>  /* struct mmu_gather is an opaque type used by the mm code for passing around
>   * any data needed by arch specific code for tlb_remove_page.
>   */
> @@ -36,11 +79,12 @@ struct mmu_gather {
>  	unsigned int		max;	/* nr < max */
>  	unsigned int		need_flush;/* Really unmapped some ptes? */
>  	unsigned int		fullmm; /* non-zero means full mm flush */
> -#ifdef HAVE_ARCH_MMU_GATHER
> -	struct arch_mmu_gather	arch;
> -#endif
>  	struct page		**pages;
>  	struct page		*local[8];
> +
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +	struct mmu_table_batch	*batch;
> +#endif
>  };
>  
>  static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
> @@ -72,8 +116,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
>  
>  	tlb->fullmm = full_mm_flush;
>  
> -#ifdef HAVE_ARCH_MMU_GATHER
> -	tlb->arch = ARCH_MMU_GATHER_INIT;
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +	tlb->batch = NULL;
>  #endif
>  }
>  
> @@ -84,6 +128,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
>  		return;
>  	tlb->need_flush = 0;
>  	tlb_flush(tlb);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +	tlb_table_flush(tlb);
> +#endif
>  	if (!tlb_fast_mode(tlb)) {
>  		free_pages_and_swap_cache(tlb->pages, tlb->nr);
>  		tlb->nr = 0;
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct t
>  
>  #endif
>  
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +
> +/*
> + * See the comment near struct mmu_table_batch.
> + */
> +
> +static void tlb_remove_table_smp_sync(void *arg)
> +{
> +	/* Simply deliver the interrupt */
> +}
> +
> +static void tlb_remove_table_one(void *table)
> +{
> +	/*
> +	 * This isn't an RCU grace period and hence the page-tables cannot be
> +	 * assumed to be actually RCU-freed.
> +	 *
> +	 * It is however sufficient for software page-table walkers that rely on
> +	 * IRQ disabling. See the comment near struct mmu_table_batch.
> +	 */
> +	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
> +	__tlb_remove_table(table);
> +}
> +
> +static void tlb_remove_table_rcu(struct rcu_head *head)
> +{
> +	struct mmu_table_batch *batch;
> +	int i;
> +
> +	batch = container_of(head, struct mmu_table_batch, rcu);
> +
> +	for (i = 0; i < batch->nr; i++)
> +		__tlb_remove_table(batch->tables[i]);
> +
> +	free_page((unsigned long)batch);
> +}
> +
> +void tlb_table_flush(struct mmu_gather *tlb)
> +{
> +	struct mmu_table_batch **batch = &tlb->batch;
> +
> +	if (*batch) {
> +		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
> +		*batch = NULL;
> +	}
> +}
> +
> +void tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> +	struct mmu_table_batch **batch = &tlb->batch;
> +
> +	tlb->need_flush = 1;
> +
> +	/*
> +	 * When there's less then two users of this mm there cannot be a
> +	 * concurrent page-table walk.
> +	 */
> +	if (atomic_read(&tlb->mm->mm_users) < 2) {
> +		__tlb_remove_table(table);
> +		return;
> +	}
> +
> +	if (*batch == NULL) {
> +		*batch = (struct mmu_table_batch *)__get_free_page(GFP_ATOMIC);
> +		if (*batch == NULL) {
> +			tlb_remove_table_one(table);
> +			return;
> +		}
> +		(*batch)->nr = 0;
> +	}
> +	(*batch)->tables[(*batch)->nr++] = table;
> +	if ((*batch)->nr == MAX_TABLE_BATCH)
> +		tlb_table_flush(tlb);
> +}
> +
> +#endif
> +
>  /*
>   * If a p?d_bad entry is found while walking page tables, report
>   * the error, before resetting entry to p?d_none.  Usually (but
> Index: linux-2.6/arch/Kconfig
> ===================================================================
> --- linux-2.6.orig/arch/Kconfig
> +++ linux-2.6/arch/Kconfig
> @@ -175,4 +175,7 @@ config HAVE_PERF_EVENTS_NMI
>  config HAVE_ARCH_JUMP_LABEL
>  	bool
>  
> +config HAVE_RCU_TABLE_FREE
> +	bool
> +
>  source "kernel/gcov/Kconfig"
> Index: linux-2.6/arch/powerpc/Kconfig
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/Kconfig
> +++ linux-2.6/arch/powerpc/Kconfig
> @@ -140,6 +140,7 @@ config PPC
>  	select HAVE_PERF_EVENTS
>  	select HAVE_REGS_AND_STACK_ACCESS_API
>  	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
> +	select HAVE_RCU_TABLE_FREE if PPC64
>  
>  config EARLY_PRINTK
>  	bool
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 16/21] mm, powerpc: Move the RCU page-table freeing into generic code
@ 2010-11-30  3:05     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-30  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Fri, 2010-11-26 at 15:38 +0100, Peter Zijlstra wrote:
> plain text document attachment (mm-preempt-tlb-gather-rcu.patch)
> In case other architectures require RCU freed page-tables to implement
> gup_fast() and software filled hashes and similar things, provide the
> means to do so by moving the logic into generic code.

A little bit of build breakage on our side that comes from this patch,
which seem to mostly be fixed by changing:

+	select HAVE_RCU_TABLE_FREE if PPC64

to:

+	select HAVE_RCU_TABLE_FREE if SMP

There's some additional breakage for the 64-bit BookE case though but
that's a different patch (and a different email) :-) I'll do some
runtime testing now.

Cheers,
Ben.


> Requested-by: David Miller <davem@davemloft.net>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/Kconfig                       |    3 +
>  arch/powerpc/Kconfig               |    1 
>  arch/powerpc/include/asm/pgalloc.h |   21 ++++++-
>  arch/powerpc/include/asm/tlb.h     |   10 ---
>  arch/powerpc/mm/pgtable.c          |   98 -------------------------------------
>  arch/powerpc/mm/tlb_hash32.c       |    3 -
>  arch/powerpc/mm/tlb_hash64.c       |    3 -
>  arch/powerpc/mm/tlb_nohash.c       |    3 -
>  include/asm-generic/tlb.h          |   57 +++++++++++++++++++--
>  mm/memory.c                        |   77 +++++++++++++++++++++++++++++
>  10 files changed, 151 insertions(+), 125 deletions(-)
> 
> Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
> +++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
> @@ -31,14 +31,29 @@ static inline void pte_free(struct mm_st
>  #endif
>  
>  #ifdef CONFIG_SMP
> -extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
> -extern void pte_free_finish(struct mmu_gather *tlb);
> +struct mmu_gather;
> +extern void tlb_remove_table(struct mmu_gather *, void *);
> +
> +static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
> +{
> +	unsigned long pgf = (unsigned long)table;
> +	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
> +	pgf |= shift;
> +	tlb_remove_table(tlb, (void *)pgf);
> +}
> +
> +static inline void __tlb_remove_table(void *_table)
> +{
> +	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
> +	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
> +
> +	pgtable_free(table, shift);
> +}
>  #else /* CONFIG_SMP */
>  static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
>  	pgtable_free(table, shift);
>  }
> -static inline void pte_free_finish(struct mmu_gather *tlb) { }
>  #endif /* !CONFIG_SMP */
>  
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
> Index: linux-2.6/arch/powerpc/include/asm/tlb.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
> +++ linux-2.6/arch/powerpc/include/asm/tlb.h
> @@ -28,16 +28,6 @@
>  #define tlb_start_vma(tlb, vma)	do { } while (0)
>  #define tlb_end_vma(tlb, vma)	do { } while (0)
>  
> -#define HAVE_ARCH_MMU_GATHER 1
> -
> -struct pte_freelist_batch;
> -
> -struct arch_mmu_gather {
> -	struct pte_freelist_batch *batch;
> -};
> -
> -#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
> -
>  extern void tlb_flush(struct mmu_gather *tlb);
>  
>  /* Get the generic bits... */
> Index: linux-2.6/arch/powerpc/mm/pgtable.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/pgtable.c
> +++ linux-2.6/arch/powerpc/mm/pgtable.c
> @@ -33,104 +33,6 @@
>  
>  #include "mmu_decl.h"
>  
> -#ifdef CONFIG_SMP
> -
> -/*
> - * Handle batching of page table freeing on SMP. Page tables are
> - * queued up and send to be freed later by RCU in order to avoid
> - * freeing a page table page that is being walked without locks
> - */
> -
> -static unsigned long pte_freelist_forced_free;
> -
> -struct pte_freelist_batch
> -{
> -	struct rcu_head	rcu;
> -	unsigned int	index;
> -	unsigned long	tables[0];
> -};
> -
> -#define PTE_FREELIST_SIZE \
> -	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
> -	  / sizeof(unsigned long))
> -
> -static void pte_free_smp_sync(void *arg)
> -{
> -	/* Do nothing, just ensure we sync with all CPUs */
> -}
> -
> -/* This is only called when we are critically out of memory
> - * (and fail to get a page in pte_free_tlb).
> - */
> -static void pgtable_free_now(void *table, unsigned shift)
> -{
> -	pte_freelist_forced_free++;
> -
> -	smp_call_function(pte_free_smp_sync, NULL, 1);
> -
> -	pgtable_free(table, shift);
> -}
> -
> -static void pte_free_rcu_callback(struct rcu_head *head)
> -{
> -	struct pte_freelist_batch *batch =
> -		container_of(head, struct pte_freelist_batch, rcu);
> -	unsigned int i;
> -
> -	for (i = 0; i < batch->index; i++) {
> -		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
> -		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
> -
> -		pgtable_free(table, shift);
> -	}
> -
> -	free_page((unsigned long)batch);
> -}
> -
> -static void pte_free_submit(struct pte_freelist_batch *batch)
> -{
> -	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
> -}
> -
> -void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
> -{
> -	struct pte_freelist_batch **batchp = &tlb->arch.batch;
> -	unsigned long pgf;
> -
> -	if (atomic_read(&tlb->mm->mm_users) < 2) {
> -		pgtable_free(table, shift);
> -		return;
> -	}
> -
> -	if (*batchp == NULL) {
> -		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
> -		if (*batchp == NULL) {
> -			pgtable_free_now(table, shift);
> -			return;
> -		}
> -		(*batchp)->index = 0;
> -	}
> -	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
> -	pgf = (unsigned long)table | shift;
> -	(*batchp)->tables[(*batchp)->index++] = pgf;
> -	if ((*batchp)->index == PTE_FREELIST_SIZE) {
> -		pte_free_submit(*batchp);
> -		*batchp = NULL;
> -	}
> -}
> -
> -void pte_free_finish(struct mmu_gather *tlb)
> -{
> -	struct pte_freelist_batch **batchp = &tlb->arch.batch;
> -
> -	if (*batchp == NULL)
> -		return;
> -	pte_free_submit(*batchp);
> -	*batchp = NULL;
> -}
> -
> -#endif /* CONFIG_SMP */
> -
>  static inline int is_exec_fault(void)
>  {
>  	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
> Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
> @@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
>  		 */
>  		_tlbia();
>  	}
> -
> -	/* Push out batch of freed page tables */
> -	pte_free_finish(tlb);
>  }
>  
>  /*
> Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
> @@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
>  		__flush_tlb_pending(tlbbatch);
>  
>  	put_cpu_var(ppc64_tlb_batch);
> -
> -	/* Push out batch of freed page tables */
> -	pte_free_finish(tlb);
>  }
>  
>  /**
> Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
> +++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
> @@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range);
>  void tlb_flush(struct mmu_gather *tlb)
>  {
>  	flush_tlb_mm(tlb->mm);
> -
> -	/* Push out batch of freed page tables */
> -	pte_free_finish(tlb);
>  }
>  
>  /*
> Index: linux-2.6/include/asm-generic/tlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/tlb.h
> +++ linux-2.6/include/asm-generic/tlb.h
> @@ -27,6 +27,49 @@
>    #define tlb_fast_mode(tlb) 1
>  #endif
>  
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +/*
> + * Semi RCU freeing of the page directories.
> + *
> + * This is needed by some architectures to implement software pagetable walkers.
> + *
> + * gup_fast() and other software pagetable walkers do a lockless page-table
> + * walk and therefore needs some synchronization with the freeing of the page
> + * directories. The chosen means to accomplish that is by disabling IRQs over
> + * the walk.
> + *
> + * Architectures that use IPIs to flush TLBs will then automagically DTRT,
> + * since we unlink the page, flush TLBs, free the page. Since the disabling of
> + * IRQs delays the copmletion of the TLB flush we can never observe an already
> + * freed page.
> + *
> + * Architectures that do not have this (PPC) need to delay the freeing by some
> + * other means, this is that means.
> + *
> + * What we do is batch the freed directory pages (tables) and RCU free them.
> + * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
> + * holds off grace periods.
> + *
> + * However, in order to batch these pages we need to allocate storage, this
> + * allocation is deep inside the MM code and can thus easily fail on memory
> + * pressure. To guarantee progress we fall back to single table freeing, see
> + * the implementation of tlb_remove_table_one().
> + *
> + */
> +struct mmu_table_batch {
> +	struct rcu_head		rcu;
> +	unsigned int		nr;
> +	void			*tables[0];
> +};
> +
> +#define MAX_TABLE_BATCH		\
> +	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
> +
> +extern void tlb_table_flush(struct mmu_gather *tlb);
> +extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
> +
> +#endif
> +
>  /* struct mmu_gather is an opaque type used by the mm code for passing around
>   * any data needed by arch specific code for tlb_remove_page.
>   */
> @@ -36,11 +79,12 @@ struct mmu_gather {
>  	unsigned int		max;	/* nr < max */
>  	unsigned int		need_flush;/* Really unmapped some ptes? */
>  	unsigned int		fullmm; /* non-zero means full mm flush */
> -#ifdef HAVE_ARCH_MMU_GATHER
> -	struct arch_mmu_gather	arch;
> -#endif
>  	struct page		**pages;
>  	struct page		*local[8];
> +
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +	struct mmu_table_batch	*batch;
> +#endif
>  };
>  
>  static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
> @@ -72,8 +116,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
>  
>  	tlb->fullmm = full_mm_flush;
>  
> -#ifdef HAVE_ARCH_MMU_GATHER
> -	tlb->arch = ARCH_MMU_GATHER_INIT;
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +	tlb->batch = NULL;
>  #endif
>  }
>  
> @@ -84,6 +128,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
>  		return;
>  	tlb->need_flush = 0;
>  	tlb_flush(tlb);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +	tlb_table_flush(tlb);
> +#endif
>  	if (!tlb_fast_mode(tlb)) {
>  		free_pages_and_swap_cache(tlb->pages, tlb->nr);
>  		tlb->nr = 0;
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct t
>  
>  #endif
>  
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> +
> +/*
> + * See the comment near struct mmu_table_batch.
> + */
> +
> +static void tlb_remove_table_smp_sync(void *arg)
> +{
> +	/* Simply deliver the interrupt */
> +}
> +
> +static void tlb_remove_table_one(void *table)
> +{
> +	/*
> +	 * This isn't an RCU grace period and hence the page-tables cannot be
> +	 * assumed to be actually RCU-freed.
> +	 *
> +	 * It is however sufficient for software page-table walkers that rely on
> +	 * IRQ disabling. See the comment near struct mmu_table_batch.
> +	 */
> +	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
> +	__tlb_remove_table(table);
> +}
> +
> +static void tlb_remove_table_rcu(struct rcu_head *head)
> +{
> +	struct mmu_table_batch *batch;
> +	int i;
> +
> +	batch = container_of(head, struct mmu_table_batch, rcu);
> +
> +	for (i = 0; i < batch->nr; i++)
> +		__tlb_remove_table(batch->tables[i]);
> +
> +	free_page((unsigned long)batch);
> +}
> +
> +void tlb_table_flush(struct mmu_gather *tlb)
> +{
> +	struct mmu_table_batch **batch = &tlb->batch;
> +
> +	if (*batch) {
> +		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
> +		*batch = NULL;
> +	}
> +}
> +
> +void tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> +	struct mmu_table_batch **batch = &tlb->batch;
> +
> +	tlb->need_flush = 1;
> +
> +	/*
> +	 * When there's less then two users of this mm there cannot be a
> +	 * concurrent page-table walk.
> +	 */
> +	if (atomic_read(&tlb->mm->mm_users) < 2) {
> +		__tlb_remove_table(table);
> +		return;
> +	}
> +
> +	if (*batch == NULL) {
> +		*batch = (struct mmu_table_batch *)__get_free_page(GFP_ATOMIC);
> +		if (*batch == NULL) {
> +			tlb_remove_table_one(table);
> +			return;
> +		}
> +		(*batch)->nr = 0;
> +	}
> +	(*batch)->tables[(*batch)->nr++] = table;
> +	if ((*batch)->nr == MAX_TABLE_BATCH)
> +		tlb_table_flush(tlb);
> +}
> +
> +#endif
> +
>  /*
>   * If a p?d_bad entry is found while walking page tables, report
>   * the error, before resetting entry to p?d_none.  Usually (but
> Index: linux-2.6/arch/Kconfig
> ===================================================================
> --- linux-2.6.orig/arch/Kconfig
> +++ linux-2.6/arch/Kconfig
> @@ -175,4 +175,7 @@ config HAVE_PERF_EVENTS_NMI
>  config HAVE_ARCH_JUMP_LABEL
>  	bool
>  
> +config HAVE_RCU_TABLE_FREE
> +	bool
> +
>  source "kernel/gcov/Kconfig"
> Index: linux-2.6/arch/powerpc/Kconfig
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/Kconfig
> +++ linux-2.6/arch/powerpc/Kconfig
> @@ -140,6 +140,7 @@ config PPC
>  	select HAVE_PERF_EVENTS
>  	select HAVE_REGS_AND_STACK_ACCESS_API
>  	select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
> +	select HAVE_RCU_TABLE_FREE if PPC64
>  
>  config EARLY_PRINTK
>  	bool
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 09/21] powerpc: Preemptible mmu_gather
  2010-11-26 14:38   ` Peter Zijlstra
@ 2010-11-30  3:12     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-30  3:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Fri, 2010-11-26 at 15:38 +0100, Peter Zijlstra wrote:
> plain text document attachment (mm-preempt-tlb-gather-power.patch)
> Fix up powerpc to the new mmu_gather stuffs.
> 
> PPC has an extra batching queue to RCU free the actual pagetable
> allocations, use the ARCH extentions for that for now.
> 
> For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> hardware hash-table, keep using per-cpu arrays but flush on context
> switch and use a TLF bit to track the laxy_mmu state.

This breaks embedded 64-bit build. Replace CONFIG_PPC64 with
CONFIG_PPC_BOOK3S_64 to only hit server when you access the
ppc64_tlb_batch in process.c since it doesn't exist for BOOK3E.

This patch to fold into yours fixes it for me:

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 253475b..56ab103 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -393,7 +393,7 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC_BOOK3S_64
 	struct ppc64_tlb_batch *batch;
 #endif
 
@@ -514,7 +514,9 @@ struct task_struct *__switch_to(struct task_struct *prev,
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
+#endif /* CONFIG_PPC64 */
 
+#ifdef CONFIG_PPC_BOOK3S_64
 	batch = &__get_cpu_var(ppc64_tlb_batch);
 	if (batch->active) {
 		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
@@ -522,7 +524,7 @@ struct task_struct *__switch_to(struct task_struct *prev,
 			__flush_tlb_pending(batch);
 		batch->active = 0;
 	}
-#endif /* CONFIG_PPC64 */
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 	local_irq_save(flags);
 
@@ -537,13 +539,13 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC_BOOK3S_64
 	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
 		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
 		batch = &__get_cpu_var(ppc64_tlb_batch);
 		batch->active = 1;
 	}
-#endif /* CONFIG_PPC64 */
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 	local_irq_restore(flags);
 
Cheers,
Ben.

> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/powerpc/include/asm/pgalloc.h     |    4 ++--
>  arch/powerpc/include/asm/thread_info.h |    2 ++
>  arch/powerpc/include/asm/tlb.h         |   10 ++++++++++
>  arch/powerpc/kernel/process.c          |   21 ++++++++++++++++++++-
>  arch/powerpc/mm/pgtable.c              |   14 ++++----------
>  arch/powerpc/mm/tlb_hash32.c           |    2 +-
>  arch/powerpc/mm/tlb_hash64.c           |   12 +++++++-----
>  arch/powerpc/mm/tlb_nohash.c           |    2 +-
>  8 files changed, 47 insertions(+), 20 deletions(-)
> 
> Index: linux-2.6/arch/powerpc/include/asm/tlb.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
> +++ linux-2.6/arch/powerpc/include/asm/tlb.h
> @@ -28,6 +28,16 @@
>  #define tlb_start_vma(tlb, vma)	do { } while (0)
>  #define tlb_end_vma(tlb, vma)	do { } while (0)
>  
> +#define HAVE_ARCH_MMU_GATHER 1
> +
> +struct pte_freelist_batch;
> +
> +struct arch_mmu_gather {
> +	struct pte_freelist_batch *batch;
> +};
> +
> +#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
> +
>  extern void tlb_flush(struct mmu_gather *tlb);
>  
>  /* Get the generic bits... */
> Index: linux-2.6/arch/powerpc/kernel/process.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/process.c
> +++ linux-2.6/arch/powerpc/kernel/process.c
> @@ -393,6 +393,9 @@ struct task_struct *__switch_to(struct t
>  	struct thread_struct *new_thread, *old_thread;
>  	unsigned long flags;
>  	struct task_struct *last;
> +#ifdef CONFIG_PPC64
> +	struct ppc64_tlb_batch *batch;
> +#endif
>  
>  #ifdef CONFIG_SMP
>  	/* avoid complexity of lazy save/restore of fpu
> @@ -511,7 +514,15 @@ struct task_struct *__switch_to(struct t
>  		old_thread->accum_tb += (current_tb - start_tb);
>  		new_thread->start_tb = current_tb;
>  	}
> -#endif
> +
> +	batch = &__get_cpu_var(ppc64_tlb_batch);
> +	if (batch->active) {
> +		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
> +		if (batch->index)
> +			__flush_tlb_pending(batch);
> +		batch->active = 0;
> +	}
> +#endif /* CONFIG_PPC64 */
>  
>  	local_irq_save(flags);
>  
> @@ -527,6 +538,14 @@ struct task_struct *__switch_to(struct t
>  	hard_irq_disable();
>  	last = _switch(old_thread, new_thread);
>  
> +#ifdef CONFIG_PPC64
> +	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
> +		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
> +		batch = &__get_cpu_var(ppc64_tlb_batch);
> +		batch->active = 1;
> +	}
> +#endif /* CONFIG_PPC64 */
> +
>  	local_irq_restore(flags);
>  
>  	return last;
> Index: linux-2.6/arch/powerpc/mm/pgtable.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/pgtable.c
> +++ linux-2.6/arch/powerpc/mm/pgtable.c
> @@ -33,8 +33,6 @@
>  
>  #include "mmu_decl.h"
>  
> -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
> -
>  #ifdef CONFIG_SMP
>  
>  /*
> @@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
>   * freeing a page table page that is being walked without locks
>   */
>  
> -static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
>  static unsigned long pte_freelist_forced_free;
>  
>  struct pte_freelist_batch
> @@ -97,12 +94,10 @@ static void pte_free_submit(struct pte_f
>  
>  void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
> -	/* This is safe since tlb_gather_mmu has disabled preemption */
> -	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> +	struct pte_freelist_batch **batchp = &tlb->arch.batch;
>  	unsigned long pgf;
>  
> -	if (atomic_read(&tlb->mm->mm_users) < 2 ||
> -	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
> +	if (atomic_read(&tlb->mm->mm_users) < 2) {
>  		pgtable_free(table, shift);
>  		return;
>  	}
> @@ -124,10 +119,9 @@ void pgtable_free_tlb(struct mmu_gather 
>  	}
>  }
>  
> -void pte_free_finish(void)
> +void pte_free_finish(struct mmu_gather *tlb)
>  {
> -	/* This is safe since tlb_gather_mmu has disabled preemption */
> -	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> +	struct pte_freelist_batch **batchp = &tlb->arch.batch;
>  
>  	if (*batchp == NULL)
>  		return;
> Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
>   * neesd to be flushed. This function will either perform the flush
>   * immediately or will batch it up if the current CPU has an active
>   * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
>   */
>  void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>  		     pte_t *ptep, unsigned long pte, int huge)
>  {
> -	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
>  	unsigned long vsid, vaddr;
>  	unsigned int psize;
>  	int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
>  	 */
>  	if (!batch->active) {
>  		flush_hash_page(vaddr, rpte, psize, ssize, 0);
> +		put_cpu_var(ppc64_tlb_batch);
>  		return;
>  	}
>  
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
>  	batch->index = ++i;
>  	if (i >= PPC64_TLB_BATCH_NR)
>  		__flush_tlb_pending(batch);
> +	put_cpu_var(ppc64_tlb_batch);
>  }
>  
>  /*
> @@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
>  
>  void tlb_flush(struct mmu_gather *tlb)
>  {
> -	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
>  
>  	/* If there's a TLB batch pending, then we must flush it because the
>  	 * pages are going to be freed and we really don't want to have a CPU
> @@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
>  	if (tlbbatch->index)
>  		__flush_tlb_pending(tlbbatch);
>  
> +	put_cpu_var(ppc64_tlb_batch);
> +
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /**
> Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
> +++ linux-2.6/arch/powerpc/include/asm/thread_info.h
> @@ -139,10 +139,12 @@ static inline struct thread_info *curren
>  #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
>  #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
>  #define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
> +#define TLF_LAZY_MMU		3	/* tlb_batch is active */
>  
>  #define _TLF_NAPPING		(1 << TLF_NAPPING)
>  #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
>  #define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
> +#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
>  
>  #ifndef __ASSEMBLY__
>  #define HAVE_SET_RESTORE_SIGMASK	1
> Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
> +++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
> @@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
>  
>  #ifdef CONFIG_SMP
>  extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
> -extern void pte_free_finish(void);
> +extern void pte_free_finish(struct mmu_gather *tlb);
>  #else /* CONFIG_SMP */
>  static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
>  	pgtable_free(table, shift);
>  }
> -static inline void pte_free_finish(void) { }
> +static inline void pte_free_finish(struct mmu_gather *tlb) { }
>  #endif /* !CONFIG_SMP */
>  
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
> Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
> @@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
>  	}
>  
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /*
> Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
> +++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
> @@ -301,7 +301,7 @@ void tlb_flush(struct mmu_gather *tlb)
>  	flush_tlb_mm(tlb->mm);
>  
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /*
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 09/21] powerpc: Preemptible mmu_gather
@ 2010-11-30  3:12     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-30  3:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Fri, 2010-11-26 at 15:38 +0100, Peter Zijlstra wrote:
> plain text document attachment (mm-preempt-tlb-gather-power.patch)
> Fix up powerpc to the new mmu_gather stuffs.
> 
> PPC has an extra batching queue to RCU free the actual pagetable
> allocations, use the ARCH extentions for that for now.
> 
> For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> hardware hash-table, keep using per-cpu arrays but flush on context
> switch and use a TLF bit to track the laxy_mmu state.

This breaks embedded 64-bit build. Replace CONFIG_PPC64 with
CONFIG_PPC_BOOK3S_64 to only hit server when you access the
ppc64_tlb_batch in process.c since it doesn't exist for BOOK3E.

This patch to fold into yours fixes it for me:

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 253475b..56ab103 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -393,7 +393,7 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC_BOOK3S_64
 	struct ppc64_tlb_batch *batch;
 #endif
 
@@ -514,7 +514,9 @@ struct task_struct *__switch_to(struct task_struct *prev,
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
+#endif /* CONFIG_PPC64 */
 
+#ifdef CONFIG_PPC_BOOK3S_64
 	batch = &__get_cpu_var(ppc64_tlb_batch);
 	if (batch->active) {
 		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
@@ -522,7 +524,7 @@ struct task_struct *__switch_to(struct task_struct *prev,
 			__flush_tlb_pending(batch);
 		batch->active = 0;
 	}
-#endif /* CONFIG_PPC64 */
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 	local_irq_save(flags);
 
@@ -537,13 +539,13 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC_BOOK3S_64
 	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
 		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
 		batch = &__get_cpu_var(ppc64_tlb_batch);
 		batch->active = 1;
 	}
-#endif /* CONFIG_PPC64 */
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 	local_irq_restore(flags);
 
Cheers,
Ben.

> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/powerpc/include/asm/pgalloc.h     |    4 ++--
>  arch/powerpc/include/asm/thread_info.h |    2 ++
>  arch/powerpc/include/asm/tlb.h         |   10 ++++++++++
>  arch/powerpc/kernel/process.c          |   21 ++++++++++++++++++++-
>  arch/powerpc/mm/pgtable.c              |   14 ++++----------
>  arch/powerpc/mm/tlb_hash32.c           |    2 +-
>  arch/powerpc/mm/tlb_hash64.c           |   12 +++++++-----
>  arch/powerpc/mm/tlb_nohash.c           |    2 +-
>  8 files changed, 47 insertions(+), 20 deletions(-)
> 
> Index: linux-2.6/arch/powerpc/include/asm/tlb.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
> +++ linux-2.6/arch/powerpc/include/asm/tlb.h
> @@ -28,6 +28,16 @@
>  #define tlb_start_vma(tlb, vma)	do { } while (0)
>  #define tlb_end_vma(tlb, vma)	do { } while (0)
>  
> +#define HAVE_ARCH_MMU_GATHER 1
> +
> +struct pte_freelist_batch;
> +
> +struct arch_mmu_gather {
> +	struct pte_freelist_batch *batch;
> +};
> +
> +#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
> +
>  extern void tlb_flush(struct mmu_gather *tlb);
>  
>  /* Get the generic bits... */
> Index: linux-2.6/arch/powerpc/kernel/process.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/process.c
> +++ linux-2.6/arch/powerpc/kernel/process.c
> @@ -393,6 +393,9 @@ struct task_struct *__switch_to(struct t
>  	struct thread_struct *new_thread, *old_thread;
>  	unsigned long flags;
>  	struct task_struct *last;
> +#ifdef CONFIG_PPC64
> +	struct ppc64_tlb_batch *batch;
> +#endif
>  
>  #ifdef CONFIG_SMP
>  	/* avoid complexity of lazy save/restore of fpu
> @@ -511,7 +514,15 @@ struct task_struct *__switch_to(struct t
>  		old_thread->accum_tb += (current_tb - start_tb);
>  		new_thread->start_tb = current_tb;
>  	}
> -#endif
> +
> +	batch = &__get_cpu_var(ppc64_tlb_batch);
> +	if (batch->active) {
> +		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
> +		if (batch->index)
> +			__flush_tlb_pending(batch);
> +		batch->active = 0;
> +	}
> +#endif /* CONFIG_PPC64 */
>  
>  	local_irq_save(flags);
>  
> @@ -527,6 +538,14 @@ struct task_struct *__switch_to(struct t
>  	hard_irq_disable();
>  	last = _switch(old_thread, new_thread);
>  
> +#ifdef CONFIG_PPC64
> +	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
> +		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
> +		batch = &__get_cpu_var(ppc64_tlb_batch);
> +		batch->active = 1;
> +	}
> +#endif /* CONFIG_PPC64 */
> +
>  	local_irq_restore(flags);
>  
>  	return last;
> Index: linux-2.6/arch/powerpc/mm/pgtable.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/pgtable.c
> +++ linux-2.6/arch/powerpc/mm/pgtable.c
> @@ -33,8 +33,6 @@
>  
>  #include "mmu_decl.h"
>  
> -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
> -
>  #ifdef CONFIG_SMP
>  
>  /*
> @@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
>   * freeing a page table page that is being walked without locks
>   */
>  
> -static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
>  static unsigned long pte_freelist_forced_free;
>  
>  struct pte_freelist_batch
> @@ -97,12 +94,10 @@ static void pte_free_submit(struct pte_f
>  
>  void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
> -	/* This is safe since tlb_gather_mmu has disabled preemption */
> -	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> +	struct pte_freelist_batch **batchp = &tlb->arch.batch;
>  	unsigned long pgf;
>  
> -	if (atomic_read(&tlb->mm->mm_users) < 2 ||
> -	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
> +	if (atomic_read(&tlb->mm->mm_users) < 2) {
>  		pgtable_free(table, shift);
>  		return;
>  	}
> @@ -124,10 +119,9 @@ void pgtable_free_tlb(struct mmu_gather 
>  	}
>  }
>  
> -void pte_free_finish(void)
> +void pte_free_finish(struct mmu_gather *tlb)
>  {
> -	/* This is safe since tlb_gather_mmu has disabled preemption */
> -	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> +	struct pte_freelist_batch **batchp = &tlb->arch.batch;
>  
>  	if (*batchp == NULL)
>  		return;
> Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
>   * neesd to be flushed. This function will either perform the flush
>   * immediately or will batch it up if the current CPU has an active
>   * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
>   */
>  void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>  		     pte_t *ptep, unsigned long pte, int huge)
>  {
> -	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
>  	unsigned long vsid, vaddr;
>  	unsigned int psize;
>  	int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
>  	 */
>  	if (!batch->active) {
>  		flush_hash_page(vaddr, rpte, psize, ssize, 0);
> +		put_cpu_var(ppc64_tlb_batch);
>  		return;
>  	}
>  
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
>  	batch->index = ++i;
>  	if (i >= PPC64_TLB_BATCH_NR)
>  		__flush_tlb_pending(batch);
> +	put_cpu_var(ppc64_tlb_batch);
>  }
>  
>  /*
> @@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
>  
>  void tlb_flush(struct mmu_gather *tlb)
>  {
> -	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
>  
>  	/* If there's a TLB batch pending, then we must flush it because the
>  	 * pages are going to be freed and we really don't want to have a CPU
> @@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
>  	if (tlbbatch->index)
>  		__flush_tlb_pending(tlbbatch);
>  
> +	put_cpu_var(ppc64_tlb_batch);
> +
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /**
> Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
> +++ linux-2.6/arch/powerpc/include/asm/thread_info.h
> @@ -139,10 +139,12 @@ static inline struct thread_info *curren
>  #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
>  #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
>  #define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
> +#define TLF_LAZY_MMU		3	/* tlb_batch is active */
>  
>  #define _TLF_NAPPING		(1 << TLF_NAPPING)
>  #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
>  #define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
> +#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
>  
>  #ifndef __ASSEMBLY__
>  #define HAVE_SET_RESTORE_SIGMASK	1
> Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
> +++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
> @@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
>  
>  #ifdef CONFIG_SMP
>  extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
> -extern void pte_free_finish(void);
> +extern void pte_free_finish(struct mmu_gather *tlb);
>  #else /* CONFIG_SMP */
>  static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
>  	pgtable_free(table, shift);
>  }
> -static inline void pte_free_finish(void) { }
> +static inline void pte_free_finish(struct mmu_gather *tlb) { }
>  #endif /* !CONFIG_SMP */
>  
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
> Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
> @@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
>  	}
>  
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /*
> Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
> +++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
> @@ -301,7 +301,7 @@ void tlb_flush(struct mmu_gather *tlb)
>  	flush_tlb_mm(tlb->mm);
>  
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /*
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH 09/21] powerpc: Preemptible mmu_gather
  2010-11-30  3:12     ` Benjamin Herrenschmidt
@ 2010-11-30  3:35       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-30  3:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell


> This breaks embedded 64-bit build. Replace CONFIG_PPC64 with
> CONFIG_PPC_BOOK3S_64 to only hit server when you access the
> ppc64_tlb_batch in process.c since it doesn't exist for BOOK3E.
> 
> This patch to fold into yours fixes it for me:

With this fix and the other Kconfig fix, it seems to boot fine and no
obvious breakage doing simple things on G5 and a power6 machine.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 09/21] powerpc: Preemptible mmu_gather
@ 2010-11-30  3:35       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2010-11-30  3:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell


> This breaks embedded 64-bit build. Replace CONFIG_PPC64 with
> CONFIG_PPC_BOOK3S_64 to only hit server when you access the
> ppc64_tlb_batch in process.c since it doesn't exist for BOOK3E.
> 
> This patch to fold into yours fixes it for me:

With this fix and the other Kconfig fix, it seems to boot fine and no
obvious breakage doing simple things on G5 and a power6 machine.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 09/21] powerpc: Preemptible mmu_gather
  2010-11-30  3:35       ` Benjamin Herrenschmidt
@ 2010-11-30 19:25         ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-30 19:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Tue, 2010-11-30 at 14:35 +1100, Benjamin Herrenschmidt wrote:
> > This breaks embedded 64-bit build. Replace CONFIG_PPC64 with
> > CONFIG_PPC_BOOK3S_64 to only hit server when you access the
> > ppc64_tlb_batch in process.c since it doesn't exist for BOOK3E.
> > 
> > This patch to fold into yours fixes it for me:
> 
> With this fix and the other Kconfig fix, it seems to boot fine and no
> obvious breakage doing simple things on G5 and a power6 machine.

Awesome, thanks for your fixes, I've folded them in the respective
patches.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 09/21] powerpc: Preemptible mmu_gather
@ 2010-11-30 19:25         ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2010-11-30 19:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul McKenney, Yanmin Zhang, Stephen Rothwell

On Tue, 2010-11-30 at 14:35 +1100, Benjamin Herrenschmidt wrote:
> > This breaks embedded 64-bit build. Replace CONFIG_PPC64 with
> > CONFIG_PPC_BOOK3S_64 to only hit server when you access the
> > ppc64_tlb_batch in process.c since it doesn't exist for BOOK3E.
> > 
> > This patch to fold into yours fixes it for me:
> 
> With this fix and the other Kconfig fix, it seems to boot fine and no
> obvious breakage doing simple things on G5 and a power6 machine.

Awesome, thanks for your fixes, I've folded them in the respective
patches.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2010-11-26 14:38 ` Peter Zijlstra
@ 2011-01-18  7:12   ` Hugh Dickins
  -1 siblings, 0 replies; 157+ messages in thread
From: Hugh Dickins @ 2011-01-18  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Fri, 26 Nov 2010, Peter Zijlstra wrote:
> This patch-set makes part of the mm a lot more preemptible. It converts
> i_mmap_lock and anon_vma->lock to mutexes and makes mmu_gather fully
> preemptible.
> 
> The main motivation was making mm_take_all_locks() preemptible, since it
> appears people are nesting hundreds of spinlocks there.
> 
> The side-effects are that can finally make mmu_gather preemptible,
> something which lots of people have wanted to do for a long time.
> 
> It also gets us anon_vma refcounting, which seems to result in a nice
> cleanup of the anon_vma lifetime rules wrt KSM and compaction.
> 
> This patch-set is build and boot-tested on x86_64 (a previous version was
> also tested on Dave's Niagra2 machines, and I suppose s390 was too when
> Martin provided the conversion patch for his arch).
> 
> There are no known architectures left unconverted.
> 
> Yanmin ran the -v3 posting through the comprehensive Intel test farm
> and didn't find any regressions.
> 
> ( Not included in this posting are the 4 Sparc64 patches that implement
>   gup_fast, those can be applied separately after this series gets
>   anywhere. )
> 
> The full series (including the Sparc64 gup_fast bits) also available in -git
> form from (against Linus' tree as of about an hour ago):
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-mmu_preempt.git

Hi Peter,

I understand you're intending to update your preemptible mmu_gather
patchset against 2.6.38-rc1, so I've spent a while looking through
(and running) your last posted version (plus its two fixes from BenH).

I've had no problems in running it, I can't tell if it's quicker or
slower than the unpatched.  The only argument against the patchset,
really, would be performance: and though there are no bad reports on
it as yet, I do wonder how we proceed if a genuine workload shows up
which is adversely affected.  Oh well, silly to worry about the
hypothetical I suppose.

There's a few minor cleanups I'd hoped for (things like removing the
start and end args from tlb_finish_mmu), but you're quite right to
have stayed on course and not strayed down that path.

However, there's one more-than-cleanup that I think you will need to add:
the ZAP_BLOCK_SIZE zap_work stuff is still there, but I think it needs
to be removed now, with the need_resched() and other checks moved down
from unmap_vmas() to inside the pagetable spinlock in zap_pte_range().

Because you're now accumulating more work than ever in the mmu_gather's
buffer, and the more so with the 20/21 extended list: but this amounts
to a backlog of work which will *usually* be done at the tlb_finish_mmu,
but when memory is low (no extra buffers) may need flushing earlier -
as things stand, while holding the pagetable spinlock, so introducing
a large unpreemptible latency under those conditions.

I believe that along with the need_resched() check moved inside
zap_pte_range(), you need to check if the mmu_gather buffer is full,
and if so drop pagetable spinlock while you flush it.  Hmm, but if
it's extensible, then it wasn't full: I've not worked out how this
would actually fit together.

(I also believe that when memory is low, we *ought* to be freeing up
the pages sooner: perhaps all the GFP_ATOMICs should be GFP_NOWAITs.)

I found patch ordering a bit odd: I'm going to comment on them in
what seems a more natural ordering to me: if Andrew folds your 00
comments into 01 as he usually does, then I'd rather see them on the
main preemptible mmu_gather patch, than on reverting some anon_vma
annotations!  And with anon_vma->lock already nested inside i_mmap_lock,
I think the anon_vma mods are secondary, and can just follow after.

08/21 mm-preemptible_mmu_gather.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      But I'd prefer __tlb_alloc_pages() be named __tlb_alloc_page(),
      and think it should pass __GFP_NOWARN with its GFP_ATOMIC (same
      remark would apply in several other patches too).

09/21 powerpc-preemptible_mmu_gather.patch
      I'll leave Acking to Ben, but it looked okay so far as I could tell.
      I worry how much (unpreemptible) work happens in __flush_tlb_pending
      in __switch_to, whether PPC64_TLB_BATCH_NR 192 ought to be smaller
      now (I wonder where 192 came from originally); move the _TLF_LAZY_MMU
      block below _switch() to after the local_irq_restore(flags)?
      The mods to hpte_need_flush() look like what we need in 2.6.37-stable
      to keep CONFIG_DEBUG_PREEMPT vfree() quiet, perhaps should be separated
      out - but perhaps they're inappropriate and Ben has another fix in mind.

10/21 sparc-preemptible_mmu_gather.patch
      Similarly, looked okay so far as I could tell, and this one was
      already doing flush_tlb_pending in switch_to; more of the 192
      (not from you, of course).  tlb_batch_add() has some commented-out
      (tb->fullmm) code that you probably meant to come back to.
      mm/init_32.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).

11/21 s390-preemptible_mmu_gather.patch
      I'd prefer __tlb_alloc_page(), with __GFP_NOWARN as suggested above.
      mm/pgtable.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).

12/21 arm-preemptible_mmu_gather.patch
13/21 sh-preemptible_mmu_gather.patch
14/21 um-preemptible_mmu_gather.patch
15/21 ia64-preemptible_mmu_gather.patch
      All straightforward, but DEFINE_PER_CPU(struct mmu_gather, mmu_gathers)
      still to be removed from these and other arches.

16/21 mm_powerpc-move_the_rcu_page-table_freeing_into.patch
      Seems good, prefer Ben and Dave to Ack.  "copmletion" -> "completion".

17/21 lockdep_mutex-provide_mutex_lock_nest_lock.patch
      Okay by me.

18/21 mutex-provide_mutex_is_contended.patch
      I suppose so, though if we use it in the truncate path, then we are
      stuck with the vm_truncate_count stuff I'd rather hoped would go away;
      but I guess you're right, that if we did spin_needbreak/need_lockbreak
      before, then we ought to do this now - though I suspect I only added
      it because I had to insert a resched-point anyway, and it seemed a good
      idea at the time to check lockbreak too since that had just been added.

19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
      I suggest doing just the i_mmap_lock->mutex conversion at this point.
      Acked-by: Hugh Dickins <hughd@google.com>
      except that in the past we have renamed a lock when we've done this
      kind of conversion, so I'd expect i_mmap_mutex throughout now.
      Or am I just out of date?  I don't feel very strongly about it.

20/21 mm-extended_batches_for_generic_mmu_gather.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      though it struck me as overdesign at first: I guess Nick wanted it
      because he had an implementation that used the pagetables themselves,
      hence an assured supply of these buffers.  tlb_finish_mmu(), and
      perhaps others, looking rather too big for inline by this stage.

01/21 mm-revert_page_lock_anon_vma_lock_annotation.patch
      Acked-by: Hugh Dickins <hughd@google.com>

02/21 powerpc-use_call_rcu_sched_for_pagetables.patch
      Already went into 2.6.37

03/21 mm-improve_page_lock_anon_vma_comment.patch
      Acked-by: Hugh Dickins <hughd@google.com>

04/21 mm-rename_drop_anon_vma_to_put_anon_vma.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      but (if you don't mind: leave it to me if you prefer) in mm/ksm.c
      please just remove wrappers hold_anon_vma() and ksm_put_anon_vma():
      they had a point when they originated the refcount but no point now.
      Note there are now two places to update in mm/migrate.c in 38-rc1.

05/21 mm-move_anon_vma_ref_out_from_under_config_ksm.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      but you shouldn't need to touch mm/migrate.c again here with 38-rc1.
      Didn't you end up double-decrementing refcount in the huge_page case?

06/21 mm-simplify_anon_vma_refcounts.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      except page_get_anon_vma() is being declared in rmap.h a patch early,
      and you shouldn't need to touch mm/ksm.c again here with 38-rc1.
      Did wonder if __put_anon_vma() is right to put anon_vma->root *before*
      freeing anon_vma, but suppose your not_zero strictness makes it safe.

07/21 mm-use_refcounts_for_page_lock_anon_vma.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      but here I'm expecting you to use your page_get_anon_vma() in
      mm/migrate.c too, to replace my 38-rc1 lock/get/unlock sequences.
      Second page_mapped() test in page_get_anon_vma(): remove "goto out;"
      from that block, it's already reached "out".  In patch description,
      didn't understand "for each of convertion": "for sake of conversion"?
      This brings us to a nice point, ready for the lock->mutex conversion:
      the only defect being the doubled atomics in page_(un)lock_anon_vma.

19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
      I suggest doing the anon_vma lock->mutex conversion separately here.
      Acked-by: Hugh Dickins <hughd@google.com>
      except that in the past we have renamed a lock when we've done this
      kind of conversion, so I'd expect anon_vma->mutex throughout now.
      Or am I just out of date?  I don't feel very strongly about it.
      
21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
      I certainly see the call for this patch, I want to eliminate those
      doubled atomics too.  This appears correct to me, and I've not dreamt
      up an alternative; but I do dislike it, and I suspect you don't like
      it much either.  I'm ambivalent about it, would love a better patch.

sparc64-Kill_page_table_quicklists.patch
sparc64-Use_RCU_page_table_freeing.patch
sparc64-Add_support_for__PAGE_SPECIA.patch
sparc64-Implement_get_user_pages_fast.patch
      I did not spend very long looking at these, none of my business really!
      but did notice one thing I didn't like, that pte_special() is declared
      unsigned long in the third, whereas int in every other architecture. I
      think it should follow the ia64-style there, use != 0 to return an int.

A few checkpatch warnings, many of which I don't particularly agree with -
though I do get annoyed by comments going over 80-cols without any need!

Hugh

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-18  7:12   ` Hugh Dickins
  0 siblings, 0 replies; 157+ messages in thread
From: Hugh Dickins @ 2011-01-18  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Fri, 26 Nov 2010, Peter Zijlstra wrote:
> This patch-set makes part of the mm a lot more preemptible. It converts
> i_mmap_lock and anon_vma->lock to mutexes and makes mmu_gather fully
> preemptible.
> 
> The main motivation was making mm_take_all_locks() preemptible, since it
> appears people are nesting hundreds of spinlocks there.
> 
> The side-effects are that can finally make mmu_gather preemptible,
> something which lots of people have wanted to do for a long time.
> 
> It also gets us anon_vma refcounting, which seems to result in a nice
> cleanup of the anon_vma lifetime rules wrt KSM and compaction.
> 
> This patch-set is build and boot-tested on x86_64 (a previous version was
> also tested on Dave's Niagra2 machines, and I suppose s390 was too when
> Martin provided the conversion patch for his arch).
> 
> There are no known architectures left unconverted.
> 
> Yanmin ran the -v3 posting through the comprehensive Intel test farm
> and didn't find any regressions.
> 
> ( Not included in this posting are the 4 Sparc64 patches that implement
>   gup_fast, those can be applied separately after this series gets
>   anywhere. )
> 
> The full series (including the Sparc64 gup_fast bits) also available in -git
> form from (against Linus' tree as of about an hour ago):
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-mmu_preempt.git

Hi Peter,

I understand you're intending to update your preemptible mmu_gather
patchset against 2.6.38-rc1, so I've spent a while looking through
(and running) your last posted version (plus its two fixes from BenH).

I've had no problems in running it, I can't tell if it's quicker or
slower than the unpatched.  The only argument against the patchset,
really, would be performance: and though there are no bad reports on
it as yet, I do wonder how we proceed if a genuine workload shows up
which is adversely affected.  Oh well, silly to worry about the
hypothetical I suppose.

There's a few minor cleanups I'd hoped for (things like removing the
start and end args from tlb_finish_mmu), but you're quite right to
have stayed on course and not strayed down that path.

However, there's one more-than-cleanup that I think you will need to add:
the ZAP_BLOCK_SIZE zap_work stuff is still there, but I think it needs
to be removed now, with the need_resched() and other checks moved down
from unmap_vmas() to inside the pagetable spinlock in zap_pte_range().

Because you're now accumulating more work than ever in the mmu_gather's
buffer, and the more so with the 20/21 extended list: but this amounts
to a backlog of work which will *usually* be done at the tlb_finish_mmu,
but when memory is low (no extra buffers) may need flushing earlier -
as things stand, while holding the pagetable spinlock, so introducing
a large unpreemptible latency under those conditions.

I believe that along with the need_resched() check moved inside
zap_pte_range(), you need to check if the mmu_gather buffer is full,
and if so drop pagetable spinlock while you flush it.  Hmm, but if
it's extensible, then it wasn't full: I've not worked out how this
would actually fit together.

(I also believe that when memory is low, we *ought* to be freeing up
the pages sooner: perhaps all the GFP_ATOMICs should be GFP_NOWAITs.)

I found patch ordering a bit odd: I'm going to comment on them in
what seems a more natural ordering to me: if Andrew folds your 00
comments into 01 as he usually does, then I'd rather see them on the
main preemptible mmu_gather patch, than on reverting some anon_vma
annotations!  And with anon_vma->lock already nested inside i_mmap_lock,
I think the anon_vma mods are secondary, and can just follow after.

08/21 mm-preemptible_mmu_gather.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      But I'd prefer __tlb_alloc_pages() be named __tlb_alloc_page(),
      and think it should pass __GFP_NOWARN with its GFP_ATOMIC (same
      remark would apply in several other patches too).

09/21 powerpc-preemptible_mmu_gather.patch
      I'll leave Acking to Ben, but it looked okay so far as I could tell.
      I worry how much (unpreemptible) work happens in __flush_tlb_pending
      in __switch_to, whether PPC64_TLB_BATCH_NR 192 ought to be smaller
      now (I wonder where 192 came from originally); move the _TLF_LAZY_MMU
      block below _switch() to after the local_irq_restore(flags)?
      The mods to hpte_need_flush() look like what we need in 2.6.37-stable
      to keep CONFIG_DEBUG_PREEMPT vfree() quiet, perhaps should be separated
      out - but perhaps they're inappropriate and Ben has another fix in mind.

10/21 sparc-preemptible_mmu_gather.patch
      Similarly, looked okay so far as I could tell, and this one was
      already doing flush_tlb_pending in switch_to; more of the 192
      (not from you, of course).  tlb_batch_add() has some commented-out
      (tb->fullmm) code that you probably meant to come back to.
      mm/init_32.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).

11/21 s390-preemptible_mmu_gather.patch
      I'd prefer __tlb_alloc_page(), with __GFP_NOWARN as suggested above.
      mm/pgtable.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).

12/21 arm-preemptible_mmu_gather.patch
13/21 sh-preemptible_mmu_gather.patch
14/21 um-preemptible_mmu_gather.patch
15/21 ia64-preemptible_mmu_gather.patch
      All straightforward, but DEFINE_PER_CPU(struct mmu_gather, mmu_gathers)
      still to be removed from these and other arches.

16/21 mm_powerpc-move_the_rcu_page-table_freeing_into.patch
      Seems good, prefer Ben and Dave to Ack.  "copmletion" -> "completion".

17/21 lockdep_mutex-provide_mutex_lock_nest_lock.patch
      Okay by me.

18/21 mutex-provide_mutex_is_contended.patch
      I suppose so, though if we use it in the truncate path, then we are
      stuck with the vm_truncate_count stuff I'd rather hoped would go away;
      but I guess you're right, that if we did spin_needbreak/need_lockbreak
      before, then we ought to do this now - though I suspect I only added
      it because I had to insert a resched-point anyway, and it seemed a good
      idea at the time to check lockbreak too since that had just been added.

19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
      I suggest doing just the i_mmap_lock->mutex conversion at this point.
      Acked-by: Hugh Dickins <hughd@google.com>
      except that in the past we have renamed a lock when we've done this
      kind of conversion, so I'd expect i_mmap_mutex throughout now.
      Or am I just out of date?  I don't feel very strongly about it.

20/21 mm-extended_batches_for_generic_mmu_gather.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      though it struck me as overdesign at first: I guess Nick wanted it
      because he had an implementation that used the pagetables themselves,
      hence an assured supply of these buffers.  tlb_finish_mmu(), and
      perhaps others, looking rather too big for inline by this stage.

01/21 mm-revert_page_lock_anon_vma_lock_annotation.patch
      Acked-by: Hugh Dickins <hughd@google.com>

02/21 powerpc-use_call_rcu_sched_for_pagetables.patch
      Already went into 2.6.37

03/21 mm-improve_page_lock_anon_vma_comment.patch
      Acked-by: Hugh Dickins <hughd@google.com>

04/21 mm-rename_drop_anon_vma_to_put_anon_vma.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      but (if you don't mind: leave it to me if you prefer) in mm/ksm.c
      please just remove wrappers hold_anon_vma() and ksm_put_anon_vma():
      they had a point when they originated the refcount but no point now.
      Note there are now two places to update in mm/migrate.c in 38-rc1.

05/21 mm-move_anon_vma_ref_out_from_under_config_ksm.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      but you shouldn't need to touch mm/migrate.c again here with 38-rc1.
      Didn't you end up double-decrementing refcount in the huge_page case?

06/21 mm-simplify_anon_vma_refcounts.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      except page_get_anon_vma() is being declared in rmap.h a patch early,
      and you shouldn't need to touch mm/ksm.c again here with 38-rc1.
      Did wonder if __put_anon_vma() is right to put anon_vma->root *before*
      freeing anon_vma, but suppose your not_zero strictness makes it safe.

07/21 mm-use_refcounts_for_page_lock_anon_vma.patch
      Acked-by: Hugh Dickins <hughd@google.com>
      but here I'm expecting you to use your page_get_anon_vma() in
      mm/migrate.c too, to replace my 38-rc1 lock/get/unlock sequences.
      Second page_mapped() test in page_get_anon_vma(): remove "goto out;"
      from that block, it's already reached "out".  In patch description,
      didn't understand "for each of convertion": "for sake of conversion"?
      This brings us to a nice point, ready for the lock->mutex conversion:
      the only defect being the doubled atomics in page_(un)lock_anon_vma.

19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
      I suggest doing the anon_vma lock->mutex conversion separately here.
      Acked-by: Hugh Dickins <hughd@google.com>
      except that in the past we have renamed a lock when we've done this
      kind of conversion, so I'd expect anon_vma->mutex throughout now.
      Or am I just out of date?  I don't feel very strongly about it.
      
21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
      I certainly see the call for this patch, I want to eliminate those
      doubled atomics too.  This appears correct to me, and I've not dreamt
      up an alternative; but I do dislike it, and I suspect you don't like
      it much either.  I'm ambivalent about it, would love a better patch.

sparc64-Kill_page_table_quicklists.patch
sparc64-Use_RCU_page_table_freeing.patch
sparc64-Add_support_for__PAGE_SPECIA.patch
sparc64-Implement_get_user_pages_fast.patch
      I did not spend very long looking at these, none of my business really!
      but did notice one thing I didn't like, that pte_special() is declared
      unsigned long in the third, whereas int in every other architecture. I
      think it should follow the ia64-style there, use != 0 to return an int.

A few checkpatch warnings, many of which I don't particularly agree with -
though I do get annoyed by comments going over 80-cols without any need!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-18  7:12   ` Hugh Dickins
@ 2011-01-18 10:30     ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-18 10:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> I understand you're intending to update your preemptible mmu_gather
> patchset against 2.6.38-rc1, so I've spent a while looking through
> (and running) your last posted version (plus its two fixes from BenH).
> 
> I've had no problems in running it, I can't tell if it's quicker or
> slower than the unpatched.  The only argument against the patchset,
> really, would be performance: and though there are no bad reports on
> it as yet, I do wonder how we proceed if a genuine workload shows up
> which is adversely affected.  Oh well, silly to worry about the
> hypothetical I suppose.
> 
Wow, _HUGE_ review, thanks! I'll slowly make my way through it when I do
the rebase against .38-rc1, most suggestions look very good indeed.

And as to performance, there's the no-regression report from Yanmin
running it through the Intel test farm. However Nick also had a
particular workload he wanted to test, Nick, do you think you've got a
few spare minutes to dig that workload up and give it a run?

But I guess you're right, there's always a chance someone hits
something, and I guess there's nothing to it but to deal with it when
that comes..

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-18 10:30     ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-18 10:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> I understand you're intending to update your preemptible mmu_gather
> patchset against 2.6.38-rc1, so I've spent a while looking through
> (and running) your last posted version (plus its two fixes from BenH).
> 
> I've had no problems in running it, I can't tell if it's quicker or
> slower than the unpatched.  The only argument against the patchset,
> really, would be performance: and though there are no bad reports on
> it as yet, I do wonder how we proceed if a genuine workload shows up
> which is adversely affected.  Oh well, silly to worry about the
> hypothetical I suppose.
> 
Wow, _HUGE_ review, thanks! I'll slowly make my way through it when I do
the rebase against .38-rc1, most suggestions look very good indeed.

And as to performance, there's the no-regression report from Yanmin
running it through the Intel test farm. However Nick also had a
particular workload he wanted to test, Nick, do you think you've got a
few spare minutes to dig that workload up and give it a run?

But I guess you're right, there's always a chance someone hits
something, and I guess there's nothing to it but to deal with it when
that comes..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-18  7:12   ` Hugh Dickins
@ 2011-01-18 10:44     ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-18 10:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 18/21 mutex-provide_mutex_is_contended.patch
>       I suppose so, though if we use it in the truncate path, then we are
>       stuck with the vm_truncate_count stuff I'd rather hoped would go away;
>       but I guess you're right, that if we did spin_needbreak/need_lockbreak
>       before, then we ought to do this now - though I suspect I only added
>       it because I had to insert a resched-point anyway, and it seemed a good
>       idea at the time to check lockbreak too since that had just been added. 

Like the other missed cleanups now possible, these are things we can
most definitely look at once the dust settles a bit. I just didn't want
to rewrite the world in one go.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-18 10:44     ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-18 10:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 18/21 mutex-provide_mutex_is_contended.patch
>       I suppose so, though if we use it in the truncate path, then we are
>       stuck with the vm_truncate_count stuff I'd rather hoped would go away;
>       but I guess you're right, that if we did spin_needbreak/need_lockbreak
>       before, then we ought to do this now - though I suspect I only added
>       it because I had to insert a resched-point anyway, and it seemed a good
>       idea at the time to check lockbreak too since that had just been added. 

Like the other missed cleanups now possible, these are things we can
most definitely look at once the dust settles a bit. I just didn't want
to rewrite the world in one go.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-18  7:12   ` Hugh Dickins
@ 2011-01-18 10:50     ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-18 10:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
>       I certainly see the call for this patch, I want to eliminate those
>       doubled atomics too.  This appears correct to me, and I've not dreamt
>       up an alternative; but I do dislike it, and I suspect you don't like
>       it much either.  I'm ambivalent about it, would love a better patch. 

Spot on, maybe a one of our keen readers will have a bright idea.. :-)



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-18 10:50     ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-18 10:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
>       I certainly see the call for this patch, I want to eliminate those
>       doubled atomics too.  This appears correct to me, and I've not dreamt
>       up an alternative; but I do dislike it, and I suspect you don't like
>       it much either.  I'm ambivalent about it, would love a better patch. 

Spot on, maybe a one of our keen readers will have a bright idea.. :-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-18  7:12   ` Hugh Dickins
@ 2011-01-19 17:10     ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-19 17:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:

> However, there's one more-than-cleanup that I think you will need to add:
> the ZAP_BLOCK_SIZE zap_work stuff is still there, but I think it needs
> to be removed now, with the need_resched() and other checks moved down
> from unmap_vmas() to inside the pagetable spinlock in zap_pte_range().
> 
> Because you're now accumulating more work than ever in the mmu_gather's
> buffer, and the more so with the 20/21 extended list: but this amounts
> to a backlog of work which will *usually* be done at the tlb_finish_mmu,
> but when memory is low (no extra buffers) may need flushing earlier -
> as things stand, while holding the pagetable spinlock, so introducing
> a large unpreemptible latency under those conditions.
> 
> I believe that along with the need_resched() check moved inside
> zap_pte_range(), you need to check if the mmu_gather buffer is full,
> and if so drop pagetable spinlock while you flush it.  Hmm, but if
> it's extensible, then it wasn't full: I've not worked out how this
> would actually fit together.

Very good point!! I'll work on this, I'll probably do a few of those
cleanups previously left undone too, I'm seriously doubting the
usefulness of the whole restart_addr muck now that its preemptible.

> (I also believe that when memory is low, we *ought* to be freeing up
> the pages sooner: perhaps all the GFP_ATOMICs should be GFP_NOWAITs.)

Agreed, I've moved everything to: GFP_NOWAIT | __GFP_NOWARN. 

> I found patch ordering a bit odd: I'm going to comment on them in
> what seems a more natural ordering to me: if Andrew folds your 00
> comments into 01 as he usually does, then I'd rather see them on the
> main preemptible mmu_gather patch, than on reverting some anon_vma
> annotations! 

Shouldn't we simply ask for better changelogs instead of Andrew doing
that? That said, I do like your order better, so did indeed reorder as
you suggest.

>  And with anon_vma->lock already nested inside i_mmap_lock,
> I think the anon_vma mods are secondary, and can just follow after.
> 
> 08/21 mm-preemptible_mmu_gather.patch
>       Acked-by: Hugh Dickins <hughd@google.com>
>       But I'd prefer __tlb_alloc_pages() be named __tlb_alloc_page(),
>       and think it should pass __GFP_NOWARN with its GFP_ATOMIC (same
>       remark would apply in several other patches too).

Did the rename, and like mentioned, switched to GFP_NOWAIT |
__GFP_NOWARN for everything.

> 09/21 powerpc-preemptible_mmu_gather.patch
>       I'll leave Acking to Ben, but it looked okay so far as I could tell.
>       I worry how much (unpreemptible) work happens in __flush_tlb_pending
>       in __switch_to, whether PPC64_TLB_BATCH_NR 192 ought to be smaller
>       now (I wonder where 192 came from originally); move the _TLF_LAZY_MMU
>       block below _switch() to after the local_irq_restore(flags)?
>       The mods to hpte_need_flush() look like what we need in 2.6.37-stable
>       to keep CONFIG_DEBUG_PREEMPT vfree() quiet, perhaps should be separated
>       out - but perhaps they're inappropriate and Ben has another fix in mind.

I'll await Ben's answer on this, but yeah, he might consider tuning the
192.

> 10/21 sparc-preemptible_mmu_gather.patch
>       Similarly, looked okay so far as I could tell, and this one was
>       already doing flush_tlb_pending in switch_to; more of the 192
>       (not from you, of course).  tlb_batch_add() has some commented-out
>       (tb->fullmm) code that you probably meant to come back to.
>       mm/init_32.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).

Ah, right.. XXX

> 11/21 s390-preemptible_mmu_gather.patch
>       I'd prefer __tlb_alloc_page(), with __GFP_NOWARN as suggested above.
>       mm/pgtable.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).

Martin, while doing the below DEFINE_PER_CPU removal I saw you had a
bunch of RCU table removal thingies in arch/s390/mm/pgtable.c, could
s390 use the generic bits like sparc and powerpc (see patch 16)?

> 12/21 arm-preemptible_mmu_gather.patch
> 13/21 sh-preemptible_mmu_gather.patch
> 14/21 um-preemptible_mmu_gather.patch
> 15/21 ia64-preemptible_mmu_gather.patch
>       All straightforward, but DEFINE_PER_CPU(struct mmu_gather, mmu_gathers)
>       still to be removed from these and other arches.

I've added a patch removing all the DEFINE_PER_CPU(struct mmu_gather,
mmu_gathers) thingies.

One thing I was wondering about, should I fold all these patches into
one big patch to improve bisectability? Because after the first patch
all !generic-tlb archs won't compile anymore due to the mm/* changes.

> 16/21 mm_powerpc-move_the_rcu_page-table_freeing_into.patch
>       Seems good, prefer Ben and Dave to Ack.  "copmletion" -> "completion".

Fixed the typo, thanks!

> 18/21 mutex-provide_mutex_is_contended.patch
>       I suppose so, though if we use it in the truncate path, then we are
>       stuck with the vm_truncate_count stuff I'd rather hoped would go away;
>       but I guess you're right, that if we did spin_needbreak/need_lockbreak
>       before, then we ought to do this now - though I suspect I only added
>       it because I had to insert a resched-point anyway, and it seemed a good
>       idea at the time to check lockbreak too since that had just been added.

Since its now preemptable we might consider simply removing that. I
simply wanted to keep the changes to a minimum for now.

> 19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
>       I suggest doing just the i_mmap_lock->mutex conversion at this point.
>       Acked-by: Hugh Dickins <hughd@google.com>
>       except that in the past we have renamed a lock when we've done this
>       kind of conversion, so I'd expect i_mmap_mutex throughout now.
>       Or am I just out of date?  I don't feel very strongly about it.

Done, split the conversion and did the s/_lock/_mutex/ thing.

> 20/21 mm-extended_batches_for_generic_mmu_gather.patch
>       Acked-by: Hugh Dickins <hughd@google.com>
>       though it struck me as overdesign at first: I guess Nick wanted it
>       because he had an implementation that used the pagetables themselves,
>       hence an assured supply of these buffers.  tlb_finish_mmu(), and
>       perhaps others, looking rather too big for inline by this stage.

Yeah, they are a bit beefy, maybe I should move some of that into
mm/memory.c.

> 04/21 mm-rename_drop_anon_vma_to_put_anon_vma.patch
>       Acked-by: Hugh Dickins <hughd@google.com>
>       but (if you don't mind: leave it to me if you prefer) in mm/ksm.c
>       please just remove wrappers hold_anon_vma() and ksm_put_anon_vma():
>       they had a point when they originated the refcount but no point now.
>       Note there are now two places to update in mm/migrate.c in 38-rc1.

Done, zapped those wrappers.

> 05/21 mm-move_anon_vma_ref_out_from_under_config_ksm.patch
>       Acked-by: Hugh Dickins <hughd@google.com>
>       but you shouldn't need to touch mm/migrate.c again here with 38-rc1.
>       Didn't you end up double-decrementing refcount in the huge_page case?

I'm afraid I need a little help here, what huge_page case? I tried
applying this comment to both patches 5 and 6, but failed to find a
huge_page case..

> 06/21 mm-simplify_anon_vma_refcounts.patch
>       Acked-by: Hugh Dickins <hughd@google.com>
>       except page_get_anon_vma() is being declared in rmap.h a patch early,
>       and you shouldn't need to touch mm/ksm.c again here with 38-rc1.
>       Did wonder if __put_anon_vma() is right to put anon_vma->root *before*
>       freeing anon_vma, but suppose your not_zero strictness makes it safe.

It seemed like the natural order to do things, release the reference we
hold on ->root right before we free ourselves.

The race you're worried about is the page_lock_anon_vma() where we
access ->root? Afaict that's ok because we check page_mapped() and
decrementing that should be done _before_ the last put_anon_vma(),
otherwise that function is already racy.

> 07/21 mm-use_refcounts_for_page_lock_anon_vma.patch
>       Acked-by: Hugh Dickins <hughd@google.com>
>       but here I'm expecting you to use your page_get_anon_vma() in
>       mm/migrate.c too, to replace my 38-rc1 lock/get/unlock sequences.

done

>       Second page_mapped() test in page_get_anon_vma(): remove "goto out;"
>       from that block, it's already reached "out".  

Paranoia on my side, done.

>       In patch description,
>       didn't understand "for each of convertion": "for sake of conversion"?

Uhm.. yeah my brain must have slipped there or something, not quite sure
what I meant, let me make that:

 "This is done to prepare for the conversion of anon_vma from spinlock
to mutex."

>       This brings us to a nice point, ready for the lock->mutex conversion:
>       the only defect being the doubled atomics in page_(un)lock_anon_vma.

Yeah, that double atomic is sadness, you've seen what I've come up with
to avoid that..

> 19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
>       I suggest doing the anon_vma lock->mutex conversion separately here.
>       Acked-by: Hugh Dickins <hughd@google.com>
>       except that in the past we have renamed a lock when we've done this
>       kind of conversion, so I'd expect anon_vma->mutex throughout now.
>       Or am I just out of date?  I don't feel very strongly about it.

Done.. however:

Index: linux-2.6/include/linux/huge_mm.h
===================================================================
--- linux-2.6.orig/include/linux/huge_mm.h
+++ linux-2.6/include/linux/huge_mm.h
@@ -91,12 +91,8 @@ extern void __split_huge_page_pmd(struct
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		spin_unlock_wait(&(__anon_vma)->root->lock);		\
-		/*							\
-		 * spin_unlock_wait() is just a loop in C and so the	\
-		 * CPU can reorder anything around it.			\
-		 */							\
-		smp_mb();						\
+		anon_vma_lock(__anon_vma);				\
+		anon_vma_unlock(__anon_vma);				\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)

Andrea, is that smp_mb() simply to avoid us doing anything before the
lock is free? Why isn't there an mb() before to ensure nothing leaks
past it from the other end?

> 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
>       I certainly see the call for this patch, I want to eliminate those
>       doubled atomics too.  This appears correct to me, and I've not dreamt
>       up an alternative; but I do dislike it, and I suspect you don't like
>       it much either.  I'm ambivalent about it, would love a better patch.

Like said, I fully agree with that sentiment, just haven't been able to
come up with anything saner :/ Although I can optimize the
__put_anon_vma() path a bit by doing something like:

  if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }

But I bet that wants a barrier someplace and my head hurts.. 

> sparc64-Kill_page_table_quicklists.patch
> sparc64-Use_RCU_page_table_freeing.patch
> sparc64-Add_support_for__PAGE_SPECIA.patch
> sparc64-Implement_get_user_pages_fast.patch
>       I did not spend very long looking at these, none of my business really!
>       but did notice one thing I didn't like, that pte_special() is declared
>       unsigned long in the third, whereas int in every other architecture. I
>       think it should follow the ia64-style there, use != 0 to return an int.

Dave, do you want me to make those changes, or will you once the rest of
this stuff makes it upstream?

> A few checkpatch warnings, many of which I don't particularly agree with -
> though I do get annoyed by comments going over 80-cols without any need!

Agreed, although I didn't spot any comments crossing the 80-column
boundary.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-19 17:10     ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-19 17:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:

> However, there's one more-than-cleanup that I think you will need to add:
> the ZAP_BLOCK_SIZE zap_work stuff is still there, but I think it needs
> to be removed now, with the need_resched() and other checks moved down
> from unmap_vmas() to inside the pagetable spinlock in zap_pte_range().
> 
> Because you're now accumulating more work than ever in the mmu_gather's
> buffer, and the more so with the 20/21 extended list: but this amounts
> to a backlog of work which will *usually* be done at the tlb_finish_mmu,
> but when memory is low (no extra buffers) may need flushing earlier -
> as things stand, while holding the pagetable spinlock, so introducing
> a large unpreemptible latency under those conditions.
> 
> I believe that along with the need_resched() check moved inside
> zap_pte_range(), you need to check if the mmu_gather buffer is full,
> and if so drop pagetable spinlock while you flush it.  Hmm, but if
> it's extensible, then it wasn't full: I've not worked out how this
> would actually fit together.

Very good point!! I'll work on this, I'll probably do a few of those
cleanups previously left undone too, I'm seriously doubting the
usefulness of the whole restart_addr muck now that its preemptible.

> (I also believe that when memory is low, we *ought* to be freeing up
> the pages sooner: perhaps all the GFP_ATOMICs should be GFP_NOWAITs.)

Agreed, I've moved everything to: GFP_NOWAIT | __GFP_NOWARN. 

> I found patch ordering a bit odd: I'm going to comment on them in
> what seems a more natural ordering to me: if Andrew folds your 00
> comments into 01 as he usually does, then I'd rather see them on the
> main preemptible mmu_gather patch, than on reverting some anon_vma
> annotations! 

Shouldn't we simply ask for better changelogs instead of Andrew doing
that? That said, I do like your order better, so did indeed reorder as
you suggest.

>  And with anon_vma->lock already nested inside i_mmap_lock,
> I think the anon_vma mods are secondary, and can just follow after.
> 
> 08/21 mm-preemptible_mmu_gather.patch
>       Acked-by: Hugh Dickins <hughd@google.com>
>       But I'd prefer __tlb_alloc_pages() be named __tlb_alloc_page(),
>       and think it should pass __GFP_NOWARN with its GFP_ATOMIC (same
>       remark would apply in several other patches too).

Did the rename, and like mentioned, switched to GFP_NOWAIT |
__GFP_NOWARN for everything.

> 09/21 powerpc-preemptible_mmu_gather.patch
>       I'll leave Acking to Ben, but it looked okay so far as I could tell

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-19 17:10     ` Peter Zijlstra
@ 2011-01-20 19:57       ` Hugh Dickins
  -1 siblings, 0 replies; 157+ messages in thread
From: Hugh Dickins @ 2011-01-20 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

We seem to agree nicely on most points :) just a few extracts:

On Wed, 19 Jan 2011, Peter Zijlstra wrote:
> On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 
> One thing I was wondering about, should I fold all these patches into
> one big patch to improve bisectability? Because after the first patch
> all !generic-tlb archs won't compile anymore due to the mm/* changes.

Ah.  Personally I like how you've split it up: certainly the larger
powerpc and sparc patches are easier to grasp in their smaller pieces,
and each arch needs to collect approval from each arch maintainer.

But that doesn't preclude lumping them all together for the final commit,
for ease of bisection.  I think Andrew and Ben are the ones to decide
on that: Ben's bisectability got burnt by THP just now, so he'll have
strong feelings and good reasons.

> > 18/21 mutex-provide_mutex_is_contended.patch
> >       I suppose so, though if we use it in the truncate path, then we are
> >       stuck with the vm_truncate_count stuff I'd rather hoped would go away;
> >       but I guess you're right, that if we did spin_needbreak/need_lockbreak
> >       before, then we ought to do this now - though I suspect I only added
> >       it because I had to insert a resched-point anyway, and it seemed a good
> >       idea at the time to check lockbreak too since that had just been added.
> 
> Since its now preemptable we might consider simply removing that. I
> simply wanted to keep the changes to a minimum for now.

Removing that along with the restart_addr stuff, yes.  I keep wavering.
Yes it would be nice to get rid of all that, particularly now we find
it's had holes in all these years.  The only significant loser, I think,
would be page reclaim (when concurrent with truncation): could spin for a
long time waiting for the i_mmap_mutex it expects would soon be dropped?

> > 05/21 mm-move_anon_vma_ref_out_from_under_config_ksm.patch
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       but you shouldn't need to touch mm/migrate.c again here with 38-rc1.
> >       Didn't you end up double-decrementing refcount in the huge_page case?
> 
> I'm afraid I need a little help here, what huge_page case? I tried
> applying this comment to both patches 5 and 6, but failed to find a
> huge_page case..

The 05/21 I was looking at had a hunk modifying mm/migrate.c, replacing
anon_vma->external_refcount by anon_vma->refcount in two places in
unmap_and_move_huge_page().  But once you've rebased to 38-rc1, it's
a different story: unmap_and_move_huge_page() using the same anon_vma
functions as unmap_and_move(), neither referring to refcount directly.

After your patchset, unmap_and_move_huge_page() ended up doing:

	if (anon_vma && atomic_dec_and_mutex_lock(&anon_vma->refcount,
					    &anon_vma->lock)) {
		int empty = list_empty(&anon_vma->head);
		mutex_unlock(&anon_vma->lock);
		if (empty)
			put_anon_vma(anon_vma);
	}

where put_anon_vma() does:

	if (atomic_dec_and_test(&anon_vma->refcount))
		__put_anon_vma(anon_vma);

But that should just be history since 38-rc1 simplified it.

> > 06/21 mm-simplify_anon_vma_refcounts.patch
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       except page_get_anon_vma() is being declared in rmap.h a patch early,
> >       and you shouldn't need to touch mm/ksm.c again here with 38-rc1.
> >       Did wonder if __put_anon_vma() is right to put anon_vma->root *before*
> >       freeing anon_vma, but suppose your not_zero strictness makes it safe.
> 
> It seemed like the natural order to do things, release the reference we
> hold on ->root right before we free ourselves.
> 
> The race you're worried about is the page_lock_anon_vma() where we
> access ->root? Afaict that's ok because we check page_mapped() and
> decrementing that should be done _before_ the last put_anon_vma(),
> otherwise that function is already racy.

I didn't get as far as worrying about any particular race: it was just
that I thought we used to be careful that root be the last to be freed,
for fear of references coming via those referring to it, even though
they are about to be freed.

Looking at 38-rc1 I see anon_vma_unlink() does
	if (empty) {
		/* We no longer need the root anon_vma */
		if (anon_vma->root != anon_vma)
			drop_anon_vma(anon_vma->root);
		anon_vma_free(anon_vma);
	}
but drop_anon_vma() does
		if (empty) {
			anon_vma_free(anon_vma);
			if (root_empty && last_root_user)
				anon_vma_free(root);
		}

So long as the safeguards are right elsewhere, it shouldn't matter.

> 
> > 19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
> >       I suggest doing the anon_vma lock->mutex conversion separately here.
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       except that in the past we have renamed a lock when we've done this
> >       kind of conversion, so I'd expect anon_vma->mutex throughout now.
> >       Or am I just out of date?  I don't feel very strongly about it.
> 
> Done.. however:
> 
> Index: linux-2.6/include/linux/huge_mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/huge_mm.h
> +++ linux-2.6/include/linux/huge_mm.h
> @@ -91,12 +91,8 @@ extern void __split_huge_page_pmd(struct
>  #define wait_split_huge_page(__anon_vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> -		/*							\
> -		 * spin_unlock_wait() is just a loop in C and so the	\
> -		 * CPU can reorder anything around it.			\
> -		 */							\
> -		smp_mb();						\
> +		anon_vma_lock(__anon_vma);				\
> +		anon_vma_unlock(__anon_vma);				\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)
> 
> Andrea, is that smp_mb() simply to avoid us doing anything before the
> lock is free? Why isn't there an mb() before to ensure nothing leaks
> past it from the other end?

I'll leave that in for Andrea to answer, I've not given it any thought.

> 
> > 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
> >       I certainly see the call for this patch, I want to eliminate those
> >       doubled atomics too.  This appears correct to me, and I've not dreamt
> >       up an alternative; but I do dislike it, and I suspect you don't like
> >       it much either.  I'm ambivalent about it, would love a better patch.
> 
> Like said, I fully agree with that sentiment, just haven't been able to
> come up with anything saner :/ Although I can optimize the
> __put_anon_vma() path a bit by doing something like:
> 
>   if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }
> 
> But I bet that wants a barrier someplace and my head hurts.. 

Without daring to hurt my head very much, yes, I'd say those kind
of "optimizations" have a habit of turning out to be racily wrong.

But you put your finger on it: if you hadn't had to add that lock-
unlock pair into __put_anon_vma(), I wouldn't have minded the
contortions added to page_lock_anon_vma().

> > A few checkpatch warnings, many of which I don't particularly agree with -
> > though I do get annoyed by comments going over 80-cols without any need!
> 
> Agreed, although I didn't spot any comments crossing the 80-column
> boundary.

06/21 mm-simplify_anon_vma_refcounts.patch
+		 * Initialise the anon_vma root to point to itself. If called from

Hugh

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-20 19:57       ` Hugh Dickins
  0 siblings, 0 replies; 157+ messages in thread
From: Hugh Dickins @ 2011-01-20 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

We seem to agree nicely on most points :) just a few extracts:

On Wed, 19 Jan 2011, Peter Zijlstra wrote:
> On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 
> One thing I was wondering about, should I fold all these patches into
> one big patch to improve bisectability? Because after the first patch
> all !generic-tlb archs won't compile anymore due to the mm/* changes.

Ah.  Personally I like how you've split it up: certainly the larger
powerpc and sparc patches are easier to grasp in their smaller pieces,
and each arch needs to collect approval from each arch maintainer.

But that doesn't preclude lumping them all together for the final commit,
for ease of bisection.  I think Andrew and Ben are the ones to decide
on that: Ben's bisectability got burnt by THP just now, so he'll have
strong feelings and good reasons.

> > 18/21 mutex-provide_mutex_is_contended.patch
> >       I suppose so, though if we use it in the truncate path, then we are
> >       stuck with the vm_truncate_count stuff I'd rather hoped would go away;
> >       but I guess you're right, that if we did spin_needbreak/need_lockbreak
> >       before, then we ought to do this now - though I suspect I only added
> >       it because I had to insert a resched-point anyway, and it seemed a good
> >       idea at the time to check lockbreak too since that had just been added.
> 
> Since its now preemptable we might consider simply removing that. I
> simply wanted to keep the changes to a minimum for now.

Removing that along with the restart_addr stuff, yes.  I keep wavering.
Yes it would be nice to get rid of all that, particularly now we find
it's had holes in all these years.  The only significant loser, I think,
would be page reclaim (when concurrent with truncation): could spin for a
long time waiting for the i_mmap_mutex it expects would soon be dropped?

> > 05/21 mm-move_anon_vma_ref_out_from_under_config_ksm.patch
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       but you shouldn't need to touch mm/migrate.c again here with 38-rc1.
> >       Didn't you end up double-decrementing refcount in the huge_page case?
> 
> I'm afraid I need a little help here, what huge_page case? I tried
> applying this comment to both patches 5 and 6, but failed to find a
> huge_page case..

The 05/21 I was looking at had a hunk modifying mm/migrate.c, replacing
anon_vma->external_refcount by anon_vma->refcount in two places in
unmap_and_move_huge_page().  But once you've rebased to 38-rc1, it's
a different story: unmap_and_move_huge_page() using the same anon_vma
functions as unmap_and_move(), neither referring to refcount directly.

After your patchset, unmap_and_move_huge_page() ended up doing:

	if (anon_vma && atomic_dec_and_mutex_lock(&anon_vma->refcount,
					    &anon_vma->lock)) {
		int empty = list_empty(&anon_vma->head);
		mutex_unlock(&anon_vma->lock);
		if (empty)
			put_anon_vma(anon_vma);
	}

where put_anon_vma() does:

	if (atomic_dec_and_test(&anon_vma->refcount))
		__put_anon_vma(anon_vma);

But that should just be history since 38-rc1 simplified it.

> > 06/21 mm-simplify_anon_vma_refcounts.patch
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       except page_get_anon_vma() is being declared in rmap.h a patch early,
> >       and you shouldn't need to touch mm/ksm.c again here with 38-rc1.
> >       Did wonder if __put_anon_vma() is right to put anon_vma->root *before*
> >       freeing anon_vma, but suppose your not_zero strictness makes it safe.
> 
> It seemed like the natural order to do things, release the reference we
> hold on ->root right before we free ourselves.
> 
> The race you're worried about is the page_lock_anon_vma() where we
> access ->root? Afaict that's ok because we check page_mapped() and
> decrementing that should be done _before_ the last put_anon_vma(),
> otherwise that function is already racy.

I didn't get as far as worrying about any particular race: it was just
that I thought we used to be careful that root be the last to be freed,
for fear of references coming via those referring to it, even though
they are about to be freed.

Looking at 38-rc1 I see anon_vma_unlink() does
	if (empty) {
		/* We no longer need the root anon_vma */
		if (anon_vma->root != anon_vma)
			drop_anon_vma(anon_vma->root);
		anon_vma_free(anon_vma);
	}
but drop_anon_vma() does
		if (empty) {
			anon_vma_free(anon_vma);
			if (root_empty && last_root_user)
				anon_vma_free(root);
		}

So long as the safeguards are right elsewhere, it shouldn't matter.

> 
> > 19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
> >       I suggest doing the anon_vma lock->mutex conversion separately here.
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       except that in the past we have renamed a lock when we've done this
> >       kind of conversion, so I'd expect anon_vma->mutex throughout now.
> >       Or am I just out of date?  I don't feel very strongly about it.
> 
> Done.. however:
> 
> Index: linux-2.6/include/linux/huge_mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/huge_mm.h
> +++ linux-2.6/include/linux/huge_mm.h
> @@ -91,12 +91,8 @@ extern void __split_huge_page_pmd(struct
>  #define wait_split_huge_page(__anon_vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> -		/*							\
> -		 * spin_unlock_wait() is just a loop in C and so the	\
> -		 * CPU can reorder anything around it.			\
> -		 */							\
> -		smp_mb();						\
> +		anon_vma_lock(__anon_vma);				\
> +		anon_vma_unlock(__anon_vma);				\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)
> 
> Andrea, is that smp_mb() simply to avoid us doing anything before the
> lock is free? Why isn't there an mb() before to ensure nothing leaks
> past it from the other end?

I'll leave that in for Andrea to answer, I've not given it any thought.

> 
> > 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
> >       I certainly see the call for this patch, I want to eliminate those
> >       doubled atomics too.  This appears correct to me, and I've not dreamt
> >       up an alternative; but I do dislike it, and I suspect you don't like
> >       it much either.  I'm ambivalent about it, would love a better patch.
> 
> Like said, I fully agree with that sentiment, just haven't been able to
> come up with anything saner :/ Although I can optimize the
> __put_anon_vma() path a bit by doing something like:
> 
>   if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }
> 
> But I bet that wants a barrier someplace and my head hurts.. 

Without daring to hurt my head very much, yes, I'd say those kind
of "optimizations" have a habit of turning out to be racily wrong.

But you put your finger on it: if you hadn't had to add that lock-
unlock pair into __put_anon_vma(), I wouldn't have minded the
contortions added to page_lock_anon_vma().

> > A few checkpatch warnings, many of which I don't particularly agree with -
> > though I do get annoyed by comments going over 80-cols without any need!
> 
> Agreed, although I didn't spot any comments crossing the 80-column
> boundary.

06/21 mm-simplify_anon_vma_refcounts.patch
+		 * Initialise the anon_vma root to point to itself. If called from

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-20 19:57       ` Hugh Dickins
@ 2011-01-21  7:36         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2011-01-21  7:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Andrew Morton, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> 
> But that doesn't preclude lumping them all together for the final
> commit,
> for ease of bisection.  I think Andrew and Ben are the ones to decide
> on that: Ben's bisectability got burnt by THP just now, so he'll have
> strong feelings and good reasons.

I do hate bisectability breakage indeed ... :-)

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-21  7:36         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 157+ messages in thread
From: Benjamin Herrenschmidt @ 2011-01-21  7:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Andrew Morton, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> 
> But that doesn't preclude lumping them all together for the final
> commit,
> for ease of bisection.  I think Andrew and Ben are the ones to decide
> on that: Ben's bisectability got burnt by THP just now, so he'll have
> strong feelings and good reasons.

I do hate bisectability breakage indeed ... :-)

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-20 19:57       ` Hugh Dickins
@ 2011-01-21 15:33         ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-21 15:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli, Oleg Nesterov, Paul E. McKenney

On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> > > 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
> > >       I certainly see the call for this patch, I want to eliminate those
> > >       doubled atomics too.  This appears correct to me, and I've not dreamt
> > >       up an alternative; but I do dislike it, and I suspect you don't like
> > >       it much either.  I'm ambivalent about it, would love a better patch.
> > 
> > Like said, I fully agree with that sentiment, just haven't been able to
> > come up with anything saner :/ Although I can optimize the
> > __put_anon_vma() path a bit by doing something like:
> > 
> >   if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }
> > 
> > But I bet that wants a barrier someplace and my head hurts.. 
> 
> Without daring to hurt my head very much, yes, I'd say those kind
> of "optimizations" have a habit of turning out to be racily wrong.
> 
> But you put your finger on it: if you hadn't had to add that lock-
> unlock pair into __put_anon_vma(), I wouldn't have minded the
> contortions added to page_lock_anon_vma(). 

I think there's just about enough implied barriers there that the
'simple' code just works ;-)

But given that I'm trying to think with snot for brains thanks to some
cold, I don't trust myself at all to have gotten this right.

[ for Oleg and Paul: https://lkml.org/lkml/2010/11/26/213 contains the
full patch this is against ]

---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
 	 * Synchronize against page_lock_anon_vma() such that
 	 * we can safely hold the lock without the anon_vma getting
 	 * freed.
+	 *
+	 * Relies on the full mb implied by the atomic_dec_and_test() from
+	 * put_anon_vma() against the full mb implied by mutex_trylock() from
+	 * page_lock_anon_vma(). This orders:
+	 *
+	 * page_lock_anon_vma()		VS	put_anon_vma()
+	 *   mutex_trylock()			  atomic_dec_and_test()
+	 *   smp_mb()				  smp_mb()
+	 *   atomic_read()			  mutex_is_locked()
 	 */
-	anon_vma_lock(anon_vma);
-	anon_vma_unlock(anon_vma);
+	if (mutex_is_locked(&anon_vma->root->mutex)) {
+		anon_vma_lock(anon_vma);
+		anon_vma_unlock(anon_vma);
+	}
 
 	if (anon_vma->root != anon_vma)
 		put_anon_vma(anon_vma->root);


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-21 15:33         ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-21 15:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli, Oleg Nesterov, Paul E. McKenney

On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> > > 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
> > >       I certainly see the call for this patch, I want to eliminate those
> > >       doubled atomics too.  This appears correct to me, and I've not dreamt
> > >       up an alternative; but I do dislike it, and I suspect you don't like
> > >       it much either.  I'm ambivalent about it, would love a better patch.
> > 
> > Like said, I fully agree with that sentiment, just haven't been able to
> > come up with anything saner :/ Although I can optimize the
> > __put_anon_vma() path a bit by doing something like:
> > 
> >   if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }
> > 
> > But I bet that wants a barrier someplace and my head hurts.. 
> 
> Without daring to hurt my head very much, yes, I'd say those kind
> of "optimizations" have a habit of turning out to be racily wrong.
> 
> But you put your finger on it: if you hadn't had to add that lock-
> unlock pair into __put_anon_vma(), I wouldn't have minded the
> contortions added to page_lock_anon_vma(). 

I think there's just about enough implied barriers there that the
'simple' code just works ;-)

But given that I'm trying to think with snot for brains thanks to some
cold, I don't trust myself at all to have gotten this right.

[ for Oleg and Paul: https://lkml.org/lkml/2010/11/26/213 contains the
full patch this is against ]

---
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
 	 * Synchronize against page_lock_anon_vma() such that
 	 * we can safely hold the lock without the anon_vma getting
 	 * freed.
+	 *
+	 * Relies on the full mb implied by the atomic_dec_and_test() from
+	 * put_anon_vma() against the full mb implied by mutex_trylock() from
+	 * page_lock_anon_vma(). This orders:
+	 *
+	 * page_lock_anon_vma()		VS	put_anon_vma()
+	 *   mutex_trylock()			  atomic_dec_and_test()
+	 *   smp_mb()				  smp_mb()
+	 *   atomic_read()			  mutex_is_locked()
 	 */
-	anon_vma_lock(anon_vma);
-	anon_vma_unlock(anon_vma);
+	if (mutex_is_locked(&anon_vma->root->mutex)) {
+		anon_vma_lock(anon_vma);
+		anon_vma_unlock(anon_vma);
+	}
 
 	if (anon_vma->root != anon_vma)
 		put_anon_vma(anon_vma->root);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-19 17:10     ` Peter Zijlstra
@ 2011-01-21 17:44       ` Andrea Arcangeli
  -1 siblings, 0 replies; 157+ messages in thread
From: Andrea Arcangeli @ 2011-01-21 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm

On Wed, Jan 19, 2011 at 06:10:39PM +0100, Peter Zijlstra wrote:
> > 19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
> >       I suggest doing the anon_vma lock->mutex conversion separately here.
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       except that in the past we have renamed a lock when we've done this
> >       kind of conversion, so I'd expect anon_vma->mutex throughout now.
> >       Or am I just out of date?  I don't feel very strongly about it.
> 
> Done.. however:
> 
> Index: linux-2.6/include/linux/huge_mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/huge_mm.h
> +++ linux-2.6/include/linux/huge_mm.h
> @@ -91,12 +91,8 @@ extern void __split_huge_page_pmd(struct
>  #define wait_split_huge_page(__anon_vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> -		/*							\
> -		 * spin_unlock_wait() is just a loop in C and so the	\
> -		 * CPU can reorder anything around it.			\
> -		 */							\
> -		smp_mb();						\
> +		anon_vma_lock(__anon_vma);				\
> +		anon_vma_unlock(__anon_vma);				\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)
> 
> Andrea, is that smp_mb() simply to avoid us doing anything before the
> lock is free? Why isn't there an mb() before to ensure nothing leaks
> past it from the other end?

The idea would be that by the time we read the pmd set to
pmd_trans_splitting with the page_table_lock hold, we're guaranteed
we'll see the anon-vma locked (if it's still locked). So it's ok if
spin_unlock_wait happens before reading the pmd_trans_splitting check
inside the CPU (all it matters is for it not to happen before the
spin_lock(&page_table_lock) which it can't by the acquire semantics of
the spinlock).

So in short we know we start with the anon_vma locked, and we just
wait as long as needed.

So we only need to protect to the stuff after spin_unlock_wait().

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-21 17:44       ` Andrea Arcangeli
  0 siblings, 0 replies; 157+ messages in thread
From: Andrea Arcangeli @ 2011-01-21 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm

On Wed, Jan 19, 2011 at 06:10:39PM +0100, Peter Zijlstra wrote:
> > 19/21 mm-convert_i_mmap_lock_and_anon_vma-_lock_to_mutexes.patch
> >       I suggest doing the anon_vma lock->mutex conversion separately here.
> >       Acked-by: Hugh Dickins <hughd@google.com>
> >       except that in the past we have renamed a lock when we've done this
> >       kind of conversion, so I'd expect anon_vma->mutex throughout now.
> >       Or am I just out of date?  I don't feel very strongly about it.
> 
> Done.. however:
> 
> Index: linux-2.6/include/linux/huge_mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/huge_mm.h
> +++ linux-2.6/include/linux/huge_mm.h
> @@ -91,12 +91,8 @@ extern void __split_huge_page_pmd(struct
>  #define wait_split_huge_page(__anon_vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> -		/*							\
> -		 * spin_unlock_wait() is just a loop in C and so the	\
> -		 * CPU can reorder anything around it.			\
> -		 */							\
> -		smp_mb();						\
> +		anon_vma_lock(__anon_vma);				\
> +		anon_vma_unlock(__anon_vma);				\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)
> 
> Andrea, is that smp_mb() simply to avoid us doing anything before the
> lock is free? Why isn't there an mb() before to ensure nothing leaks
> past it from the other end?

The idea would be that by the time we read the pmd set to
pmd_trans_splitting with the page_table_lock hold, we're guaranteed
we'll see the anon-vma locked (if it's still locked). So it's ok if
spin_unlock_wait happens before reading the pmd_trans_splitting check
inside the CPU (all it matters is for it not to happen before the
spin_lock(&page_table_lock) which it can't by the acquire semantics of
the spinlock).

So in short we know we start with the anon_vma locked, and we just
wait as long as needed.

So we only need to protect to the stuff after spin_unlock_wait().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-21 15:33         ` Peter Zijlstra
@ 2011-01-22 21:06           ` Paul E. McKenney
  -1 siblings, 0 replies; 157+ messages in thread
From: Paul E. McKenney @ 2011-01-22 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Oleg Nesterov

On Fri, Jan 21, 2011 at 04:33:54PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> > > > 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
> > > >       I certainly see the call for this patch, I want to eliminate those
> > > >       doubled atomics too.  This appears correct to me, and I've not dreamt
> > > >       up an alternative; but I do dislike it, and I suspect you don't like
> > > >       it much either.  I'm ambivalent about it, would love a better patch.
> > > 
> > > Like said, I fully agree with that sentiment, just haven't been able to
> > > come up with anything saner :/ Although I can optimize the
> > > __put_anon_vma() path a bit by doing something like:
> > > 
> > >   if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }
> > > 
> > > But I bet that wants a barrier someplace and my head hurts.. 
> > 
> > Without daring to hurt my head very much, yes, I'd say those kind
> > of "optimizations" have a habit of turning out to be racily wrong.
> > 
> > But you put your finger on it: if you hadn't had to add that lock-
> > unlock pair into __put_anon_vma(), I wouldn't have minded the
> > contortions added to page_lock_anon_vma(). 
> 
> I think there's just about enough implied barriers there that the
> 'simple' code just works ;-)
> 
> But given that I'm trying to think with snot for brains thanks to some
> cold, I don't trust myself at all to have gotten this right.
> 
> [ for Oleg and Paul: https://lkml.org/lkml/2010/11/26/213 contains the
> full patch this is against ]
> 
> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
>  	 * Synchronize against page_lock_anon_vma() such that
>  	 * we can safely hold the lock without the anon_vma getting
>  	 * freed.
> +	 *
> +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> +	 * page_lock_anon_vma(). This orders:
> +	 *
> +	 * page_lock_anon_vma()		VS	put_anon_vma()
> +	 *   mutex_trylock()			  atomic_dec_and_test()
> +	 *   smp_mb()				  smp_mb()
> +	 *   atomic_read()			  mutex_is_locked()
>  	 */
> -	anon_vma_lock(anon_vma);
> -	anon_vma_unlock(anon_vma);
> +	if (mutex_is_locked(&anon_vma->root->mutex)) {
> +		anon_vma_lock(anon_vma);
> +		anon_vma_unlock(anon_vma);
> +	}
>  
>  	if (anon_vma->root != anon_vma)
>  		put_anon_vma(anon_vma->root);
> 

OK, so the anon_vma slab cache is SLAB_DESTROY_BY_RCU.  Presumably
all callers of page_lock_anon_vma() check the identity of the page
that got locked, since it might be recycled at any time.  But when
I look at 2.6.37, I only see checks for NULL.  So I am assuming
that this code is supposed to prevent such recycling.

I am not sure that I am seeing a consistent snapshot of all of the
relevant code, in particular, I am guessing that the ->lock and ->mutex
are the result of changes rather than there really being both a spinlock
and a mutex in anon_vma.  Mainline currently has a lock, FWIW.  But from
what I do see, I am concerned about the following sequence of events:

o	CPU 0 starts executing page_lock_anon_vma() as shown at
	https://lkml.org/lkml/2010/11/26/213, fetches the pointer
	to anon_vma->root->lock, but does not yet invoke
	mutex_trylock().

o	CPU 1 executes __put_anon_vma() above on the same VMA
	that CPU 0 is attempting to use.  It sees that the
	anon_vma->root->mutex (presumably AKA ->lock) is not held,
	so it calls anon_vma_free().

o	CPU 2 reallocates the anon_vma freed by CPU 1, so that it
	now has a non-zero reference count.

o	CPU 0 continues execution, incorrectly acquiring a reference
	to the now-recycled anon_vma.

Or am I misunderstanding what this code is trying to do?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-22 21:06           ` Paul E. McKenney
  0 siblings, 0 replies; 157+ messages in thread
From: Paul E. McKenney @ 2011-01-22 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Oleg Nesterov

On Fri, Jan 21, 2011 at 04:33:54PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> > > > 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
> > > >       I certainly see the call for this patch, I want to eliminate those
> > > >       doubled atomics too.  This appears correct to me, and I've not dreamt
> > > >       up an alternative; but I do dislike it, and I suspect you don't like
> > > >       it much either.  I'm ambivalent about it, would love a better patch.
> > > 
> > > Like said, I fully agree with that sentiment, just haven't been able to
> > > come up with anything saner :/ Although I can optimize the
> > > __put_anon_vma() path a bit by doing something like:
> > > 
> > >   if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }
> > > 
> > > But I bet that wants a barrier someplace and my head hurts.. 
> > 
> > Without daring to hurt my head very much, yes, I'd say those kind
> > of "optimizations" have a habit of turning out to be racily wrong.
> > 
> > But you put your finger on it: if you hadn't had to add that lock-
> > unlock pair into __put_anon_vma(), I wouldn't have minded the
> > contortions added to page_lock_anon_vma(). 
> 
> I think there's just about enough implied barriers there that the
> 'simple' code just works ;-)
> 
> But given that I'm trying to think with snot for brains thanks to some
> cold, I don't trust myself at all to have gotten this right.
> 
> [ for Oleg and Paul: https://lkml.org/lkml/2010/11/26/213 contains the
> full patch this is against ]
> 
> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
>  	 * Synchronize against page_lock_anon_vma() such that
>  	 * we can safely hold the lock without the anon_vma getting
>  	 * freed.
> +	 *
> +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> +	 * page_lock_anon_vma(). This orders:
> +	 *
> +	 * page_lock_anon_vma()		VS	put_anon_vma()
> +	 *   mutex_trylock()			  atomic_dec_and_test()
> +	 *   smp_mb()				  smp_mb()
> +	 *   atomic_read()			  mutex_is_locked()
>  	 */
> -	anon_vma_lock(anon_vma);
> -	anon_vma_unlock(anon_vma);
> +	if (mutex_is_locked(&anon_vma->root->mutex)) {
> +		anon_vma_lock(anon_vma);
> +		anon_vma_unlock(anon_vma);
> +	}
>  
>  	if (anon_vma->root != anon_vma)
>  		put_anon_vma(anon_vma->root);
> 

OK, so the anon_vma slab cache is SLAB_DESTROY_BY_RCU.  Presumably
all callers of page_lock_anon_vma() check the identity of the page
that got locked, since it might be recycled at any time.  But when
I look at 2.6.37, I only see checks for NULL.  So I am assuming
that this code is supposed to prevent such recycling.

I am not sure that I am seeing a consistent snapshot of all of the
relevant code, in particular, I am guessing that the ->lock and ->mutex
are the result of changes rather than there really being both a spinlock
and a mutex in anon_vma.  Mainline currently has a lock, FWIW.  But from
what I do see, I am concerned about the following sequence of events:

o	CPU 0 starts executing page_lock_anon_vma() as shown at
	https://lkml.org/lkml/2010/11/26/213, fetches the pointer
	to anon_vma->root->lock, but does not yet invoke
	mutex_trylock().

o	CPU 1 executes __put_anon_vma() above on the same VMA
	that CPU 0 is attempting to use.  It sees that the
	anon_vma->root->mutex (presumably AKA ->lock) is not held,
	so it calls anon_vma_free().

o	CPU 2 reallocates the anon_vma freed by CPU 1, so that it
	now has a non-zero reference count.

o	CPU 0 continues execution, incorrectly acquiring a reference
	to the now-recycled anon_vma.

Or am I misunderstanding what this code is trying to do?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-22 21:06           ` Paul E. McKenney
@ 2011-01-23 11:03             ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-23 11:03 UTC (permalink / raw)
  To: paulmck
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Oleg Nesterov

On Sat, 2011-01-22 at 13:06 -0800, Paul E. McKenney wrote:

> OK, so the anon_vma slab cache is SLAB_DESTROY_BY_RCU.  Presumably
> all callers of page_lock_anon_vma() check the identity of the page
> that got locked, since it might be recycled at any time.  But when
> I look at 2.6.37, I only see checks for NULL.  So I am assuming
> that this code is supposed to prevent such recycling.
> 
> I am not sure that I am seeing a consistent snapshot of all of the
> relevant code, in particular, I am guessing that the ->lock and ->mutex
> are the result of changes rather than there really being both a spinlock
> and a mutex in anon_vma. 

Correct, my earlier spinlock -> mutex conversion left is being called
->lock, but Hugh (rightly) pointed out that I should rename it too, so
in the new (as of yet unposted version its called ->mutex).

>  Mainline currently has a lock, FWIW.  But from
> what I do see, I am concerned about the following sequence of events:
> 
> o	CPU 0 starts executing page_lock_anon_vma() as shown at
> 	https://lkml.org/lkml/2010/11/26/213, fetches the pointer
> 	to anon_vma->root->lock, but does not yet invoke
> 	mutex_trylock().
> 
> o	CPU 1 executes __put_anon_vma() above on the same VMA
> 	that CPU 0 is attempting to use.  It sees that the
> 	anon_vma->root->mutex (presumably AKA ->lock) is not held,
> 	so it calls anon_vma_free().
> 
> o	CPU 2 reallocates the anon_vma freed by CPU 1, so that it
> 	now has a non-zero reference count.
> 
> o	CPU 0 continues execution, incorrectly acquiring a reference
> 	to the now-recycled anon_vma.
> 
> Or am I misunderstanding what this code is trying to do?

No that is quite right and possible, its one of the many subtle issues
surrounding the existing page_lock_anon_vma(), we can indeed return a
locked anon_vma that is not in fact related to the page we asked it for,
all calling code SHOULD and afaict does deal with that, mostly by
calling things like vma_address(vma, page) for all vma's obtained from
the anon_vma, to verify the page is indeed (or not) part of the vma.

The race we guard against with all the fancy stuff is the page itself
getting unmapped and us returning an anon_vma for an unmapped page.

And of course, returning a locked but free'd anon_vma, that too isn't
allowed ;-)



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-23 11:03             ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-23 11:03 UTC (permalink / raw)
  To: paulmck
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Oleg Nesterov

On Sat, 2011-01-22 at 13:06 -0800, Paul E. McKenney wrote:

> OK, so the anon_vma slab cache is SLAB_DESTROY_BY_RCU.  Presumably
> all callers of page_lock_anon_vma() check the identity of the page
> that got locked, since it might be recycled at any time.  But when
> I look at 2.6.37, I only see checks for NULL.  So I am assuming
> that this code is supposed to prevent such recycling.
> 
> I am not sure that I am seeing a consistent snapshot of all of the
> relevant code, in particular, I am guessing that the ->lock and ->mutex
> are the result of changes rather than there really being both a spinlock
> and a mutex in anon_vma. 

Correct, my earlier spinlock -> mutex conversion left is being called
->lock, but Hugh (rightly) pointed out that I should rename it too, so
in the new (as of yet unposted version its called ->mutex).

>  Mainline currently has a lock, FWIW.  But from
> what I do see, I am concerned about the following sequence of events:
> 
> o	CPU 0 starts executing page_lock_anon_vma() as shown at
> 	https://lkml.org/lkml/2010/11/26/213, fetches the pointer
> 	to anon_vma->root->lock, but does not yet invoke
> 	mutex_trylock().
> 
> o	CPU 1 executes __put_anon_vma() above on the same VMA
> 	that CPU 0 is attempting to use.  It sees that the
> 	anon_vma->root->mutex (presumably AKA ->lock) is not held,
> 	so it calls anon_vma_free().
> 
> o	CPU 2 reallocates the anon_vma freed by CPU 1, so that it
> 	now has a non-zero reference count.
> 
> o	CPU 0 continues execution, incorrectly acquiring a reference
> 	to the now-recycled anon_vma.
> 
> Or am I misunderstanding what this code is trying to do?

No that is quite right and possible, its one of the many subtle issues
surrounding the existing page_lock_anon_vma(), we can indeed return a
locked anon_vma that is not in fact related to the page we asked it for,
all calling code SHOULD and afaict does deal with that, mostly by
calling things like vma_address(vma, page) for all vma's obtained from
the anon_vma, to verify the page is indeed (or not) part of the vma.

The race we guard against with all the fancy stuff is the page itself
getting unmapped and us returning an anon_vma for an unmapped page.

And of course, returning a locked but free'd anon_vma, that too isn't
allowed ;-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-21 15:33         ` Peter Zijlstra
@ 2011-01-24 12:21           ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 12:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli, Oleg Nesterov, Paul E. McKenney

On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:

> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
>  	 * Synchronize against page_lock_anon_vma() such that
>  	 * we can safely hold the lock without the anon_vma getting
>  	 * freed.
> +	 *
> +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> +	 * page_lock_anon_vma(). This orders:
> +	 *
> +	 * page_lock_anon_vma()		VS	put_anon_vma()
> +	 *   mutex_trylock()			  atomic_dec_and_test()
> +	 *   smp_mb()				  smp_mb()
> +	 *   atomic_read()			  mutex_is_locked()

Bah!, I thought all mutex_trylock() implementations used an atomic op
with return value (which implies a mb), but it looks like (at least*)
PPC doesn't and only provides a LOCK barrier.


* possibly ARM and SH don't either, but I can't read either ASMs well
enough to tell.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-24 12:21           ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 12:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli, Oleg Nesterov, Paul E. McKenney

On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:

> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
>  	 * Synchronize against page_lock_anon_vma() such that
>  	 * we can safely hold the lock without the anon_vma getting
>  	 * freed.
> +	 *
> +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> +	 * page_lock_anon_vma(). This orders:
> +	 *
> +	 * page_lock_anon_vma()		VS	put_anon_vma()
> +	 *   mutex_trylock()			  atomic_dec_and_test()
> +	 *   smp_mb()				  smp_mb()
> +	 *   atomic_read()			  mutex_is_locked()

Bah!, I thought all mutex_trylock() implementations used an atomic op
with return value (which implies a mb), but it looks like (at least*)
PPC doesn't and only provides a LOCK barrier.


* possibly ARM and SH don't either, but I can't read either ASMs well
enough to tell.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-20 19:57       ` Hugh Dickins
@ 2011-01-24 12:45         ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 12:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> > Since its now preemptable we might consider simply removing that. I
> > simply wanted to keep the changes to a minimum for now.
> 
> Removing that along with the restart_addr stuff, yes.  I keep wavering.
> Yes it would be nice to get rid of all that, particularly now we find
> it's had holes in all these years.  The only significant loser, I think,
> would be page reclaim (when concurrent with truncation): could spin for a
> long time waiting for the i_mmap_mutex it expects would soon be dropped? 

Right, so unmap vs reclaim is only nasty because we'd hold up reclaim,
avoiding it from doing useful work. Then again, avoiding it from
reclaiming stuff we're throwing away anyway is good, but holding it up
from doing other work less so.

/me applies axe to see what he gets.

XXX: still needs !generic-tlb arch updates, and probably should cure the
tlb_flush_mmu(.start=0, .end=0) thing.

Temping diffstat, but like you said, not quite sure if its fair wrt
reclaim..

---
 include/asm-generic/tlb.h |   18 ++--
 include/linux/fs.h        |    1 
 include/linux/mm.h        |    2 
 include/linux/mm_types.h  |    1 
 kernel/fork.c             |    1 
 mm/memory.c               |  203 ++++++++--------------------------------------
 mm/mmap.c                 |   13 --
 mm/mremap.c               |    3 
 8 files changed, 50 insertions(+), 192 deletions(-)

Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -193,7 +193,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  *	handling the additional races in SMP caused by other CPUs caching valid
  *	mappings in their TLBs.
  */
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	struct mmu_gather_batch *batch;
 
@@ -201,17 +201,25 @@ static inline void tlb_remove_page(struc
 
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
-		return;
+		return 0;
 	}
 
 	batch = tlb->active;
+	batch->pages[batch->nr++] = page;
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
-			tlb_flush_mmu(tlb, 0, 0);
-		batch = tlb->active;
+			return 1;
 	}
 
-	batch->pages[batch->nr++] = page;
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	WARN_ON_ONCE(in_atomic());
+
+	if (__tlb_remove_page(tlb, page))
+		tlb_flush_mmu(tlb, 0, 0);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -876,8 +876,6 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	struct mutex *i_mmap_mutex;		/* For unmap_mapping_range: */
-	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -986,26 +986,24 @@ int copy_page_range(struct mm_struct *ds
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int rss[NR_MM_COUNTERS];
+	int need_flush = 0;
+	spinlock_t *ptl;
+	pte_t *pte;
 
 	init_rss_vec(rss);
-
+again:
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
 		if (pte_none(ptent)) {
-			(*zap_work)--;
 			continue;
 		}
 
-		(*zap_work) -= PAGE_SIZE;
-
 		if (pte_present(ptent)) {
 			struct page *page;
 
@@ -1051,7 +1049,7 @@ static unsigned long zap_pte_range(struc
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			tlb_remove_page(tlb, page);
+			need_flush = __tlb_remove_page(tlb, page);
 			continue;
 		}
 		/*
@@ -1072,19 +1070,26 @@ static unsigned long zap_pte_range(struc
 				print_bad_pte(vma, addr, ptent, NULL);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
+	} while (pte++, addr += PAGE_SIZE, (addr != end && !need_flush));
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
+	if (need_flush) {
+		need_flush = 0;
+		tlb_flush_mmu(tlb, 0, 0);
+		if (addr != end)
+			goto again;
+	}
+
 	return addr;
 }
 
 static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -1096,19 +1101,15 @@ static inline unsigned long zap_pmd_rang
 			if (next-addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			} else if (zap_huge_pmd(tlb, vma, pmd)) {
-				(*zap_work)--;
+			} else if (zap_huge_pmd(tlb, vma, pmd))
 				continue;
-			}
 			/* fall through */
 		}
-		if (pmd_none_or_clear_bad(pmd)) {
-			(*zap_work)--;
+		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		}
-		next = zap_pte_range(tlb, vma, pmd, addr, next,
-						zap_work, details);
-	} while (pmd++, addr = next, (addr != end && *zap_work > 0));
+		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+		cond_resched();
+	} while (pmd++, addr = next, addr != end);
 
 	return addr;
 }
@@ -1116,7 +1117,7 @@ static inline unsigned long zap_pmd_rang
 static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -1124,13 +1125,10 @@ static inline unsigned long zap_pud_rang
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud)) {
-			(*zap_work)--;
+		if (pud_none_or_clear_bad(pud))
 			continue;
-		}
-		next = zap_pmd_range(tlb, vma, pud, addr, next,
-						zap_work, details);
-	} while (pud++, addr = next, (addr != end && *zap_work > 0));
+		next = zap_pmd_range(tlb, vma, pud, addr, next, details);
+	} while (pud++, addr = next, addr != end);
 
 	return addr;
 }
@@ -1138,7 +1136,7 @@ static inline unsigned long zap_pud_rang
 static unsigned long unmap_page_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -1152,13 +1150,10 @@ static unsigned long unmap_page_range(st
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd)) {
-			(*zap_work)--;
+		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		}
-		next = zap_pud_range(tlb, vma, pgd, addr, next,
-						zap_work, details);
-	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
+		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
+	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
 	mem_cgroup_uncharge_end();
 
@@ -1203,9 +1198,7 @@ unsigned long unmap_vmas(struct mmu_gath
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
-	long zap_work = ZAP_BLOCK_SIZE;
 	unsigned long start = start_addr;
-	struct mutex *i_mmap_mutex = details ? details->i_mmap_mutex : NULL;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1238,33 +1231,16 @@ unsigned long unmap_vmas(struct mmu_gath
 				 * Since no pte has actually been setup, it is
 				 * safe to do nothing in this case.
 				 */
-				if (vma->vm_file) {
+				if (vma->vm_file)
 					unmap_hugepage_range(vma, start, end, NULL);
-					zap_work -= (end - start) /
-					pages_per_huge_page(hstate_vma(vma));
-				}
 
 				start = end;
 			} else
 				start = unmap_page_range(tlb, vma,
-						start, end, &zap_work, details);
-
-			if (zap_work > 0) {
-				BUG_ON(start != end);
-				break;
-			}
-
-			if (need_resched() ||
-				(i_mmap_mutex && mutex_is_contended(i_mmap_mutex))) {
-				if (i_mmap_mutex)
-					goto out;
-				cond_resched();
-			}
-
-			zap_work = ZAP_BLOCK_SIZE;
+						start, end, details);
 		}
 	}
-out:
+
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
@@ -2528,96 +2504,11 @@ static int do_wp_page(struct mm_struct *
 	return ret;
 }
 
-/*
- * Helper functions for unmap_mapping_range().
- *
- * __ Notes on dropping i_mmap_mutex to reduce latency while unmapping __
- *
- * We have to restart searching the prio_tree whenever we drop the lock,
- * since the iterator is only valid while the lock is held, and anyway
- * a later vma might be split and reinserted earlier while lock dropped.
- *
- * The list of nonlinear vmas could be handled more efficiently, using
- * a placeholder, but handle it in the same way until a need is shown.
- * It is important to search the prio_tree before nonlinear list: a vma
- * may become nonlinear and be shifted from prio_tree to nonlinear list
- * while the lock is dropped; but never shifted from list to prio_tree.
- *
- * In order to make forward progress despite restarting the search,
- * vm_truncate_count is used to mark a vma as now dealt with, so we can
- * quickly skip it next time around.  Since the prio_tree search only
- * shows us those vmas affected by unmapping the range in question, we
- * can't efficiently keep all vmas in step with mapping->truncate_count:
- * so instead reset them all whenever it wraps back to 0 (then go to 1).
- * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_mutex.
- *
- * In order to make forward progress despite repeatedly restarting some
- * large vma, note the restart_addr from unmap_vmas when it breaks out:
- * and restart from that address when we reach that vma again.  It might
- * have been split or merged, shrunk or extended, but never shifted: so
- * restart_addr remains valid so long as it remains in the vma's range.
- * unmap_mapping_range forces truncate_count to leap over page-aligned
- * values so we can save vma's restart_addr in its truncate_count field.
- */
-#define is_restart_addr(truncate_count) (!((truncate_count) & ~PAGE_MASK))
-
-static void reset_vma_truncate_counts(struct address_space *mapping)
-{
-	struct vm_area_struct *vma;
-	struct prio_tree_iter iter;
-
-	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX)
-		vma->vm_truncate_count = 0;
-	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
-		vma->vm_truncate_count = 0;
-}
-
-static int unmap_mapping_range_vma(struct vm_area_struct *vma,
+static void unmap_mapping_range_vma(struct vm_area_struct *vma,
 		unsigned long start_addr, unsigned long end_addr,
 		struct zap_details *details)
 {
-	unsigned long restart_addr;
-	int need_break;
-
-	/*
-	 * files that support invalidating or truncating portions of the
-	 * file from under mmaped areas must have their ->fault function
-	 * return a locked page (and set VM_FAULT_LOCKED in the return).
-	 * This provides synchronisation against concurrent unmapping here.
-	 */
-
-again:
-	restart_addr = vma->vm_truncate_count;
-	if (is_restart_addr(restart_addr) && start_addr < restart_addr) {
-		start_addr = restart_addr;
-		if (start_addr >= end_addr) {
-			/* Top of vma has been split off since last time */
-			vma->vm_truncate_count = details->truncate_count;
-			return 0;
-		}
-	}
-
-	restart_addr = zap_page_range(vma, start_addr,
-					end_addr - start_addr, details);
-	need_break = need_resched() || mutex_is_contended(details->i_mmap_mutex);
-
-	if (restart_addr >= end_addr) {
-		/* We have now completed this vma: mark it so */
-		vma->vm_truncate_count = details->truncate_count;
-		if (!need_break)
-			return 0;
-	} else {
-		/* Note restart_addr in vma's truncate_count field */
-		vma->vm_truncate_count = restart_addr;
-		if (!need_break)
-			goto again;
-	}
-
-	mutex_unlock(details->i_mmap_mutex);
-	cond_resched();
-	mutex_lock(details->i_mmap_mutex);
-	return -EINTR;
+	zap_page_range(vma, start_addr, end_addr - start_addr, details);
 }
 
 static inline void unmap_mapping_range_tree(struct prio_tree_root *root,
@@ -2627,12 +2518,8 @@ static inline void unmap_mapping_range_t
 	struct prio_tree_iter iter;
 	pgoff_t vba, vea, zba, zea;
 
-restart:
 	vma_prio_tree_foreach(vma, &iter, root,
 			details->first_index, details->last_index) {
-		/* Skip quickly over those we have already dealt with */
-		if (vma->vm_truncate_count == details->truncate_count)
-			continue;
 
 		vba = vma->vm_pgoff;
 		vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
@@ -2644,11 +2531,10 @@ static inline void unmap_mapping_range_t
 		if (zea > vea)
 			zea = vea;
 
-		if (unmap_mapping_range_vma(vma,
+		unmap_mapping_range_vma(vma,
 			((zba - vba) << PAGE_SHIFT) + vma->vm_start,
 			((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start,
-				details) < 0)
-			goto restart;
+				details);
 	}
 }
 
@@ -2663,15 +2549,9 @@ static inline void unmap_mapping_range_l
 	 * across *all* the pages in each nonlinear VMA, not just the pages
 	 * whose virtual address lies outside the file truncation point.
 	 */
-restart:
 	list_for_each_entry(vma, head, shared.vm_set.list) {
-		/* Skip quickly over those we have already dealt with */
-		if (vma->vm_truncate_count == details->truncate_count)
-			continue;
 		details->nonlinear_vma = vma;
-		if (unmap_mapping_range_vma(vma, vma->vm_start,
-					vma->vm_end, details) < 0)
-			goto restart;
+		unmap_mapping_range_vma(vma, vma->vm_start, vma->vm_end, details);
 	}
 }
 
@@ -2710,19 +2590,8 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_mutex = &mapping->i_mmap_mutex;
 
 	mutex_lock(&mapping->i_mmap_mutex);
-
-	/* Protect against endless unmapping loops */
-	mapping->truncate_count++;
-	if (unlikely(is_restart_addr(mapping->truncate_count))) {
-		if (mapping->truncate_count == 0)
-			reset_vma_truncate_counts(mapping);
-		mapping->truncate_count++;
-	}
-	details.truncate_count = mapping->truncate_count;
-
 	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -640,7 +640,6 @@ struct address_space {
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
-	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -175,7 +175,6 @@ struct vm_area_struct {
 					   units, *not* PAGE_CACHE_SIZE */
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
-	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
 
 #ifndef CONFIG_MMU
 	struct vm_region *vm_region;	/* NOMMU mapping region */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -379,7 +379,6 @@ static int dup_mmap(struct mm_struct *mm
 			mutex_lock(&mapping->i_mmap_mutex);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
-			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -464,10 +464,8 @@ static void vma_link(struct mm_struct *m
 	if (vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
 
-	if (mapping) {
+	if (mapping)
 		mutex_lock(&mapping->i_mmap_mutex);
-		vma->vm_truncate_count = mapping->truncate_count;
-	}
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -577,16 +575,7 @@ again:			remove_next = 1 + (end > next->
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
 		mutex_lock(&mapping->i_mmap_mutex);
-		if (importer &&
-		    vma->vm_truncate_count != next->vm_truncate_count) {
-			/*
-			 * unmap_mapping_range might be in progress:
-			 * ensure that the expanding vma is rescanned.
-			 */
-			importer->vm_truncate_count = 0;
-		}
 		if (insert) {
-			insert->vm_truncate_count = vma->vm_truncate_count;
 			/*
 			 * Put into prio_tree now, so instantiated pages
 			 * are visible to arm/parisc __flush_dcache_page
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -94,9 +94,6 @@ static void move_ptes(struct vm_area_str
 		 */
 		mapping = vma->vm_file->f_mapping;
 		mutex_lock(&mapping->i_mmap_mutex);
-		if (new_vma->vm_truncate_count &&
-		    new_vma->vm_truncate_count != vma->vm_truncate_count)
-			new_vma->vm_truncate_count = 0;
 	}
 
 	/*


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-24 12:45         ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 12:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> > Since its now preemptable we might consider simply removing that. I
> > simply wanted to keep the changes to a minimum for now.
> 
> Removing that along with the restart_addr stuff, yes.  I keep wavering.
> Yes it would be nice to get rid of all that, particularly now we find
> it's had holes in all these years.  The only significant loser, I think,
> would be page reclaim (when concurrent with truncation): could spin for a
> long time waiting for the i_mmap_mutex it expects would soon be dropped? 

Right, so unmap vs reclaim is only nasty because we'd hold up reclaim,
avoiding it from doing useful work. Then again, avoiding it from
reclaiming stuff we're throwing away anyway is good, but holding it up
from doing other work less so.

/me applies axe to see what he gets.

XXX: still needs !generic-tlb arch updates, and probably should cure the
tlb_flush_mmu(.start=0, .end=0) thing.

Temping diffstat, but like you said, not quite sure if its fair wrt
reclaim..

---
 include/asm-generic/tlb.h |   18 ++--
 include/linux/fs.h        |    1 
 include/linux/mm.h        |    2 
 include/linux/mm_types.h  |    1 
 kernel/fork.c             |    1 
 mm/memory.c               |  203 ++++++++--------------------------------------
 mm/mmap.c                 |   13 --
 mm/mremap.c               |    3 
 8 files changed, 50 insertions(+), 192 deletions(-)

Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -193,7 +193,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  *	handling the additional races in SMP caused by other CPUs caching valid
  *	mappings in their TLBs.
  */
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	struct mmu_gather_batch *batch;
 
@@ -201,17 +201,25 @@ static inline void tlb_remove_page(struc
 
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
-		return;
+		return 0;
 	}
 
 	batch = tlb->active;
+	batch->pages[batch->nr++] = page;
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
-			tlb_flush_mmu(tlb, 0, 0);
-		batch = tlb->active;
+			return 1;
 	}
 
-	batch->pages[batch->nr++] = page;
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	WARN_ON_ONCE(in_atomic());
+
+	if (__tlb_remove_page(tlb, page))
+		tlb_flush_mmu(tlb, 0, 0);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -876,8 +876,6 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	struct mutex *i_mmap_mutex;		/* For unmap_mapping_range: */
-	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -986,26 +986,24 @@ int copy_page_range(struct mm_struct *ds
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int rss[NR_MM_COUNTERS];
+	int need_flush = 0;
+	spinlock_t *ptl;
+	pte_t *pte;
 
 	init_rss_vec(rss);
-
+again:
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
 		if (pte_none(ptent)) {
-			(*zap_work)--;
 			continue;
 		}
 
-		(*zap_work) -= PAGE_SIZE;
-
 		if (pte_present(ptent)) {
 			struct page *page;
 
@@ -1051,7 +1049,7 @@ static unsigned long zap_pte_range(struc
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			tlb_remove_page(tlb, page);
+			need_flush = __tlb_remove_page(tlb, page);
 			continue;
 		}
 		/*
@@ -1072,19 +1070,26 @@ static unsigned long zap_pte_range(struc
 				print_bad_pte(vma, addr, ptent, NULL);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
+	} while (pte++, addr += PAGE_SIZE, (addr != end && !need_flush));
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
+	if (need_flush) {
+		need_flush = 0;
+		tlb_flush_mmu(tlb, 0, 0);
+		if (addr != end)
+			goto again;
+	}
+
 	return addr;
 }
 
 static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -1096,19 +1101,15 @@ static inline unsigned long zap_pmd_rang
 			if (next-addr != HPAGE_PMD_SIZE) {
 				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			} else if (zap_huge_pmd(tlb, vma, pmd)) {
-				(*zap_work)--;
+			} else if (zap_huge_pmd(tlb, vma, pmd))
 				continue;
-			}
 			/* fall through */
 		}
-		if (pmd_none_or_clear_bad(pmd)) {
-			(*zap_work)--;
+		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		}
-		next = zap_pte_range(tlb, vma, pmd, addr, next,
-						zap_work, details);
-	} while (pmd++, addr = next, (addr != end && *zap_work > 0));
+		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+		cond_resched();
+	} while (pmd++, addr = next, addr != end);
 
 	return addr;
 }
@@ -1116,7 +1117,7 @@ static inline unsigned long zap_pmd_rang
 static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -1124,13 +1125,10 @@ static inline unsigned long zap_pud_rang
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud)) {
-			(*zap_work)--;
+		if (pud_none_or_clear_bad(pud))
 			continue;
-		}
-		next = zap_pmd_range(tlb, vma, pud, addr, next,
-						zap_work, details);
-	} while (pud++, addr = next, (addr != end && *zap_work > 0));
+		next = zap_pmd_range(tlb, vma, pud, addr, next, details);
+	} while (pud++, addr = next, addr != end);
 
 	return addr;
 }
@@ -1138,7 +1136,7 @@ static inline unsigned long zap_pud_rang
 static unsigned long unmap_page_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma,
 				unsigned long addr, unsigned long end,
-				long *zap_work, struct zap_details *details)
+				struct zap_details *details)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -1152,13 +1150,10 @@ static unsigned long unmap_page_range(st
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd)) {
-			(*zap_work)--;
+		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		}
-		next = zap_pud_range(tlb, vma, pgd, addr, next,
-						zap_work, details);
-	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
+		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
+	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
 	mem_cgroup_uncharge_end();
 
@@ -1203,9 +1198,7 @@ unsigned long unmap_vmas(struct mmu_gath
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
-	long zap_work = ZAP_BLOCK_SIZE;
 	unsigned long start = start_addr;
-	struct mutex *i_mmap_mutex = details ? details->i_mmap_mutex : NULL;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1238,33 +1231,16 @@ unsigned long unmap_vmas(struct mmu_gath
 				 * Since no pte has actually been setup, it is
 				 * safe to do nothing in this case.
 				 */
-				if (vma->vm_file) {
+				if (vma->vm_file)
 					unmap_hugepage_range(vma, start, end, NULL);
-					zap_work -= (end - start) /
-					pages_per_huge_page(hstate_vma(vma));
-				}
 
 				start = end;
 			} else
 				start = unmap_page_range(tlb, vma,
-						start, end, &zap_work, details);
-
-			if (zap_work > 0) {
-				BUG_ON(start != end);
-				break;
-			}
-
-			if (need_resched() ||
-				(i_mmap_mutex && mutex_is_contended(i_mmap_mutex))) {
-				if (i_mmap_mutex)
-					goto out;
-				cond_resched();
-			}
-
-			zap_work = ZAP_BLOCK_SIZE;
+						start, end, details);
 		}
 	}
-out:
+
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
@@ -2528,96 +2504,11 @@ static int do_wp_page(struct mm_struct *
 	return ret;
 }
 
-/*
- * Helper functions for unmap_mapping_range().
- *
- * __ Notes on dropping i_mmap_mutex to reduce latency while unmapping __
- *
- * We have to restart searching the prio_tree whenever we drop the lock,
- * since the iterator is only valid while the lock is held, and anyway
- * a later vma might be split and reinserted earlier while lock dropped.
- *
- * The list of nonlinear vmas could be handled more efficiently, using
- * a placeholder, but handle it in the same way until a need is shown.
- * It is important to search the prio_tree before nonlinear list: a vma
- * may become nonlinear and be shifted from prio_tree to nonlinear list
- * while the lock is dropped; but never shifted from list to prio_tree.
- *
- * In order to make forward progress despite restarting the search,
- * vm_truncate_count is used to mark a vma as now dealt with, so we can
- * quickly skip it next time around.  Since the prio_tree search only
- * shows us those vmas affected by unmapping the range in question, we
- * can't efficiently keep all vmas in step with mapping->truncate_count:
- * so instead reset them all whenever it wraps back to 0 (then go to 1).
- * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_mutex.
- *
- * In order to make forward progress despite repeatedly restarting some
- * large vma, note the restart_addr from unmap_vmas when it breaks out:
- * and restart from that address when we reach that vma again.  It might
- * have been split or merged, shrunk or extended, but never shifted: so
- * restart_addr remains valid so long as it remains in the vma's range.
- * unmap_mapping_range forces truncate_count to leap over page-aligned
- * values so we can save vma's restart_addr in its truncate_count field.
- */
-#define is_restart_addr(truncate_count) (!((truncate_count) & ~PAGE_MASK))
-
-static void reset_vma_truncate_counts(struct address_space *mapping)
-{
-	struct vm_area_struct *vma;
-	struct prio_tree_iter iter;
-
-	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX)
-		vma->vm_truncate_count = 0;
-	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
-		vma->vm_truncate_count = 0;
-}
-
-static int unmap_mapping_range_vma(struct vm_area_struct *vma,
+static void unmap_mapping_range_vma(struct vm_area_struct *vma,
 		unsigned long start_addr, unsigned long end_addr,
 		struct zap_details *details)
 {
-	unsigned long restart_addr;
-	int need_break;
-
-	/*
-	 * files that support invalidating or truncating portions of the
-	 * file from under mmaped areas must have their ->fault function
-	 * return a locked page (and set VM_FAULT_LOCKED in the return).
-	 * This provides synchronisation against concurrent unmapping here.
-	 */
-
-again:
-	restart_addr = vma->vm_truncate_count;
-	if (is_restart_addr(restart_addr) && start_addr < restart_addr) {
-		start_addr = restart_addr;
-		if (start_addr >= end_addr) {
-			/* Top of vma has been split off since last time */
-			vma->vm_truncate_count = details->truncate_count;
-			return 0;
-		}
-	}
-
-	restart_addr = zap_page_range(vma, start_addr,
-					end_addr - start_addr, details);
-	need_break = need_resched() || mutex_is_contended(details->i_mmap_mutex);
-
-	if (restart_addr >= end_addr) {
-		/* We have now completed this vma: mark it so */
-		vma->vm_truncate_count = details->truncate_count;
-		if (!need_break)
-			return 0;
-	} else {
-		/* Note restart_addr in vma's truncate_count field */
-		vma->vm_truncate_count = restart_addr;
-		if (!need_break)
-			goto again;
-	}
-
-	mutex_unlock(details->i_mmap_mutex);
-	cond_resched();
-	mutex_lock(details->i_mmap_mutex);
-	return -EINTR;
+	zap_page_range(vma, start_addr, end_addr - start_addr, details);
 }
 
 static inline void unmap_mapping_range_tree(struct prio_tree_root *root,
@@ -2627,12 +2518,8 @@ static inline void unmap_mapping_range_t
 	struct prio_tree_iter iter;
 	pgoff_t vba, vea, zba, zea;
 
-restart:
 	vma_prio_tree_foreach(vma, &iter, root,
 			details->first_index, details->last_index) {
-		/* Skip quickly over those we have already dealt with */
-		if (vma->vm_truncate_count == details->truncate_count)
-			continue;
 
 		vba = vma->vm_pgoff;
 		vea = vba + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) - 1;
@@ -2644,11 +2531,10 @@ static inline void unmap_mapping_range_t
 		if (zea > vea)
 			zea = vea;
 
-		if (unmap_mapping_range_vma(vma,
+		unmap_mapping_range_vma(vma,
 			((zba - vba) << PAGE_SHIFT) + vma->vm_start,
 			((zea - vba + 1) << PAGE_SHIFT) + vma->vm_start,
-				details) < 0)
-			goto restart;
+				details);
 	}
 }
 
@@ -2663,15 +2549,9 @@ static inline void unmap_mapping_range_l
 	 * across *all* the pages in each nonlinear VMA, not just the pages
 	 * whose virtual address lies outside the file truncation point.
 	 */
-restart:
 	list_for_each_entry(vma, head, shared.vm_set.list) {
-		/* Skip quickly over those we have already dealt with */
-		if (vma->vm_truncate_count == details->truncate_count)
-			continue;
 		details->nonlinear_vma = vma;
-		if (unmap_mapping_range_vma(vma, vma->vm_start,
-					vma->vm_end, details) < 0)
-			goto restart;
+		unmap_mapping_range_vma(vma, vma->vm_start, vma->vm_end, details);
 	}
 }
 
@@ -2710,19 +2590,8 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_mutex = &mapping->i_mmap_mutex;
 
 	mutex_lock(&mapping->i_mmap_mutex);
-
-	/* Protect against endless unmapping loops */
-	mapping->truncate_count++;
-	if (unlikely(is_restart_addr(mapping->truncate_count))) {
-		if (mapping->truncate_count == 0)
-			reset_vma_truncate_counts(mapping);
-		mapping->truncate_count++;
-	}
-	details.truncate_count = mapping->truncate_count;
-
 	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -640,7 +640,6 @@ struct address_space {
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
-	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -175,7 +175,6 @@ struct vm_area_struct {
 					   units, *not* PAGE_CACHE_SIZE */
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
-	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
 
 #ifndef CONFIG_MMU
 	struct vm_region *vm_region;	/* NOMMU mapping region */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -379,7 +379,6 @@ static int dup_mmap(struct mm_struct *mm
 			mutex_lock(&mapping->i_mmap_mutex);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
-			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -464,10 +464,8 @@ static void vma_link(struct mm_struct *m
 	if (vma->vm_file)
 		mapping = vma->vm_file->f_mapping;
 
-	if (mapping) {
+	if (mapping)
 		mutex_lock(&mapping->i_mmap_mutex);
-		vma->vm_truncate_count = mapping->truncate_count;
-	}
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
@@ -577,16 +575,7 @@ again:			remove_next = 1 + (end > next->
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
 		mutex_lock(&mapping->i_mmap_mutex);
-		if (importer &&
-		    vma->vm_truncate_count != next->vm_truncate_count) {
-			/*
-			 * unmap_mapping_range might be in progress:
-			 * ensure that the expanding vma is rescanned.
-			 */
-			importer->vm_truncate_count = 0;
-		}
 		if (insert) {
-			insert->vm_truncate_count = vma->vm_truncate_count;
 			/*
 			 * Put into prio_tree now, so instantiated pages
 			 * are visible to arm/parisc __flush_dcache_page
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -94,9 +94,6 @@ static void move_ptes(struct vm_area_str
 		 */
 		mapping = vma->vm_file->f_mapping;
 		mutex_lock(&mapping->i_mmap_mutex);
-		if (new_vma->vm_truncate_count &&
-		    new_vma->vm_truncate_count != vma->vm_truncate_count)
-			new_vma->vm_truncate_count = 0;
 	}
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-24 12:45         ` Peter Zijlstra
@ 2011-01-24 14:24           ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 14:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Mon, 2011-01-24 at 13:45 +0100, Peter Zijlstra wrote:
> The only significant loser, I think,
> > would be page reclaim (when concurrent with truncation): could spin for a
> > long time waiting for the i_mmap_mutex it expects would soon be dropped? 

Well it won't spin (much) but mostly go to sleep if it really takes very
long, but then, could it really take much longer than say a lock_page()
when reclaim hits a page under IO?



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-24 14:24           ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 14:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Mon, 2011-01-24 at 13:45 +0100, Peter Zijlstra wrote:
> The only significant loser, I think,
> > would be page reclaim (when concurrent with truncation): could spin for a
> > long time waiting for the i_mmap_mutex it expects would soon be dropped? 

Well it won't spin (much) but mostly go to sleep if it really takes very
long, but then, could it really take much longer than say a lock_page()
when reclaim hits a page under IO?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-24 12:21           ` Peter Zijlstra
@ 2011-01-24 14:34             ` Oleg Nesterov
  -1 siblings, 0 replies; 157+ messages in thread
From: Oleg Nesterov @ 2011-01-24 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Paul E. McKenney

On 01/24, Peter Zijlstra wrote:
>
> On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:
>
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
> >  	 * Synchronize against page_lock_anon_vma() such that
> >  	 * we can safely hold the lock without the anon_vma getting
> >  	 * freed.
> > +	 *
> > +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> > +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> > +	 * page_lock_anon_vma(). This orders:
> > +	 *
> > +	 * page_lock_anon_vma()		VS	put_anon_vma()
> > +	 *   mutex_trylock()			  atomic_dec_and_test()
> > +	 *   smp_mb()				  smp_mb()
> > +	 *   atomic_read()			  mutex_is_locked()
>
> Bah!, I thought all mutex_trylock() implementations used an atomic op
> with return value (which implies a mb), but it looks like (at least*)
> PPC doesn't and only provides a LOCK barrier.

But, mutex_trylock() must imply the one-way barrier, otherwise it
is buggy, no?

IOW, page_lock_anon_vma() does:

	if (mutex_trylock(anon_vma->root->lock)) {
		...
		atomic_read(&anon_vma->refcount);
		...
	}

If this atomic_read() can leak out of the critical section, then
I think mutex_trylock() should be fixed. Or I misunderstood the
problem completely...

BTW, from https://lkml.org/lkml/2010/11/26/213

	+ * Similar to page_get_anon_vma() except it locks the anon_vma.
	...
	-	struct anon_vma *anon_vma = page_get_anon_vma(page);

looks like, page_get_anon_vma() becomes unused.

Oleg.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-24 14:34             ` Oleg Nesterov
  0 siblings, 0 replies; 157+ messages in thread
From: Oleg Nesterov @ 2011-01-24 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Paul E. McKenney

On 01/24, Peter Zijlstra wrote:
>
> On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:
>
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
> >  	 * Synchronize against page_lock_anon_vma() such that
> >  	 * we can safely hold the lock without the anon_vma getting
> >  	 * freed.
> > +	 *
> > +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> > +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> > +	 * page_lock_anon_vma(). This orders:
> > +	 *
> > +	 * page_lock_anon_vma()		VS	put_anon_vma()
> > +	 *   mutex_trylock()			  atomic_dec_and_test()
> > +	 *   smp_mb()				  smp_mb()
> > +	 *   atomic_read()			  mutex_is_locked()
>
> Bah!, I thought all mutex_trylock() implementations used an atomic op
> with return value (which implies a mb), but it looks like (at least*)
> PPC doesn't and only provides a LOCK barrier.

But, mutex_trylock() must imply the one-way barrier, otherwise it
is buggy, no?

IOW, page_lock_anon_vma() does:

	if (mutex_trylock(anon_vma->root->lock)) {
		...
		atomic_read(&anon_vma->refcount);
		...
	}

If this atomic_read() can leak out of the critical section, then
I think mutex_trylock() should be fixed. Or I misunderstood the
problem completely...

BTW, from https://lkml.org/lkml/2010/11/26/213

	+ * Similar to page_get_anon_vma() except it locks the anon_vma.
	...
	-	struct anon_vma *anon_vma = page_get_anon_vma(page);

looks like, page_get_anon_vma() becomes unused.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-24 14:34             ` Oleg Nesterov
@ 2011-01-24 15:00               ` Peter Zijlstra
  -1 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 15:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Paul E. McKenney

On Mon, 2011-01-24 at 15:34 +0100, Oleg Nesterov wrote:
> On 01/24, Peter Zijlstra wrote:
> >
> > On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:
> >
> > > Index: linux-2.6/mm/rmap.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/rmap.c
> > > +++ linux-2.6/mm/rmap.c
> > > @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
> > >  	 * Synchronize against page_lock_anon_vma() such that
> > >  	 * we can safely hold the lock without the anon_vma getting
> > >  	 * freed.
> > > +	 *
> > > +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> > > +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> > > +	 * page_lock_anon_vma(). This orders:
> > > +	 *
> > > +	 * page_lock_anon_vma()		VS	put_anon_vma()
> > > +	 *   mutex_trylock()			  atomic_dec_and_test()
> > > +	 *   smp_mb()				  smp_mb()
> > > +	 *   atomic_read()			  mutex_is_locked()
> >
> > Bah!, I thought all mutex_trylock() implementations used an atomic op
> > with return value (which implies a mb), but it looks like (at least*)
> > PPC doesn't and only provides a LOCK barrier.
> 
> But, mutex_trylock() must imply the one-way barrier, otherwise it
> is buggy, no?

> If this atomic_read() can leak out of the critical section, then
> I think mutex_trylock() should be fixed. Or I misunderstood the
> problem completely...

It implies the LOCK barrier, the one way permeable thing, not a full mb.

But I'm not sure the LOCK is sufficient to make the above scenario work.

> BTW, from https://lkml.org/lkml/2010/11/26/213
> 
> 	+ * Similar to page_get_anon_vma() except it locks the anon_vma.
> 	...
> 	-	struct anon_vma *anon_vma = page_get_anon_vma(page);
> 
> looks like, page_get_anon_vma() becomes unused.

In my current tree there's two left in mm/migrate.c.


git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-mmu_preempt.git

HEAD: 7e23542fae47ed3b811583cb925837e3aa7a3f0f

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-24 15:00               ` Peter Zijlstra
  0 siblings, 0 replies; 157+ messages in thread
From: Peter Zijlstra @ 2011-01-24 15:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Paul E. McKenney

On Mon, 2011-01-24 at 15:34 +0100, Oleg Nesterov wrote:
> On 01/24, Peter Zijlstra wrote:
> >
> > On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:
> >
> > > Index: linux-2.6/mm/rmap.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/rmap.c
> > > +++ linux-2.6/mm/rmap.c
> > > @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
> > >  	 * Synchronize against page_lock_anon_vma() such that
> > >  	 * we can safely hold the lock without the anon_vma getting
> > >  	 * freed.
> > > +	 *
> > > +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> > > +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> > > +	 * page_lock_anon_vma(). This orders:
> > > +	 *
> > > +	 * page_lock_anon_vma()		VS	put_anon_vma()
> > > +	 *   mutex_trylock()			  atomic_dec_and_test()
> > > +	 *   smp_mb()				  smp_mb()
> > > +	 *   atomic_read()			  mutex_is_locked()
> >
> > Bah!, I thought all mutex_trylock() implementations used an atomic op
> > with return value (which implies a mb), but it looks like (at least*)
> > PPC doesn't and only provides a LOCK barrier.
> 
> But, mutex_trylock() must imply the one-way barrier, otherwise it
> is buggy, no?

> If this atomic_read() can leak out of the critical section, then
> I think mutex_trylock() should be fixed. Or I misunderstood the
> problem completely...

It implies the LOCK barrier, the one way permeable thing, not a full mb.

But I'm not sure the LOCK is sufficient to make the above scenario work.

> BTW, from https://lkml.org/lkml/2010/11/26/213
> 
> 	+ * Similar to page_get_anon_vma() except it locks the anon_vma.
> 	...
> 	-	struct anon_vma *anon_vma = page_get_anon_vma(page);
> 
> looks like, page_get_anon_vma() becomes unused.

In my current tree there's two left in mm/migrate.c.


git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-mmu_preempt.git

HEAD: 7e23542fae47ed3b811583cb925837e3aa7a3f0f

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-24 15:00               ` Peter Zijlstra
@ 2011-01-24 15:33                 ` Oleg Nesterov
  -1 siblings, 0 replies; 157+ messages in thread
From: Oleg Nesterov @ 2011-01-24 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Paul E. McKenney

On 01/24, Peter Zijlstra wrote:
>
> On Mon, 2011-01-24 at 15:34 +0100, Oleg Nesterov wrote:
> > On 01/24, Peter Zijlstra wrote:
> > >
> > > On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:
> > >
> > > > Index: linux-2.6/mm/rmap.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/mm/rmap.c
> > > > +++ linux-2.6/mm/rmap.c
> > > > @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
> > > >  	 * Synchronize against page_lock_anon_vma() such that
> > > >  	 * we can safely hold the lock without the anon_vma getting
> > > >  	 * freed.
> > > > +	 *
> > > > +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> > > > +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> > > > +	 * page_lock_anon_vma(). This orders:
> > > > +	 *
> > > > +	 * page_lock_anon_vma()		VS	put_anon_vma()
> > > > +	 *   mutex_trylock()			  atomic_dec_and_test()
> > > > +	 *   smp_mb()				  smp_mb()
> > > > +	 *   atomic_read()			  mutex_is_locked()
> > >
> > > Bah!, I thought all mutex_trylock() implementations used an atomic op
> > > with return value (which implies a mb), but it looks like (at least*)
> > > PPC doesn't and only provides a LOCK barrier.
> >
> > But, mutex_trylock() must imply the one-way barrier, otherwise it
> > is buggy, no?
>
> > If this atomic_read() can leak out of the critical section, then
> > I think mutex_trylock() should be fixed. Or I misunderstood the
> > problem completely...
>
> It implies the LOCK barrier, the one way permeable thing, not a full mb.
>
> But I'm not sure the LOCK is sufficient to make the above scenario work.

OK, I can't say I am sure too.



Well. Thinking more, I can't understand why we can trust mutex_trylock()
at all. Suppose that page_lock_anon_vma() races with put_anon_vma()
and anon_vma->root != anon_vma. In this case page_lock_anon_vma() can
take anon_vma->root->lock, but if this vma was already freed then it can
be reused and anon_vma_alloc() can change ->root and set ->refcount == 1
before we check atomic_read(refcount) != 0.

IOW,
	page_lock_anon_vma:
							put_anon_vma() drops the
							last reference

	if (mutex_trylock(&anon_vma->root->lock)) {
							anon_vma_alloc() picks this
							memory, changes ->root and
							sets ->refcount == 1

		if (atomic_read(&anon_vma->refcount))
			return anon_vma;
	}

Most probably I missed something, I forgot everything I knew about this
code....

Oleg.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-24 15:33                 ` Oleg Nesterov
  0 siblings, 0 replies; 157+ messages in thread
From: Oleg Nesterov @ 2011-01-24 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, Martin Schwidefsky, linux-kernel,
	linux-arch, linux-mm, Andrea Arcangeli, Paul E. McKenney

On 01/24, Peter Zijlstra wrote:
>
> On Mon, 2011-01-24 at 15:34 +0100, Oleg Nesterov wrote:
> > On 01/24, Peter Zijlstra wrote:
> > >
> > > On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:
> > >
> > > > Index: linux-2.6/mm/rmap.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/mm/rmap.c
> > > > +++ linux-2.6/mm/rmap.c
> > > > @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
> > > >  	 * Synchronize against page_lock_anon_vma() such that
> > > >  	 * we can safely hold the lock without the anon_vma getting
> > > >  	 * freed.
> > > > +	 *
> > > > +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> > > > +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> > > > +	 * page_lock_anon_vma(). This orders:
> > > > +	 *
> > > > +	 * page_lock_anon_vma()		VS	put_anon_vma()
> > > > +	 *   mutex_trylock()			  atomic_dec_and_test()
> > > > +	 *   smp_mb()				  smp_mb()
> > > > +	 *   atomic_read()			  mutex_is_locked()
> > >
> > > Bah!, I thought all mutex_trylock() implementations used an atomic op
> > > with return value (which implies a mb), but it looks like (at least*)
> > > PPC doesn't and only provides a LOCK barrier.
> >
> > But, mutex_trylock() must imply the one-way barrier, otherwise it
> > is buggy, no?
>
> > If this atomic_read() can leak out of the critical section, then
> > I think mutex_trylock() should be fixed. Or I misunderstood the
> > problem completely...
>
> It implies the LOCK barrier, the one way permeable thing, not a full mb.
>
> But I'm not sure the LOCK is sufficient to make the above scenario work.

OK, I can't say I am sure too.



Well. Thinking more, I can't understand why we can trust mutex_trylock()
at all. Suppose that page_lock_anon_vma() races with put_anon_vma()
and anon_vma->root != anon_vma. In this case page_lock_anon_vma() can
take anon_vma->root->lock, but if this vma was already freed then it can
be reused and anon_vma_alloc() can change ->root and set ->refcount == 1
before we check atomic_read(refcount) != 0.

IOW,
	page_lock_anon_vma:
							put_anon_vma() drops the
							last reference

	if (mutex_trylock(&anon_vma->root->lock)) {
							anon_vma_alloc() picks this
							memory, changes ->root and
							sets ->refcount == 1

		if (atomic_read(&anon_vma->refcount))
			return anon_vma;
	}

Most probably I missed something, I forgot everything I knew about this
code....

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-19 17:10     ` Peter Zijlstra
@ 2011-01-31 10:02       ` Martin Schwidefsky
  -1 siblings, 0 replies; 157+ messages in thread
From: Martin Schwidefsky @ 2011-01-31 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Wed, 19 Jan 2011 18:10:39 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 
> > 11/21 s390-preemptible_mmu_gather.patch
> >       I'd prefer __tlb_alloc_page(), with __GFP_NOWARN as suggested above.
> >       mm/pgtable.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).
> 
> Martin, while doing the below DEFINE_PER_CPU removal I saw you had a
> bunch of RCU table removal thingies in arch/s390/mm/pgtable.c, could
> s390 use the generic bits like sparc and powerpc (see patch 16)?

Maybe, I will have a look at it.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-01-31 10:02       ` Martin Schwidefsky
  0 siblings, 0 replies; 157+ messages in thread
From: Martin Schwidefsky @ 2011-01-31 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Wed, 19 Jan 2011 18:10:39 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2011-01-17 at 23:12 -0800, Hugh Dickins wrote:
> 
> > 11/21 s390-preemptible_mmu_gather.patch
> >       I'd prefer __tlb_alloc_page(), with __GFP_NOWARN as suggested above.
> >       mm/pgtable.c still has DEFINE_PER_CPU(struct mmu_gather, mmu_gathers).
> 
> Martin, while doing the below DEFINE_PER_CPU removal I saw you had a
> bunch of RCU table removal thingies in arch/s390/mm/pgtable.c, could
> s390 use the generic bits like sparc and powerpc (see patch 16)?

Maybe, I will have a look at it.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-01-19 17:10     ` Peter Zijlstra
@ 2011-02-15 14:00       ` Martin Schwidefsky
  -1 siblings, 0 replies; 157+ messages in thread
From: Martin Schwidefsky @ 2011-02-15 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Wed, 19 Jan 2011 18:10:39 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Martin, while doing the below DEFINE_PER_CPU removal I saw you had a
> bunch of RCU table removal thingies in arch/s390/mm/pgtable.c, could
> s390 use the generic bits like sparc and powerpc (see patch 16)?

That should do it. 229 deletions vs. 74 insertions, not bad. And the
tlb flushing code actually got simpler. Even better :-)

--
Subject: [PATCH] s390: use generic RCP page-table freeing

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Now that we have a generic implementation for RCU based page table
freeing, use it for s390 as well. It saves a couple of lines.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/Kconfig               |    1 
 arch/s390/include/asm/pgalloc.h |   19 +--
 arch/s390/include/asm/tlb.h     |   91 ++++++++----------
 arch/s390/mm/pgtable.c          |  192 +++++-----------------------------------
 4 files changed, 74 insertions(+), 229 deletions(-)

--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -87,6 +87,7 @@ config S390
 	select HAVE_KERNEL_LZO
 	select HAVE_GET_USER_PAGES_FAST
 	select HAVE_ARCH_MUTEX_CPU_RELAX
+	select HAVE_RCU_TABLE_FREE
 	select ARCH_INLINE_SPIN_TRYLOCK
 	select ARCH_INLINE_SPIN_TRYLOCK_BH
 	select ARCH_INLINE_SPIN_LOCK
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -20,12 +20,11 @@
 #define check_pgt_cache()	do {} while (0)
 
 unsigned long *crst_table_alloc(struct mm_struct *, int);
-void crst_table_free(struct mm_struct *, unsigned long *);
-void crst_table_free_rcu(struct mm_struct *, unsigned long *);
+void crst_table_free(unsigned long *);
 
 unsigned long *page_table_alloc(struct mm_struct *);
-void page_table_free(struct mm_struct *, unsigned long *);
-void page_table_free_rcu(struct mm_struct *, unsigned long *);
+void page_table_free(unsigned long *);
+
 void disable_noexec(struct mm_struct *, struct task_struct *);
 
 static inline void clear_table(unsigned long *s, unsigned long val, size_t n)
@@ -95,7 +94,7 @@ static inline pud_t *pud_alloc_one(struc
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
-#define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
+#define pud_free(mm, pud) crst_table_free((unsigned long *) pud)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
 {
@@ -104,7 +103,7 @@ static inline pmd_t *pmd_alloc_one(struc
 		crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
 	return (pmd_t *) table;
 }
-#define pmd_free(mm, pmd) crst_table_free(mm, (unsigned long *) pmd)
+#define pmd_free(mm, pmd) crst_table_free((unsigned long *) pmd)
 
 static inline void pgd_populate_kernel(struct mm_struct *mm,
 				       pgd_t *pgd, pud_t *pud)
@@ -148,7 +147,7 @@ static inline pgd_t *pgd_alloc(struct mm
 	return (pgd_t *)
 		crst_table_alloc(mm, user_mode == SECONDARY_SPACE_MODE);
 }
-#define pgd_free(mm, pgd) crst_table_free(mm, (unsigned long *) pgd)
+#define pgd_free(mm, pgd) crst_table_free((unsigned long *) pgd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
@@ -175,9 +174,7 @@ static inline void pmd_populate(struct m
 #define pte_alloc_one_kernel(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 #define pte_alloc_one(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 
-#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
-#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
-
-extern void rcu_table_freelist_finish(void);
+#define pte_free_kernel(mm, pte) page_table_free((unsigned long *) pte)
+#define pte_free(mm, pte) page_table_free((unsigned long *) pte)
 
 #endif /* _S390_PGALLOC_H */
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -29,50 +29,42 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
+struct mmu_table_batch {
+	struct rcu_head rcu;
+	unsigned int nr;
+	void *tables[0];
+};
+
+#define MAX_TABLE_BATCH \
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+void tlb_table_flush(struct mmu_gather *);
+void tlb_remove_table(struct mmu_gather *, void *);
+
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
-	unsigned int nr_ptes;
-	unsigned int nr_pxds;
-	unsigned int max;
-	void **array;
-	void *local[8];
+	struct mmu_table_batch *batch;
+	/* need_flush is used only for page tables */
+	unsigned int need_flush : 1;
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->array = (void *) addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
 static inline void tlb_gather_mmu(struct mmu_gather *tlb,
 				  struct mm_struct *mm,
 				  unsigned int full_mm_flush)
 {
 	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	else
-		__tlb_alloc_page(tlb);
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = tlb->max;
+	tlb->batch = NULL;
+	tlb->need_flush = 0;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
-		__tlb_flush_mm(tlb->mm);
-	while (tlb->nr_ptes > 0)
-		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < tlb->max)
-		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
+	if (tlb->need_flush)
+		tlb_table_flush(tlb);
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
@@ -80,13 +72,8 @@ static inline void tlb_finish_mmu(struct
 {
 	tlb_flush_mmu(tlb);
 
-	rcu_table_freelist_finish();
-
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	if (tlb->array != tlb->local)
-		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -113,12 +100,11 @@ static inline void tlb_remove_page(struc
 static inline void pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				unsigned long address)
 {
-	if (!tlb->fullmm) {
-		tlb->array[tlb->nr_ptes++] = pte;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		page_table_free(tlb->mm, (unsigned long *) pte);
+	if (!tlb->fullmm)
+		/* Use LSB to distinguish crst table vs. page table */
+		tlb_remove_table(tlb, (void *) pte + 1);
+	else
+		page_table_free((unsigned long *) pte);
 }
 
 /*
@@ -134,12 +120,10 @@ static inline void pmd_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 31))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pmd;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pmd);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pmd);
+	else
+		crst_table_free((unsigned long *) pmd);
 #endif
 }
 
@@ -156,15 +140,22 @@ static inline void pud_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 42))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pud;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pud);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pud);
+	else
+		crst_table_free((unsigned long *) pud);
 #endif
 }
 
+static inline void __tlb_remove_table(void *table)
+{
+	/* Use LSB to distinguish crst table vs. page table */
+	if ((unsigned long) table & 1)
+		page_table_free(table - 1);
+	else
+		crst_table_free(table);
+}
+
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
 #define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -24,92 +24,17 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-struct rcu_table_freelist {
-	struct rcu_head rcu;
-	struct mm_struct *mm;
-	unsigned int pgt_index;
-	unsigned int crst_index;
-	unsigned long *table[0];
-};
-
-#define RCU_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
-	  / sizeof(unsigned long))
-
-static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
-
-static void __page_table_free(struct mm_struct *mm, unsigned long *table);
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table);
-
-static struct rcu_table_freelist *rcu_table_freelist_get(struct mm_struct *mm)
-{
-	struct rcu_table_freelist **batchp = &__get_cpu_var(rcu_table_freelist);
-	struct rcu_table_freelist *batch = *batchp;
-
-	if (batch)
-		return batch;
-	batch = (struct rcu_table_freelist *) __get_free_page(GFP_ATOMIC);
-	if (batch) {
-		batch->mm = mm;
-		batch->pgt_index = 0;
-		batch->crst_index = RCU_FREELIST_SIZE;
-		*batchp = batch;
-	}
-	return batch;
-}
-
-static void rcu_table_freelist_callback(struct rcu_head *head)
-{
-	struct rcu_table_freelist *batch =
-		container_of(head, struct rcu_table_freelist, rcu);
-
-	while (batch->pgt_index > 0)
-		__page_table_free(batch->mm, batch->table[--batch->pgt_index]);
-	while (batch->crst_index < RCU_FREELIST_SIZE)
-		__crst_table_free(batch->mm, batch->table[batch->crst_index++]);
-	free_page((unsigned long) batch);
-}
-
-void rcu_table_freelist_finish(void)
-{
-	struct rcu_table_freelist *batch = __get_cpu_var(rcu_table_freelist);
-
-	if (!batch)
-		return;
-	call_rcu(&batch->rcu, rcu_table_freelist_callback);
-	__get_cpu_var(rcu_table_freelist) = NULL;
-}
-
-static void smp_sync(void *arg)
-{
-}
 
 #ifndef CONFIG_64BIT
 #define ALLOC_ORDER	1
 #define TABLES_PER_PAGE	4
 #define FRAG_MASK	15UL
 #define SECOND_HALVES	10UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 256, 0, PAGE_SIZE/4);
-	clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 768, 0, PAGE_SIZE/4);
-}
-
 #else
 #define ALLOC_ORDER	2
 #define TABLES_PER_PAGE	2
 #define FRAG_MASK	3UL
 #define SECOND_HALVES	2UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
-	memset(table + 256, 0, PAGE_SIZE/2);
-}
-
 #endif
 
 unsigned long VMALLOC_START = VMALLOC_END - VMALLOC_SIZE;
@@ -138,6 +63,7 @@ unsigned long *crst_table_alloc(struct m
 			return NULL;
 		}
 		page->index = page_to_phys(shadow);
+		page->private = (unsigned long) mm;
 	}
 	spin_lock_bh(&mm->context.list_lock);
 	list_add(&page->lru, &mm->context.crst_list);
@@ -145,47 +71,19 @@ unsigned long *crst_table_alloc(struct m
 	return (unsigned long *) page_to_phys(page);
 }
 
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	unsigned long *shadow = get_shadow_table(table);
-
-	if (shadow)
-		free_pages((unsigned long) shadow, ALLOC_ORDER);
-	free_pages((unsigned long) table, ALLOC_ORDER);
-}
-
-void crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page = virt_to_page(table);
-
-	spin_lock_bh(&mm->context.list_lock);
-	list_del(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	__crst_table_free(mm, table);
-}
-
-void crst_table_free_rcu(struct mm_struct *mm, unsigned long *table)
+void crst_table_free(unsigned long *table)
 {
-	struct rcu_table_freelist *batch;
 	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
+	unsigned long *shadow = get_shadow_table(table);
 
 	spin_lock_bh(&mm->context.list_lock);
 	list_del(&page->lru);
+	page->private = 0;
 	spin_unlock_bh(&mm->context.list_lock);
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch->table[--batch->crst_index] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
+	if (shadow)
+		free_pages((unsigned long) shadow, ALLOC_ORDER);
+	free_pages((unsigned long) table, ALLOC_ORDER);
 }
 
 #ifdef CONFIG_64BIT
@@ -223,7 +121,7 @@ repeat:
 	}
 	spin_unlock_bh(&mm->page_table_lock);
 	if (table)
-		crst_table_free(mm, table);
+		crst_table_free(table);
 	if (mm->context.asce_limit < limit)
 		goto repeat;
 	update_mm(mm, current);
@@ -257,7 +155,7 @@ void crst_table_downgrade(struct mm_stru
 		}
 		mm->pgd = (pgd_t *) (pgd_val(*pgd) & _REGION_ENTRY_ORIGIN);
 		mm->task_size = mm->context.asce_limit;
-		crst_table_free(mm, (unsigned long *) pgd);
+		crst_table_free((unsigned long *) pgd);
 	}
 	update_mm(mm, current);
 }
@@ -288,11 +186,7 @@ unsigned long *page_table_alloc(struct m
 			return NULL;
 		pgtable_page_ctor(page);
 		page->flags &= ~FRAG_MASK;
-		table = (unsigned long *) page_to_phys(page);
-		if (mm->context.has_pgste)
-			clear_table_pgstes(table);
-		else
-			clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+		page->private = (unsigned long) mm;
 		spin_lock_bh(&mm->context.list_lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
 	}
@@ -305,42 +199,34 @@ unsigned long *page_table_alloc(struct m
 	if ((page->flags & FRAG_MASK) == ((1UL << TABLES_PER_PAGE) - 1))
 		list_move_tail(&page->lru, &mm->context.pgtable_list);
 	spin_unlock_bh(&mm->context.list_lock);
+	clear_table(table, _PAGE_TYPE_EMPTY, PTRS_PER_PTE * sizeof(long));
+	if (mm->context.noexec)
+		clear_table(table + 256, _PAGE_TYPE_EMPTY,
+			    PTRS_PER_PTE * sizeof(long));
+	else if (mm->context.has_pgste)
+		clear_table(table + 256, 0, PTRS_PER_PTE * sizeof(long));
 	return table;
 }
 
-static void __page_table_free(struct mm_struct *mm, unsigned long *table)
+void page_table_free(unsigned long *table)
 {
-	struct page *page;
-	unsigned long bits;
-
-	bits = ((unsigned long) table) & 15;
-	table = (unsigned long *)(((unsigned long) table) ^ bits);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	page->flags ^= bits;
-	if (!(page->flags & FRAG_MASK)) {
-		pgtable_page_dtor(page);
-		__free_page(page);
-	}
-}
-
-void page_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page;
+	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
 	unsigned long bits;
 
 	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
 	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
 	spin_lock_bh(&mm->context.list_lock);
 	page->flags ^= bits;
 	if (page->flags & FRAG_MASK) {
 		/* Page now has some free pgtable fragments. */
-		if (!list_empty(&page->lru))
-			list_move(&page->lru, &mm->context.pgtable_list);
+		list_move(&page->lru, &mm->context.pgtable_list);
 		page = NULL;
-	} else
+	} else {
 		/* All fragments of the 4K page have been freed. */
 		list_del(&page->lru);
+		page->private = 0;
+	}
 	spin_unlock_bh(&mm->context.list_lock);
 	if (page) {
 		pgtable_page_dtor(page);
@@ -348,36 +234,6 @@ void page_table_free(struct mm_struct *m
 	}
 }
 
-void page_table_free_rcu(struct mm_struct *mm, unsigned long *table)
-{
-	struct rcu_table_freelist *batch;
-	struct page *page;
-	unsigned long bits;
-
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		page_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		page_table_free(mm, table);
-		return;
-	}
-	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
-	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	spin_lock_bh(&mm->context.list_lock);
-	/* Delayed freeing with rcu prevents reuse of pgtable fragments */
-	list_del_init(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	table = (unsigned long *)(((unsigned long) table) | bits);
-	batch->table[batch->pgt_index++] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
-}
-
 void disable_noexec(struct mm_struct *mm, struct task_struct *tsk)
 {
 	struct page *page;


-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-02-15 14:00       ` Martin Schwidefsky
  0 siblings, 0 replies; 157+ messages in thread
From: Martin Schwidefsky @ 2011-02-15 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Wed, 19 Jan 2011 18:10:39 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Martin, while doing the below DEFINE_PER_CPU removal I saw you had a
> bunch of RCU table removal thingies in arch/s390/mm/pgtable.c, could
> s390 use the generic bits like sparc and powerpc (see patch 16)?

That should do it. 229 deletions vs. 74 insertions, not bad. And the
tlb flushing code actually got simpler. Even better :-)

--
Subject: [PATCH] s390: use generic RCP page-table freeing

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Now that we have a generic implementation for RCU based page table
freeing, use it for s390 as well. It saves a couple of lines.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/Kconfig               |    1 
 arch/s390/include/asm/pgalloc.h |   19 +--
 arch/s390/include/asm/tlb.h     |   91 ++++++++----------
 arch/s390/mm/pgtable.c          |  192 +++++-----------------------------------
 4 files changed, 74 insertions(+), 229 deletions(-)

--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -87,6 +87,7 @@ config S390
 	select HAVE_KERNEL_LZO
 	select HAVE_GET_USER_PAGES_FAST
 	select HAVE_ARCH_MUTEX_CPU_RELAX
+	select HAVE_RCU_TABLE_FREE
 	select ARCH_INLINE_SPIN_TRYLOCK
 	select ARCH_INLINE_SPIN_TRYLOCK_BH
 	select ARCH_INLINE_SPIN_LOCK
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -20,12 +20,11 @@
 #define check_pgt_cache()	do {} while (0)
 
 unsigned long *crst_table_alloc(struct mm_struct *, int);
-void crst_table_free(struct mm_struct *, unsigned long *);
-void crst_table_free_rcu(struct mm_struct *, unsigned long *);
+void crst_table_free(unsigned long *);
 
 unsigned long *page_table_alloc(struct mm_struct *);
-void page_table_free(struct mm_struct *, unsigned long *);
-void page_table_free_rcu(struct mm_struct *, unsigned long *);
+void page_table_free(unsigned long *);
+
 void disable_noexec(struct mm_struct *, struct task_struct *);
 
 static inline void clear_table(unsigned long *s, unsigned long val, size_t n)
@@ -95,7 +94,7 @@ static inline pud_t *pud_alloc_one(struc
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
-#define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
+#define pud_free(mm, pud) crst_table_free((unsigned long *) pud)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
 {
@@ -104,7 +103,7 @@ static inline pmd_t *pmd_alloc_one(struc
 		crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
 	return (pmd_t *) table;
 }
-#define pmd_free(mm, pmd) crst_table_free(mm, (unsigned long *) pmd)
+#define pmd_free(mm, pmd) crst_table_free((unsigned long *) pmd)
 
 static inline void pgd_populate_kernel(struct mm_struct *mm,
 				       pgd_t *pgd, pud_t *pud)
@@ -148,7 +147,7 @@ static inline pgd_t *pgd_alloc(struct mm
 	return (pgd_t *)
 		crst_table_alloc(mm, user_mode == SECONDARY_SPACE_MODE);
 }
-#define pgd_free(mm, pgd) crst_table_free(mm, (unsigned long *) pgd)
+#define pgd_free(mm, pgd) crst_table_free((unsigned long *) pgd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
@@ -175,9 +174,7 @@ static inline void pmd_populate(struct m
 #define pte_alloc_one_kernel(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 #define pte_alloc_one(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 
-#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
-#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
-
-extern void rcu_table_freelist_finish(void);
+#define pte_free_kernel(mm, pte) page_table_free((unsigned long *) pte)
+#define pte_free(mm, pte) page_table_free((unsigned long *) pte)
 
 #endif /* _S390_PGALLOC_H */
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -29,50 +29,42 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
+struct mmu_table_batch {
+	struct rcu_head rcu;
+	unsigned int nr;
+	void *tables[0];
+};
+
+#define MAX_TABLE_BATCH \
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+void tlb_table_flush(struct mmu_gather *);
+void tlb_remove_table(struct mmu_gather *, void *);
+
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
-	unsigned int nr_ptes;
-	unsigned int nr_pxds;
-	unsigned int max;
-	void **array;
-	void *local[8];
+	struct mmu_table_batch *batch;
+	/* need_flush is used only for page tables */
+	unsigned int need_flush : 1;
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->array = (void *) addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
 static inline void tlb_gather_mmu(struct mmu_gather *tlb,
 				  struct mm_struct *mm,
 				  unsigned int full_mm_flush)
 {
 	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	else
-		__tlb_alloc_page(tlb);
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = tlb->max;
+	tlb->batch = NULL;
+	tlb->need_flush = 0;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
-		__tlb_flush_mm(tlb->mm);
-	while (tlb->nr_ptes > 0)
-		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < tlb->max)
-		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
+	if (tlb->need_flush)
+		tlb_table_flush(tlb);
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
@@ -80,13 +72,8 @@ static inline void tlb_finish_mmu(struct
 {
 	tlb_flush_mmu(tlb);
 
-	rcu_table_freelist_finish();
-
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	if (tlb->array != tlb->local)
-		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -113,12 +100,11 @@ static inline void tlb_remove_page(struc
 static inline void pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				unsigned long address)
 {
-	if (!tlb->fullmm) {
-		tlb->array[tlb->nr_ptes++] = pte;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		page_table_free(tlb->mm, (unsigned long *) pte);
+	if (!tlb->fullmm)
+		/* Use LSB to distinguish crst table vs. page table */
+		tlb_remove_table(tlb, (void *) pte + 1);
+	else
+		page_table_free((unsigned long *) pte);
 }
 
 /*
@@ -134,12 +120,10 @@ static inline void pmd_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 31))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pmd;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pmd);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pmd);
+	else
+		crst_table_free((unsigned long *) pmd);
 #endif
 }
 
@@ -156,15 +140,22 @@ static inline void pud_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 42))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pud;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pud);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pud);
+	else
+		crst_table_free((unsigned long *) pud);
 #endif
 }
 
+static inline void __tlb_remove_table(void *table)
+{
+	/* Use LSB to distinguish crst table vs. page table */
+	if ((unsigned long) table & 1)
+		page_table_free(table - 1);
+	else
+		crst_table_free(table);
+}
+
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
 #define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -24,92 +24,17 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-struct rcu_table_freelist {
-	struct rcu_head rcu;
-	struct mm_struct *mm;
-	unsigned int pgt_index;
-	unsigned int crst_index;
-	unsigned long *table[0];
-};
-
-#define RCU_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
-	  / sizeof(unsigned long))
-
-static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
-
-static void __page_table_free(struct mm_struct *mm, unsigned long *table);
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table);
-
-static struct rcu_table_freelist *rcu_table_freelist_get(struct mm_struct *mm)
-{
-	struct rcu_table_freelist **batchp = &__get_cpu_var(rcu_table_freelist);
-	struct rcu_table_freelist *batch = *batchp;
-
-	if (batch)
-		return batch;
-	batch = (struct rcu_table_freelist *) __get_free_page(GFP_ATOMIC);
-	if (batch) {
-		batch->mm = mm;
-		batch->pgt_index = 0;
-		batch->crst_index = RCU_FREELIST_SIZE;
-		*batchp = batch;
-	}
-	return batch;
-}
-
-static void rcu_table_freelist_callback(struct rcu_head *head)
-{
-	struct rcu_table_freelist *batch =
-		container_of(head, struct rcu_table_freelist, rcu);
-
-	while (batch->pgt_index > 0)
-		__page_table_free(batch->mm, batch->table[--batch->pgt_index]);
-	while (batch->crst_index < RCU_FREELIST_SIZE)
-		__crst_table_free(batch->mm, batch->table[batch->crst_index++]);
-	free_page((unsigned long) batch);
-}
-
-void rcu_table_freelist_finish(void)
-{
-	struct rcu_table_freelist *batch = __get_cpu_var(rcu_table_freelist);
-
-	if (!batch)
-		return;
-	call_rcu(&batch->rcu, rcu_table_freelist_callback);
-	__get_cpu_var(rcu_table_freelist) = NULL;
-}
-
-static void smp_sync(void *arg)
-{
-}
 
 #ifndef CONFIG_64BIT
 #define ALLOC_ORDER	1
 #define TABLES_PER_PAGE	4
 #define FRAG_MASK	15UL
 #define SECOND_HALVES	10UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 256, 0, PAGE_SIZE/4);
-	clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 768, 0, PAGE_SIZE/4);
-}
-
 #else
 #define ALLOC_ORDER	2
 #define TABLES_PER_PAGE	2
 #define FRAG_MASK	3UL
 #define SECOND_HALVES	2UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
-	memset(table + 256, 0, PAGE_SIZE/2);
-}
-
 #endif
 
 unsigned long VMALLOC_START = VMALLOC_END - VMALLOC_SIZE;
@@ -138,6 +63,7 @@ unsigned long *crst_table_alloc(struct m
 			return NULL;
 		}
 		page->index = page_to_phys(shadow);
+		page->private = (unsigned long) mm;
 	}
 	spin_lock_bh(&mm->context.list_lock);
 	list_add(&page->lru, &mm->context.crst_list);
@@ -145,47 +71,19 @@ unsigned long *crst_table_alloc(struct m
 	return (unsigned long *) page_to_phys(page);
 }
 
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	unsigned long *shadow = get_shadow_table(table);
-
-	if (shadow)
-		free_pages((unsigned long) shadow, ALLOC_ORDER);
-	free_pages((unsigned long) table, ALLOC_ORDER);
-}
-
-void crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page = virt_to_page(table);
-
-	spin_lock_bh(&mm->context.list_lock);
-	list_del(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	__crst_table_free(mm, table);
-}
-
-void crst_table_free_rcu(struct mm_struct *mm, unsigned long *table)
+void crst_table_free(unsigned long *table)
 {
-	struct rcu_table_freelist *batch;
 	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
+	unsigned long *shadow = get_shadow_table(table);
 
 	spin_lock_bh(&mm->context.list_lock);
 	list_del(&page->lru);
+	page->private = 0;
 	spin_unlock_bh(&mm->context.list_lock);
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch->table[--batch->crst_index] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
+	if (shadow)
+		free_pages((unsigned long) shadow, ALLOC_ORDER);
+	free_pages((unsigned long) table, ALLOC_ORDER);
 }
 
 #ifdef CONFIG_64BIT
@@ -223,7 +121,7 @@ repeat:
 	}
 	spin_unlock_bh(&mm->page_table_lock);
 	if (table)
-		crst_table_free(mm, table);
+		crst_table_free(table);
 	if (mm->context.asce_limit < limit)
 		goto repeat;
 	update_mm(mm, current);
@@ -257,7 +155,7 @@ void crst_table_downgrade(struct mm_stru
 		}
 		mm->pgd = (pgd_t *) (pgd_val(*pgd) & _REGION_ENTRY_ORIGIN);
 		mm->task_size = mm->context.asce_limit;
-		crst_table_free(mm, (unsigned long *) pgd);
+		crst_table_free((unsigned long *) pgd);
 	}
 	update_mm(mm, current);
 }
@@ -288,11 +186,7 @@ unsigned long *page_table_alloc(struct m
 			return NULL;
 		pgtable_page_ctor(page);
 		page->flags &= ~FRAG_MASK;
-		table = (unsigned long *) page_to_phys(page);
-		if (mm->context.has_pgste)
-			clear_table_pgstes(table);
-		else
-			clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+		page->private = (unsigned long) mm;
 		spin_lock_bh(&mm->context.list_lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
 	}
@@ -305,42 +199,34 @@ unsigned long *page_table_alloc(struct m
 	if ((page->flags & FRAG_MASK) == ((1UL << TABLES_PER_PAGE) - 1))
 		list_move_tail(&page->lru, &mm->context.pgtable_list);
 	spin_unlock_bh(&mm->context.list_lock);
+	clear_table(table, _PAGE_TYPE_EMPTY, PTRS_PER_PTE * sizeof(long));
+	if (mm->context.noexec)
+		clear_table(table + 256, _PAGE_TYPE_EMPTY,
+			    PTRS_PER_PTE * sizeof(long));
+	else if (mm->context.has_pgste)
+		clear_table(table + 256, 0, PTRS_PER_PTE * sizeof(long));
 	return table;
 }
 
-static void __page_table_free(struct mm_struct *mm, unsigned long *table)
+void page_table_free(unsigned long *table)
 {
-	struct page *page;
-	unsigned long bits;
-
-	bits = ((unsigned long) table) & 15;
-	table = (unsigned long *)(((unsigned long) table) ^ bits);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	page->flags ^= bits;
-	if (!(page->flags & FRAG_MASK)) {
-		pgtable_page_dtor(page);
-		__free_page(page);
-	}
-}
-
-void page_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page;
+	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
 	unsigned long bits;
 
 	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
 	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
 	spin_lock_bh(&mm->context.list_lock);
 	page->flags ^= bits;
 	if (page->flags & FRAG_MASK) {
 		/* Page now has some free pgtable fragments. */
-		if (!list_empty(&page->lru))
-			list_move(&page->lru, &mm->context.pgtable_list);
+		list_move(&page->lru, &mm->context.pgtable_list);
 		page = NULL;
-	} else
+	} else {
 		/* All fragments of the 4K page have been freed. */
 		list_del(&page->lru);
+		page->private = 0;
+	}
 	spin_unlock_bh(&mm->context.list_lock);
 	if (page) {
 		pgtable_page_dtor(page);
@@ -348,36 +234,6 @@ void page_table_free(struct mm_struct *m
 	}
 }
 
-void page_table_free_rcu(struct mm_struct *mm, unsigned long *table)
-{
-	struct rcu_table_freelist *batch;
-	struct page *page;
-	unsigned long bits;
-
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		page_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		page_table_free(mm, table);
-		return;
-	}
-	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
-	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	spin_lock_bh(&mm->context.list_lock);
-	/* Delayed freeing with rcu prevents reuse of pgtable fragments */
-	list_del_init(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	table = (unsigned long *)(((unsigned long) table) | bits);
-	batch->table[batch->pgt_index++] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
-}
-
 void disable_noexec(struct mm_struct *mm, struct task_struct *tsk)
 {
 	struct page *page;


-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
  2011-02-15 14:00       ` Martin Schwidefsky
@ 2011-02-15 15:39         ` Martin Schwidefsky
  -1 siblings, 0 replies; 157+ messages in thread
From: Martin Schwidefsky @ 2011-02-15 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Tue, 15 Feb 2011 15:00:17 +0100
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Wed, 19 Jan 2011 18:10:39 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Martin, while doing the below DEFINE_PER_CPU removal I saw you had a
> > bunch of RCU table removal thingies in arch/s390/mm/pgtable.c, could
> > s390 use the generic bits like sparc and powerpc (see patch 16)?
> 
> That should do it. 229 deletions vs. 74 insertions, not bad. And the
> tlb flushing code actually got simpler. Even better :-)

Darn, forgot "quilt refresh". The last patch I've sent is the old,
broken one. This one is better..

--
Subject: [PATCH] s390: use generic RCP page-table freeing

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Now that we have a generic implementation for RCU based page table
freeing, use it for s390 as well. It saves a couple of lines.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/Kconfig               |    1 
 arch/s390/include/asm/pgalloc.h |   19 +--
 arch/s390/include/asm/tlb.h     |   91 ++++++++----------
 arch/s390/mm/pgtable.c          |  192 +++++-----------------------------------
 4 files changed, 74 insertions(+), 229 deletions(-)

--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -87,6 +87,7 @@ config S390
 	select HAVE_KERNEL_LZO
 	select HAVE_GET_USER_PAGES_FAST
 	select HAVE_ARCH_MUTEX_CPU_RELAX
+	select HAVE_RCU_TABLE_FREE
 	select ARCH_INLINE_SPIN_TRYLOCK
 	select ARCH_INLINE_SPIN_TRYLOCK_BH
 	select ARCH_INLINE_SPIN_LOCK
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -20,12 +20,11 @@
 #define check_pgt_cache()	do {} while (0)
 
 unsigned long *crst_table_alloc(struct mm_struct *, int);
-void crst_table_free(struct mm_struct *, unsigned long *);
-void crst_table_free_rcu(struct mm_struct *, unsigned long *);
+void crst_table_free(unsigned long *);
 
 unsigned long *page_table_alloc(struct mm_struct *);
-void page_table_free(struct mm_struct *, unsigned long *);
-void page_table_free_rcu(struct mm_struct *, unsigned long *);
+void page_table_free(unsigned long *);
+
 void disable_noexec(struct mm_struct *, struct task_struct *);
 
 static inline void clear_table(unsigned long *s, unsigned long val, size_t n)
@@ -95,7 +94,7 @@ static inline pud_t *pud_alloc_one(struc
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
-#define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
+#define pud_free(mm, pud) crst_table_free((unsigned long *) pud)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
 {
@@ -104,7 +103,7 @@ static inline pmd_t *pmd_alloc_one(struc
 		crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
 	return (pmd_t *) table;
 }
-#define pmd_free(mm, pmd) crst_table_free(mm, (unsigned long *) pmd)
+#define pmd_free(mm, pmd) crst_table_free((unsigned long *) pmd)
 
 static inline void pgd_populate_kernel(struct mm_struct *mm,
 				       pgd_t *pgd, pud_t *pud)
@@ -148,7 +147,7 @@ static inline pgd_t *pgd_alloc(struct mm
 	return (pgd_t *)
 		crst_table_alloc(mm, user_mode == SECONDARY_SPACE_MODE);
 }
-#define pgd_free(mm, pgd) crst_table_free(mm, (unsigned long *) pgd)
+#define pgd_free(mm, pgd) crst_table_free((unsigned long *) pgd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
@@ -175,9 +174,7 @@ static inline void pmd_populate(struct m
 #define pte_alloc_one_kernel(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 #define pte_alloc_one(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 
-#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
-#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
-
-extern void rcu_table_freelist_finish(void);
+#define pte_free_kernel(mm, pte) page_table_free((unsigned long *) pte)
+#define pte_free(mm, pte) page_table_free((unsigned long *) pte)
 
 #endif /* _S390_PGALLOC_H */
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -29,50 +29,42 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
+struct mmu_table_batch {
+	struct rcu_head rcu;
+	unsigned int nr;
+	void *tables[0];
+};
+
+#define MAX_TABLE_BATCH \
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+void tlb_table_flush(struct mmu_gather *);
+void tlb_remove_table(struct mmu_gather *, void *);
+
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
-	unsigned int nr_ptes;
-	unsigned int nr_pxds;
-	unsigned int max;
-	void **array;
-	void *local[8];
+	struct mmu_table_batch *batch;
+	/* need_flush is used only for page tables */
+	unsigned int need_flush : 1;
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->array = (void *) addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
 static inline void tlb_gather_mmu(struct mmu_gather *tlb,
 				  struct mm_struct *mm,
 				  unsigned int full_mm_flush)
 {
 	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	else
-		__tlb_alloc_page(tlb);
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = tlb->max;
+	tlb->batch = NULL;
+	tlb->need_flush = 0;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
-		__tlb_flush_mm(tlb->mm);
-	while (tlb->nr_ptes > 0)
-		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < tlb->max)
-		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
+	if (tlb->need_flush)
+		tlb_table_flush(tlb);
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
@@ -80,13 +72,8 @@ static inline void tlb_finish_mmu(struct
 {
 	tlb_flush_mmu(tlb);
 
-	rcu_table_freelist_finish();
-
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	if (tlb->array != tlb->local)
-		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -113,12 +100,11 @@ static inline void tlb_remove_page(struc
 static inline void pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				unsigned long address)
 {
-	if (!tlb->fullmm) {
-		tlb->array[tlb->nr_ptes++] = pte;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		page_table_free(tlb->mm, (unsigned long *) pte);
+	if (!tlb->fullmm)
+		/* Use LSB to distinguish crst table vs. page table */
+		tlb_remove_table(tlb, (void *) pte + 1);
+	else
+		page_table_free((unsigned long *) pte);
 }
 
 /*
@@ -134,12 +120,10 @@ static inline void pmd_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 31))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pmd;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pmd);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pmd);
+	else
+		crst_table_free((unsigned long *) pmd);
 #endif
 }
 
@@ -156,15 +140,22 @@ static inline void pud_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 42))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pud;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pud);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pud);
+	else
+		crst_table_free((unsigned long *) pud);
 #endif
 }
 
+static inline void __tlb_remove_table(void *table)
+{
+	/* Use LSB to distinguish crst table vs. page table */
+	if ((unsigned long) table & 1)
+		page_table_free(table - 1);
+	else
+		crst_table_free(table);
+}
+
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
 #define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -24,92 +24,17 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-struct rcu_table_freelist {
-	struct rcu_head rcu;
-	struct mm_struct *mm;
-	unsigned int pgt_index;
-	unsigned int crst_index;
-	unsigned long *table[0];
-};
-
-#define RCU_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
-	  / sizeof(unsigned long))
-
-static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
-
-static void __page_table_free(struct mm_struct *mm, unsigned long *table);
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table);
-
-static struct rcu_table_freelist *rcu_table_freelist_get(struct mm_struct *mm)
-{
-	struct rcu_table_freelist **batchp = &__get_cpu_var(rcu_table_freelist);
-	struct rcu_table_freelist *batch = *batchp;
-
-	if (batch)
-		return batch;
-	batch = (struct rcu_table_freelist *) __get_free_page(GFP_ATOMIC);
-	if (batch) {
-		batch->mm = mm;
-		batch->pgt_index = 0;
-		batch->crst_index = RCU_FREELIST_SIZE;
-		*batchp = batch;
-	}
-	return batch;
-}
-
-static void rcu_table_freelist_callback(struct rcu_head *head)
-{
-	struct rcu_table_freelist *batch =
-		container_of(head, struct rcu_table_freelist, rcu);
-
-	while (batch->pgt_index > 0)
-		__page_table_free(batch->mm, batch->table[--batch->pgt_index]);
-	while (batch->crst_index < RCU_FREELIST_SIZE)
-		__crst_table_free(batch->mm, batch->table[batch->crst_index++]);
-	free_page((unsigned long) batch);
-}
-
-void rcu_table_freelist_finish(void)
-{
-	struct rcu_table_freelist *batch = __get_cpu_var(rcu_table_freelist);
-
-	if (!batch)
-		return;
-	call_rcu(&batch->rcu, rcu_table_freelist_callback);
-	__get_cpu_var(rcu_table_freelist) = NULL;
-}
-
-static void smp_sync(void *arg)
-{
-}
 
 #ifndef CONFIG_64BIT
 #define ALLOC_ORDER	1
 #define TABLES_PER_PAGE	4
 #define FRAG_MASK	15UL
 #define SECOND_HALVES	10UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 256, 0, PAGE_SIZE/4);
-	clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 768, 0, PAGE_SIZE/4);
-}
-
 #else
 #define ALLOC_ORDER	2
 #define TABLES_PER_PAGE	2
 #define FRAG_MASK	3UL
 #define SECOND_HALVES	2UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
-	memset(table + 256, 0, PAGE_SIZE/2);
-}
-
 #endif
 
 unsigned long VMALLOC_START = VMALLOC_END - VMALLOC_SIZE;
@@ -138,6 +63,7 @@ unsigned long *crst_table_alloc(struct m
 			return NULL;
 		}
 		page->index = page_to_phys(shadow);
+		page->private = (unsigned long) mm;
 	}
 	spin_lock_bh(&mm->context.list_lock);
 	list_add(&page->lru, &mm->context.crst_list);
@@ -145,47 +71,19 @@ unsigned long *crst_table_alloc(struct m
 	return (unsigned long *) page_to_phys(page);
 }
 
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	unsigned long *shadow = get_shadow_table(table);
-
-	if (shadow)
-		free_pages((unsigned long) shadow, ALLOC_ORDER);
-	free_pages((unsigned long) table, ALLOC_ORDER);
-}
-
-void crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page = virt_to_page(table);
-
-	spin_lock_bh(&mm->context.list_lock);
-	list_del(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	__crst_table_free(mm, table);
-}
-
-void crst_table_free_rcu(struct mm_struct *mm, unsigned long *table)
+void crst_table_free(unsigned long *table)
 {
-	struct rcu_table_freelist *batch;
 	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
+	unsigned long *shadow = get_shadow_table(table);
 
 	spin_lock_bh(&mm->context.list_lock);
 	list_del(&page->lru);
+	page->private = 0;
 	spin_unlock_bh(&mm->context.list_lock);
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch->table[--batch->crst_index] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
+	if (shadow)
+		free_pages((unsigned long) shadow, ALLOC_ORDER);
+	free_pages((unsigned long) table, ALLOC_ORDER);
 }
 
 #ifdef CONFIG_64BIT
@@ -223,7 +121,7 @@ repeat:
 	}
 	spin_unlock_bh(&mm->page_table_lock);
 	if (table)
-		crst_table_free(mm, table);
+		crst_table_free(table);
 	if (mm->context.asce_limit < limit)
 		goto repeat;
 	update_mm(mm, current);
@@ -257,7 +155,7 @@ void crst_table_downgrade(struct mm_stru
 		}
 		mm->pgd = (pgd_t *) (pgd_val(*pgd) & _REGION_ENTRY_ORIGIN);
 		mm->task_size = mm->context.asce_limit;
-		crst_table_free(mm, (unsigned long *) pgd);
+		crst_table_free((unsigned long *) pgd);
 	}
 	update_mm(mm, current);
 }
@@ -288,11 +186,7 @@ unsigned long *page_table_alloc(struct m
 			return NULL;
 		pgtable_page_ctor(page);
 		page->flags &= ~FRAG_MASK;
-		table = (unsigned long *) page_to_phys(page);
-		if (mm->context.has_pgste)
-			clear_table_pgstes(table);
-		else
-			clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+		page->private = (unsigned long) mm;
 		spin_lock_bh(&mm->context.list_lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
 	}
@@ -305,42 +199,34 @@ unsigned long *page_table_alloc(struct m
 	if ((page->flags & FRAG_MASK) == ((1UL << TABLES_PER_PAGE) - 1))
 		list_move_tail(&page->lru, &mm->context.pgtable_list);
 	spin_unlock_bh(&mm->context.list_lock);
+	clear_table(table, _PAGE_TYPE_EMPTY, PTRS_PER_PTE * sizeof(long));
+	if (mm->context.noexec)
+		clear_table(table + 256, _PAGE_TYPE_EMPTY,
+			    PTRS_PER_PTE * sizeof(long));
+	else if (mm->context.has_pgste)
+		clear_table(table + 256, 0, PTRS_PER_PTE * sizeof(long));
 	return table;
 }
 
-static void __page_table_free(struct mm_struct *mm, unsigned long *table)
+void page_table_free(unsigned long *table)
 {
-	struct page *page;
-	unsigned long bits;
-
-	bits = ((unsigned long) table) & 15;
-	table = (unsigned long *)(((unsigned long) table) ^ bits);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	page->flags ^= bits;
-	if (!(page->flags & FRAG_MASK)) {
-		pgtable_page_dtor(page);
-		__free_page(page);
-	}
-}
-
-void page_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page;
+	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
 	unsigned long bits;
 
 	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
 	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
 	spin_lock_bh(&mm->context.list_lock);
 	page->flags ^= bits;
 	if (page->flags & FRAG_MASK) {
 		/* Page now has some free pgtable fragments. */
-		if (!list_empty(&page->lru))
-			list_move(&page->lru, &mm->context.pgtable_list);
+		list_move(&page->lru, &mm->context.pgtable_list);
 		page = NULL;
-	} else
+	} else {
 		/* All fragments of the 4K page have been freed. */
 		list_del(&page->lru);
+		page->private = 0;
+	}
 	spin_unlock_bh(&mm->context.list_lock);
 	if (page) {
 		pgtable_page_dtor(page);
@@ -348,36 +234,6 @@ void page_table_free(struct mm_struct *m
 	}
 }
 
-void page_table_free_rcu(struct mm_struct *mm, unsigned long *table)
-{
-	struct rcu_table_freelist *batch;
-	struct page *page;
-	unsigned long bits;
-
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		page_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		page_table_free(mm, table);
-		return;
-	}
-	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
-	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	spin_lock_bh(&mm->context.list_lock);
-	/* Delayed freeing with rcu prevents reuse of pgtable fragments */
-	list_del_init(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	table = (unsigned long *)(((unsigned long) table) | bits);
-	batch->table[batch->pgt_index++] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
-}
-
 void disable_noexec(struct mm_struct *mm, struct task_struct *tsk)
 {
 	struct page *page;

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH 00/21] mm: Preemptibility -v6
@ 2011-02-15 15:39         ` Martin Schwidefsky
  0 siblings, 0 replies; 157+ messages in thread
From: Martin Schwidefsky @ 2011-02-15 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt,
	David Miller, Nick Piggin, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli

On Tue, 15 Feb 2011 15:00:17 +0100
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Wed, 19 Jan 2011 18:10:39 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Martin, while doing the below DEFINE_PER_CPU removal I saw you had a
> > bunch of RCU table removal thingies in arch/s390/mm/pgtable.c, could
> > s390 use the generic bits like sparc and powerpc (see patch 16)?
> 
> That should do it. 229 deletions vs. 74 insertions, not bad. And the
> tlb flushing code actually got simpler. Even better :-)

Darn, forgot "quilt refresh". The last patch I've sent is the old,
broken one. This one is better..

--
Subject: [PATCH] s390: use generic RCP page-table freeing

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Now that we have a generic implementation for RCU based page table
freeing, use it for s390 as well. It saves a couple of lines.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/Kconfig               |    1 
 arch/s390/include/asm/pgalloc.h |   19 +--
 arch/s390/include/asm/tlb.h     |   91 ++++++++----------
 arch/s390/mm/pgtable.c          |  192 +++++-----------------------------------
 4 files changed, 74 insertions(+), 229 deletions(-)

--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -87,6 +87,7 @@ config S390
 	select HAVE_KERNEL_LZO
 	select HAVE_GET_USER_PAGES_FAST
 	select HAVE_ARCH_MUTEX_CPU_RELAX
+	select HAVE_RCU_TABLE_FREE
 	select ARCH_INLINE_SPIN_TRYLOCK
 	select ARCH_INLINE_SPIN_TRYLOCK_BH
 	select ARCH_INLINE_SPIN_LOCK
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -20,12 +20,11 @@
 #define check_pgt_cache()	do {} while (0)
 
 unsigned long *crst_table_alloc(struct mm_struct *, int);
-void crst_table_free(struct mm_struct *, unsigned long *);
-void crst_table_free_rcu(struct mm_struct *, unsigned long *);
+void crst_table_free(unsigned long *);
 
 unsigned long *page_table_alloc(struct mm_struct *);
-void page_table_free(struct mm_struct *, unsigned long *);
-void page_table_free_rcu(struct mm_struct *, unsigned long *);
+void page_table_free(unsigned long *);
+
 void disable_noexec(struct mm_struct *, struct task_struct *);
 
 static inline void clear_table(unsigned long *s, unsigned long val, size_t n)
@@ -95,7 +94,7 @@ static inline pud_t *pud_alloc_one(struc
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
-#define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
+#define pud_free(mm, pud) crst_table_free((unsigned long *) pud)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
 {
@@ -104,7 +103,7 @@ static inline pmd_t *pmd_alloc_one(struc
 		crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
 	return (pmd_t *) table;
 }
-#define pmd_free(mm, pmd) crst_table_free(mm, (unsigned long *) pmd)
+#define pmd_free(mm, pmd) crst_table_free((unsigned long *) pmd)
 
 static inline void pgd_populate_kernel(struct mm_struct *mm,
 				       pgd_t *pgd, pud_t *pud)
@@ -148,7 +147,7 @@ static inline pgd_t *pgd_alloc(struct mm
 	return (pgd_t *)
 		crst_table_alloc(mm, user_mode == SECONDARY_SPACE_MODE);
 }
-#define pgd_free(mm, pgd) crst_table_free(mm, (unsigned long *) pgd)
+#define pgd_free(mm, pgd) crst_table_free((unsigned long *) pgd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
@@ -175,9 +174,7 @@ static inline void pmd_populate(struct m
 #define pte_alloc_one_kernel(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 #define pte_alloc_one(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 
-#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
-#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
-
-extern void rcu_table_freelist_finish(void);
+#define pte_free_kernel(mm, pte) page_table_free((unsigned long *) pte)
+#define pte_free(mm, pte) page_table_free((unsigned long *) pte)
 
 #endif /* _S390_PGALLOC_H */
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -29,50 +29,42 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
+struct mmu_table_batch {
+	struct rcu_head rcu;
+	unsigned int nr;
+	void *tables[0];
+};
+
+#define MAX_TABLE_BATCH \
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+void tlb_table_flush(struct mmu_gather *);
+void tlb_remove_table(struct mmu_gather *, void *);
+
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
-	unsigned int nr_ptes;
-	unsigned int nr_pxds;
-	unsigned int max;
-	void **array;
-	void *local[8];
+	struct mmu_table_batch *batch;
+	/* need_flush is used only for page tables */
+	unsigned int need_flush : 1;
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->array = (void *) addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
 static inline void tlb_gather_mmu(struct mmu_gather *tlb,
 				  struct mm_struct *mm,
 				  unsigned int full_mm_flush)
 {
 	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	else
-		__tlb_alloc_page(tlb);
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = tlb->max;
+	tlb->batch = NULL;
+	tlb->need_flush = 0;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
-		__tlb_flush_mm(tlb->mm);
-	while (tlb->nr_ptes > 0)
-		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < tlb->max)
-		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
+	if (tlb->need_flush)
+		tlb_table_flush(tlb);
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
@@ -80,13 +72,8 @@ static inline void tlb_finish_mmu(struct
 {
 	tlb_flush_mmu(tlb);
 
-	rcu_table_freelist_finish();
-
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	if (tlb->array != tlb->local)
-		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -113,12 +100,11 @@ static inline void tlb_remove_page(struc
 static inline void pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				unsigned long address)
 {
-	if (!tlb->fullmm) {
-		tlb->array[tlb->nr_ptes++] = pte;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		page_table_free(tlb->mm, (unsigned long *) pte);
+	if (!tlb->fullmm)
+		/* Use LSB to distinguish crst table vs. page table */
+		tlb_remove_table(tlb, (void *) pte + 1);
+	else
+		page_table_free((unsigned long *) pte);
 }
 
 /*
@@ -134,12 +120,10 @@ static inline void pmd_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 31))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pmd;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pmd);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pmd);
+	else
+		crst_table_free((unsigned long *) pmd);
 #endif
 }
 
@@ -156,15 +140,22 @@ static inline void pud_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 42))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pud;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pud);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pud);
+	else
+		crst_table_free((unsigned long *) pud);
 #endif
 }
 
+static inline void __tlb_remove_table(void *table)
+{
+	/* Use LSB to distinguish crst table vs. page table */
+	if ((unsigned long) table & 1)
+		page_table_free(table - 1);
+	else
+		crst_table_free(table);
+}
+
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
 #define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -24,92 +24,17 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-struct rcu_table_freelist {
-	struct rcu_head rcu;
-	struct mm_struct *mm;
-	unsigned int pgt_index;
-	unsigned int crst_index;
-	unsigned long *table[0];
-};
-
-#define RCU_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
-	  / sizeof(unsigned long))
-
-static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
-
-static void __page_table_free(struct mm_struct *mm, unsigned long *table);
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table);
-
-static struct rcu_table_freelist *rcu_table_freelist_get(struct mm_struct *mm)
-{
-	struct rcu_table_freelist **batchp = &__get_cpu_var(rcu_table_freelist);
-	struct rcu_table_freelist *batch = *batchp;
-
-	if (batch)
-		return batch;
-	batch = (struct rcu_table_freelist *) __get_free_page(GFP_ATOMIC);
-	if (batch) {
-		batch->mm = mm;
-		batch->pgt_index = 0;
-		batch->crst_index = RCU_FREELIST_SIZE;
-		*batchp = batch;
-	}
-	return batch;
-}
-
-static void rcu_table_freelist_callback(struct rcu_head *head)
-{
-	struct rcu_table_freelist *batch =
-		container_of(head, struct rcu_table_freelist, rcu);
-
-	while (batch->pgt_index > 0)
-		__page_table_free(batch->mm, batch->table[--batch->pgt_index]);
-	while (batch->crst_index < RCU_FREELIST_SIZE)
-		__crst_table_free(batch->mm, batch->table[batch->crst_index++]);
-	free_page((unsigned long) batch);
-}
-
-void rcu_table_freelist_finish(void)
-{
-	struct rcu_table_freelist *batch = __get_cpu_var(rcu_table_freelist);
-
-	if (!batch)
-		return;
-	call_rcu(&batch->rcu, rcu_table_freelist_callback);
-	__get_cpu_var(rcu_table_freelist) = NULL;
-}
-
-static void smp_sync(void *arg)
-{
-}
 
 #ifndef CONFIG_64BIT
 #define ALLOC_ORDER	1
 #define TABLES_PER_PAGE	4
 #define FRAG_MASK	15UL
 #define SECOND_HALVES	10UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 256, 0, PAGE_SIZE/4);
-	clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 768, 0, PAGE_SIZE/4);
-}
-
 #else
 #define ALLOC_ORDER	2
 #define TABLES_PER_PAGE	2
 #define FRAG_MASK	3UL
 #define SECOND_HALVES	2UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
-	memset(table + 256, 0, PAGE_SIZE/2);
-}
-
 #endif
 
 unsigned long VMALLOC_START = VMALLOC_END - VMALLOC_SIZE;
@@ -138,6 +63,7 @@ unsigned long *crst_table_alloc(struct m
 			return NULL;
 		}
 		page->index = page_to_phys(shadow);
+		page->private = (unsigned long) mm;
 	}
 	spin_lock_bh(&mm->context.list_lock);
 	list_add(&page->lru, &mm->context.crst_list);
@@ -145,47 +71,19 @@ unsigned long *crst_table_alloc(struct m
 	return (unsigned long *) page_to_phys(page);
 }
 
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	unsigned long *shadow = get_shadow_table(table);
-
-	if (shadow)
-		free_pages((unsigned long) shadow, ALLOC_ORDER);
-	free_pages((unsigned long) table, ALLOC_ORDER);
-}
-
-void crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page = virt_to_page(table);
-
-	spin_lock_bh(&mm->context.list_lock);
-	list_del(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	__crst_table_free(mm, table);
-}
-
-void crst_table_free_rcu(struct mm_struct *mm, unsigned long *table)
+void crst_table_free(unsigned long *table)
 {
-	struct rcu_table_freelist *batch;
 	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
+	unsigned long *shadow = get_shadow_table(table);
 
 	spin_lock_bh(&mm->context.list_lock);
 	list_del(&page->lru);
+	page->private = 0;
 	spin_unlock_bh(&mm->context.list_lock);
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch->table[--batch->crst_index] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
+	if (shadow)
+		free_pages((unsigned long) shadow, ALLOC_ORDER);
+	free_pages((unsigned long) table, ALLOC_ORDER);
 }
 
 #ifdef CONFIG_64BIT
@@ -223,7 +121,7 @@ repeat:
 	}
 	spin_unlock_bh(&mm->page_table_lock);
 	if (table)
-		crst_table_free(mm, table);
+		crst_table_free(table);
 	if (mm->context.asce_limit < limit)
 		goto repeat;
 	update_mm(mm, current);
@@ -257,7 +155,7 @@ void crst_table_downgrade(struct mm_stru
 		}
 		mm->pgd = (pgd_t *) (pgd_val(*pgd) & _REGION_ENTRY_ORIGIN);
 		mm->task_size = mm->context.asce_limit;
-		crst_table_free(mm, (unsigned long *) pgd);
+		crst_table_free((unsigned long *) pgd);
 	}
 	update_mm(mm, current);
 }
@@ -288,11 +186,7 @@ unsigned long *page_table_alloc(struct m
 			return NULL;
 		pgtable_page_ctor(page);
 		page->flags &= ~FRAG_MASK;
-		table = (unsigned long *) page_to_phys(page);
-		if (mm->context.has_pgste)
-			clear_table_pgstes(table);
-		else
-			clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+		page->private = (unsigned long) mm;
 		spin_lock_bh(&mm->context.list_lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
 	}
@@ -305,42 +199,34 @@ unsigned long *page_table_alloc(struct m
 	if ((page->flags & FRAG_MASK) == ((1UL << TABLES_PER_PAGE) - 1))
 		list_move_tail(&page->lru, &mm->context.pgtable_list);
 	spin_unlock_bh(&mm->context.list_lock);
+	clear_table(table, _PAGE_TYPE_EMPTY, PTRS_PER_PTE * sizeof(long));
+	if (mm->context.noexec)
+		clear_table(table + 256, _PAGE_TYPE_EMPTY,
+			    PTRS_PER_PTE * sizeof(long));
+	else if (mm->context.has_pgste)
+		clear_table(table + 256, 0, PTRS_PER_PTE * sizeof(long));
 	return table;
 }
 
-static void __page_table_free(struct mm_struct *mm, unsigned long *table)
+void page_table_free(unsigned long *table)
 {
-	struct page *page;
-	unsigned long bits;
-
-	bits = ((unsigned long) table) & 15;
-	table = (unsigned long *)(((unsigned long) table) ^ bits);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	page->flags ^= bits;
-	if (!(page->flags & FRAG_MASK)) {
-		pgtable_page_dtor(page);
-		__free_page(page);
-	}
-}
-
-void page_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page;
+	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
 	unsigned long bits;
 
 	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
 	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
 	spin_lock_bh(&mm->context.list_lock);
 	page->flags ^= bits;
 	if (page->flags & FRAG_MASK) {
 		/* Page now has some free pgtable fragments. */
-		if (!list_empty(&page->lru))
-			list_move(&page->lru, &mm->context.pgtable_list);
+		list_move(&page->lru, &mm->context.pgtable_list);
 		page = NULL;
-	} else
+	} else {
 		/* All fragments of the 4K page have been freed. */
 		list_del(&page->lru);
+		page->private = 0;
+	}
 	spin_unlock_bh(&mm->context.list_lock);
 	if (page) {
 		pgtable_page_dtor(page);
@@ -348,36 +234,6 @@ void page_table_free(struct mm_struct *m
 	}
 }
 
-void page_table_free_rcu(struct mm_struct *mm, unsigned long *table)
-{
-	struct rcu_table_freelist *batch;
-	struct page *page;
-	unsigned long bits;
-
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		page_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		page_table_free(mm, table);
-		return;
-	}
-	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
-	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	spin_lock_bh(&mm->context.list_lock);
-	/* Delayed freeing with rcu prevents reuse of pgtable fragments */
-	list_del_init(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	table = (unsigned long *)(((unsigned long) table) | bits);
-	batch->table[batch->pgt_index++] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
-}
-
 void disable_noexec(struct mm_struct *mm, struct task_struct *tsk)
 {
 	struct page *page;

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 157+ messages in thread

end of thread, other threads:[~2011-02-15 15:39 UTC | newest]

Thread overview: 157+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-26 14:38 [PATCH 00/21] mm: Preemptibility -v6 Peter Zijlstra
2010-11-26 14:38 ` Peter Zijlstra
2010-11-26 14:38 ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 01/21] mm: Revert page_lock_anon_vma() lock annotation Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-30  1:19   ` KOSAKI Motohiro
2010-11-30  1:19     ` KOSAKI Motohiro
2010-11-26 14:38 ` [PATCH 02/21] powerpc: Use call_rcu_sched() for pagetables Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-27 10:33   ` Nick Piggin
2010-11-27 10:33     ` Nick Piggin
2010-11-27 21:55     ` Benjamin Herrenschmidt
2010-11-27 21:55       ` Benjamin Herrenschmidt
2010-11-26 14:38 ` [PATCH 03/21] mm: Improve page_lock_anon_vma() comment Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-29  2:14   ` KAMEZAWA Hiroyuki
2010-11-29  2:14     ` KAMEZAWA Hiroyuki
2010-11-26 14:38 ` [PATCH 04/21] mm: Rename drop_anon_vma to put_anon_vma Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-29  2:16   ` KAMEZAWA Hiroyuki
2010-11-29  2:16     ` KAMEZAWA Hiroyuki
2010-11-26 14:38 ` [PATCH 05/21] mm: Move anon_vma ref out from under CONFIG_KSM Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-29  2:19   ` KAMEZAWA Hiroyuki
2010-11-29  2:19     ` KAMEZAWA Hiroyuki
2010-11-26 14:38 ` [PATCH 06/21] mm: Simplify anon_vma refcounts Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-29  2:30   ` KAMEZAWA Hiroyuki
2010-11-29  2:30     ` KAMEZAWA Hiroyuki
2010-11-26 14:38 ` [PATCH 07/21] mm: Use refcounts for page_lock_anon_vma() Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-29  2:35   ` KAMEZAWA Hiroyuki
2010-11-29  2:35     ` KAMEZAWA Hiroyuki
2010-11-29 20:41     ` Peter Zijlstra
2010-11-29 20:41       ` Peter Zijlstra
2010-11-30  1:21     ` KOSAKI Motohiro
2010-11-30  1:21       ` KOSAKI Motohiro
2010-11-26 14:38 ` [PATCH 08/21] mm: Preemptible mmu_gather Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-29  2:53   ` KAMEZAWA Hiroyuki
2010-11-29  2:53     ` KAMEZAWA Hiroyuki
2010-11-29  2:53     ` KAMEZAWA Hiroyuki
2010-11-29 20:47     ` Peter Zijlstra
2010-11-29 20:47       ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 09/21] powerpc: " Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-30  3:12   ` Benjamin Herrenschmidt
2010-11-30  3:12     ` Benjamin Herrenschmidt
2010-11-30  3:35     ` Benjamin Herrenschmidt
2010-11-30  3:35       ` Benjamin Herrenschmidt
2010-11-30 19:25       ` Peter Zijlstra
2010-11-30 19:25         ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 10/21] sparc: " Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 11/21] s390: preemptible mmu_gather Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 12/21] arm: Preemptible mmu_gather Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 13/21] sh: " Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 14/21] um: " Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 15/21] ia64: " Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38 ` [PATCH 16/21] mm, powerpc: Move the RCU page-table freeing into generic code Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-26 14:38   ` Peter Zijlstra
2010-11-30  3:05   ` Benjamin Herrenschmidt
2010-11-30  3:05     ` Benjamin Herrenschmidt
2010-11-26 14:39 ` [PATCH 17/21] lockdep, mutex: Provide mutex_lock_nest_lock Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-26 14:39 ` [PATCH 18/21] mutex: Provide mutex_is_contended Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-29  2:58   ` KAMEZAWA Hiroyuki
2010-11-29  2:58     ` KAMEZAWA Hiroyuki
2010-11-29 20:49     ` Peter Zijlstra
2010-11-29 20:49       ` Peter Zijlstra
2010-11-26 14:39 ` [PATCH 19/21] mm: Convert i_mmap_lock and anon_vma->lock to mutexes Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-29  3:05   ` KAMEZAWA Hiroyuki
2010-11-29  3:05     ` KAMEZAWA Hiroyuki
2010-11-29 20:50     ` Peter Zijlstra
2010-11-29 20:50       ` Peter Zijlstra
2010-11-30  1:28   ` KOSAKI Motohiro
2010-11-30  1:28     ` KOSAKI Motohiro
2010-11-26 14:39 ` [PATCH 20/21] mm: Extended batches for generic mmu_gather Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-29  3:11   ` KAMEZAWA Hiroyuki
2010-11-29  3:11     ` KAMEZAWA Hiroyuki
2010-11-26 14:39 ` [PATCH 21/21] mm: Optimize page_lock_anon_vma() fast-path Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-26 14:39   ` Peter Zijlstra
2010-11-29  3:22   ` KAMEZAWA Hiroyuki
2010-11-29  3:22     ` KAMEZAWA Hiroyuki
2010-11-29  9:00 ` [PATCH 00/21] mm: Preemptibility -v6 Benjamin Herrenschmidt
2010-11-29  9:00   ` Benjamin Herrenschmidt
2010-11-29 11:41   ` Peter Zijlstra
2010-11-29 11:41     ` Peter Zijlstra
2011-01-18  7:12 ` Hugh Dickins
2011-01-18  7:12   ` Hugh Dickins
2011-01-18 10:30   ` Peter Zijlstra
2011-01-18 10:30     ` Peter Zijlstra
2011-01-18 10:44   ` Peter Zijlstra
2011-01-18 10:44     ` Peter Zijlstra
2011-01-18 10:50   ` Peter Zijlstra
2011-01-18 10:50     ` Peter Zijlstra
2011-01-19 17:10   ` Peter Zijlstra
2011-01-19 17:10     ` Peter Zijlstra
2011-01-20 19:57     ` Hugh Dickins
2011-01-20 19:57       ` Hugh Dickins
2011-01-21  7:36       ` Benjamin Herrenschmidt
2011-01-21  7:36         ` Benjamin Herrenschmidt
2011-01-21 15:33       ` Peter Zijlstra
2011-01-21 15:33         ` Peter Zijlstra
2011-01-22 21:06         ` Paul E. McKenney
2011-01-22 21:06           ` Paul E. McKenney
2011-01-23 11:03           ` Peter Zijlstra
2011-01-23 11:03             ` Peter Zijlstra
2011-01-24 12:21         ` Peter Zijlstra
2011-01-24 12:21           ` Peter Zijlstra
2011-01-24 14:34           ` Oleg Nesterov
2011-01-24 14:34             ` Oleg Nesterov
2011-01-24 15:00             ` Peter Zijlstra
2011-01-24 15:00               ` Peter Zijlstra
2011-01-24 15:33               ` Oleg Nesterov
2011-01-24 15:33                 ` Oleg Nesterov
2011-01-24 12:45       ` Peter Zijlstra
2011-01-24 12:45         ` Peter Zijlstra
2011-01-24 14:24         ` Peter Zijlstra
2011-01-24 14:24           ` Peter Zijlstra
2011-01-21 17:44     ` Andrea Arcangeli
2011-01-21 17:44       ` Andrea Arcangeli
2011-01-31 10:02     ` Martin Schwidefsky
2011-01-31 10:02       ` Martin Schwidefsky
2011-02-15 14:00     ` Martin Schwidefsky
2011-02-15 14:00       ` Martin Schwidefsky
2011-02-15 15:39       ` Martin Schwidefsky
2011-02-15 15:39         ` Martin Schwidefsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.