All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] mm: preemptibility -v2
@ 2010-04-08 19:17 ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

Hi,

This (still incomplete) patch-set makes part of the mm a lot more preemptible.
It converts i_mmap_lock and anon_vma->lock to mutexes.  On the way there it
also makes mmu_gather preemptible.

The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.

The side-effects are that we can finally make mmu_gather preemptible, something
which lots of people have wanted to do for a long time.

It also gets us anon_vma refcounting which seems to be wanted by KSM as well as
Mel's compaction work.

This patch-set seems to build and boot on my x86_64 machines and even builds a
kernel. I've also attempted powerpc and sparc, which I've compile tested with
their respective defconfigs, remaining are (afaikt the rest uses the generic
tlb bits):

 - s390
 - ia64
 - arm
 - superh
 - um

>From those, s390 and ia64 look 'interesting', arm and superh seem very similar
and should be relatively easy (-rt has a patchlet for arm iirc).

What kind of performance tests would people have me run on this to satisfy
their need for numbers? I've done a kernel build on x86_64 and if anything that
was slightly faster with these patches, but it was well within the noise
levels so it might be heat noise I'm looking at ;-)




^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 00/13] mm: preemptibility -v2
@ 2010-04-08 19:17 ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

Hi,

This (still incomplete) patch-set makes part of the mm a lot more preemptible.
It converts i_mmap_lock and anon_vma->lock to mutexes.  On the way there it
also makes mmu_gather preemptible.

The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.

The side-effects are that we can finally make mmu_gather preemptible, something
which lots of people have wanted to do for a long time.

It also gets us anon_vma refcounting which seems to be wanted by KSM as well as
Mel's compaction work.

This patch-set seems to build and boot on my x86_64 machines and even builds a
kernel. I've also attempted powerpc and sparc, which I've compile tested with
their respective defconfigs, remaining are (afaikt the rest uses the generic
tlb bits):

 - s390
 - ia64
 - arm
 - superh
 - um

From those, s390 and ia64 look 'interesting', arm and superh seem very similar
and should be relatively easy (-rt has a patchlet for arm iirc).

What kind of performance tests would people have me run on this to satisfy
their need for numbers? I've done a kernel build on x86_64 and if anything that
was slightly faster with these patches, but it was well within the noise
levels so it might be heat noise I'm looking at ;-)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 00/13] mm: preemptibility -v2
@ 2010-04-08 19:17 ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

Hi,

This (still incomplete) patch-set makes part of the mm a lot more preemptible.
It converts i_mmap_lock and anon_vma->lock to mutexes.  On the way there it
also makes mmu_gather preemptible.

The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.

The side-effects are that we can finally make mmu_gather preemptible, something
which lots of people have wanted to do for a long time.

It also gets us anon_vma refcounting which seems to be wanted by KSM as well as
Mel's compaction work.

This patch-set seems to build and boot on my x86_64 machines and even builds a
kernel. I've also attempted powerpc and sparc, which I've compile tested with
their respective defconfigs, remaining are (afaikt the rest uses the generic
tlb bits):

 - s390
 - ia64
 - arm
 - superh
 - um

From those, s390 and ia64 look 'interesting', arm and superh seem very similar
and should be relatively easy (-rt has a patchlet for arm iirc).

What kind of performance tests would people have me run on this to satisfy
their need for numbers? I've done a kernel build on x86_64 and if anything that
was slightly faster with these patches, but it was well within the noise
levels so it might be heat noise I'm looking at ;-)




^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: powerpc-gup_fast-rcu.patch --]
[-- Type: text/plain, Size: 1469 bytes --]

The powerpc page table freeing relies on the fact that IRQs hold off
an RCU grace period, this is currently true for all existing RCU
implementations but is not an assumption Paul wants to support.

Therefore, also take the RCU read lock along with disabling IRQs to
ensure the RCU grace period does at least cover these lookups.

Requested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/mm/gup.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/arch/powerpc/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/gup.c
+++ linux-2.6/arch/powerpc/mm/gup.c
@@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
 	 * So long as we atomically load page table pointers versus teardown,
 	 * we can follow the address down to the the page and take a ref on it.
 	 */
+	rcu_read_lock();
 	local_irq_disable();
 
 	pgdp = pgd_offset(mm, addr);
@@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
 	} while (pgdp++, addr = next, addr != end);
 
 	local_irq_enable();
+	rcu_read_unlock();
 
 	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
 	return nr;
@@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
 
 slow:
 		local_irq_enable();
+		rcu_read_unlock();
 slow_irqon:
 		pr_devel("  slow path ! nr = %d\n", nr);
 



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: powerpc-gup_fast-rcu.patch --]
[-- Type: text/plain, Size: 1467 bytes --]

The powerpc page table freeing relies on the fact that IRQs hold off
an RCU grace period, this is currently true for all existing RCU
implementations but is not an assumption Paul wants to support.

Therefore, also take the RCU read lock along with disabling IRQs to
ensure the RCU grace period does at least cover these lookups.

Requested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/mm/gup.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/arch/powerpc/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/gup.c
+++ linux-2.6/arch/powerpc/mm/gup.c
@@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
 	 * So long as we atomically load page table pointers versus teardown,
 	 * we can follow the address down to the the page and take a ref on it.
 	 */
+	rcu_read_lock();
 	local_irq_disable();
 
 	pgdp = pgd_offset(mm, addr);
@@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
 	} while (pgdp++, addr = next, addr != end);
 
 	local_irq_enable();
+	rcu_read_unlock();
 
 	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
 	return nr;
@@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
 
 slow:
 		local_irq_enable();
+		rcu_read_unlock();
 slow_irqon:
 		pr_devel("  slow path ! nr = %d\n", nr);
 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 1779 bytes --]

There is nothing preventing the anon_vma from being detached while we
are spinning to acquire the lock. Most (all?) current users end up
calling something like vma_address(page, vma) on it, which has a
fairly good chance of weeding out wonky vmas.

However suppose the anon_vma got freed and re-used while we were
waiting to acquire the lock, and the new anon_vma fits with the
page->index (because that is the only thing vma_address() uses to
determine if the page fits in a particular vma, we could end up
traversing faulty anon_vma chains.

Close this hole for good by re-validating that page->mapping still
holds the very same anon_vma pointer after we acquire the lock, if not
be utterly paranoid and retry the whole operation (which will very
likely bail, because it's unlikely the page got attached to a different
anon_vma in the meantime).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
+again:
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		goto out;
@@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
+
+	if (page_rmapping(page) != anon_vma) {
+		spin_unlock(&anon_vma->lock);
+		goto again;
+	}
+
 	return anon_vma;
 out:
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-page_lock_anon_vma.patch --]
[-- Type: text/plain, Size: 1777 bytes --]

There is nothing preventing the anon_vma from being detached while we
are spinning to acquire the lock. Most (all?) current users end up
calling something like vma_address(page, vma) on it, which has a
fairly good chance of weeding out wonky vmas.

However suppose the anon_vma got freed and re-used while we were
waiting to acquire the lock, and the new anon_vma fits with the
page->index (because that is the only thing vma_address() uses to
determine if the page fits in a particular vma, we could end up
traversing faulty anon_vma chains.

Close this hole for good by re-validating that page->mapping still
holds the very same anon_vma pointer after we acquire the lock, if not
be utterly paranoid and retry the whole operation (which will very
likely bail, because it's unlikely the page got attached to a different
anon_vma in the meantime).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/rmap.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
+again:
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		goto out;
@@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	spin_lock(&anon_vma->lock);
+
+	if (page_rmapping(page) != anon_vma) {
+		spin_unlock(&anon_vma->lock);
+		goto again;
+	}
+
 	return anon_vma;
 out:
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 03/13] x86: Remove last traces of quicklist usage
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: x86-remove-quicklist-include.patch --]
[-- Type: text/plain, Size: 609 bytes --]

We still have a stray quicklist header included even though we axed
quicklist usage quite a while back.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/pgtable_32.c |    1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6/arch/x86/mm/pgtable_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/pgtable_32.c
+++ linux-2.6/arch/x86/mm/pgtable_32.c
@@ -9,7 +9,6 @@
 #include <linux/pagemap.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
-#include <linux/quicklist.h>
 
 #include <asm/system.h>
 #include <asm/pgtable.h>



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 03/13] x86: Remove last traces of quicklist usage
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: x86-remove-quicklist-include.patch --]
[-- Type: text/plain, Size: 607 bytes --]

We still have a stray quicklist header included even though we axed
quicklist usage quite a while back.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/pgtable_32.c |    1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6/arch/x86/mm/pgtable_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/pgtable_32.c
+++ linux-2.6/arch/x86/mm/pgtable_32.c
@@ -9,7 +9,6 @@
 #include <linux/pagemap.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
-#include <linux/quicklist.h>
 
 #include <asm/system.h>
 #include <asm/pgtable.h>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 04/13] mm: Move anon_vma ref out from under CONFIG_KSM
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 2632 bytes --]

We need an anon_vma refcount for preemptible anon_vma->lock as well
as memory compaction, so move it out into generic code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -26,9 +26,7 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
+	atomic_t ref;
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -61,26 +59,6 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->ksm_refcount, 0);
-}
-
-static inline int ksm_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->ksm_refcount);
-}
-#else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int ksm_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_KSM */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_it
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->ref);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->ref, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -248,7 +248,7 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->ref);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -272,7 +272,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
+	atomic_set(&anon_vma->ref, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 04/13] mm: Move anon_vma ref out from under CONFIG_KSM
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 2630 bytes --]

We need an anon_vma refcount for preemptible anon_vma->lock as well
as memory compaction, so move it out into generic code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -26,9 +26,7 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
+	atomic_t ref;
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -61,26 +59,6 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->ksm_refcount, 0);
-}
-
-static inline int ksm_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->ksm_refcount);
-}
-#else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int ksm_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_KSM */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_it
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->ref);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->ref, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -248,7 +248,7 @@ static void anon_vma_unlink(struct anon_
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->ref);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -272,7 +272,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
+	atomic_set(&anon_vma->ref, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 05/13] mm: Make use of the anon_vma ref count
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-use-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 6267 bytes --]

This patch changes the anon_vma refcount to be 0 when the object is
free. It does this by adding 1 ref to being in use in the anon_vma
structure (iow. the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry
on the list.

We then use this new refcount in the migration code to avoid a long
RCU read side section and convert page_lock_anon_vma() over to use
refcounts.

This later is done for each of convertion of anon_vma from spinlock
to mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |    7 +++++++
 mm/ksm.c             |    9 +--------
 mm/migrate.c         |   17 ++++++-----------
 mm/rmap.c            |   45 ++++++++++++++++++++++++++++++---------------
 4 files changed, 44 insertions(+), 34 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -100,6 +100,13 @@ static inline void anon_vma_merge(struct
 	unlink_anon_vmas(next);
 }
 
+struct anon_vma *anon_vma_get(struct page *page);
+static inline void anon_vma_put(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->ref))
+		anon_vma_free(anon_vma);
+}
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -323,14 +323,7 @@ static void hold_anon_vma(struct rmap_it
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
-	struct anon_vma *anon_vma = rmap_item->anon_vma;
-
-	if (atomic_dec_and_lock(&anon_vma->ref, &anon_vma->lock)) {
-		int empty = list_empty(&anon_vma->head);
-		spin_unlock(&anon_vma->lock);
-		if (empty)
-			anon_vma_free(anon_vma);
-	}
+	anon_vma_put(rmap_item->anon_vma);
 }
 
 /*
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -545,7 +545,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
 
@@ -601,10 +601,8 @@ static int unmap_and_move(new_page_t get
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = anon_vma_get(page);
 
 	/*
 	 * Corner case handling:
@@ -622,10 +620,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && page_has_private(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 			goto rcu_unlock;
@@ -643,8 +638,8 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		anon_vma_put(anon_vma);
 uncharge:
 	if (!charge)
 		mem_cgroup_end_migration(mem, page, newpage);
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -66,11 +66,18 @@ static struct kmem_cache *anon_vma_chain
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	if (anon_vma)
+		atomic_set(&anon_vma->ref, 1);
+
+	return anon_vma;
 }
 
 void anon_vma_free(struct anon_vma *anon_vma)
 {
+	VM_BUG_ON(atomic_read(&anon_vma->ref));
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -149,7 +156,7 @@ int anon_vma_prepare(struct vm_area_stru
 
 		spin_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {
-			anon_vma_free(allocated);
+			anon_vma_put(allocated);
 			anon_vma_chain_free(avc);
 		}
 	}
@@ -230,7 +237,7 @@ int anon_vma_fork(struct vm_area_struct 
 	return 0;
 
  out_error_free_anon_vma:
-	anon_vma_free(anon_vma);
+	anon_vma_put(anon_vma);
  out_error:
 	unlink_anon_vmas(vma);
 	return -ENOMEM;
@@ -247,13 +254,11 @@ static void anon_vma_unlink(struct anon_
 
 	spin_lock(&anon_vma->lock);
 	list_del(&anon_vma_chain->same_anon_vma);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->ref);
+	empty = list_empty(&anon_vma->head);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
-		anon_vma_free(anon_vma);
+		anon_vma_put(anon_vma);
 }
 
 void unlink_anon_vmas(struct vm_area_struct *vma)
@@ -286,11 +291,11 @@ void __init anon_vma_init(void)
 
 /*
  * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * tricky: page_lock_anon_vma relies on RCU to guard against the races.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *anon_vma_get(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -302,23 +307,33 @@ again:
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	if (!atomic_inc_not_zero(&anon_vma->ref))
+		anon_vma = NULL;
 
 	if (page_rmapping(page) != anon_vma) {
-		spin_unlock(&anon_vma->lock);
+		anon_vma_put(anon_vma);
 		goto again;
 	}
 
-	return anon_vma;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = anon_vma_get(page);
+
+	if (anon_vma)
+		spin_lock(&anon_vma->lock);
+
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	anon_vma_put(anon_vma);
 }
 
 /*



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 05/13] mm: Make use of the anon_vma ref count
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-use-anon_vma-ref.patch --]
[-- Type: text/plain, Size: 6265 bytes --]

This patch changes the anon_vma refcount to be 0 when the object is
free. It does this by adding 1 ref to being in use in the anon_vma
structure (iow. the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry
on the list.

We then use this new refcount in the migration code to avoid a long
RCU read side section and convert page_lock_anon_vma() over to use
refcounts.

This later is done for each of convertion of anon_vma from spinlock
to mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rmap.h |    7 +++++++
 mm/ksm.c             |    9 +--------
 mm/migrate.c         |   17 ++++++-----------
 mm/rmap.c            |   45 ++++++++++++++++++++++++++++++---------------
 4 files changed, 44 insertions(+), 34 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -100,6 +100,13 @@ static inline void anon_vma_merge(struct
 	unlink_anon_vmas(next);
 }
 
+struct anon_vma *anon_vma_get(struct page *page);
+static inline void anon_vma_put(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->ref))
+		anon_vma_free(anon_vma);
+}
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -323,14 +323,7 @@ static void hold_anon_vma(struct rmap_it
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
-	struct anon_vma *anon_vma = rmap_item->anon_vma;
-
-	if (atomic_dec_and_lock(&anon_vma->ref, &anon_vma->lock)) {
-		int empty = list_empty(&anon_vma->head);
-		spin_unlock(&anon_vma->lock);
-		if (empty)
-			anon_vma_free(anon_vma);
-	}
+	anon_vma_put(rmap_item->anon_vma);
 }
 
 /*
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -545,7 +545,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
 
@@ -601,10 +601,8 @@ static int unmap_and_move(new_page_t get
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = anon_vma_get(page);
 
 	/*
 	 * Corner case handling:
@@ -622,10 +620,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && page_has_private(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 			goto rcu_unlock;
@@ -643,8 +638,8 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		anon_vma_put(anon_vma);
 uncharge:
 	if (!charge)
 		mem_cgroup_end_migration(mem, page, newpage);
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -66,11 +66,18 @@ static struct kmem_cache *anon_vma_chain
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	if (anon_vma)
+		atomic_set(&anon_vma->ref, 1);
+
+	return anon_vma;
 }
 
 void anon_vma_free(struct anon_vma *anon_vma)
 {
+	VM_BUG_ON(atomic_read(&anon_vma->ref));
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -149,7 +156,7 @@ int anon_vma_prepare(struct vm_area_stru
 
 		spin_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {
-			anon_vma_free(allocated);
+			anon_vma_put(allocated);
 			anon_vma_chain_free(avc);
 		}
 	}
@@ -230,7 +237,7 @@ int anon_vma_fork(struct vm_area_struct 
 	return 0;
 
  out_error_free_anon_vma:
-	anon_vma_free(anon_vma);
+	anon_vma_put(anon_vma);
  out_error:
 	unlink_anon_vmas(vma);
 	return -ENOMEM;
@@ -247,13 +254,11 @@ static void anon_vma_unlink(struct anon_
 
 	spin_lock(&anon_vma->lock);
 	list_del(&anon_vma_chain->same_anon_vma);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !atomic_read(&anon_vma->ref);
+	empty = list_empty(&anon_vma->head);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
-		anon_vma_free(anon_vma);
+		anon_vma_put(anon_vma);
 }
 
 void unlink_anon_vmas(struct vm_area_struct *vma)
@@ -286,11 +291,11 @@ void __init anon_vma_init(void)
 
 /*
  * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma rely on RCU to guard against the races.
+ * tricky: page_lock_anon_vma relies on RCU to guard against the races.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *anon_vma_get(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -302,23 +307,33 @@ again:
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	if (!atomic_inc_not_zero(&anon_vma->ref))
+		anon_vma = NULL;
 
 	if (page_rmapping(page) != anon_vma) {
-		spin_unlock(&anon_vma->lock);
+		anon_vma_put(anon_vma);
 		goto again;
 	}
 
-	return anon_vma;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = anon_vma_get(page);
+
+	if (anon_vma)
+		spin_lock(&anon_vma->lock);
+
+	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	anon_vma_put(anon_vma);
 }
 
 /*

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 06/13] mm: Preemptible mmu_gather
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather.patch --]
[-- Type: text/plain, Size: 9828 bytes --]

Make mmu_gather preemptible by using a small on stack list and use
an option allocation to speed things up.

This breaks at least PPC for which there is a patch in -rt which
still needs porting.

Preemptible mmu_gather is desired in general and usable once
i_mmap_lock becomes a mutex. Doing it before the mutex conversion
saves us from having to rework the code by moving the mmu_gather
bits inside the i_mmap_lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c                 |   10 ++++-----
 include/asm-generic/tlb.h |   51 +++++++++++++++++++++++++++++-----------------
 include/linux/mm.h        |    2 -
 mm/memory.c               |   27 +++++-------------------
 mm/mmap.c                 |   16 +++++++-------
 5 files changed, 53 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -503,7 +503,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
 
@@ -529,12 +529,12 @@ static int shift_arg_pages(struct vm_are
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
+		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	} else {
 		/*
@@ -543,10 +543,10 @@ static int shift_arg_pages(struct vm_are
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
+		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
+	tlb_finish_mmu(&tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -22,14 +22,8 @@
  * and page free order so much..
  */
 #ifdef CONFIG_SMP
-  #ifdef ARCH_FREE_PTR_NR
-    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
-  #else
-    #define FREE_PTE_NR	506
-  #endif
   #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
 #else
-  #define FREE_PTE_NR	1
   #define tlb_fast_mode(tlb) 1
 #endif
 
@@ -39,30 +33,48 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page *		pages[FREE_PTE_NR];
+#ifdef HAVE_ARCH_MMU_GATHER
+	struct arch_mmu_gather	arch;
+#endif
+	struct page		**pages;
+	struct page		*local[8];
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+{
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	}
+}
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized struct mmu_gather.
  */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 
-	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
+
+	if (num_online_cpus() > 1) {
+		tlb->nr = 0;
+		__tlb_alloc_pages(tlb);
+	} else /* Use fast mode if only one CPU is online */
+		tlb->nr = ~0U;
 
 	tlb->fullmm = full_mm_flush;
 
-	return tlb;
+#ifdef HAVE_ARCH_MMU_GATHER
+	tlb->arch = ARCH_MMU_GATHER_INIT;
+#endif
 }
 
 static inline void
@@ -75,6 +87,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
+		if (tlb->pages == tlb->local)
+			__tlb_alloc_pages(tlb);
 	}
 }
 
@@ -90,7 +104,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /* tlb_remove_page
@@ -106,7 +121,7 @@ static inline void tlb_remove_page(struc
 		return;
 	}
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		tlb_flush_mmu(tlb, 0, 0);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -760,7 +760,7 @@ int zap_vma_ptes(struct vm_area_struct *
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1095,17 +1095,14 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
 	long zap_work = ZAP_BLOCK_SIZE;
-	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
-	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1126,11 +1123,6 @@ unsigned long unmap_vmas(struct mmu_gath
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
-			if (!tlb_start_valid) {
-				tlb_start = start;
-				tlb_start_valid = 1;
-			}
-
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
@@ -1151,7 +1143,7 @@ unsigned long unmap_vmas(struct mmu_gath
 
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -1159,19 +1151,13 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
-
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
+				if (i_mmap_lock)
 					goto out;
-				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
-			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
@@ -1191,16 +1177,15 @@ unsigned long zap_page_range(struct vm_a
 		unsigned long size, struct zap_details *details)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
+	tlb_finish_mmu(&tlb, address, end);
 	return end;
 }
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1844,17 +1844,17 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2190,7 +2190,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2215,14 +2215,14 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
-	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	tlb_finish_mmu(&tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 06/13] mm: Preemptible mmu_gather
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather.patch --]
[-- Type: text/plain, Size: 9826 bytes --]

Make mmu_gather preemptible by using a small on stack list and use
an option allocation to speed things up.

This breaks at least PPC for which there is a patch in -rt which
still needs porting.

Preemptible mmu_gather is desired in general and usable once
i_mmap_lock becomes a mutex. Doing it before the mutex conversion
saves us from having to rework the code by moving the mmu_gather
bits inside the i_mmap_lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c                 |   10 ++++-----
 include/asm-generic/tlb.h |   51 +++++++++++++++++++++++++++++-----------------
 include/linux/mm.h        |    2 -
 mm/memory.c               |   27 +++++-------------------
 mm/mmap.c                 |   16 +++++++-------
 5 files changed, 53 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -503,7 +503,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
 
@@ -529,12 +529,12 @@ static int shift_arg_pages(struct vm_are
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
+		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	} else {
 		/*
@@ -543,10 +543,10 @@ static int shift_arg_pages(struct vm_are
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
+		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
+	tlb_finish_mmu(&tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -22,14 +22,8 @@
  * and page free order so much..
  */
 #ifdef CONFIG_SMP
-  #ifdef ARCH_FREE_PTR_NR
-    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
-  #else
-    #define FREE_PTE_NR	506
-  #endif
   #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
 #else
-  #define FREE_PTE_NR	1
   #define tlb_fast_mode(tlb) 1
 #endif
 
@@ -39,30 +33,48 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page *		pages[FREE_PTE_NR];
+#ifdef HAVE_ARCH_MMU_GATHER
+	struct arch_mmu_gather	arch;
+#endif
+	struct page		**pages;
+	struct page		*local[8];
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+{
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	}
+}
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized struct mmu_gather.
  */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 
-	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
+
+	if (num_online_cpus() > 1) {
+		tlb->nr = 0;
+		__tlb_alloc_pages(tlb);
+	} else /* Use fast mode if only one CPU is online */
+		tlb->nr = ~0U;
 
 	tlb->fullmm = full_mm_flush;
 
-	return tlb;
+#ifdef HAVE_ARCH_MMU_GATHER
+	tlb->arch = ARCH_MMU_GATHER_INIT;
+#endif
 }
 
 static inline void
@@ -75,6 +87,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
+		if (tlb->pages == tlb->local)
+			__tlb_alloc_pages(tlb);
 	}
 }
 
@@ -90,7 +104,8 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /* tlb_remove_page
@@ -106,7 +121,7 @@ static inline void tlb_remove_page(struc
 		return;
 	}
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
+	if (tlb->nr >= tlb->max)
 		tlb_flush_mmu(tlb, 0, 0);
 }
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -760,7 +760,7 @@ int zap_vma_ptes(struct vm_area_struct *
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1095,17 +1095,14 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
 	long zap_work = ZAP_BLOCK_SIZE;
-	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
-	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1126,11 +1123,6 @@ unsigned long unmap_vmas(struct mmu_gath
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
-			if (!tlb_start_valid) {
-				tlb_start = start;
-				tlb_start_valid = 1;
-			}
-
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
@@ -1151,7 +1143,7 @@ unsigned long unmap_vmas(struct mmu_gath
 
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -1159,19 +1151,13 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
-
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
+				if (i_mmap_lock)
 					goto out;
-				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
-			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
@@ -1191,16 +1177,15 @@ unsigned long zap_page_range(struct vm_a
 		unsigned long size, struct zap_details *details)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
+	tlb_finish_mmu(&tlb, address, end);
 	return end;
 }
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1844,17 +1844,17 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2190,7 +2190,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2215,14 +2215,14 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
-	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	tlb_finish_mmu(&tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather-power.patch --]
[-- Type: text/plain, Size: 10787 bytes --]

Fix up powerpc to the new mmu_gather stuffs.

PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.

For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context
switch and use a TIF bit to track the laxy_mmu state.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/pgalloc.h     |    4 +--
 arch/powerpc/include/asm/thread_info.h |    2 +
 arch/powerpc/include/asm/tlb.h         |   10 +++++++++
 arch/powerpc/include/asm/tlbflush.h    |   16 ++++++++++-----
 arch/powerpc/kernel/process.c          |   18 +++++++++++++++++
 arch/powerpc/mm/pgtable.c              |   34 +++++++++++++++++++++++----------
 arch/powerpc/mm/tlb_hash32.c           |    2 -
 arch/powerpc/mm/tlb_hash64.c           |   12 ++++++-----
 arch/powerpc/mm/tlb_nohash.c           |    2 -
 9 files changed, 76 insertions(+), 24 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,6 +28,16 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
+#define HAVE_ARCH_MMU_GATHER 1
+
+struct pte_freelist_batch;
+
+struct arch_mmu_gather {
+	struct pte_freelist_batch *batch;
+};
+
+#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/include/asm/tlbflush.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlbflush.h
+++ linux-2.6/arch/powerpc/include/asm/tlbflush.h
@@ -108,18 +108,24 @@ extern void hpte_need_flush(struct mm_st
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 
 	batch->active = 1;
+
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 static inline void arch_leave_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
+
+	if (batch->active) {
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
 
-	if (batch->index)
-		__flush_tlb_pending(batch);
-	batch->active = 0;
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 #define arch_flush_lazy_mmu_mode()      do {} while (0)
Index: linux-2.6/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/process.c
+++ linux-2.6/arch/powerpc/kernel/process.c
@@ -389,6 +389,9 @@ struct task_struct *__switch_to(struct t
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
+#ifdef CONFIG_PPC64
+	struct ppc64_tlb_batch *batch;
+#endif
 
 #ifdef CONFIG_SMP
 	/* avoid complexity of lazy save/restore of fpu
@@ -479,6 +482,14 @@ struct task_struct *__switch_to(struct t
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
+
+	batch = &__get_cpu_var(ppc64_tlb_batch);
+	if (batch->active) {
+		set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU);
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
 #endif
 
 	local_irq_save(flags);
@@ -495,6 +506,13 @@ struct task_struct *__switch_to(struct t
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
+#ifdef CONFIG_PPC64
+	if (test_and_clear_ti_thread_flag(task_thread_info(new), TIF_LAZY_MMU)) {
+		batch = &__get_cpu_var(ppc64_tlb_batch);
+		batch->active = 1;
+	}
+#endif
+
 	local_irq_restore(flags);
 
 	return last;
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,8 +33,6 @@
 
 #include "mmu_decl.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 /*
@@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
  * freeing a page table page that is being walked without locks
  */
 
-static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
 static unsigned long pte_freelist_forced_free;
 
 struct pte_freelist_batch
@@ -98,12 +95,30 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 	unsigned long pgf;
 
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
+	/*
+	 * A comment here about on why we have RCU freed page tables might be
+	 * interesting, also explaining why we don't need any sort of grace
+	 * period for mm_users == 1, and have some home brewn smp_call_func()
+	 * for single frees.
+	 *
+	 * The only lockless page table walker I know of is gup_fast() which
+	 * relies on irq_disable(). So my guess is that mm_users == 1 means
+	 * that there cannot be another thread and so precludes gup_fast()
+	 * concurrency.
+	 *
+	 * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
+	 * to serialize against the IRQ disable. In case we do batch, the RCU
+	 * grace period is at least long enough to cover IRQ disabled sections
+	 * (XXX assumption, not strictly true).
+	 *
+	 * All this results in us doing our own free batching and not using
+	 * the generic mmu_gather batches (XXX fix that somehow?).
+	 */
+
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		pgtable_free(table, shift);
 		return;
 	}
@@ -125,10 +140,9 @@ void pgtable_free_tlb(struct mmu_gather 
 	}
 }
 
-void pte_free_finish(void)
+void pte_free_finish(struct mmu_gather *tlb)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 
 	if (*batchp == NULL)
 		return;
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*
@@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
 
 	/* If there's a TLB batch pending, then we must flush it because the
 	 * pages are going to be freed and we really don't want to have a CPU
@@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
 	if (tlbbatch->index)
 		__flush_tlb_pending(tlbbatch);
 
+	put_cpu_var(ppc64_tlb_batch);
+
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6/arch/powerpc/include/asm/thread_info.h
@@ -111,6 +111,7 @@ static inline struct thread_info *curren
 #define TIF_NOTIFY_RESUME	13	/* callback before returning to user */
 #define TIF_FREEZE		14	/* Freezing for suspend */
 #define TIF_RUNLATCH		15	/* Is the runlatch enabled? */
+#define TIF_LAZY_MMU		16	/* tlb_batch is active */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -128,6 +129,7 @@ static inline struct thread_info *curren
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_FREEZE		(1<<TIF_FREEZE)
 #define _TIF_RUNLATCH		(1<<TIF_RUNLATCH)
+#define _TIF_LAZY_MMU		(1<<TIF_LAZY_MMU)
 #define _TIF_SYSCALL_T_OR_A	(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP)
 
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
 
 #ifdef CONFIG_SMP
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(void);
+extern void pte_free_finish(struct mmu_gather *tlb);
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(void) { }
+static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	}
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -298,7 +298,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	flush_tlb_mm(tlb->mm);
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 07/13] powerpc: Preemptible mmu_gather
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather-power.patch --]
[-- Type: text/plain, Size: 10785 bytes --]

Fix up powerpc to the new mmu_gather stuffs.

PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.

For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context
switch and use a TIF bit to track the laxy_mmu state.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/pgalloc.h     |    4 +--
 arch/powerpc/include/asm/thread_info.h |    2 +
 arch/powerpc/include/asm/tlb.h         |   10 +++++++++
 arch/powerpc/include/asm/tlbflush.h    |   16 ++++++++++-----
 arch/powerpc/kernel/process.c          |   18 +++++++++++++++++
 arch/powerpc/mm/pgtable.c              |   34 +++++++++++++++++++++++----------
 arch/powerpc/mm/tlb_hash32.c           |    2 -
 arch/powerpc/mm/tlb_hash64.c           |   12 ++++++-----
 arch/powerpc/mm/tlb_nohash.c           |    2 -
 9 files changed, 76 insertions(+), 24 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,6 +28,16 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
+#define HAVE_ARCH_MMU_GATHER 1
+
+struct pte_freelist_batch;
+
+struct arch_mmu_gather {
+	struct pte_freelist_batch *batch;
+};
+
+#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/include/asm/tlbflush.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlbflush.h
+++ linux-2.6/arch/powerpc/include/asm/tlbflush.h
@@ -108,18 +108,24 @@ extern void hpte_need_flush(struct mm_st
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 
 	batch->active = 1;
+
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 static inline void arch_leave_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
+
+	if (batch->active) {
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
 
-	if (batch->index)
-		__flush_tlb_pending(batch);
-	batch->active = 0;
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 #define arch_flush_lazy_mmu_mode()      do {} while (0)
Index: linux-2.6/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/process.c
+++ linux-2.6/arch/powerpc/kernel/process.c
@@ -389,6 +389,9 @@ struct task_struct *__switch_to(struct t
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
+#ifdef CONFIG_PPC64
+	struct ppc64_tlb_batch *batch;
+#endif
 
 #ifdef CONFIG_SMP
 	/* avoid complexity of lazy save/restore of fpu
@@ -479,6 +482,14 @@ struct task_struct *__switch_to(struct t
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
+
+	batch = &__get_cpu_var(ppc64_tlb_batch);
+	if (batch->active) {
+		set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU);
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
 #endif
 
 	local_irq_save(flags);
@@ -495,6 +506,13 @@ struct task_struct *__switch_to(struct t
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
+#ifdef CONFIG_PPC64
+	if (test_and_clear_ti_thread_flag(task_thread_info(new), TIF_LAZY_MMU)) {
+		batch = &__get_cpu_var(ppc64_tlb_batch);
+		batch->active = 1;
+	}
+#endif
+
 	local_irq_restore(flags);
 
 	return last;
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,8 +33,6 @@
 
 #include "mmu_decl.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 /*
@@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
  * freeing a page table page that is being walked without locks
  */
 
-static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
 static unsigned long pte_freelist_forced_free;
 
 struct pte_freelist_batch
@@ -98,12 +95,30 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 	unsigned long pgf;
 
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
+	/*
+	 * A comment here about on why we have RCU freed page tables might be
+	 * interesting, also explaining why we don't need any sort of grace
+	 * period for mm_users == 1, and have some home brewn smp_call_func()
+	 * for single frees.
+	 *
+	 * The only lockless page table walker I know of is gup_fast() which
+	 * relies on irq_disable(). So my guess is that mm_users == 1 means
+	 * that there cannot be another thread and so precludes gup_fast()
+	 * concurrency.
+	 *
+	 * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
+	 * to serialize against the IRQ disable. In case we do batch, the RCU
+	 * grace period is at least long enough to cover IRQ disabled sections
+	 * (XXX assumption, not strictly true).
+	 *
+	 * All this results in us doing our own free batching and not using
+	 * the generic mmu_gather batches (XXX fix that somehow?).
+	 */
+
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		pgtable_free(table, shift);
 		return;
 	}
@@ -125,10 +140,9 @@ void pgtable_free_tlb(struct mmu_gather 
 	}
 }
 
-void pte_free_finish(void)
+void pte_free_finish(struct mmu_gather *tlb)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 
 	if (*batchp == NULL)
 		return;
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*
@@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
 
 	/* If there's a TLB batch pending, then we must flush it because the
 	 * pages are going to be freed and we really don't want to have a CPU
@@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
 	if (tlbbatch->index)
 		__flush_tlb_pending(tlbbatch);
 
+	put_cpu_var(ppc64_tlb_batch);
+
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6/arch/powerpc/include/asm/thread_info.h
@@ -111,6 +111,7 @@ static inline struct thread_info *curren
 #define TIF_NOTIFY_RESUME	13	/* callback before returning to user */
 #define TIF_FREEZE		14	/* Freezing for suspend */
 #define TIF_RUNLATCH		15	/* Is the runlatch enabled? */
+#define TIF_LAZY_MMU		16	/* tlb_batch is active */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -128,6 +129,7 @@ static inline struct thread_info *curren
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_FREEZE		(1<<TIF_FREEZE)
 #define _TIF_RUNLATCH		(1<<TIF_RUNLATCH)
+#define _TIF_LAZY_MMU		(1<<TIF_LAZY_MMU)
 #define _TIF_SYSCALL_T_OR_A	(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP)
 
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
 
 #ifdef CONFIG_SMP
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(void);
+extern void pte_free_finish(struct mmu_gather *tlb);
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(void) { }
+static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	}
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -298,7 +298,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	flush_tlb_mm(tlb->mm);
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 08/13] sparc: Preemptible mmu_gather
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather-sparc.patch --]
[-- Type: text/plain, Size: 8846 bytes --]

Rework the sparc mmu_gather usage to conform to the new world order :-)

Sparc mmu_gather does two things:
 - tracks vaddrs to unhash
 - tracks pages to free

Split these two things like powerpc has done and keep the vaddrs
in per-cpu data structures and flush them on context switch.

The remaining bits can then use the generic mmu_gather.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/pgalloc_64.h  |    3 +
 arch/sparc/include/asm/tlb_64.h      |   91 ++---------------------------------
 arch/sparc/include/asm/tlbflush_64.h |   12 +++-
 arch/sparc/mm/tlb.c                  |   42 +++++++++-------
 arch/sparc/mm/tsb.c                  |   15 +++--
 5 files changed, 51 insertions(+), 112 deletions(-)

Index: linux-2.6/arch/sparc/include/asm/pgalloc_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgalloc_64.h
+++ linux-2.6/arch/sparc/include/asm/pgalloc_64.h
@@ -78,4 +78,7 @@ static inline void check_pgt_cache(void)
 	quicklist_trim(0, NULL, 25, 16);
 }
 
+#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+
 #endif /* _SPARC64_PGALLOC_H */
Index: linux-2.6/arch/sparc/include/asm/tlb_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlb_64.h
+++ linux-2.6/arch/sparc/include/asm/tlb_64.h
@@ -7,66 +7,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-#define TLB_BATCH_NR	192
-
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define FREE_PTE_NR	506
-  #define tlb_fast_mode(bp) ((bp)->pages_nr == ~0U)
-#else
-  #define FREE_PTE_NR	1
-  #define tlb_fast_mode(bp) 1
-#endif
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	unsigned int pages_nr;
-	unsigned int need_flush;
-	unsigned int fullmm;
-	unsigned int tlb_nr;
-	unsigned long vaddrs[TLB_BATCH_NR];
-	struct page *pages[FREE_PTE_NR];
-};
-
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_pending(struct mm_struct *,
 				  unsigned long, unsigned long *);
 #endif
 
-extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
-extern void flush_tlb_pending(void);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
-
-	BUG_ON(mp->tlb_nr);
-
-	mp->mm = mm;
-	mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
-	mp->fullmm = full_mm_flush;
-
-	return mp;
-}
-
-
-static inline void tlb_flush_mmu(struct mmu_gather *mp)
-{
-	if (!mp->fullmm)
-		flush_tlb_pending();
-	if (mp->need_flush) {
-		free_pages_and_swap_cache(mp->pages, mp->pages_nr);
-		mp->pages_nr = 0;
-		mp->need_flush = 0;
-	}
-
-}
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_mm(struct mm_struct *mm);
 #define do_flush_tlb_mm(mm) smp_flush_tlb_mm(mm)
@@ -74,38 +19,14 @@ extern void smp_flush_tlb_mm(struct mm_s
 #define do_flush_tlb_mm(mm) __flush_tlb_mm(CTX_HWBITS(mm->context), SECONDARY_CONTEXT)
 #endif
 
-static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(mp);
-
-	if (mp->fullmm)
-		mp->fullmm = 0;
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
-}
-
-static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
-{
-	if (tlb_fast_mode(mp)) {
-		free_page_and_swap_cache(page);
-		return;
-	}
-	mp->need_flush = 1;
-	mp->pages[mp->pages_nr++] = page;
-	if (mp->pages_nr >= FREE_PTE_NR)
-		tlb_flush_mmu(mp);
-}
-
-#define tlb_remove_tlb_entry(mp,ptep,addr) do { } while (0)
-#define pte_free_tlb(mp, ptepage, addr) pte_free((mp)->mm, ptepage)
-#define pmd_free_tlb(mp, pmdp, addr) pmd_free((mp)->mm, pmdp)
-#define pud_free_tlb(tlb,pudp, addr) __pud_free_tlb(tlb,pudp,addr)
+extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
+extern void flush_tlb_pending(void);
 
-#define tlb_migrate_finish(mm)	do { } while (0)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define tlb_flush(tlb)	flush_tlb_pending()
+
+#include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
Index: linux-2.6/arch/sparc/include/asm/tlbflush_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlbflush_64.h
+++ linux-2.6/arch/sparc/include/asm/tlbflush_64.h
@@ -5,9 +5,17 @@
 #include <asm/mmu_context.h>
 
 /* TSB flush operations. */
-struct mmu_gather;
+
+#define TLB_BATCH_NR	192
+
+struct tlb_batch {
+	struct mm_struct *mm;
+	unsigned long tlb_nr;
+	unsigned long vaddrs[TLB_BATCH_NR];
+};
+
 extern void flush_tsb_kernel_range(unsigned long start, unsigned long end);
-extern void flush_tsb_user(struct mmu_gather *mp);
+extern void flush_tsb_user(struct tlb_batch *tb);
 
 /* TLB flush operations. */
 
Index: linux-2.6/arch/sparc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tlb.c
+++ linux-2.6/arch/sparc/mm/tlb.c
@@ -19,33 +19,33 @@
 
 /* Heavily inspired by the ppc64 code.  */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+static DEFINE_PER_CPU(struct tlb_batch, tlb_batch);
 
 void flush_tlb_pending(void)
 {
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 
-	if (mp->tlb_nr) {
-		flush_tsb_user(mp);
+	if (tb->tlb_nr) {
+		flush_tsb_user(tb);
 
-		if (CTX_VALID(mp->mm->context)) {
+		if (CTX_VALID(tb->mm->context)) {
 #ifdef CONFIG_SMP
-			smp_flush_tlb_pending(mp->mm, mp->tlb_nr,
-					      &mp->vaddrs[0]);
+			smp_flush_tlb_pending(tb->mm, tb->tlb_nr,
+					      &tb->vaddrs[0]);
 #else
-			__flush_tlb_pending(CTX_HWBITS(mp->mm->context),
-					    mp->tlb_nr, &mp->vaddrs[0]);
+			__flush_tlb_pending(CTX_HWBITS(tb->mm->context),
+					    tb->tlb_nr, &tb->vaddrs[0]);
 #endif
 		}
-		mp->tlb_nr = 0;
+		tb->tlb_nr = 0;
 	}
 
-	put_cpu_var(mmu_gathers);
+	put_cpu_var(tlb_batch);
 }
 
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig)
 {
-	struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 	unsigned long nr;
 
 	vaddr &= PAGE_MASK;
@@ -77,21 +77,27 @@ void tlb_batch_add(struct mm_struct *mm,
 
 no_cache_flush:
 
-	if (mp->fullmm)
+	/*
+	if (tb->fullmm) {
+		put_cpu_var(tlb_batch);
 		return;
+	}
+	*/
 
-	nr = mp->tlb_nr;
+	nr = tb->tlb_nr;
 
-	if (unlikely(nr != 0 && mm != mp->mm)) {
+	if (unlikely(nr != 0 && mm != tb->mm)) {
 		flush_tlb_pending();
 		nr = 0;
 	}
 
 	if (nr == 0)
-		mp->mm = mm;
+		tb->mm = mm;
 
-	mp->vaddrs[nr] = vaddr;
-	mp->tlb_nr = ++nr;
+	tb->vaddrs[nr] = vaddr;
+	tb->tlb_nr = ++nr;
 	if (nr >= TLB_BATCH_NR)
 		flush_tlb_pending();
+
+	put_cpu_var(tlb_batch);
 }
Index: linux-2.6/arch/sparc/mm/tsb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tsb.c
+++ linux-2.6/arch/sparc/mm/tsb.c
@@ -47,12 +47,13 @@ void flush_tsb_kernel_range(unsigned lon
 	}
 }
 
-static void __flush_tsb_one(struct mmu_gather *mp, unsigned long hash_shift, unsigned long tsb, unsigned long nentries)
+static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
+			    unsigned long tsb, unsigned long nentries)
 {
 	unsigned long i;
 
-	for (i = 0; i < mp->tlb_nr; i++) {
-		unsigned long v = mp->vaddrs[i];
+	for (i = 0; i < tb->tlb_nr; i++) {
+		unsigned long v = tb->vaddrs[i];
 		unsigned long tag, ent, hash;
 
 		v &= ~0x1UL;
@@ -65,9 +66,9 @@ static void __flush_tsb_one(struct mmu_g
 	}
 }
 
-void flush_tsb_user(struct mmu_gather *mp)
+void flush_tsb_user(struct tlb_batch *tb)
 {
-	struct mm_struct *mm = mp->mm;
+	struct mm_struct *mm = tb->mm;
 	unsigned long nentries, base, flags;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
@@ -76,7 +77,7 @@ void flush_tsb_user(struct mmu_gather *m
 	nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
 	if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 		base = __pa(base);
-	__flush_tsb_one(mp, PAGE_SHIFT, base, nentries);
+	__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
 
 #ifdef CONFIG_HUGETLB_PAGE
 	if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -84,7 +85,7 @@ void flush_tsb_user(struct mmu_gather *m
 		nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
 		if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 			base = __pa(base);
-		__flush_tsb_one(mp, HPAGE_SHIFT, base, nentries);
+		__flush_tsb_one(tb, HPAGE_SHIFT, base, nentries);
 	}
 #endif
 	spin_unlock_irqrestore(&mm->context.lock, flags);



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 08/13] sparc: Preemptible mmu_gather
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather-sparc.patch --]
[-- Type: text/plain, Size: 8844 bytes --]

Rework the sparc mmu_gather usage to conform to the new world order :-)

Sparc mmu_gather does two things:
 - tracks vaddrs to unhash
 - tracks pages to free

Split these two things like powerpc has done and keep the vaddrs
in per-cpu data structures and flush them on context switch.

The remaining bits can then use the generic mmu_gather.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/pgalloc_64.h  |    3 +
 arch/sparc/include/asm/tlb_64.h      |   91 ++---------------------------------
 arch/sparc/include/asm/tlbflush_64.h |   12 +++-
 arch/sparc/mm/tlb.c                  |   42 +++++++++-------
 arch/sparc/mm/tsb.c                  |   15 +++--
 5 files changed, 51 insertions(+), 112 deletions(-)

Index: linux-2.6/arch/sparc/include/asm/pgalloc_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgalloc_64.h
+++ linux-2.6/arch/sparc/include/asm/pgalloc_64.h
@@ -78,4 +78,7 @@ static inline void check_pgt_cache(void)
 	quicklist_trim(0, NULL, 25, 16);
 }
 
+#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+
 #endif /* _SPARC64_PGALLOC_H */
Index: linux-2.6/arch/sparc/include/asm/tlb_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlb_64.h
+++ linux-2.6/arch/sparc/include/asm/tlb_64.h
@@ -7,66 +7,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-#define TLB_BATCH_NR	192
-
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define FREE_PTE_NR	506
-  #define tlb_fast_mode(bp) ((bp)->pages_nr == ~0U)
-#else
-  #define FREE_PTE_NR	1
-  #define tlb_fast_mode(bp) 1
-#endif
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	unsigned int pages_nr;
-	unsigned int need_flush;
-	unsigned int fullmm;
-	unsigned int tlb_nr;
-	unsigned long vaddrs[TLB_BATCH_NR];
-	struct page *pages[FREE_PTE_NR];
-};
-
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_pending(struct mm_struct *,
 				  unsigned long, unsigned long *);
 #endif
 
-extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
-extern void flush_tlb_pending(void);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
-
-	BUG_ON(mp->tlb_nr);
-
-	mp->mm = mm;
-	mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
-	mp->fullmm = full_mm_flush;
-
-	return mp;
-}
-
-
-static inline void tlb_flush_mmu(struct mmu_gather *mp)
-{
-	if (!mp->fullmm)
-		flush_tlb_pending();
-	if (mp->need_flush) {
-		free_pages_and_swap_cache(mp->pages, mp->pages_nr);
-		mp->pages_nr = 0;
-		mp->need_flush = 0;
-	}
-
-}
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_mm(struct mm_struct *mm);
 #define do_flush_tlb_mm(mm) smp_flush_tlb_mm(mm)
@@ -74,38 +19,14 @@ extern void smp_flush_tlb_mm(struct mm_s
 #define do_flush_tlb_mm(mm) __flush_tlb_mm(CTX_HWBITS(mm->context), SECONDARY_CONTEXT)
 #endif
 
-static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(mp);
-
-	if (mp->fullmm)
-		mp->fullmm = 0;
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
-}
-
-static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
-{
-	if (tlb_fast_mode(mp)) {
-		free_page_and_swap_cache(page);
-		return;
-	}
-	mp->need_flush = 1;
-	mp->pages[mp->pages_nr++] = page;
-	if (mp->pages_nr >= FREE_PTE_NR)
-		tlb_flush_mmu(mp);
-}
-
-#define tlb_remove_tlb_entry(mp,ptep,addr) do { } while (0)
-#define pte_free_tlb(mp, ptepage, addr) pte_free((mp)->mm, ptepage)
-#define pmd_free_tlb(mp, pmdp, addr) pmd_free((mp)->mm, pmdp)
-#define pud_free_tlb(tlb,pudp, addr) __pud_free_tlb(tlb,pudp,addr)
+extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
+extern void flush_tlb_pending(void);
 
-#define tlb_migrate_finish(mm)	do { } while (0)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define tlb_flush(tlb)	flush_tlb_pending()
+
+#include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
Index: linux-2.6/arch/sparc/include/asm/tlbflush_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlbflush_64.h
+++ linux-2.6/arch/sparc/include/asm/tlbflush_64.h
@@ -5,9 +5,17 @@
 #include <asm/mmu_context.h>
 
 /* TSB flush operations. */
-struct mmu_gather;
+
+#define TLB_BATCH_NR	192
+
+struct tlb_batch {
+	struct mm_struct *mm;
+	unsigned long tlb_nr;
+	unsigned long vaddrs[TLB_BATCH_NR];
+};
+
 extern void flush_tsb_kernel_range(unsigned long start, unsigned long end);
-extern void flush_tsb_user(struct mmu_gather *mp);
+extern void flush_tsb_user(struct tlb_batch *tb);
 
 /* TLB flush operations. */
 
Index: linux-2.6/arch/sparc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tlb.c
+++ linux-2.6/arch/sparc/mm/tlb.c
@@ -19,33 +19,33 @@
 
 /* Heavily inspired by the ppc64 code.  */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+static DEFINE_PER_CPU(struct tlb_batch, tlb_batch);
 
 void flush_tlb_pending(void)
 {
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 
-	if (mp->tlb_nr) {
-		flush_tsb_user(mp);
+	if (tb->tlb_nr) {
+		flush_tsb_user(tb);
 
-		if (CTX_VALID(mp->mm->context)) {
+		if (CTX_VALID(tb->mm->context)) {
 #ifdef CONFIG_SMP
-			smp_flush_tlb_pending(mp->mm, mp->tlb_nr,
-					      &mp->vaddrs[0]);
+			smp_flush_tlb_pending(tb->mm, tb->tlb_nr,
+					      &tb->vaddrs[0]);
 #else
-			__flush_tlb_pending(CTX_HWBITS(mp->mm->context),
-					    mp->tlb_nr, &mp->vaddrs[0]);
+			__flush_tlb_pending(CTX_HWBITS(tb->mm->context),
+					    tb->tlb_nr, &tb->vaddrs[0]);
 #endif
 		}
-		mp->tlb_nr = 0;
+		tb->tlb_nr = 0;
 	}
 
-	put_cpu_var(mmu_gathers);
+	put_cpu_var(tlb_batch);
 }
 
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig)
 {
-	struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 	unsigned long nr;
 
 	vaddr &= PAGE_MASK;
@@ -77,21 +77,27 @@ void tlb_batch_add(struct mm_struct *mm,
 
 no_cache_flush:
 
-	if (mp->fullmm)
+	/*
+	if (tb->fullmm) {
+		put_cpu_var(tlb_batch);
 		return;
+	}
+	*/
 
-	nr = mp->tlb_nr;
+	nr = tb->tlb_nr;
 
-	if (unlikely(nr != 0 && mm != mp->mm)) {
+	if (unlikely(nr != 0 && mm != tb->mm)) {
 		flush_tlb_pending();
 		nr = 0;
 	}
 
 	if (nr == 0)
-		mp->mm = mm;
+		tb->mm = mm;
 
-	mp->vaddrs[nr] = vaddr;
-	mp->tlb_nr = ++nr;
+	tb->vaddrs[nr] = vaddr;
+	tb->tlb_nr = ++nr;
 	if (nr >= TLB_BATCH_NR)
 		flush_tlb_pending();
+
+	put_cpu_var(tlb_batch);
 }
Index: linux-2.6/arch/sparc/mm/tsb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tsb.c
+++ linux-2.6/arch/sparc/mm/tsb.c
@@ -47,12 +47,13 @@ void flush_tsb_kernel_range(unsigned lon
 	}
 }
 
-static void __flush_tsb_one(struct mmu_gather *mp, unsigned long hash_shift, unsigned long tsb, unsigned long nentries)
+static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
+			    unsigned long tsb, unsigned long nentries)
 {
 	unsigned long i;
 
-	for (i = 0; i < mp->tlb_nr; i++) {
-		unsigned long v = mp->vaddrs[i];
+	for (i = 0; i < tb->tlb_nr; i++) {
+		unsigned long v = tb->vaddrs[i];
 		unsigned long tag, ent, hash;
 
 		v &= ~0x1UL;
@@ -65,9 +66,9 @@ static void __flush_tsb_one(struct mmu_g
 	}
 }
 
-void flush_tsb_user(struct mmu_gather *mp)
+void flush_tsb_user(struct tlb_batch *tb)
 {
-	struct mm_struct *mm = mp->mm;
+	struct mm_struct *mm = tb->mm;
 	unsigned long nentries, base, flags;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
@@ -76,7 +77,7 @@ void flush_tsb_user(struct mmu_gather *m
 	nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
 	if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 		base = __pa(base);
-	__flush_tsb_one(mp, PAGE_SHIFT, base, nentries);
+	__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
 
 #ifdef CONFIG_HUGETLB_PAGE
 	if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -84,7 +85,7 @@ void flush_tsb_user(struct mmu_gather *m
 		nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
 		if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 			base = __pa(base);
-		__flush_tsb_one(mp, HPAGE_SHIFT, base, nentries);
+		__flush_tsb_one(tb, HPAGE_SHIFT, base, nentries);
 	}
 #endif
 	spin_unlock_irqrestore(&mm->context.lock, flags);

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 09/13] mm, powerpc: Move the RCU page-table freeing into generic code
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather-rcu.patch --]
[-- Type: text/plain, Size: 12888 bytes --]

In case other architectures require RCU freed page-tables to implement
gup_fast(), provide the means to do so by moving the logic into generic
code.

Requested-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/pgalloc.h |   23 ++++++-
 arch/powerpc/include/asm/tlb.h     |   10 ---
 arch/powerpc/mm/pgtable.c          |  119 -------------------------------------
 arch/powerpc/mm/tlb_hash32.c       |    3 
 arch/powerpc/mm/tlb_hash64.c       |    3 
 arch/powerpc/mm/tlb_nohash.c       |    3 
 include/asm-generic/tlb.h          |   57 ++++++++++++++++-
 mm/memory.c                        |   81 +++++++++++++++++++++++++
 8 files changed, 153 insertions(+), 146 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -31,14 +31,31 @@ static inline void pte_free(struct mm_st
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(struct mmu_gather *tlb);
+#define HAVE_ARCH_RCU_TABLE_FREE
+
+struct mmu_gather;
+extern void tlb_remove_table(struct mmu_gather *, void *);
+
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,16 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
-#define HAVE_ARCH_MMU_GATHER 1
-
-struct pte_freelist_batch;
-
-struct arch_mmu_gather {
-	struct pte_freelist_batch *batch;
-};
-
-#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
-
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,125 +33,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_SMP
-
-/*
- * Handle batching of page table freeing on SMP. Page tables are
- * queued up and send to be freed later by RCU in order to avoid
- * freeing a page table page that is being walked without locks
- */
-
-static unsigned long pte_freelist_forced_free;
-
-struct pte_freelist_batch
-{
-	struct rcu_head	rcu;
-	unsigned int	index;
-	unsigned long	tables[0];
-};
-
-#define PTE_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(unsigned long))
-
-static void pte_free_smp_sync(void *arg)
-{
-	/* Do nothing, just ensure we sync with all CPUs */
-}
-
-/* This is only called when we are critically out of memory
- * (and fail to get a page in pte_free_tlb).
- */
-static void pgtable_free_now(void *table, unsigned shift)
-{
-	pte_freelist_forced_free++;
-
-	smp_call_function(pte_free_smp_sync, NULL, 1);
-
-	pgtable_free(table, shift);
-}
-
-static void pte_free_rcu_callback(struct rcu_head *head)
-{
-	struct pte_freelist_batch *batch =
-		container_of(head, struct pte_freelist_batch, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++) {
-		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
-		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
-
-		pgtable_free(table, shift);
-	}
-
-	free_page((unsigned long)batch);
-}
-
-static void pte_free_submit(struct pte_freelist_batch *batch)
-{
-	INIT_RCU_HEAD(&batch->rcu);
-	call_rcu(&batch->rcu, pte_free_rcu_callback);
-}
-
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-	unsigned long pgf;
-
-	/*
-	 * A comment here about on why we have RCU freed page tables might be
-	 * interesting, also explaining why we don't need any sort of grace
-	 * period for mm_users == 1, and have some home brewn smp_call_func()
-	 * for single frees.
-	 *
-	 * The only lockless page table walker I know of is gup_fast() which
-	 * relies on irq_disable(). So my guess is that mm_users == 1 means
-	 * that there cannot be another thread and so precludes gup_fast()
-	 * concurrency.
-	 *
-	 * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
-	 * to serialize against the IRQ disable. In case we do batch, the RCU
-	 * grace period is at least long enough to cover IRQ disabled sections
-	 * (XXX assumption, not strictly true).
-	 *
-	 * All this results in us doing our own free batching and not using
-	 * the generic mmu_gather batches (XXX fix that somehow?).
-	 */
-
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
-		pgtable_free(table, shift);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
-		if (*batchp == NULL) {
-			pgtable_free_now(table, shift);
-			return;
-		}
-		(*batchp)->index = 0;
-	}
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf = (unsigned long)table | shift;
-	(*batchp)->tables[(*batchp)->index++] = pgf;
-	if ((*batchp)->index == PTE_FREELIST_SIZE) {
-		pte_free_submit(*batchp);
-		*batchp = NULL;
-	}
-}
-
-void pte_free_finish(struct mmu_gather *tlb)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-
-	if (*batchp == NULL)
-		return;
-	pte_free_submit(*batchp);
-	*batchp = NULL;
-}
-
-#endif /* CONFIG_SMP */
-
 static inline int is_exec_fault(void)
 {
 	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		 */
 		_tlbia();
 	}
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		__flush_tlb_pending(tlbbatch);
 
 	put_cpu_var(ppc64_tlb_batch);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -296,9 +296,6 @@ EXPORT_SYMBOL(flush_tlb_range);
 void tlb_flush(struct mmu_gather *tlb)
 {
 	flush_tlb_mm(tlb->mm);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -27,6 +27,49 @@
   #define tlb_fast_mode(tlb) 1
 #endif
 
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+/*
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement gup_fast().
+ *
+ * gup_fast() does a lockless page-table walk and therefore needs some
+ * synchronization with the freeing of the page directories. The chosen means
+ * to accomplish that is by disabling IRQs over the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the copmletion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * The assumption that IRQ disabled also holds off RCU grace periods is valid
+ * for the current RCU implementations (if this ever were to change these
+ * gup_fast() implementation would need to use the RCU read lock).
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
+ */
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(unsigned long))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#endif
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
@@ -36,11 +79,12 @@ struct mmu_gather {
 	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-#ifdef HAVE_ARCH_MMU_GATHER
-	struct arch_mmu_gather	arch;
-#endif
 	struct page		**pages;
 	struct page		*local[8];
+
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+#endif
 };
 
 static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
@@ -72,8 +116,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 
 	tlb->fullmm = full_mm_flush;
 
-#ifdef HAVE_ARCH_MMU_GATHER
-	tlb->arch = ARCH_MMU_GATHER_INIT;
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+	tlb->batch = NULL;
 #endif
 }
 
@@ -84,6 +128,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -193,6 +193,87 @@ static void check_sync_rss_stat(struct t
 
 #endif
 
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for gup_fast() implementations that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 *
+	 * [ Using synchronize_rcu() instead of this smp_call_function() would
+	 *   make it proper RCU, however that is also much slower ]
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		INIT_RCU_HEAD(&(*batch)->rcu);
+		call_rcu(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	tlb->need_flush = 1;
+
+	/*
+	 * When there's less that two users of this mm there cannot be a
+	 * concurrent gup_fast() lookup.
+	 */
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
+		__tlb_remove_table(table);
+		return;
+	}
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_ATOMIC);
+		if (*batch == NULL) {
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 09/13] mm, powerpc: Move the RCU page-table freeing into generic code
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-preempt-tlb-gather-rcu.patch --]
[-- Type: text/plain, Size: 12886 bytes --]

In case other architectures require RCU freed page-tables to implement
gup_fast(), provide the means to do so by moving the logic into generic
code.

Requested-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/include/asm/pgalloc.h |   23 ++++++-
 arch/powerpc/include/asm/tlb.h     |   10 ---
 arch/powerpc/mm/pgtable.c          |  119 -------------------------------------
 arch/powerpc/mm/tlb_hash32.c       |    3 
 arch/powerpc/mm/tlb_hash64.c       |    3 
 arch/powerpc/mm/tlb_nohash.c       |    3 
 include/asm-generic/tlb.h          |   57 ++++++++++++++++-
 mm/memory.c                        |   81 +++++++++++++++++++++++++
 8 files changed, 153 insertions(+), 146 deletions(-)

Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -31,14 +31,31 @@ static inline void pte_free(struct mm_st
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(struct mmu_gather *tlb);
+#define HAVE_ARCH_RCU_TABLE_FREE
+
+struct mmu_gather;
+extern void tlb_remove_table(struct mmu_gather *, void *);
+
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,16 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
-#define HAVE_ARCH_MMU_GATHER 1
-
-struct pte_freelist_batch;
-
-struct arch_mmu_gather {
-	struct pte_freelist_batch *batch;
-};
-
-#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
-
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,125 +33,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_SMP
-
-/*
- * Handle batching of page table freeing on SMP. Page tables are
- * queued up and send to be freed later by RCU in order to avoid
- * freeing a page table page that is being walked without locks
- */
-
-static unsigned long pte_freelist_forced_free;
-
-struct pte_freelist_batch
-{
-	struct rcu_head	rcu;
-	unsigned int	index;
-	unsigned long	tables[0];
-};
-
-#define PTE_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(unsigned long))
-
-static void pte_free_smp_sync(void *arg)
-{
-	/* Do nothing, just ensure we sync with all CPUs */
-}
-
-/* This is only called when we are critically out of memory
- * (and fail to get a page in pte_free_tlb).
- */
-static void pgtable_free_now(void *table, unsigned shift)
-{
-	pte_freelist_forced_free++;
-
-	smp_call_function(pte_free_smp_sync, NULL, 1);
-
-	pgtable_free(table, shift);
-}
-
-static void pte_free_rcu_callback(struct rcu_head *head)
-{
-	struct pte_freelist_batch *batch =
-		container_of(head, struct pte_freelist_batch, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++) {
-		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
-		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
-
-		pgtable_free(table, shift);
-	}
-
-	free_page((unsigned long)batch);
-}
-
-static void pte_free_submit(struct pte_freelist_batch *batch)
-{
-	INIT_RCU_HEAD(&batch->rcu);
-	call_rcu(&batch->rcu, pte_free_rcu_callback);
-}
-
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-	unsigned long pgf;
-
-	/*
-	 * A comment here about on why we have RCU freed page tables might be
-	 * interesting, also explaining why we don't need any sort of grace
-	 * period for mm_users == 1, and have some home brewn smp_call_func()
-	 * for single frees.
-	 *
-	 * The only lockless page table walker I know of is gup_fast() which
-	 * relies on irq_disable(). So my guess is that mm_users == 1 means
-	 * that there cannot be another thread and so precludes gup_fast()
-	 * concurrency.
-	 *
-	 * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
-	 * to serialize against the IRQ disable. In case we do batch, the RCU
-	 * grace period is at least long enough to cover IRQ disabled sections
-	 * (XXX assumption, not strictly true).
-	 *
-	 * All this results in us doing our own free batching and not using
-	 * the generic mmu_gather batches (XXX fix that somehow?).
-	 */
-
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
-		pgtable_free(table, shift);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
-		if (*batchp == NULL) {
-			pgtable_free_now(table, shift);
-			return;
-		}
-		(*batchp)->index = 0;
-	}
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf = (unsigned long)table | shift;
-	(*batchp)->tables[(*batchp)->index++] = pgf;
-	if ((*batchp)->index == PTE_FREELIST_SIZE) {
-		pte_free_submit(*batchp);
-		*batchp = NULL;
-	}
-}
-
-void pte_free_finish(struct mmu_gather *tlb)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-
-	if (*batchp == NULL)
-		return;
-	pte_free_submit(*batchp);
-	*batchp = NULL;
-}
-
-#endif /* CONFIG_SMP */
-
 static inline int is_exec_fault(void)
 {
 	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		 */
 		_tlbia();
 	}
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		__flush_tlb_pending(tlbbatch);
 
 	put_cpu_var(ppc64_tlb_batch);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -296,9 +296,6 @@ EXPORT_SYMBOL(flush_tlb_range);
 void tlb_flush(struct mmu_gather *tlb)
 {
 	flush_tlb_mm(tlb->mm);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -27,6 +27,49 @@
   #define tlb_fast_mode(tlb) 1
 #endif
 
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+/*
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement gup_fast().
+ *
+ * gup_fast() does a lockless page-table walk and therefore needs some
+ * synchronization with the freeing of the page directories. The chosen means
+ * to accomplish that is by disabling IRQs over the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the copmletion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * The assumption that IRQ disabled also holds off RCU grace periods is valid
+ * for the current RCU implementations (if this ever were to change these
+ * gup_fast() implementation would need to use the RCU read lock).
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
+ */
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(unsigned long))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#endif
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
@@ -36,11 +79,12 @@ struct mmu_gather {
 	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-#ifdef HAVE_ARCH_MMU_GATHER
-	struct arch_mmu_gather	arch;
-#endif
 	struct page		**pages;
 	struct page		*local[8];
+
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+#endif
 };
 
 static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
@@ -72,8 +116,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 
 	tlb->fullmm = full_mm_flush;
 
-#ifdef HAVE_ARCH_MMU_GATHER
-	tlb->arch = ARCH_MMU_GATHER_INIT;
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+	tlb->batch = NULL;
 #endif
 }
 
@@ -84,6 +128,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -193,6 +193,87 @@ static void check_sync_rss_stat(struct t
 
 #endif
 
+#ifdef HAVE_ARCH_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for gup_fast() implementations that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 *
+	 * [ Using synchronize_rcu() instead of this smp_call_function() would
+	 *   make it proper RCU, however that is also much slower ]
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		INIT_RCU_HEAD(&(*batch)->rcu);
+		call_rcu(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	tlb->need_flush = 1;
+
+	/*
+	 * When there's less that two users of this mm there cannot be a
+	 * concurrent gup_fast() lookup.
+	 */
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
+		__tlb_remove_table(table);
+		return;
+	}
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_ATOMIC);
+		if (*batch == NULL) {
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 10/13] lockdep, mutex: Provide mutex_lock_nest_lock
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mutex_lock_nest_lock.patch --]
[-- Type: text/plain, Size: 5442 bytes --]

Provide the mutex_lock_nest_lock() annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/lockdep.h |    3 +++
 include/linux/mutex.h   |    9 +++++++++
 kernel/mutex.c          |   25 +++++++++++++++++--------
 3 files changed, 29 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -484,12 +484,15 @@ static inline void print_irqtrace_events
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # ifdef CONFIG_PROVE_LOCKING
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 2, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 2, n, i)
 # else
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 1, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 1, n, i)
 # endif
 # define mutex_release(l, n, i)			lock_release(l, n, i)
 #else
 # define mutex_acquire(l, s, t, i)		do { } while (0)
+# define mutex_acquire_nest(l, s, t, n, i)	do { } while (0)
 # define mutex_release(l, n, i)			do { } while (0)
 #endif
 
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -124,6 +124,7 @@ static inline int mutex_is_locked(struct
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
+extern void _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock);
 extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
@@ -132,6 +133,13 @@ extern int __must_check mutex_lock_killa
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+
+#define mutex_lock_nest_lock(lock, nest_lock)				\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);		\
+} while (0)
+
 #else
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
@@ -140,6 +148,7 @@ extern int __must_check mutex_lock_killa
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
+# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
 #endif
 
 /*
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -140,14 +140,14 @@ EXPORT_SYMBOL(mutex_unlock);
  */
 static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+		    struct lockdep_map *nest_lock, unsigned long ip)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
 
 	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
@@ -278,16 +278,25 @@ void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
 
+void __sched
+_mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest)
+{
+	might_sleep();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, nest, _RET_IP_);
+}
+
+EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock);
+
 int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
@@ -296,7 +305,7 @@ mutex_lock_interruptible_nested(struct m
 {
 	might_sleep();
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
-				   subclass, _RET_IP_);
+				   subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -402,7 +411,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -410,7 +419,7 @@ __mutex_lock_killable_slowpath(atomic_t 
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_KILLABLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -418,7 +427,7 @@ __mutex_lock_interruptible_slowpath(atom
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 #endif
 



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 10/13] lockdep, mutex: Provide mutex_lock_nest_lock
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mutex_lock_nest_lock.patch --]
[-- Type: text/plain, Size: 5440 bytes --]

Provide the mutex_lock_nest_lock() annotation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/lockdep.h |    3 +++
 include/linux/mutex.h   |    9 +++++++++
 kernel/mutex.c          |   25 +++++++++++++++++--------
 3 files changed, 29 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -484,12 +484,15 @@ static inline void print_irqtrace_events
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # ifdef CONFIG_PROVE_LOCKING
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 2, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 2, n, i)
 # else
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 1, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 1, n, i)
 # endif
 # define mutex_release(l, n, i)			lock_release(l, n, i)
 #else
 # define mutex_acquire(l, s, t, i)		do { } while (0)
+# define mutex_acquire_nest(l, s, t, n, i)	do { } while (0)
 # define mutex_release(l, n, i)			do { } while (0)
 #endif
 
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -124,6 +124,7 @@ static inline int mutex_is_locked(struct
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
+extern void _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock);
 extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 					unsigned int subclass);
 extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
@@ -132,6 +133,13 @@ extern int __must_check mutex_lock_killa
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+
+#define mutex_lock_nest_lock(lock, nest_lock)				\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);		\
+} while (0)
+
 #else
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
@@ -140,6 +148,7 @@ extern int __must_check mutex_lock_killa
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
+# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
 #endif
 
 /*
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -140,14 +140,14 @@ EXPORT_SYMBOL(mutex_unlock);
  */
 static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+		    struct lockdep_map *nest_lock, unsigned long ip)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
 
 	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
@@ -278,16 +278,25 @@ void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
 
+void __sched
+_mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest)
+{
+	might_sleep();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, nest, _RET_IP_);
+}
+
+EXPORT_SYMBOL_GPL(_mutex_lock_nest_lock);
+
 int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
@@ -296,7 +305,7 @@ mutex_lock_interruptible_nested(struct m
 {
 	might_sleep();
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
-				   subclass, _RET_IP_);
+				   subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -402,7 +411,7 @@ __mutex_lock_slowpath(atomic_t *lock_cou
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -410,7 +419,7 @@ __mutex_lock_killable_slowpath(atomic_t 
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_KILLABLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -418,7 +427,7 @@ __mutex_lock_interruptible_slowpath(atom
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 #endif
 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 11/13] mutex: Provide mutex_is_contended
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mutex-is-contended.patch --]
[-- Type: text/plain, Size: 677 bytes --]

Usable for lock-breaks and such.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -118,6 +118,11 @@ static inline int mutex_is_locked(struct
 	return atomic_read(&lock->count) != 1;
 }
 
+static inline int mutex_is_contended(struct mutex *lock)
+{
+	return atomic_read(&lock->count) < 0;
+}
+
 /*
  * See kernel/mutex.c for detailed documentation of these APIs.
  * Also see Documentation/mutex-design.txt.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 11/13] mutex: Provide mutex_is_contended
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mutex-is-contended.patch --]
[-- Type: text/plain, Size: 675 bytes --]

Usable for lock-breaks and such.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -118,6 +118,11 @@ static inline int mutex_is_locked(struct
 	return atomic_read(&lock->count) != 1;
 }
 
+static inline int mutex_is_contended(struct mutex *lock)
+{
+	return atomic_read(&lock->count) < 0;
+}
+
 /*
  * See kernel/mutex.c for detailed documentation of these APIs.
  * Also see Documentation/mutex-design.txt.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 12/13] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-mutex.patch --]
[-- Type: text/plain, Size: 22758 bytes --]

Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 include/linux/rmap.h      |   10 +++++-----
 kernel/fork.c             |    4 ++--
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   12 ++++++------
 mm/ksm.c                  |   16 ++++++++--------
 mm/memory-failure.c       |    4 ++--
 mm/memory.c               |   14 +++++++-------
 mm/mmap.c                 |   20 ++++++++++----------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   42 +++++++++++++++++++++---------------------
 16 files changed, 74 insertions(+), 74 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -73,7 +73,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -98,7 +98,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -429,10 +429,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -258,7 +258,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	mutex_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -627,7 +627,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct mutex		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -749,7 +749,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct mutex *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -25,8 +25,7 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
-	atomic_t ref;
+	struct mutex lock;	/* Serialize access to vma list */
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -36,6 +35,7 @@ struct anon_vma {
 	 * mm_take_all_locks() (mm_all_locks_mutex).
 	 */
 	struct list_head head;	/* Chain of private "related" vmas */
+	atomic_t ref;
 };
 
 /*
@@ -72,14 +72,14 @@ static inline void anon_vma_lock(struct 
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 }
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -355,7 +355,7 @@ static int dup_mmap(struct mm_struct *mm
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			spin_lock(&mapping->i_mmap_lock);
+			mutex_lock(&mapping->i_mmap_lock);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
@@ -363,7 +363,7 @@ static int dup_mmap(struct mm_struct *mm
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+			mutex_unlock(&mapping->i_mmap_lock);
 		}
 
 		/*
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapp
 		return;
 
 retry:
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -200,7 +200,7 @@ retry:
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	if (locked) {
 		mutex_unlock(&xip_sparse_mutex);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -208,13 +208,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsign
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -2210,9 +2210,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	__unmap_hugepage_range(vma, start, end, ref_page);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 /*
@@ -2244,7 +2244,7 @@ static int unmap_ref_private(struct mm_s
 	 * this mapping should be shared between all the VMAs,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
@@ -2262,7 +2262,7 @@ static int unmap_ref_private(struct mm_s
 				address, address + huge_page_size(h),
 				page);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	return 1;
 }
@@ -2678,7 +2678,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
@@ -2693,7 +2693,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -1559,7 +1559,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1582,7 +1582,7 @@ again:
 			if (!search_new_forks || !mapcount)
 				break;
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (!mapcount)
 			goto out;
 	}
@@ -1612,7 +1612,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1630,11 +1630,11 @@ again:
 			ret = try_to_unmap_one(page, vma,
 					rmap_item->address, flags);
 			if (ret != SWAP_AGAIN || !page_mapped(page)) {
-				spin_unlock(&anon_vma->lock);
+				mutex_unlock(&anon_vma->lock);
 				goto out;
 			}
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 	if (!search_new_forks++)
 		goto again;
@@ -1664,7 +1664,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1681,11 +1681,11 @@ again:
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
 			if (ret != SWAP_AGAIN) {
-				spin_unlock(&anon_vma->lock);
+				mutex_unlock(&anon_vma->lock);
 				goto out;
 			}
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 	if (!search_new_forks++)
 		goto again;
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c
+++ linux-2.6/mm/memory-failure.c
@@ -421,7 +421,7 @@ static void collect_procs_file(struct pa
 	 */
 
 	read_lock(&tasklist_lock);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
@@ -441,7 +441,7 @@ static void collect_procs_file(struct pa
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	read_unlock(&tasklist_lock);
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1102,7 +1102,7 @@ unsigned long unmap_vmas(struct mmu_gath
 {
 	long zap_work = ZAP_BLOCK_SIZE;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct mutex *i_mmap_lock = details ? details->i_mmap_lock : NULL;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1152,7 +1152,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
+				(i_mmap_lock && mutex_is_contended(i_mmap_lock))) {
 				if (i_mmap_lock)
 					goto out;
 				cond_resched();
@@ -2427,7 +2427,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || mutex_is_contended(details->i_mmap_lock);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -2441,9 +2441,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	mutex_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	mutex_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -2539,7 +2539,7 @@ void unmap_mapping_range(struct address_
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2554,7 +2554,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -216,9 +216,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -449,7 +449,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -459,7 +459,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -565,7 +565,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file) {
@@ -2440,7 +2440,7 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->lock. If some other vma in this mm shares
@@ -2470,7 +2470,7 @@ static void vm_lock_mapping(struct mm_st
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
 	}
 }
 
@@ -2558,7 +2558,7 @@ static void vm_unlock_anon_vma(struct an
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->head.next))
 			BUG();
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 }
 
@@ -2569,7 +2569,7 @@ static void vm_unlock_mapping(struct add
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
 			BUG();
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -91,7 +91,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -123,7 +123,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -140,7 +140,7 @@ int anon_vma_prepare(struct vm_area_stru
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
@@ -154,7 +154,7 @@ int anon_vma_prepare(struct vm_area_stru
 		}
 		spin_unlock(&mm->page_table_lock);
 
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {
 			anon_vma_put(allocated);
 			anon_vma_chain_free(avc);
@@ -176,9 +176,9 @@ static void anon_vma_chain_link(struct v
 	avc->anon_vma = anon_vma;
 	list_add(&avc->same_vma, &vma->anon_vma_chain);
 
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_add_tail(&avc->same_anon_vma, &anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 }
 
 /*
@@ -251,10 +251,10 @@ static void anon_vma_unlink(struct anon_
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_del(&anon_vma_chain->same_anon_vma);
 	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 
 	if (empty)
 		anon_vma_put(anon_vma);
@@ -276,7 +276,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	mutex_init(&anon_vma->lock);
 	atomic_set(&anon_vma->ref, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
@@ -317,14 +317,14 @@ struct anon_vma *page_lock_anon_vma(stru
 	struct anon_vma *anon_vma = anon_vma_get(page);
 
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 
 	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	anon_vma_put(anon_vma);
 }
 
@@ -569,7 +569,7 @@ static int page_referenced_file(struct p
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -594,7 +594,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -681,7 +681,7 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED) {
 			unsigned long address = vma_address(page, vma);
@@ -690,7 +690,7 @@ static int page_mkclean_file(struct addr
 			ret += page_mkclean_one(page, vma, address);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1196,7 +1196,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1242,7 +1242,7 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
+	cond_resched();
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -1264,7 +1264,7 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
+		cond_resched();
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1276,7 +1276,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1361,7 +1361,7 @@ static int rmap_walk_anon(struct page *p
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1371,7 +1371,7 @@ static int rmap_walk_anon(struct page *p
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	return ret;
 }
 
@@ -1386,7 +1386,7 @@ static int rmap_walk_file(struct page *p
 
 	if (!mapping)
 		return ret;
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1400,7 +1400,7 @@ static int rmap_walk_file(struct page *p
 	 * never contain migration ptes.  Decide what to do about this
 	 * limitation to linear when we need rmap_walk() on nonlinear.
 	 */
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 12/13] mm: Convert i_mmap_lock and anon_vma->lock to mutexes
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-mutex.patch --]
[-- Type: text/plain, Size: 22756 bytes --]

Straight fwd conversion of i_mmap_lock and anon_vma->lock to mutexes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 include/linux/rmap.h      |   10 +++++-----
 kernel/fork.c             |    4 ++--
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   12 ++++++------
 mm/ksm.c                  |   16 ++++++++--------
 mm/memory-failure.c       |    4 ++--
 mm/memory.c               |   14 +++++++-------
 mm/mmap.c                 |   20 ++++++++++----------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   42 +++++++++++++++++++++---------------------
 16 files changed, 74 insertions(+), 74 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -73,7 +73,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -98,7 +98,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -429,10 +429,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -258,7 +258,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	mutex_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -627,7 +627,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct mutex		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -749,7 +749,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct mutex *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h
+++ linux-2.6/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -25,8 +25,7 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
-	atomic_t ref;
+	struct mutex lock;	/* Serialize access to vma list */
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -36,6 +35,7 @@ struct anon_vma {
 	 * mm_take_all_locks() (mm_all_locks_mutex).
 	 */
 	struct list_head head;	/* Chain of private "related" vmas */
+	atomic_t ref;
 };
 
 /*
@@ -72,14 +72,14 @@ static inline void anon_vma_lock(struct 
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 }
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -355,7 +355,7 @@ static int dup_mmap(struct mm_struct *mm
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			spin_lock(&mapping->i_mmap_lock);
+			mutex_lock(&mapping->i_mmap_lock);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
@@ -363,7 +363,7 @@ static int dup_mmap(struct mm_struct *mm
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+			mutex_unlock(&mapping->i_mmap_lock);
 		}
 
 		/*
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapp
 		return;
 
 retry:
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -200,7 +200,7 @@ retry:
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	if (locked) {
 		mutex_unlock(&xip_sparse_mutex);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -208,13 +208,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsign
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -2210,9 +2210,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	__unmap_hugepage_range(vma, start, end, ref_page);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 /*
@@ -2244,7 +2244,7 @@ static int unmap_ref_private(struct mm_s
 	 * this mapping should be shared between all the VMAs,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
@@ -2262,7 +2262,7 @@ static int unmap_ref_private(struct mm_s
 				address, address + huge_page_size(h),
 				page);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	return 1;
 }
@@ -2678,7 +2678,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
@@ -2693,7 +2693,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6/mm/ksm.c
===================================================================
--- linux-2.6.orig/mm/ksm.c
+++ linux-2.6/mm/ksm.c
@@ -1559,7 +1559,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1582,7 +1582,7 @@ again:
 			if (!search_new_forks || !mapcount)
 				break;
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (!mapcount)
 			goto out;
 	}
@@ -1612,7 +1612,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1630,11 +1630,11 @@ again:
 			ret = try_to_unmap_one(page, vma,
 					rmap_item->address, flags);
 			if (ret != SWAP_AGAIN || !page_mapped(page)) {
-				spin_unlock(&anon_vma->lock);
+				mutex_unlock(&anon_vma->lock);
 				goto out;
 			}
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 	if (!search_new_forks++)
 		goto again;
@@ -1664,7 +1664,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1681,11 +1681,11 @@ again:
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
 			if (ret != SWAP_AGAIN) {
-				spin_unlock(&anon_vma->lock);
+				mutex_unlock(&anon_vma->lock);
 				goto out;
 			}
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 	if (!search_new_forks++)
 		goto again;
Index: linux-2.6/mm/memory-failure.c
===================================================================
--- linux-2.6.orig/mm/memory-failure.c
+++ linux-2.6/mm/memory-failure.c
@@ -421,7 +421,7 @@ static void collect_procs_file(struct pa
 	 */
 
 	read_lock(&tasklist_lock);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
@@ -441,7 +441,7 @@ static void collect_procs_file(struct pa
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	read_unlock(&tasklist_lock);
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1102,7 +1102,7 @@ unsigned long unmap_vmas(struct mmu_gath
 {
 	long zap_work = ZAP_BLOCK_SIZE;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct mutex *i_mmap_lock = details ? details->i_mmap_lock : NULL;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1152,7 +1152,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
+				(i_mmap_lock && mutex_is_contended(i_mmap_lock))) {
 				if (i_mmap_lock)
 					goto out;
 				cond_resched();
@@ -2427,7 +2427,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || mutex_is_contended(details->i_mmap_lock);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -2441,9 +2441,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	mutex_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	mutex_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -2539,7 +2539,7 @@ void unmap_mapping_range(struct address_
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2554,7 +2554,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -216,9 +216,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -449,7 +449,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -459,7 +459,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -565,7 +565,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file) {
@@ -2440,7 +2440,7 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->lock. If some other vma in this mm shares
@@ -2470,7 +2470,7 @@ static void vm_lock_mapping(struct mm_st
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
 	}
 }
 
@@ -2558,7 +2558,7 @@ static void vm_unlock_anon_vma(struct an
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->head.next))
 			BUG();
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 }
 
@@ -2569,7 +2569,7 @@ static void vm_unlock_mapping(struct add
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
 			BUG();
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -91,7 +91,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -123,7 +123,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -140,7 +140,7 @@ int anon_vma_prepare(struct vm_area_stru
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
@@ -154,7 +154,7 @@ int anon_vma_prepare(struct vm_area_stru
 		}
 		spin_unlock(&mm->page_table_lock);
 
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {
 			anon_vma_put(allocated);
 			anon_vma_chain_free(avc);
@@ -176,9 +176,9 @@ static void anon_vma_chain_link(struct v
 	avc->anon_vma = anon_vma;
 	list_add(&avc->same_vma, &vma->anon_vma_chain);
 
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_add_tail(&avc->same_anon_vma, &anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 }
 
 /*
@@ -251,10 +251,10 @@ static void anon_vma_unlink(struct anon_
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_del(&anon_vma_chain->same_anon_vma);
 	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 
 	if (empty)
 		anon_vma_put(anon_vma);
@@ -276,7 +276,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	mutex_init(&anon_vma->lock);
 	atomic_set(&anon_vma->ref, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
@@ -317,14 +317,14 @@ struct anon_vma *page_lock_anon_vma(stru
 	struct anon_vma *anon_vma = anon_vma_get(page);
 
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 
 	return anon_vma;
 }
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	anon_vma_put(anon_vma);
 }
 
@@ -569,7 +569,7 @@ static int page_referenced_file(struct p
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -594,7 +594,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -681,7 +681,7 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED) {
 			unsigned long address = vma_address(page, vma);
@@ -690,7 +690,7 @@ static int page_mkclean_file(struct addr
 			ret += page_mkclean_one(page, vma, address);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1196,7 +1196,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1242,7 +1242,7 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
+	cond_resched();
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -1264,7 +1264,7 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
+		cond_resched();
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1276,7 +1276,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1361,7 +1361,7 @@ static int rmap_walk_anon(struct page *p
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1371,7 +1371,7 @@ static int rmap_walk_anon(struct page *p
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	return ret;
 }
 
@@ -1386,7 +1386,7 @@ static int rmap_walk_file(struct page *p
 
 	if (!mapping)
 		return ret;
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1400,7 +1400,7 @@ static int rmap_walk_file(struct page *p
 	 * never contain migration ptes.  Decide what to do about this
 	 * limitation to linear when we need rmap_walk() on nonlinear.
 	 */
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 13/13] mm: Optimize page_lock_anon_vma
  2010-04-08 19:17 ` Peter Zijlstra
@ 2010-04-08 19:17   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-opt-rmap.patch --]
[-- Type: text/plain, Size: 3275 bytes --]

Optimize page_lock_anon_vma() by removing the atomic ref count
ops from the fast path.

Rather complicates the code a lot, but might be worth it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 67 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -78,6 +78,12 @@ static inline struct anon_vma *anon_vma_
 void anon_vma_free(struct anon_vma *anon_vma)
 {
 	VM_BUG_ON(atomic_read(&anon_vma->ref));
+	/*
+	 * Sync against the anon_vma->lock, so that we can hold the
+	 * lock without requiring a reference. See page_lock_anon_vma().
+	 */
+	mutex_lock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -291,7 +297,7 @@ void __init anon_vma_init(void)
 
 /*
  * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma relies on RCU to guard against the races.
+ * tricky: anon_vma_get relies on RCU to guard against the races.
  */
 struct anon_vma *anon_vma_get(struct page *page)
 {
@@ -320,12 +326,70 @@ out:
 	return anon_vma;
 }
 
+/*
+ * Similar to anon_vma_get(), however it relies on the anon_vma->lock
+ * to pin the object. However since we cannot wait for the mutex
+ * acquisition inside the RCU read lock, we use the ref count
+ * in the slow path.
+ */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma = anon_vma_get(page);
+	struct anon_vma *anon_vma = NULL;
+	unsigned long anon_mapping;
+
+again:
+	rcu_read_lock();
+	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
+		goto unlock;
+	if (!page_mapped(page))
+		goto unlock;
+
+	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	if (!mutex_trylock(&anon_vma->lock)) {
+		/*
+		 * We failed to acquire the lock, take a ref so we can
+		 * drop the RCU read lock and sleep on it.
+		 */
+		if (!atomic_inc_not_zero(&anon_vma->ref)) {
+			/*
+			 * Failed to get a ref, we're dead, bail.
+			 */
+			anon_vma = NULL;
+			goto unlock;
+		}
+		rcu_read_unlock();
 
-	if (anon_vma)
 		mutex_lock(&anon_vma->lock);
+		/*
+		 * We got the lock, drop the temp. ref, if it was the last
+		 * one free it and bail.
+		 */
+		if (atomic_dec_and_test(&anon_vma->ref)) {
+			mutex_unlock(&anon_vma->lock);
+			anon_vma_free(anon_vma);
+			anon_vma = NULL;
+		}
+		goto out;
+	}
+	/*
+	 * Got the lock, check we're still alive. Seeing a ref
+	 * here guarantees the object will stay alive due to
+	 * anon_vma_free() syncing against the lock we now hold.
+	 */
+	smp_rmb(); /* Order against anon_vma_put() */
+	if (!atomic_read(&anon_vma->ref)) {
+		mutex_unlock(&anon_vma->lock);
+		anon_vma = NULL;
+	}
+
+unlock:
+	rcu_read_unlock();
+out:
+	if (anon_vma && page_rmapping(page) != anon_vma) {
+		mutex_unlock(&anon_vma->lock);
+		goto again;
+	}
 
 	return anon_vma;
 }
@@ -333,7 +397,6 @@ struct anon_vma *page_lock_anon_vma(stru
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	mutex_unlock(&anon_vma->lock);
-	anon_vma_put(anon_vma);
 }
 
 /*



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 13/13] mm: Optimize page_lock_anon_vma
@ 2010-04-08 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 19:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Peter Zijlstra

[-- Attachment #1: mm-opt-rmap.patch --]
[-- Type: text/plain, Size: 3273 bytes --]

Optimize page_lock_anon_vma() by removing the atomic ref count
ops from the fast path.

Rather complicates the code a lot, but might be worth it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 67 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -78,6 +78,12 @@ static inline struct anon_vma *anon_vma_
 void anon_vma_free(struct anon_vma *anon_vma)
 {
 	VM_BUG_ON(atomic_read(&anon_vma->ref));
+	/*
+	 * Sync against the anon_vma->lock, so that we can hold the
+	 * lock without requiring a reference. See page_lock_anon_vma().
+	 */
+	mutex_lock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -291,7 +297,7 @@ void __init anon_vma_init(void)
 
 /*
  * Getting a lock on a stable anon_vma from a page off the LRU is
- * tricky: page_lock_anon_vma relies on RCU to guard against the races.
+ * tricky: anon_vma_get relies on RCU to guard against the races.
  */
 struct anon_vma *anon_vma_get(struct page *page)
 {
@@ -320,12 +326,70 @@ out:
 	return anon_vma;
 }
 
+/*
+ * Similar to anon_vma_get(), however it relies on the anon_vma->lock
+ * to pin the object. However since we cannot wait for the mutex
+ * acquisition inside the RCU read lock, we use the ref count
+ * in the slow path.
+ */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma = anon_vma_get(page);
+	struct anon_vma *anon_vma = NULL;
+	unsigned long anon_mapping;
+
+again:
+	rcu_read_lock();
+	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
+		goto unlock;
+	if (!page_mapped(page))
+		goto unlock;
+
+	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	if (!mutex_trylock(&anon_vma->lock)) {
+		/*
+		 * We failed to acquire the lock, take a ref so we can
+		 * drop the RCU read lock and sleep on it.
+		 */
+		if (!atomic_inc_not_zero(&anon_vma->ref)) {
+			/*
+			 * Failed to get a ref, we're dead, bail.
+			 */
+			anon_vma = NULL;
+			goto unlock;
+		}
+		rcu_read_unlock();
 
-	if (anon_vma)
 		mutex_lock(&anon_vma->lock);
+		/*
+		 * We got the lock, drop the temp. ref, if it was the last
+		 * one free it and bail.
+		 */
+		if (atomic_dec_and_test(&anon_vma->ref)) {
+			mutex_unlock(&anon_vma->lock);
+			anon_vma_free(anon_vma);
+			anon_vma = NULL;
+		}
+		goto out;
+	}
+	/*
+	 * Got the lock, check we're still alive. Seeing a ref
+	 * here guarantees the object will stay alive due to
+	 * anon_vma_free() syncing against the lock we now hold.
+	 */
+	smp_rmb(); /* Order against anon_vma_put() */
+	if (!atomic_read(&anon_vma->ref)) {
+		mutex_unlock(&anon_vma->lock);
+		anon_vma = NULL;
+	}
+
+unlock:
+	rcu_read_unlock();
+out:
+	if (anon_vma && page_rmapping(page) != anon_vma) {
+		mutex_unlock(&anon_vma->lock);
+		goto again;
+	}
 
 	return anon_vma;
 }
@@ -333,7 +397,6 @@ struct anon_vma *page_lock_anon_vma(stru
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	mutex_unlock(&anon_vma->lock);
-	anon_vma_put(anon_vma);
 }
 
 /*

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-08 19:17 ` Peter Zijlstra
                   ` (14 preceding siblings ...)
  (?)
@ 2010-04-08 20:29 ` David Miller
  2010-04-08 20:35   ` Peter Zijlstra
  2010-04-09  1:00   ` David Miller
  -1 siblings, 2 replies; 113+ messages in thread
From: David Miller @ 2010-04-08 20:29 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: aarcange, avi, tglx, riel, mingo, akpm, torvalds, linux-kernel,
	linux-arch, benh, hugh.dickins, mel, npiggin

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu, 08 Apr 2010 21:17:37 +0200

> This patch-set seems to build and boot on my x86_64 machines and even builds a
> kernel. I've also attempted powerpc and sparc, which I've compile tested with
> their respective defconfigs, remaining are (afaikt the rest uses the generic
> tlb bits):

Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
work basically fine.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-08 20:31   ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2010-04-08 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> The powerpc page table freeing relies on the fact that IRQs hold off
> an RCU grace period, this is currently true for all existing RCU
> implementations but is not an assumption Paul wants to support.
>
> Therefore, also take the RCU read lock along with disabling IRQs to
> ensure the RCU grace period does at least cover these lookups.
>
> Requested-by: Paul E. McKenney<paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Cc: Nick Piggin<npiggin@suse.de>
> Cc: Benjamin Herrenschmidt<benh@kernel.crashing.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-08 20:29 ` [PATCH 00/13] mm: preemptibility -v2 David Miller
@ 2010-04-08 20:35   ` Peter Zijlstra
  2010-04-09  1:00   ` David Miller
  1 sibling, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 20:35 UTC (permalink / raw)
  To: David Miller
  Cc: aarcange, avi, tglx, riel, mingo, akpm, torvalds, linux-kernel,
	linux-arch, benh, hugh.dickins, mel, npiggin

On Thu, 2010-04-08 at 13:29 -0700, David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Thu, 08 Apr 2010 21:17:37 +0200
> 
> > This patch-set seems to build and boot on my x86_64 machines and even builds a
> > kernel. I've also attempted powerpc and sparc, which I've compile tested with
> > their respective defconfigs, remaining are (afaikt the rest uses the generic
> > tlb bits):
> 
> Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
> work basically fine.

Wheee, thanks Dave!


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-08 20:50   ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2010-04-08 20:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> There is nothing preventing the anon_vma from being detached while we
> are spinning to acquire the lock. Most (all?) current users end up
> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
>
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.
>
> Close this hole for good by re-validating that page->mapping still
> holds the very same anon_vma pointer after we acquire the lock, if not
> be utterly paranoid and retry the whole operation (which will very
> likely bail, because it's unlikely the page got attached to a different
> anon_vma in the meantime).
>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Cc: Hugh Dickins<hugh.dickins@tiscali.co.uk>
> Cc: Linus Torvalds<torvalds@linux-foundation.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/13] x86: Remove last traces of quicklist usage
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-08 20:51   ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2010-04-08 20:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> We still have a stray quicklist header included even though we axed
> quicklist usage quite a while back.
>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
  (?)
@ 2010-04-08 21:20   ` Andrew Morton
  2010-04-08 21:54       ` Peter Zijlstra
  -1 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2010-04-08 21:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Thu, 08 Apr 2010 21:17:39 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> There is nothing preventing the anon_vma from being detached while we
> are spinning to acquire the lock.

Well.  The comment there clearly implies (or states) that RCU
protection is used to "guard against races".  If that's inaccurate
or incomplete, can we please get it fixed?


The whole function makes be a bit queasy.

- Fails to explain why it pulls all these party tricks to read
  page->mapping a single time.  What code path are we defending against
  here?

- Then checks page_mapped() without having any apparent defence
  against page_mapped() becoming untrue one nanosecond later.

- Checks page_mapped() inside the rcu_read_locked() section for
  inscrutable reasons.

> Most (all?) current users end up
> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
> 
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.
> 
> Close this hole for good by re-validating that page->mapping still
> holds the very same anon_vma pointer after we acquire the lock, if not
> be utterly paranoid and retry the whole operation (which will very
> likely bail, because it's unlikely the page got attached to a different
> anon_vma in the meantime).
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/rmap.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
>  	unsigned long anon_mapping;
>  
>  	rcu_read_lock();
> +again:
>  	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
>  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>  		goto out;
> @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);
> +
> +	if (page_rmapping(page) != anon_vma) {
> +		spin_unlock(&anon_vma->lock);
> +		goto again;
> +	}
> +
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();
> 

A comment here explaining how this situation could come about would
be helpful.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 21:20   ` Andrew Morton
@ 2010-04-08 21:54       ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 21:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Thu, 2010-04-08 at 14:20 -0700, Andrew Morton wrote:
> On Thu, 08 Apr 2010 21:17:39 +0200
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > There is nothing preventing the anon_vma from being detached while we
> > are spinning to acquire the lock.
> 
> Well.  The comment there clearly implies (or states) that RCU
> protection is used to "guard against races".  If that's inaccurate
> or incomplete, can we please get it fixed?

Good point, goes together with that last comment you made.

> The whole function makes be a bit queasy.
> 
> - Fails to explain why it pulls all these party tricks to read
>   page->mapping a single time.  What code path are we defending against
>   here?

>From what I understand we race with tear-down, anon_vma_unlock() takes
anon_vma->lock, so holding the lock pins the anon_vma.

So what we do to acquire a stable anon_vma from a page * is to, while
holding RCU read lock, very carefully read page->mapping, extract the
anon_vma and acquire the lock.

Now, the RCU usage is a tad tricky here, anon_vma uses
SLAB_DESTROY_BY_RCU, which means that the slab will be RCU freed,
however not the objects allocated from it. This means that an anon_vma
can be re-used directly after its gets freed, but the storage will
remain valid for at least a grace period after the free.

So once we do have the lock we need to revalidate that we indeed got the
anon_vma we throught we got.

So its:

  page->mapping = NULL;
  anon_vma_unlink();
    spin_lock()
    spin_unlock()
    kmem_cache_free(anon_vma);

 VS

  page_lock_anon_vma()'s trickery.

> - Then checks page_mapped() without having any apparent defence
>   against page_mapped() becoming untrue one nanosecond later.
> 
> - Checks page_mapped() inside the rcu_read_locked() section for
>   inscrutable reasons.

Right, I think the page_mapped() stuff is just an early bail out.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
@ 2010-04-08 21:54       ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-08 21:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Thu, 2010-04-08 at 14:20 -0700, Andrew Morton wrote:
> On Thu, 08 Apr 2010 21:17:39 +0200
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > There is nothing preventing the anon_vma from being detached while we
> > are spinning to acquire the lock.
> 
> Well.  The comment there clearly implies (or states) that RCU
> protection is used to "guard against races".  If that's inaccurate
> or incomplete, can we please get it fixed?

Good point, goes together with that last comment you made.

> The whole function makes be a bit queasy.
> 
> - Fails to explain why it pulls all these party tricks to read
>   page->mapping a single time.  What code path are we defending against
>   here?

From what I understand we race with tear-down, anon_vma_unlock() takes
anon_vma->lock, so holding the lock pins the anon_vma.

So what we do to acquire a stable anon_vma from a page * is to, while
holding RCU read lock, very carefully read page->mapping, extract the
anon_vma and acquire the lock.

Now, the RCU usage is a tad tricky here, anon_vma uses
SLAB_DESTROY_BY_RCU, which means that the slab will be RCU freed,
however not the objects allocated from it. This means that an anon_vma
can be re-used directly after its gets freed, but the storage will
remain valid for at least a grace period after the free.

So once we do have the lock we need to revalidate that we indeed got the
anon_vma we throught we got.

So its:

  page->mapping = NULL;
  anon_vma_unlink();
    spin_lock()
    spin_unlock()
    kmem_cache_free(anon_vma);

 VS

  page_lock_anon_vma()'s trickery.

> - Then checks page_mapped() without having any apparent defence
>   against page_mapped() becoming untrue one nanosecond later.
> 
> - Checks page_mapped() inside the rcu_read_locked() section for
>   inscrutable reasons.

Right, I think the page_mapped() stuff is just an early bail out.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13/13] mm: Optimize page_lock_anon_vma
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-08 22:18   ` Paul E. McKenney
  2010-04-09  8:35     ` Peter Zijlstra
  -1 siblings, 1 reply; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-08 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Thu, Apr 08, 2010 at 09:17:50PM +0200, Peter Zijlstra wrote:
> Optimize page_lock_anon_vma() by removing the atomic ref count
> ops from the fast path.
> 
> Rather complicates the code a lot, but might be worth it.

Some questions and a disclaimer below.

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 67 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -78,6 +78,12 @@ static inline struct anon_vma *anon_vma_
>  void anon_vma_free(struct anon_vma *anon_vma)
>  {
>  	VM_BUG_ON(atomic_read(&anon_vma->ref));
> +	/*
> +	 * Sync against the anon_vma->lock, so that we can hold the
> +	 * lock without requiring a reference. See page_lock_anon_vma().
> +	 */
> +	mutex_lock(&anon_vma->lock);

On some systems, the CPU is permitted to pull references into the critical
section from either side.  So, do we also need an smp_mb() here?

> +	mutex_unlock(&anon_vma->lock);

So, a question...

Can the above mutex be contended?  If yes, what happens when the
competing mutex_lock() acquires the lock at this point?  Or, worse yet,
after the kmem_cache_free()?

If no, what do we accomplish by acquiring the lock?

If the above mutex can be contended, can we fix by substituting
synchronize_rcu_expedited()?  Which will soon require some scalability
attention if it gets used here, but what else is new?  ;-)

>  	kmem_cache_free(anon_vma_cachep, anon_vma);
>  }
> 
> @@ -291,7 +297,7 @@ void __init anon_vma_init(void)
> 
>  /*
>   * Getting a lock on a stable anon_vma from a page off the LRU is
> - * tricky: page_lock_anon_vma relies on RCU to guard against the races.
> + * tricky: anon_vma_get relies on RCU to guard against the races.
>   */
>  struct anon_vma *anon_vma_get(struct page *page)
>  {
> @@ -320,12 +326,70 @@ out:
>  	return anon_vma;
>  }
> 
> +/*
> + * Similar to anon_vma_get(), however it relies on the anon_vma->lock
> + * to pin the object. However since we cannot wait for the mutex
> + * acquisition inside the RCU read lock, we use the ref count
> + * in the slow path.
> + */
>  struct anon_vma *page_lock_anon_vma(struct page *page)
>  {
> -	struct anon_vma *anon_vma = anon_vma_get(page);
> +	struct anon_vma *anon_vma = NULL;
> +	unsigned long anon_mapping;
> +
> +again:
> +	rcu_read_lock();

This is interesting.  You have an RCU read-side critical section with
no rcu_dereference().

This strange state of affairs is actually legal (assuming that
anon_mapping is the RCU-protected structure) because all dereferences
of the anon_vma variable are atomic operations that guarantee ordering
(the mutex_trylock() and the atomic_inc_not_zero().

The other dereferences (the atomic_read()s) are under the lock, so
are also OK assuming that the lock is held when initializing and
updating these fields, and even more OK due to the smp_rmb() below.

But see below.

> +	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> +	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> +		goto unlock;
> +	if (!page_mapped(page))
> +		goto unlock;
> +
> +	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> +	if (!mutex_trylock(&anon_vma->lock)) {
> +		/*
> +		 * We failed to acquire the lock, take a ref so we can
> +		 * drop the RCU read lock and sleep on it.
> +		 */
> +		if (!atomic_inc_not_zero(&anon_vma->ref)) {
> +			/*
> +			 * Failed to get a ref, we're dead, bail.
> +			 */
> +			anon_vma = NULL;
> +			goto unlock;
> +		}
> +		rcu_read_unlock();
> 
> -	if (anon_vma)
>  		mutex_lock(&anon_vma->lock);
> +		/*
> +		 * We got the lock, drop the temp. ref, if it was the last
> +		 * one free it and bail.
> +		 */
> +		if (atomic_dec_and_test(&anon_vma->ref)) {
> +			mutex_unlock(&anon_vma->lock);
> +			anon_vma_free(anon_vma);
> +			anon_vma = NULL;
> +		}
> +		goto out;
> +	}
> +	/*
> +	 * Got the lock, check we're still alive. Seeing a ref
> +	 * here guarantees the object will stay alive due to
> +	 * anon_vma_free() syncing against the lock we now hold.
> +	 */
> +	smp_rmb(); /* Order against anon_vma_put() */

This is ordering the fetch into anon_vma against the atomic_read() below?
If so, smp_read_barrier_depends() will cover it more cheaply.  Alternatively,
use rcu_dereference() when fetching into anon_vma.

Or am I misunderstanding the purpose of this barrier?

(Disclaimer: I have not yet found anon_vma_put(), so I am assuming that
anon_vma_free() plays the role of a grace period.)

> +	if (!atomic_read(&anon_vma->ref)) {
> +		mutex_unlock(&anon_vma->lock);
> +		anon_vma = NULL;
> +	}
> +
> +unlock:
> +	rcu_read_unlock();
> +out:
> +	if (anon_vma && page_rmapping(page) != anon_vma) {
> +		mutex_unlock(&anon_vma->lock);
> +		goto again;
> +	}
> 
>  	return anon_vma;
>  }
> @@ -333,7 +397,6 @@ struct anon_vma *page_lock_anon_vma(stru
>  void page_unlock_anon_vma(struct anon_vma *anon_vma)
>  {
>  	mutex_unlock(&anon_vma->lock);
> -	anon_vma_put(anon_vma);
>  }
> 
>  /*
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-08 20:29 ` [PATCH 00/13] mm: preemptibility -v2 David Miller
  2010-04-08 20:35   ` Peter Zijlstra
@ 2010-04-09  1:00   ` David Miller
  1 sibling, 0 replies; 113+ messages in thread
From: David Miller @ 2010-04-09  1:00 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: aarcange, avi, tglx, riel, mingo, akpm, torvalds, linux-kernel,
	linux-arch, benh, hugh.dickins, mel, npiggin

From: David Miller <davem@davemloft.net>
Date: Thu, 08 Apr 2010 13:29:35 -0700 (PDT)

> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Thu, 08 Apr 2010 21:17:37 +0200
> 
>> This patch-set seems to build and boot on my x86_64 machines and even builds a
>> kernel. I've also attempted powerpc and sparc, which I've compile tested with
>> their respective defconfigs, remaining are (afaikt the rest uses the generic
>> tlb bits):
> 
> Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
> work basically fine.

Here comes a set of 4 patches which build on top of your
work by:

1) Killing quicklists on sparc64
2) Using the generic RCU page table liberation code on sparc64
3) Implement pte_special() et al. on sparc64
4) Implement get_user_pages_fast() on sparc64

Please add them to your patch set.  If you change the RCU generic code
enabler CPP define to be controlled via Kconfig (as we discussed on
IRC) it should be easy to propagate that change into patch #2 here.

Thanks!

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 19:17   ` Peter Zijlstra
                     ` (2 preceding siblings ...)
  (?)
@ 2010-04-09  2:19   ` Minchan Kim
  -1 siblings, 0 replies; 113+ messages in thread
From: Minchan Kim @ 2010-04-09  2:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

Hi, Peter.

On Fri, Apr 9, 2010 at 4:17 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> There is nothing preventing the anon_vma from being detached while we
> are spinning to acquire the lock. Most (all?) current users end up
> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
>
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.

We have second defense rule by page_check_address.
Before anon_vma is detached, pte of pages on the anon_vma should be zeroed.
So can't page_check_address close the race?

Thanks for good trial for good feature.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 21:54       ` Peter Zijlstra
  (?)
@ 2010-04-09  2:19       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 113+ messages in thread
From: KOSAKI Motohiro @ 2010-04-09  2:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Andrew Morton, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, Linus Torvalds,
	linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin

Hi

> > - Then checks page_mapped() without having any apparent defence
> >   against page_mapped() becoming untrue one nanosecond later.
> > 
> > - Checks page_mapped() inside the rcu_read_locked() section for
> >   inscrutable reasons.
> 
> Right, I think the page_mapped() stuff is just an early bail out.

FWIW, not only early bail out.
page_remove_rmap() don't have "page->mapping = NULL". page_remove_rmap's
comment says

        /*
         * It would be tidy to reset the PageAnon mapping here,
         * but that might overwrite a racing page_add_anon_rmap
         * which increments mapcount after us but sets mapping
         * before us: so leave the reset to free_hot_cold_page,
         * and remember that it's only reliable while mapped.
         * Leaving it set also helps swapoff to reinstate ptes
         * faster for those pages still in swapcache.
         */

So, If following scenario happen, we can't dereference page->mapping
(iow, can't call spin_lock(&anon_vma->lock)). It might be pointed 
invalid address.


   CPU0                                CPU1
===============================================================
page->mappings become -1
anon_vma_unlink()
-- grace period --
                                      page_lock_anon_vma
                                      page_mapped()
                                      spin_lock(&anon_vma->lock);



Of cource, this statement doesn't mean I'm against your patch at all.
I like it.




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
  (?)
@ 2010-04-09  3:11   ` Nick Piggin
  -1 siblings, 0 replies; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  3:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

It's not wrong per se, but the entire powerpc memory management code
does the IRQ disabling for its pagetable RCU code. So I think it would be
better to do the whole thing in one go.

I don't think Paul will surprise-break powerpc :)

It's up to Ben really, though.


On Thu, Apr 08, 2010 at 09:17:38PM +0200, Peter Zijlstra wrote:
> The powerpc page table freeing relies on the fact that IRQs hold off
> an RCU grace period, this is currently true for all existing RCU
> implementations but is not an assumption Paul wants to support.
> 
> Therefore, also take the RCU read lock along with disabling IRQs to
> ensure the RCU grace period does at least cover these lookups.
> 
> Requested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
>  arch/powerpc/mm/gup.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6/arch/powerpc/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/gup.c
> +++ linux-2.6/arch/powerpc/mm/gup.c
> @@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
>  	 * So long as we atomically load page table pointers versus teardown,
>  	 * we can follow the address down to the the page and take a ref on it.
>  	 */
> +	rcu_read_lock();
>  	local_irq_disable();
>  
>  	pgdp = pgd_offset(mm, addr);
> @@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
>  	} while (pgdp++, addr = next, addr != end);
>  
>  	local_irq_enable();
> +	rcu_read_unlock();
>  
>  	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
>  	return nr;
> @@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
>  
>  slow:
>  		local_irq_enable();
> +		rcu_read_unlock();
>  slow_irqon:
>  		pr_devel("  slow path ! nr = %d\n", nr);
>  
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 19:17   ` Peter Zijlstra
                     ` (3 preceding siblings ...)
  (?)
@ 2010-04-09  3:16   ` Nick Piggin
  2010-04-09  4:56     ` KAMEZAWA Hiroyuki
  -1 siblings, 1 reply; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  3:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> There is nothing preventing the anon_vma from being detached while we
> are spinning to acquire the lock. Most (all?) current users end up
> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
> 
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.
> 
> Close this hole for good by re-validating that page->mapping still
> holds the very same anon_vma pointer after we acquire the lock, if not
> be utterly paranoid and retry the whole operation (which will very
> likely bail, because it's unlikely the page got attached to a different
> anon_vma in the meantime).

Hm, looks like a bugfix? How was this supposed to be safe?

 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/rmap.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
>  	unsigned long anon_mapping;
>  
>  	rcu_read_lock();
> +again:
>  	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
>  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>  		goto out;
> @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);
> +
> +	if (page_rmapping(page) != anon_vma) {

very unlikely()?

> +		spin_unlock(&anon_vma->lock);
> +		goto again;
> +	}
> +
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();
> 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06/13] mm: Preemptible mmu_gather
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-09  3:25   ` Nick Piggin
  2010-04-09  8:18     ` Peter Zijlstra
  2010-04-09 20:36     ` Peter Zijlstra
  -1 siblings, 2 replies; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  3:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Thu, Apr 08, 2010 at 09:17:43PM +0200, Peter Zijlstra wrote:
> @@ -39,30 +33,48 @@
>  struct mmu_gather {
>  	struct mm_struct	*mm;
>  	unsigned int		nr;	/* set to ~0U means fast mode */
> +	unsigned int		max;	/* nr < max */
>  	unsigned int		need_flush;/* Really unmapped some ptes? */
>  	unsigned int		fullmm; /* non-zero means full mm flush */
> -	struct page *		pages[FREE_PTE_NR];
> +#ifdef HAVE_ARCH_MMU_GATHER
> +	struct arch_mmu_gather	arch;
> +#endif
> +	struct page		**pages;
> +	struct page		*local[8];

Have you done some profiling on this? What I would like to see, if
it's not too much complexity, is to have a small set of pages to
handle common size frees, and then use them up first by default
before attempting to allocate more.

Also, it would be cool to be able to chain allocations to avoid
TLB flushes even on big frees (overridable by arch of course, in
case they're doing some non-preeemptible work or you wish to break
up lock hold times). But that might be just getting over engineered.

>  };
>  
> -/* Users of the generic TLB shootdown code must declare this storage space. */
> -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
> +static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
> +{
> +	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);

Slab allocations should be faster, so it's nice to use them in
performance critical code if you don't need the struct page.

Otherwise, looks ok to me.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 09/13] mm, powerpc: Move the RCU page-table freeing into generic code
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-09  3:35   ` Nick Piggin
  2010-04-09  8:08     ` Peter Zijlstra
  -1 siblings, 1 reply; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  3:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Thu, Apr 08, 2010 at 09:17:46PM +0200, Peter Zijlstra wrote:
> Index: linux-2.6/include/asm-generic/tlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/tlb.h
> +++ linux-2.6/include/asm-generic/tlb.h
> @@ -27,6 +27,49 @@
>    #define tlb_fast_mode(tlb) 1
>  #endif
>  
> +#ifdef HAVE_ARCH_RCU_TABLE_FREE
> +/*
> + * Semi RCU freeing of the page directories.
> + *
> + * This is needed by some architectures to implement gup_fast().

Really? I see the comment in the powerpc code, but powerpc was already
using RCU before gup_fast(), and AFAIKS it is indeed using it so that
it can handle faults by getting the linux pte with find_linux_pte ?

I would have thought this should be a well used method for handling
software TLB faults.

But anyway this looks like a nice abstraction.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-09  4:07   ` Nick Piggin
  2010-04-09  8:14     ` Peter Zijlstra
  2010-04-13  1:56     ` Benjamin Herrenschmidt
  -1 siblings, 2 replies; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  4:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Thu, Apr 08, 2010 at 09:17:44PM +0200, Peter Zijlstra wrote:
> Fix up powerpc to the new mmu_gather stuffs.
> 
> PPC has an extra batching queue to RCU free the actual pagetable
> allocations, use the ARCH extentions for that for now.
> 
> For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> hardware hash-table, keep using per-cpu arrays but flush on context
> switch and use a TIF bit to track the laxy_mmu state.

Hm. Pity powerpc can't just use tlb flush gathering for this batching,
(which is what it was designed for). Then it could avoid these tricks.
What's preventing this? Adding a tlb gather for COW case in
copy_page_range?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-08 19:17 ` Peter Zijlstra
                   ` (15 preceding siblings ...)
  (?)
@ 2010-04-09  4:14 ` Nick Piggin
  2010-04-09  8:35   ` Peter Zijlstra
  -1 siblings, 1 reply; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  4:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> It converts i_mmap_lock and anon_vma->lock to mutexes.  On the way there it
> also makes mmu_gather preemptible.
> 
> The main motivation was making mm_take_all_locks() preemptible, since it
> appears people are nesting hundreds of spinlocks there.
> 
> The side-effects are that we can finally make mmu_gather preemptible, something
> which lots of people have wanted to do for a long time.

What's the straight-line performance impact of all this? And how about
concurrency, I wonder. mutexes of course are double the atomics, and
you've added a refcount which is two more again for those paths using
it.

Page faults are very important. We unfortunately have some databases
doing a significant amount of mmap/munmap activity too. I'd like to
see microbenchmark numbers for each of those (both anon and file backed
for page faults).

kbuild does quite a few pages faults, that would be an easy thing to
test. Not sure what reasonable kinds of cases exercise parallelism.


> What kind of performance tests would people have me run on this to satisfy
> their need for numbers? I've done a kernel build on x86_64 and if anything that
> was slightly faster with these patches, but it was well within the noise
> levels so it might be heat noise I'm looking at ;-)

Is it because you're reducing the number of TLB flushes, or what
(kbuild isn't multi threaded so on x86 TLB flushes should be really
fast anyway).


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  3:16   ` Nick Piggin
@ 2010-04-09  4:56     ` KAMEZAWA Hiroyuki
  2010-04-09  6:34       ` KOSAKI Motohiro
  0 siblings, 1 reply; 113+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-09  4:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman

On Fri, 9 Apr 2010 13:16:41 +1000
Nick Piggin <npiggin@suse.de> wrote:

> On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> > There is nothing preventing the anon_vma from being detached while we
> > are spinning to acquire the lock. Most (all?) current users end up
> > calling something like vma_address(page, vma) on it, which has a
> > fairly good chance of weeding out wonky vmas.
> > 
> > However suppose the anon_vma got freed and re-used while we were
> > waiting to acquire the lock, and the new anon_vma fits with the
> > page->index (because that is the only thing vma_address() uses to
> > determine if the page fits in a particular vma, we could end up
> > traversing faulty anon_vma chains.
> > 
> > Close this hole for good by re-validating that page->mapping still
> > holds the very same anon_vma pointer after we acquire the lock, if not
> > be utterly paranoid and retry the whole operation (which will very
> > likely bail, because it's unlikely the page got attached to a different
> > anon_vma in the meantime).
> 
> Hm, looks like a bugfix? How was this supposed to be safe?
> 
IIUC.

Before Rik's change to anon_vma, once page->mapping is set as anon_vma | 0x1,
it's not modified until the page is freed.
After the patch, do_wp_page() overwrite page->mapping when it reuse existing
page.

==
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
                unsigned long address, pte_t *page_table, pmd_t *pmd,
                spinlock_t *ptl, pte_t orig_pte)
{
....
        if (PageAnon(old_page) && !PageKsm(old_page)) {
                if (!trylock_page(old_page)) {
                        page_cache_get(old_page);
....
                reuse = reuse_swap_page(old_page);
                if (reuse)
                        /*
                         * The page is all ours.  Move it to our anon_vma so
                         * the rmap code will not search our parent or siblings.
                         * Protected against the rmap code by the page lock.
                         */
                        page_move_anon_rmap(old_page, vma, address); ----(*)
}
===		
(*) is new.

Then, this new check makes sense in the current kernel.



>  
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > ---
> >  mm/rmap.c |    7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
> >  	unsigned long anon_mapping;
> >  
> >  	rcu_read_lock();
> > +again:
> >  	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> >  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >  		goto out;
> > @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
> >  
> >  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> >  	spin_lock(&anon_vma->lock);
> > +
> > +	if (page_rmapping(page) != anon_vma) {
> 
> very unlikely()?
> 
I think so.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  4:56     ` KAMEZAWA Hiroyuki
@ 2010-04-09  6:34       ` KOSAKI Motohiro
  2010-04-09  6:47         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 113+ messages in thread
From: KOSAKI Motohiro @ 2010-04-09  6:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Nick Piggin, Peter Zijlstra, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman

> On Fri, 9 Apr 2010 13:16:41 +1000
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> > > There is nothing preventing the anon_vma from being detached while we
> > > are spinning to acquire the lock. Most (all?) current users end up
> > > calling something like vma_address(page, vma) on it, which has a
> > > fairly good chance of weeding out wonky vmas.
> > > 
> > > However suppose the anon_vma got freed and re-used while we were
> > > waiting to acquire the lock, and the new anon_vma fits with the
> > > page->index (because that is the only thing vma_address() uses to
> > > determine if the page fits in a particular vma, we could end up
> > > traversing faulty anon_vma chains.
> > > 
> > > Close this hole for good by re-validating that page->mapping still
> > > holds the very same anon_vma pointer after we acquire the lock, if not
> > > be utterly paranoid and retry the whole operation (which will very
> > > likely bail, because it's unlikely the page got attached to a different
> > > anon_vma in the meantime).
> > 
> > Hm, looks like a bugfix? How was this supposed to be safe?
> > 
> IIUC.
> 
> Before Rik's change to anon_vma, once page->mapping is set as anon_vma | 0x1,
> it's not modified until the page is freed.
> After the patch, do_wp_page() overwrite page->mapping when it reuse existing
> page.

Why?
IIUC. page->mapping dereference in page_lock_anon_vma() makes four story.

1. the anon_vma is valid
	-> do page_referenced_one(). 
2. the anon_vma is invalid and freed to buddy
	-> bail out by page_mapped(), no touch anon_vma
3. the anon_vma is kfreed, and not reused
	-> bail out by page_mapped()
4. the anon_vma is kfreed, but reused as another anon_vma
	-> bail out by page_check_address()

Now we have to consider 5th story.

5. the anon_vma is exchanged another anon_vma by do_wp_page.
	-> bail out by above bailing out stuff.


I agree peter's patch makes sense. but I don't think Rik's patch change
locking rule.


> 
> ==
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>                 unsigned long address, pte_t *page_table, pmd_t *pmd,
>                 spinlock_t *ptl, pte_t orig_pte)
> {
> ....
>         if (PageAnon(old_page) && !PageKsm(old_page)) {
>                 if (!trylock_page(old_page)) {
>                         page_cache_get(old_page);
> ....
>                 reuse = reuse_swap_page(old_page);
>                 if (reuse)
>                         /*
>                          * The page is all ours.  Move it to our anon_vma so
>                          * the rmap code will not search our parent or siblings.
>                          * Protected against the rmap code by the page lock.
>                          */
>                         page_move_anon_rmap(old_page, vma, address); ----(*)
> }
> ===		
> (*) is new.
> 
> Then, this new check makes sense in the current kernel.






^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  6:34       ` KOSAKI Motohiro
@ 2010-04-09  6:47         ` KAMEZAWA Hiroyuki
  2010-04-09  7:29           ` KOSAKI Motohiro
  0 siblings, 1 reply; 113+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-09  6:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, Peter Zijlstra, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman

On Fri,  9 Apr 2010 15:34:33 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Fri, 9 Apr 2010 13:16:41 +1000
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> > > > There is nothing preventing the anon_vma from being detached while we
> > > > are spinning to acquire the lock. Most (all?) current users end up
> > > > calling something like vma_address(page, vma) on it, which has a
> > > > fairly good chance of weeding out wonky vmas.
> > > > 
> > > > However suppose the anon_vma got freed and re-used while we were
> > > > waiting to acquire the lock, and the new anon_vma fits with the
> > > > page->index (because that is the only thing vma_address() uses to
> > > > determine if the page fits in a particular vma, we could end up
> > > > traversing faulty anon_vma chains.
> > > > 
> > > > Close this hole for good by re-validating that page->mapping still
> > > > holds the very same anon_vma pointer after we acquire the lock, if not
> > > > be utterly paranoid and retry the whole operation (which will very
> > > > likely bail, because it's unlikely the page got attached to a different
> > > > anon_vma in the meantime).
> > > 
> > > Hm, looks like a bugfix? How was this supposed to be safe?
> > > 
> > IIUC.
> > 
> > Before Rik's change to anon_vma, once page->mapping is set as anon_vma | 0x1,
> > it's not modified until the page is freed.
> > After the patch, do_wp_page() overwrite page->mapping when it reuse existing
> > page.
> 
> Why?
> IIUC. page->mapping dereference in page_lock_anon_vma() makes four story.
> 
> 1. the anon_vma is valid
> 	-> do page_referenced_one(). 
> 2. the anon_vma is invalid and freed to buddy
> 	-> bail out by page_mapped(), no touch anon_vma
> 3. the anon_vma is kfreed, and not reused
> 	-> bail out by page_mapped()
> 4. the anon_vma is kfreed, but reused as another anon_vma
> 	-> bail out by page_check_address()
> 
> Now we have to consider 5th story.
> 
> 5. the anon_vma is exchanged another anon_vma by do_wp_page.
> 	-> bail out by above bailing out stuff.
> 
> 
> I agree peter's patch makes sense. but I don't think Rik's patch change
> locking rule.
> 

Hmm, I think following.

Assume a page is ANON and SwapCache, and it has only one reference.
Consider it's read-only mapped and cause do_wp_page().
page_mapcount(page) == 1 here.

    CPU0                          CPU1

1. do_wp_page()
2. .....
3. replace anon_vma.     anon_vma = lock_page_anon_vma()

So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)

4. modify pte to writable.        do something...

After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
containes pfn of the page and page_check_address() will success.

I'm not sure how this is dangerouns.
But it's possible that CPU1 cannot notice there was anon_vma replacement.
And modifies pte withoug holding anon vma's lock which the code believes
it's holded.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 05/13] mm: Make use of the anon_vma ref count
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-09  7:04   ` Christian Ehrhardt
  2010-04-09  9:57     ` Peter Zijlstra
  -1 siblings, 1 reply; 113+ messages in thread
From: Christian Ehrhardt @ 2010-04-09  7:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin


Hi,

On Thu, Apr 08, 2010 at 09:17:42PM +0200, Peter Zijlstra wrote:
> @@ -302,23 +307,33 @@ again:
>  		goto out;
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> -	spin_lock(&anon_vma->lock);
> +	if (!atomic_inc_not_zero(&anon_vma->ref))
> +		anon_vma = NULL;
>  
>  	if (page_rmapping(page) != anon_vma) {
> -		spin_unlock(&anon_vma->lock);
> +		anon_vma_put(anon_vma);
>  		goto again;
>  	}

AFAICS anon_vma_put might be called with anon_vma == NULL here which
will oops on the ref count. Not sure if

     page_rmapping(page) == anon_vma == NULL 

is possible, too.

    regards   Christian


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  6:47         ` KAMEZAWA Hiroyuki
@ 2010-04-09  7:29           ` KOSAKI Motohiro
  2010-04-09  7:57             ` KAMEZAWA Hiroyuki
                               ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: KOSAKI Motohiro @ 2010-04-09  7:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Nick Piggin, Peter Zijlstra, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman

> On Fri,  9 Apr 2010 15:34:33 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Fri, 9 Apr 2010 13:16:41 +1000
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> > > > > There is nothing preventing the anon_vma from being detached while we
> > > > > are spinning to acquire the lock. Most (all?) current users end up
> > > > > calling something like vma_address(page, vma) on it, which has a
> > > > > fairly good chance of weeding out wonky vmas.
> > > > > 
> > > > > However suppose the anon_vma got freed and re-used while we were
> > > > > waiting to acquire the lock, and the new anon_vma fits with the
> > > > > page->index (because that is the only thing vma_address() uses to
> > > > > determine if the page fits in a particular vma, we could end up
> > > > > traversing faulty anon_vma chains.
> > > > > 
> > > > > Close this hole for good by re-validating that page->mapping still
> > > > > holds the very same anon_vma pointer after we acquire the lock, if not
> > > > > be utterly paranoid and retry the whole operation (which will very
> > > > > likely bail, because it's unlikely the page got attached to a different
> > > > > anon_vma in the meantime).
> > > > 
> > > > Hm, looks like a bugfix? How was this supposed to be safe?
> > > > 
> > > IIUC.
> > > 
> > > Before Rik's change to anon_vma, once page->mapping is set as anon_vma | 0x1,
> > > it's not modified until the page is freed.
> > > After the patch, do_wp_page() overwrite page->mapping when it reuse existing
> > > page.
> > 
> > Why?
> > IIUC. page->mapping dereference in page_lock_anon_vma() makes four story.
> > 
> > 1. the anon_vma is valid
> > 	-> do page_referenced_one(). 
> > 2. the anon_vma is invalid and freed to buddy
> > 	-> bail out by page_mapped(), no touch anon_vma
> > 3. the anon_vma is kfreed, and not reused
> > 	-> bail out by page_mapped()
> > 4. the anon_vma is kfreed, but reused as another anon_vma
> > 	-> bail out by page_check_address()
> > 
> > Now we have to consider 5th story.
> > 
> > 5. the anon_vma is exchanged another anon_vma by do_wp_page.
> > 	-> bail out by above bailing out stuff.
> > 
> > 
> > I agree peter's patch makes sense. but I don't think Rik's patch change
> > locking rule.
> > 
> 
> Hmm, I think following.
> 
> Assume a page is ANON and SwapCache, and it has only one reference.
> Consider it's read-only mapped and cause do_wp_page().
> page_mapcount(page) == 1 here.
> 
>     CPU0                          CPU1
> 
> 1. do_wp_page()
> 2. .....
> 3. replace anon_vma.     anon_vma = lock_page_anon_vma()
> 
> So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)
> 
> 4. modify pte to writable.        do something...
> 
> After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
> containes pfn of the page and page_check_address() will success.
> 
> I'm not sure how this is dangerouns.
> But it's possible that CPU1 cannot notice there was anon_vma replacement.
> And modifies pte withoug holding anon vma's lock which the code believes
> it's holded.


Hehe, page_referenced() already can take unstable VM_LOCKED value. So,
In worst case we make false positive pageout, but it's not disaster.
I think. Anyway "use after free" don't happen by this blutal code.

However, I think you pointed one good thing. before Rik patch, we don't have
page->mapping reassignment. then, we didn't need rcu_dereference().
but now it can happen. so, I think rcu_dereference() is better.

Perhaps, I'm missing something.


diff --git a/mm/rmap.c b/mm/rmap.c
index 8b088f0..b4a0b5b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
-	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+	anon_mapping = (unsigned long) rcu_dereference(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		goto out;
 	if (!page_mapped(page))





^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  7:29           ` KOSAKI Motohiro
@ 2010-04-09  7:57             ` KAMEZAWA Hiroyuki
  2010-04-09  8:03               ` KAMEZAWA Hiroyuki
  2010-04-09  8:01             ` Minchan Kim
  2010-04-09  8:44             ` [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma() Peter Zijlstra
  2 siblings, 1 reply; 113+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-09  7:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, Peter Zijlstra, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman

On Fri,  9 Apr 2010 16:29:59 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Fri,  9 Apr 2010 15:34:33 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > On Fri, 9 Apr 2010 13:16:41 +1000
> > > > Nick Piggin <npiggin@suse.de> wrote:
> > > > 
> > > > > On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> > > > > > There is nothing preventing the anon_vma from being detached while we
> > > > > > are spinning to acquire the lock. Most (all?) current users end up
> > > > > > calling something like vma_address(page, vma) on it, which has a
> > > > > > fairly good chance of weeding out wonky vmas.
> > > > > > 
> > > > > > However suppose the anon_vma got freed and re-used while we were
> > > > > > waiting to acquire the lock, and the new anon_vma fits with the
> > > > > > page->index (because that is the only thing vma_address() uses to
> > > > > > determine if the page fits in a particular vma, we could end up
> > > > > > traversing faulty anon_vma chains.
> > > > > > 
> > > > > > Close this hole for good by re-validating that page->mapping still
> > > > > > holds the very same anon_vma pointer after we acquire the lock, if not
> > > > > > be utterly paranoid and retry the whole operation (which will very
> > > > > > likely bail, because it's unlikely the page got attached to a different
> > > > > > anon_vma in the meantime).
> > > > > 
> > > > > Hm, looks like a bugfix? How was this supposed to be safe?
> > > > > 
> > > > IIUC.
> > > > 
> > > > Before Rik's change to anon_vma, once page->mapping is set as anon_vma | 0x1,
> > > > it's not modified until the page is freed.
> > > > After the patch, do_wp_page() overwrite page->mapping when it reuse existing
> > > > page.
> > > 
> > > Why?
> > > IIUC. page->mapping dereference in page_lock_anon_vma() makes four story.
> > > 
> > > 1. the anon_vma is valid
> > > 	-> do page_referenced_one(). 
> > > 2. the anon_vma is invalid and freed to buddy
> > > 	-> bail out by page_mapped(), no touch anon_vma
> > > 3. the anon_vma is kfreed, and not reused
> > > 	-> bail out by page_mapped()
> > > 4. the anon_vma is kfreed, but reused as another anon_vma
> > > 	-> bail out by page_check_address()
> > > 
> > > Now we have to consider 5th story.
> > > 
> > > 5. the anon_vma is exchanged another anon_vma by do_wp_page.
> > > 	-> bail out by above bailing out stuff.
> > > 
> > > 
> > > I agree peter's patch makes sense. but I don't think Rik's patch change
> > > locking rule.
> > > 
> > 
> > Hmm, I think following.
> > 
> > Assume a page is ANON and SwapCache, and it has only one reference.
> > Consider it's read-only mapped and cause do_wp_page().
> > page_mapcount(page) == 1 here.
> > 
> >     CPU0                          CPU1
> > 
> > 1. do_wp_page()
> > 2. .....
> > 3. replace anon_vma.     anon_vma = lock_page_anon_vma()
> > 
> > So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)
> > 
> > 4. modify pte to writable.        do something...
> > 
> > After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
> > containes pfn of the page and page_check_address() will success.
> > 
> > I'm not sure how this is dangerouns.
> > But it's possible that CPU1 cannot notice there was anon_vma replacement.
> > And modifies pte withoug holding anon vma's lock which the code believes
> > it's holded.
> 
> 
> Hehe, page_referenced() already can take unstable VM_LOCKED value. So,
> In worst case we make false positive pageout, but it's not disaster.
> I think. Anyway "use after free" don't happen by this blutal code.
> 
> However, I think you pointed one good thing. before Rik patch, we don't have
> page->mapping reassignment. then, we didn't need rcu_dereference().
> but now it can happen. so, I think rcu_dereference() is better.
> 
> Perhaps, I'm missing something.
> 

Hmm. I wonder we can check "whether we lock valid anon_vma or not" only under
pte_lock or lock_page().
==
	anon_vma = page_anon_vma();
	lock(anon_vma->lock);
	....
	page_check_address(page)
		....
		pte_lock();
		if (page_anon_vma(page) == anon_vma)
			# anon_vma replacement happens!
	unlock(anon_vma->lock);
==
So, rather than page_lock_anon_vma(),  page_check_address() may have to check anon_vma
replacement....But I cannot think of dangerous case which can cause panic for now.
I may miss something...


Thanks,
-Kame

> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8b088f0..b4a0b5b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
>  	unsigned long anon_mapping;
>  
>  	rcu_read_lock();
> -	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> +	anon_mapping = (unsigned long) rcu_dereference(page->mapping);
>  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>  		goto out;
>  	if (!page_mapped(page))
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  7:29           ` KOSAKI Motohiro
  2010-04-09  7:57             ` KAMEZAWA Hiroyuki
@ 2010-04-09  8:01             ` Minchan Kim
  2010-04-09  8:17               ` KOSAKI Motohiro
  2010-04-09  8:44             ` [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma() Peter Zijlstra
  2 siblings, 1 reply; 113+ messages in thread
From: Minchan Kim @ 2010-04-09  8:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Peter Zijlstra, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman

Hi, Kosaki.

On Fri, Apr 9, 2010 at 4:29 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> Hmm, I think following.
>>
>> Assume a page is ANON and SwapCache, and it has only one reference.
>> Consider it's read-only mapped and cause do_wp_page().
>> page_mapcount(page) == 1 here.
>>
>>     CPU0                          CPU1
>>
>> 1. do_wp_page()
>> 2. .....
>> 3. replace anon_vma.     anon_vma = lock_page_anon_vma()
>>
>> So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)
>>
>> 4. modify pte to writable.        do something...
>>
>> After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
>> containes pfn of the page and page_check_address() will success.
>>
>> I'm not sure how this is dangerouns.
>> But it's possible that CPU1 cannot notice there was anon_vma replacement.
>> And modifies pte withoug holding anon vma's lock which the code believes
>> it's holded.
>
>
> Hehe, page_referenced() already can take unstable VM_LOCKED value. So,
> In worst case we make false positive pageout, but it's not disaster.

OFF-TOPIC:

I think you pointed out good thing, too. :)

You mean although application call mlock of any vma, few pages on the vma can
be swapout by race between mlock and reclaim?

Although it's not disaster, apparently it breaks API.
Man page
" mlock() and munlock()
  mlock()  locks pages in the address range starting at addr and
continuing for len bytes. All pages that contain a part of the
specified address range are guaranteed to be resident in RAM when the
call returns  successfully;  the pages are guaranteed to stay in RAM
until later unlocked."

Do you have a plan to solve such problem?

And how about adding simple comment about that race in page_referenced_one?
Could you send the patch?




-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  7:57             ` KAMEZAWA Hiroyuki
@ 2010-04-09  8:03               ` KAMEZAWA Hiroyuki
  2010-04-09  8:24                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 113+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-09  8:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Nick Piggin, Peter Zijlstra, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman

On Fri, 9 Apr 2010 16:57:03 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri,  9 Apr 2010 16:29:59 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > Hmm, I think following.
> > > 
> > > Assume a page is ANON and SwapCache, and it has only one reference.
> > > Consider it's read-only mapped and cause do_wp_page().
> > > page_mapcount(page) == 1 here.
> > > 
> > >     CPU0                          CPU1
> > > 
> > > 1. do_wp_page()
> > > 2. .....
> > > 3. replace anon_vma.     anon_vma = lock_page_anon_vma()
> > > 
> > > So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)
> > > 
> > > 4. modify pte to writable.        do something...
> > > 
> > > After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
> > > containes pfn of the page and page_check_address() will success.
> > > 
> > > I'm not sure how this is dangerouns.
> > > But it's possible that CPU1 cannot notice there was anon_vma replacement.
> > > And modifies pte withoug holding anon vma's lock which the code believes
> > > it's holded.
> > 
> > 
> > Hehe, page_referenced() already can take unstable VM_LOCKED value. So,
> > In worst case we make false positive pageout, but it's not disaster.
> > I think. Anyway "use after free" don't happen by this blutal code.
> > 
> > However, I think you pointed one good thing. before Rik patch, we don't have
> > page->mapping reassignment. then, we didn't need rcu_dereference().
> > but now it can happen. so, I think rcu_dereference() is better.
> > 
> > Perhaps, I'm missing something.
> > 
> 
> Hmm. I wonder we can check "whether we lock valid anon_vma or not" only under
> pte_lock or lock_page().
> ==
> 	anon_vma = page_anon_vma();
> 	lock(anon_vma->lock);
> 	....
> 	page_check_address(page)
> 		....
> 		pte_lock();
> 		if (page_anon_vma(page) == anon_vma)
> 			# anon_vma replacement happens!
> 	unlock(anon_vma->lock);
> ==
> So, rather than page_lock_anon_vma(),  page_check_address() may have to check anon_vma
> replacement....But I cannot think of dangerous case which can cause panic for now.
> I may miss something...
> 
Ah...anon_vma replacemet occurs under lock_page() and pte_lock.
Almost all callers of page_lock_anon_vma() holds lock_page(). So, I think
this anon_vma replacement is not very serious.
Hmm...

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 09/13] mm, powerpc: Move the RCU page-table freeing into generic code
  2010-04-09  3:35   ` Nick Piggin
@ 2010-04-09  8:08     ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  8:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 13:35 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:46PM +0200, Peter Zijlstra wrote:
> > Index: linux-2.6/include/asm-generic/tlb.h
> > ===================================================================
> > --- linux-2.6.orig/include/asm-generic/tlb.h
> > +++ linux-2.6/include/asm-generic/tlb.h
> > @@ -27,6 +27,49 @@
> >    #define tlb_fast_mode(tlb) 1
> >  #endif
> >  
> > +#ifdef HAVE_ARCH_RCU_TABLE_FREE
> > +/*
> > + * Semi RCU freeing of the page directories.
> > + *
> > + * This is needed by some architectures to implement gup_fast().
> 
> Really? I see the comment in the powerpc code, but powerpc was already
> using RCU before gup_fast(), and AFAIKS it is indeed using it so that
> it can handle faults by getting the linux pte with find_linux_pte ?

Ah, see that is my ignorance of the powerpc mmu code, I've only been
staring at this mmu_gather piece long enough to (hopefully) make it
work.

If there is indeed more to it, then yes, that needs documenting.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-09  4:07   ` Nick Piggin
@ 2010-04-09  8:14     ` Peter Zijlstra
  2010-04-09  8:46       ` Nick Piggin
  2010-04-13  2:06       ` Benjamin Herrenschmidt
  2010-04-13  1:56     ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  8:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 14:07 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:44PM +0200, Peter Zijlstra wrote:
> > Fix up powerpc to the new mmu_gather stuffs.
> > 
> > PPC has an extra batching queue to RCU free the actual pagetable
> > allocations, use the ARCH extentions for that for now.
> > 
> > For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> > hardware hash-table, keep using per-cpu arrays but flush on context
> > switch and use a TIF bit to track the laxy_mmu state.
> 
> Hm. Pity powerpc can't just use tlb flush gathering for this batching,
> (which is what it was designed for). Then it could avoid these tricks.
> What's preventing this? Adding a tlb gather for COW case in
> copy_page_range?

I'm not quite sure what about that, didn't fully investigate it, just
wanted to get something working for now.

Of of the things is that both power and sparc need more than the struct
page we normally gather.

I did think of making the mmu_gather have something like

struct mmu_page {
  struct page *page;
#ifdef HAVE_ARCH_TLB_VADDR
  unsigned long vaddr;
#endif
};

struct mmu_gather {
  ...
  unsigned int nr;
  struct mmu_page *pages;
};


and doing that vaddr collection right along with it in the same batch.

I think that that would work, Ben, Dave?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  8:01             ` Minchan Kim
@ 2010-04-09  8:17               ` KOSAKI Motohiro
  2010-04-09 14:41                 ` mlock and pageout race? Minchan Kim
  0 siblings, 1 reply; 113+ messages in thread
From: KOSAKI Motohiro @ 2010-04-09  8:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Nick Piggin, Peter Zijlstra,
	Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

Hi Minchan,

> OFF-TOPIC:
> 
> I think you pointed out good thing, too. :)
> 
> You mean although application call mlock of any vma, few pages on the vma can
> be swapout by race between mlock and reclaim?
> 
> Although it's not disaster, apparently it breaks API.
> Man page
> " mlock() and munlock()
>   mlock()  locks pages in the address range starting at addr and
> continuing for len bytes. All pages that contain a part of the
> specified address range are guaranteed to be resident in RAM when the
> call returns  successfully;  the pages are guaranteed to stay in RAM
> until later unlocked."
> 
> Do you have a plan to solve such problem?
> 
> And how about adding simple comment about that race in page_referenced_one?
> Could you send the patch?

I'm surprising this mail. you were pushing much patch in this area.
I believed you know all stuff ;)

My answer is, it don't need to fix, because it's not bug. The point is
that this one is race issue. not "pageout after mlock" issue.
If pageout and mlock occur at the exactly same time, the human can't
observe which event occur in first. it's not API violation.

Thanks.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06/13] mm: Preemptible mmu_gather
  2010-04-09  3:25   ` Nick Piggin
@ 2010-04-09  8:18     ` Peter Zijlstra
  2010-04-09 20:36     ` Peter Zijlstra
  1 sibling, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  8:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 13:25 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:43PM +0200, Peter Zijlstra wrote:
> > @@ -39,30 +33,48 @@
> >  struct mmu_gather {
> >  	struct mm_struct	*mm;
> >  	unsigned int		nr;	/* set to ~0U means fast mode */
> > +	unsigned int		max;	/* nr < max */
> >  	unsigned int		need_flush;/* Really unmapped some ptes? */
> >  	unsigned int		fullmm; /* non-zero means full mm flush */
> > -	struct page *		pages[FREE_PTE_NR];
> > +#ifdef HAVE_ARCH_MMU_GATHER
> > +	struct arch_mmu_gather	arch;
> > +#endif
> > +	struct page		**pages;
> > +	struct page		*local[8];
> 
> Have you done some profiling on this? What I would like to see, if
> it's not too much complexity, is to have a small set of pages to
> handle common size frees, and then use them up first by default
> before attempting to allocate more.
> 
> Also, it would be cool to be able to chain allocations to avoid
> TLB flushes even on big frees (overridable by arch of course, in
> case they're doing some non-preeemptible work or you wish to break
> up lock hold times). But that might be just getting over engineered.

Did no profiling at all, back when I wrote this I was in a hurry to get
this working for -rt.

But yes, those things do look like something we want to look into, we
can easily add a head structure to these pages like we did for the RCU
batches.

But as it stands I think we can do those things as incrementals on top
of this, no?

What kind of workload would you recommend I use to profile this?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  8:03               ` KAMEZAWA Hiroyuki
@ 2010-04-09  8:24                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 113+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-09  8:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Nick Piggin, Peter Zijlstra, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman

On Fri, 9 Apr 2010 17:03:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 9 Apr 2010 16:57:03 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Fri,  9 Apr 2010 16:29:59 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > > Hmm, I think following.
> > > > 
> > > > Assume a page is ANON and SwapCache, and it has only one reference.
> > > > Consider it's read-only mapped and cause do_wp_page().
> > > > page_mapcount(page) == 1 here.
> > > > 
> > > >     CPU0                          CPU1
> > > > 
> > > > 1. do_wp_page()
> > > > 2. .....
> > > > 3. replace anon_vma.     anon_vma = lock_page_anon_vma()
> > > > 
> > > > So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)
> > > > 
> > > > 4. modify pte to writable.        do something...
> > > > 
> > > > After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
> > > > containes pfn of the page and page_check_address() will success.
> > > > 
> > > > I'm not sure how this is dangerouns.
> > > > But it's possible that CPU1 cannot notice there was anon_vma replacement.
> > > > And modifies pte withoug holding anon vma's lock which the code believes
> > > > it's holded.
> > > 
> > > 
> > > Hehe, page_referenced() already can take unstable VM_LOCKED value. So,
> > > In worst case we make false positive pageout, but it's not disaster.
> > > I think. Anyway "use after free" don't happen by this blutal code.
> > > 
> > > However, I think you pointed one good thing. before Rik patch, we don't have
> > > page->mapping reassignment. then, we didn't need rcu_dereference().
> > > but now it can happen. so, I think rcu_dereference() is better.
> > > 
> > > Perhaps, I'm missing something.
> > > 
> > 
> > Hmm. I wonder we can check "whether we lock valid anon_vma or not" only under
> > pte_lock or lock_page().
> > ==
> > 	anon_vma = page_anon_vma();
> > 	lock(anon_vma->lock);
> > 	....
> > 	page_check_address(page)
> > 		....
> > 		pte_lock();
> > 		if (page_anon_vma(page) == anon_vma)
> > 			# anon_vma replacement happens!
> > 	unlock(anon_vma->lock);
> > ==
> > So, rather than page_lock_anon_vma(),  page_check_address() may have to check anon_vma
> > replacement....But I cannot think of dangerous case which can cause panic for now.
> > I may miss something...
> > 
> Ah...anon_vma replacemet occurs under lock_page() and pte_lock.
> Almost all callers of page_lock_anon_vma() holds lock_page(). So, I think
> this anon_vma replacement is not very serious.
Sorry for short mails ;(


Note:vmscan.c::shrink_active_list()
	-> page_referenced()
doesn't take lock_page() and may see wrong anon_vma by replacement.

Don't we need lock_page() around ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13/13] mm: Optimize page_lock_anon_vma
  2010-04-08 22:18   ` Paul E. McKenney
@ 2010-04-09  8:35     ` Peter Zijlstra
  2010-04-09 19:22       ` Paul E. McKenney
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  8:35 UTC (permalink / raw)
  To: paulmck
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Thu, 2010-04-08 at 15:18 -0700, Paul E. McKenney wrote:
> On Thu, Apr 08, 2010 at 09:17:50PM +0200, Peter Zijlstra wrote:
> > Optimize page_lock_anon_vma() by removing the atomic ref count
> > ops from the fast path.
> > 
> > Rather complicates the code a lot, but might be worth it.
> 
> Some questions and a disclaimer below.
> 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 67 insertions(+), 4 deletions(-)
> > 
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -78,6 +78,12 @@ static inline struct anon_vma *anon_vma_
> >  void anon_vma_free(struct anon_vma *anon_vma)
> >  {
> >  	VM_BUG_ON(atomic_read(&anon_vma->ref));
> > +	/*
> > +	 * Sync against the anon_vma->lock, so that we can hold the
> > +	 * lock without requiring a reference. See page_lock_anon_vma().
> > +	 */
> > +	mutex_lock(&anon_vma->lock);
> 
> On some systems, the CPU is permitted to pull references into the critical
> section from either side.  So, do we also need an smp_mb() here?
> 
> > +	mutex_unlock(&anon_vma->lock);
> 
> So, a question...
> 
> Can the above mutex be contended?  If yes, what happens when the
> competing mutex_lock() acquires the lock at this point?  Or, worse yet,
> after the kmem_cache_free()?
> 
> If no, what do we accomplish by acquiring the lock?

The thing we gain is that when the holder of the lock finds a !0
refcount it knows it can't go away because any free will first wait to
acquire the lock.

> If the above mutex can be contended, can we fix by substituting
> synchronize_rcu_expedited()?  Which will soon require some scalability
> attention if it gets used here, but what else is new?  ;-)

No, synchronize_rcu_expedited() will not work here, there is no RCU read
side that covers the full usage of the anon_vma (there can't be, it
needs to sleep).

> >  	kmem_cache_free(anon_vma_cachep, anon_vma);
> >  }
> > 
> > @@ -291,7 +297,7 @@ void __init anon_vma_init(void)
> > 
> >  /*
> >   * Getting a lock on a stable anon_vma from a page off the LRU is
> > - * tricky: page_lock_anon_vma relies on RCU to guard against the races.
> > + * tricky: anon_vma_get relies on RCU to guard against the races.
> >   */
> >  struct anon_vma *anon_vma_get(struct page *page)
> >  {
> > @@ -320,12 +326,70 @@ out:
> >  	return anon_vma;
> >  }
> > 
> > +/*
> > + * Similar to anon_vma_get(), however it relies on the anon_vma->lock
> > + * to pin the object. However since we cannot wait for the mutex
> > + * acquisition inside the RCU read lock, we use the ref count
> > + * in the slow path.
> > + */
> >  struct anon_vma *page_lock_anon_vma(struct page *page)
> >  {
> > -	struct anon_vma *anon_vma = anon_vma_get(page);
> > +	struct anon_vma *anon_vma = NULL;
> > +	unsigned long anon_mapping;
> > +
> > +again:
> > +	rcu_read_lock();
> 
> This is interesting.  You have an RCU read-side critical section with
> no rcu_dereference().
> 
> This strange state of affairs is actually legal (assuming that
> anon_mapping is the RCU-protected structure) because all dereferences
> of the anon_vma variable are atomic operations that guarantee ordering
> (the mutex_trylock() and the atomic_inc_not_zero().
> 
> The other dereferences (the atomic_read()s) are under the lock, so
> are also OK assuming that the lock is held when initializing and
> updating these fields, and even more OK due to the smp_rmb() below.
> 
> But see below.


Right so the only thing rcu_read_lock() does here is create the
guarantee that anon_vma is safe to dereference (it lives on a
SLAB_DESTROY_BY_RCU slab).

But yes, I suppose that page->mapping read that now uses ACCESS_ONCE()
would actually want to be an rcu_dereference(), since that both provides
the ACCESS_ONCE() as the read-depend barrier that I thing would be
needed.

> > +	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > +	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > +		goto unlock;
> > +	if (!page_mapped(page))
> > +		goto unlock;
> > +
> > +	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > +	if (!mutex_trylock(&anon_vma->lock)) {
> > +		/*
> > +		 * We failed to acquire the lock, take a ref so we can
> > +		 * drop the RCU read lock and sleep on it.
> > +		 */
> > +		if (!atomic_inc_not_zero(&anon_vma->ref)) {
> > +			/*
> > +			 * Failed to get a ref, we're dead, bail.
> > +			 */
> > +			anon_vma = NULL;
> > +			goto unlock;
> > +		}
> > +		rcu_read_unlock();
> > 
> > -	if (anon_vma)
> >  		mutex_lock(&anon_vma->lock);
> > +		/*
> > +		 * We got the lock, drop the temp. ref, if it was the last
> > +		 * one free it and bail.
> > +		 */
> > +		if (atomic_dec_and_test(&anon_vma->ref)) {
> > +			mutex_unlock(&anon_vma->lock);
> > +			anon_vma_free(anon_vma);
> > +			anon_vma = NULL;
> > +		}
> > +		goto out;
> > +	}
> > +	/*
> > +	 * Got the lock, check we're still alive. Seeing a ref
> > +	 * here guarantees the object will stay alive due to
> > +	 * anon_vma_free() syncing against the lock we now hold.
> > +	 */
> > +	smp_rmb(); /* Order against anon_vma_put() */
> 
> This is ordering the fetch into anon_vma against the atomic_read() below?
> If so, smp_read_barrier_depends() will cover it more cheaply.  Alternatively,
> use rcu_dereference() when fetching into anon_vma.
> 
> Or am I misunderstanding the purpose of this barrier?

Yes, it is:

  atomic_dec_and_test(&anon_vma->ref) /* implies mb */

				smp_rmb();
				atomic_read(&anon_vma->ref);

> (Disclaimer: I have not yet found anon_vma_put(), so I am assuming that
> anon_vma_free() plays the role of a grace period.)

Yes, that lives in one of the other patches (does not exist in
mainline).



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-09  4:14 ` Nick Piggin
@ 2010-04-09  8:35   ` Peter Zijlstra
  2010-04-09  8:50     ` Nick Piggin
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  8:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 14:14 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> > Hi,
> > 
> > This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> > It converts i_mmap_lock and anon_vma->lock to mutexes.  On the way there it
> > also makes mmu_gather preemptible.
> > 
> > The main motivation was making mm_take_all_locks() preemptible, since it
> > appears people are nesting hundreds of spinlocks there.
> > 
> > The side-effects are that we can finally make mmu_gather preemptible, something
> > which lots of people have wanted to do for a long time.
> 
> What's the straight-line performance impact of all this? And how about
> concurrency, I wonder. mutexes of course are double the atomics, and
> you've added a refcount which is two more again for those paths using
> it.
> 
> Page faults are very important. We unfortunately have some databases
> doing a significant amount of mmap/munmap activity too. 

You think this would affect the mmap/munmap times in any significant
way? It seems to me those are relatively heavy ops to begin with.

> I'd like to
> see microbenchmark numbers for each of those (both anon and file backed
> for page faults).

OK, I'll dig out that fault test used in the whole mmap_sem/rwsem thread
a while back and modify it to also do file backed faults.

> kbuild does quite a few pages faults, that would be an easy thing to
> test. Not sure what reasonable kinds of cases exercise parallelism.
> 
> 
> > What kind of performance tests would people have me run on this to satisfy
> > their need for numbers? I've done a kernel build on x86_64 and if anything that
> > was slightly faster with these patches, but it was well within the noise
> > levels so it might be heat noise I'm looking at ;-)
> 
> Is it because you're reducing the number of TLB flushes, or what
> (kbuild isn't multi threaded so on x86 TLB flushes should be really
> fast anyway).

I'll try and get some perf stat runs to get some insight into this. But
the numbers were:

 time make O=defconfig -j48 bzImage (5x, cache hot)

without:  avg: 39.2018s +- 0.3407
with:     avg: 38.9886s +- 0.1814




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  7:29           ` KOSAKI Motohiro
  2010-04-09  7:57             ` KAMEZAWA Hiroyuki
  2010-04-09  8:01             ` Minchan Kim
@ 2010-04-09  8:44             ` Peter Zijlstra
  2010-05-24 19:32               ` Andrew Morton
  2 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  8:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 16:29 +0900, KOSAKI Motohiro wrote:
> 
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8b088f0..b4a0b5b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page
> *page)
>         unsigned long anon_mapping;
>  
>         rcu_read_lock();
> -       anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> +       anon_mapping = (unsigned long) rcu_dereference(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
>         if (!page_mapped(page)) 

Yes, I think this is indeed required.

I'll do a new version of the patch that includes the comment updates
requested by Andrew.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-09  8:14     ` Peter Zijlstra
@ 2010-04-09  8:46       ` Nick Piggin
  2010-04-09  9:22         ` Peter Zijlstra
  2010-04-13  2:06       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  8:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, Apr 09, 2010 at 10:14:07AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 14:07 +1000, Nick Piggin wrote:
> > On Thu, Apr 08, 2010 at 09:17:44PM +0200, Peter Zijlstra wrote:
> > > Fix up powerpc to the new mmu_gather stuffs.
> > > 
> > > PPC has an extra batching queue to RCU free the actual pagetable
> > > allocations, use the ARCH extentions for that for now.
> > > 
> > > For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> > > hardware hash-table, keep using per-cpu arrays but flush on context
> > > switch and use a TIF bit to track the laxy_mmu state.
> > 
> > Hm. Pity powerpc can't just use tlb flush gathering for this batching,
> > (which is what it was designed for). Then it could avoid these tricks.
> > What's preventing this? Adding a tlb gather for COW case in
> > copy_page_range?
> 
> I'm not quite sure what about that, didn't fully investigate it, just
> wanted to get something working for now.

No it's not your problem, but just perhaps a good add-on to your
patchset. Thanks for thinking about it, though.

> 
> Of of the things is that both power and sparc need more than the struct
> page we normally gather.
> 
> I did think of making the mmu_gather have something like
> 
> struct mmu_page {
>   struct page *page;
> #ifdef HAVE_ARCH_TLB_VADDR
>   unsigned long vaddr;
> #endif
> };

Well you could also have a per-arch struct for this, which they can
fill in their own info with (I think powerpc takes the pte as well)

> 
> struct mmu_gather {
>   ...
>   unsigned int nr;
>   struct mmu_page *pages;
> };
> 
> 
> and doing that vaddr collection right along with it in the same batch.
> 
> I think that that would work, Ben, Dave?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-09  8:35   ` Peter Zijlstra
@ 2010-04-09  8:50     ` Nick Piggin
  2010-04-09  8:58       ` Peter Zijlstra
  0 siblings, 1 reply; 113+ messages in thread
From: Nick Piggin @ 2010-04-09  8:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, Apr 09, 2010 at 10:35:31AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 14:14 +1000, Nick Piggin wrote:
> > On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> > > Hi,
> > > 
> > > This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> > > It converts i_mmap_lock and anon_vma->lock to mutexes.  On the way there it
> > > also makes mmu_gather preemptible.
> > > 
> > > The main motivation was making mm_take_all_locks() preemptible, since it
> > > appears people are nesting hundreds of spinlocks there.
> > > 
> > > The side-effects are that we can finally make mmu_gather preemptible, something
> > > which lots of people have wanted to do for a long time.
> > 
> > What's the straight-line performance impact of all this? And how about
> > concurrency, I wonder. mutexes of course are double the atomics, and
> > you've added a refcount which is two more again for those paths using
> > it.
> > 
> > Page faults are very important. We unfortunately have some databases
> > doing a significant amount of mmap/munmap activity too. 
> 
> You think this would affect the mmap/munmap times in any significant
> way? It seems to me those are relatively heavy ops to begin with.

They're actually not _too_ heavy because they just setup and tear
down vmas. No flushing or faulting required (well, in order to make
any _use_ of them you need flushing and faulting of course).

I have some microbenchmarks like this and the page fault test I
could try.

 
> > I'd like to
> > see microbenchmark numbers for each of those (both anon and file backed
> > for page faults).
> 
> OK, I'll dig out that fault test used in the whole mmap_sem/rwsem thread
> a while back and modify it to also do file backed faults.

That'd be good. Anonymous as well of course, for non-databases :)

 
> > kbuild does quite a few pages faults, that would be an easy thing to
> > test. Not sure what reasonable kinds of cases exercise parallelism.
> > 
> > 
> > > What kind of performance tests would people have me run on this to satisfy
> > > their need for numbers? I've done a kernel build on x86_64 and if anything that
> > > was slightly faster with these patches, but it was well within the noise
> > > levels so it might be heat noise I'm looking at ;-)
> > 
> > Is it because you're reducing the number of TLB flushes, or what
> > (kbuild isn't multi threaded so on x86 TLB flushes should be really
> > fast anyway).
> 
> I'll try and get some perf stat runs to get some insight into this. But
> the numbers were:
> 
>  time make O=defconfig -j48 bzImage (5x, cache hot)
> 
> without:  avg: 39.2018s +- 0.3407
> with:     avg: 38.9886s +- 0.1814

Well that's interesting. Nice if it is an improvement not just some
anomoly. I'd start by looking at TLB flushes maybe? For testing, it
would be nice to make the flush sizes equal so you get more of a
comparison of the straight line code.

Other than this, I don't have a good suggestion of what to test. I mean,
how far can you go? :) Some threaded workloads would probably be a good
idea, though. Java?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-08 19:17 ` Peter Zijlstra
                   ` (16 preceding siblings ...)
  (?)
@ 2010-04-09  8:58 ` Martin Schwidefsky
  2010-04-09  9:53   ` Peter Zijlstra
  -1 siblings, 1 reply; 113+ messages in thread
From: Martin Schwidefsky @ 2010-04-09  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

Hi Peter,

On Thu, 08 Apr 2010 21:17:37 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> The side-effects are that we can finally make mmu_gather preemptible, something
> which lots of people have wanted to do for a long time.

Yes, that is a good thing. With the preemptible mmu_gather s390 will
use less IDTEs (that is the instruction that flushes all TLBs for a
given address space) on a full flush. Good :-)

> This patch-set seems to build and boot on my x86_64 machines and even builds a
> kernel. I've also attempted powerpc and sparc, which I've compile tested with
> their respective defconfigs, remaining are (afaikt the rest uses the generic
> tlb bits):
> 
>  - s390
>  - ia64
>  - arm
>  - superh
>  - um
> 
> From those, s390 and ia64 look 'interesting', arm and superh seem very similar
> and should be relatively easy (-rt has a patchlet for arm iirc).

To get the 'interesting' TLB flushing on s390 working again you need
this patch:

--
[PATCH] s390: preemptible mmu_gather

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/tlb.h |   43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -28,45 +28,51 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS	1
-#else
-#define TLB_NR_PTRS	508
-#endif
-
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
 	unsigned int nr_ptes;
 	unsigned int nr_pxds;
-	void *array[TLB_NR_PTRS];
+	unsigned int max;
+	void **array;
+	void *local[8];
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
-						unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+	if (addr) {
+		tlb->array = (void *) addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
 
+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+				  struct mm_struct *mm,
+				  unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush || (num_online_cpus() == 1) ||
 		(atomic_read(&mm->mm_users) <= 1 && mm == current->active_mm);
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = TLB_NR_PTRS;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	return tlb;
+	else
+		__tlb_alloc_pages(tlb);
+	tlb->nr_ptes = 0;
+	tlb->nr_pxds = tlb->max;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb,
 				 unsigned long start, unsigned long end)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
 		__tlb_flush_mm(tlb->mm);
 	while (tlb->nr_ptes > 0)
 		pte_free(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < TLB_NR_PTRS)
+	while (tlb->nr_pxds < tlb->max)
 		/* pgd_free frees the pointer as region or segment table */
 		pgd_free(tlb->mm, tlb->array[tlb->nr_pxds++]);
 }
@@ -79,7 +85,8 @@ static inline void tlb_finish_mmu(struct
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->array != tlb->local)
+		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-09  8:50     ` Nick Piggin
@ 2010-04-09  8:58       ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  8:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 18:50 +1000, Nick Piggin wrote:
> I have some microbenchmarks like this and the page fault test I
> could try.
> 
If you don't mind, please. If you send them my way I'll run them on my
machines too.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-08 19:17 ` Peter Zijlstra
                   ` (17 preceding siblings ...)
  (?)
@ 2010-04-09  9:03 ` David Howells
  2010-04-09  9:22   ` Peter Zijlstra
  -1 siblings, 1 reply; 113+ messages in thread
From: David Howells @ 2010-04-09  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin


Have you tried compiling this for NOMMU?

David

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-09  8:46       ` Nick Piggin
@ 2010-04-09  9:22         ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  9:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 18:46 +1000, Nick Piggin wrote:
> > struct mmu_page {
> >   struct page *page;
> > #ifdef HAVE_ARCH_TLB_VADDR
> >   unsigned long vaddr;
> > #endif
> > };
> 
> Well you could also have a per-arch struct for this, which they can
> fill in their own info with (I think powerpc takes the pte as well)

Ah, right you are.

Maybe something like:

#ifndef tlb_fill_page
struct mmu_page {
	struct page *page;
};

#define tlb_fill_page(tlb, pte, addr) do { } while (0)
#endif

Which can be used from the tlb_remove_tlb_entry() implementation to
gather both the pte and addr. The only complication seems to be that
tlb_remove_tlb_entry() and tlb_remove_page() aren't always matched.

Also, it looks like both power and sparc drive the pte/vaddr flush from
the pte_update/set_pte_at like functions. Which means its not synced to
the mmu_gather at all.

So I'm not at all sure its feasible to link these, but I'll keep it in
mind.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-09  9:03 ` David Howells
@ 2010-04-09  9:22   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  9:22 UTC (permalink / raw)
  To: David Howells
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, 2010-04-09 at 10:03 +0100, David Howells wrote:
> Have you tried compiling this for NOMMU?

No, I will eventually, I'm sure there's something to fix.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 00/13] mm: preemptibility -v2
  2010-04-09  8:58 ` Martin Schwidefsky
@ 2010-04-09  9:53   ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  9:53 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, 2010-04-09 at 10:58 +0200, Martin Schwidefsky wrote:
> [PATCH] s390: preemptible mmu_gather
> 
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> Adapt the stand-alone s390 mmu_gather implementation to the new
> preemptible mmu_gather interface.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> 


Thanks a lot Martin!


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 05/13] mm: Make use of the anon_vma ref count
  2010-04-09  7:04   ` Christian Ehrhardt
@ 2010-04-09  9:57     ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09  9:57 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, 2010-04-09 at 09:04 +0200, Christian Ehrhardt wrote:
> Hi,
> 
> On Thu, Apr 08, 2010 at 09:17:42PM +0200, Peter Zijlstra wrote:
> > @@ -302,23 +307,33 @@ again:
> >  		goto out;
> >  
> >  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > -	spin_lock(&anon_vma->lock);
> > +	if (!atomic_inc_not_zero(&anon_vma->ref))
> > +		anon_vma = NULL;
> >  
> >  	if (page_rmapping(page) != anon_vma) {
> > -		spin_unlock(&anon_vma->lock);
> > +		anon_vma_put(anon_vma);
> >  		goto again;
> >  	}
> 
> AFAICS anon_vma_put might be called with anon_vma == NULL here which
> will oops on the ref count. Not sure if
> 
>      page_rmapping(page) == anon_vma == NULL 
> 
> is possible, too.

Gah, you're right, thanks!


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04/13] mm: Move anon_vma ref out from under CONFIG_KSM
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-09 12:35   ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2010-04-09 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> We need an anon_vma refcount for preemptible anon_vma->lock as well
> as memory compaction, so move it out into generic code.
>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-08 19:17   ` Peter Zijlstra
                     ` (4 preceding siblings ...)
  (?)
@ 2010-04-09 12:57   ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09 12:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:

> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
>  	unsigned long anon_mapping;
>  
>  	rcu_read_lock();
> +again:
>  	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
>  	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>  		goto out;
> @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	spin_lock(&anon_vma->lock);
> +
> +	if (page_rmapping(page) != anon_vma) {
> +		spin_unlock(&anon_vma->lock);
> +		goto again;
> +	}
> +
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();



OK, so I'm not quite sure about this anymore... this locking is quite,..
umh,. interesting.

We have:

 - page_remove_rmap() - which decrements page->_mapcount but does not
clear page->mapping. Even though both it and page_add_anon_rmap()
require pte_lock, there is no guarantee these are the same locks, so
these can indeed race.

So there is indeed a possibility for page->mapping to change, but
holding anon_vma->lock won't make a difference.

 - anon_vma->lock - this does protect vma_anon_link/unlink like things
and is held during rmap walks, it only protects the list.

 - SLAB_DESTROY_BY_RCU - only guarantees we can deref anon_vma, not that
we'll find the one we were looking for.

 - we generally unmap/zap before unlink/free - so this should not see
pages race with anon_vma freeing (except when someone has a page ref),
however on anon_vma merges we seem to simply unlink the next one,
without updating potential pages referencing it?

 - KSM seems to not do anything that swap doesn't also do (but I didn't
look too closely).

 - page migration seems terribly broken..


As it stands I don't see a reliable way to obtain an anon_vma,
vma_address() + page_check_address() things are indeed enough, but not
everybody seems to do that.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* mlock and pageout race?
  2010-04-09  8:17               ` KOSAKI Motohiro
@ 2010-04-09 14:41                 ` Minchan Kim
  0 siblings, 0 replies; 113+ messages in thread
From: Minchan Kim @ 2010-04-09 14:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Peter Zijlstra, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman

Hi, Kosaki. 

I don't want to make noise due to off-topic.
So I open new thread. 

On Fri, 2010-04-09 at 17:17 +0900, KOSAKI Motohiro wrote:
> Hi Minchan,
> 
> > OFF-TOPIC:
> > 
> > I think you pointed out good thing, too. :)
> > 
> > You mean although application call mlock of any vma, few pages on the vma can
> > be swapout by race between mlock and reclaim?
> > 
> > Although it's not disaster, apparently it breaks API.
> > Man page
> > " mlock() and munlock()
> >   mlock()  locks pages in the address range starting at addr and
> > continuing for len bytes. All pages that contain a part of the
> > specified address range are guaranteed to be resident in RAM when the
> > call returns  successfully;  the pages are guaranteed to stay in RAM
> > until later unlocked."
> > 
> > Do you have a plan to solve such problem?
> > 
> > And how about adding simple comment about that race in page_referenced_one?
> > Could you send the patch?
> 
> I'm surprising this mail. you were pushing much patch in this area.
> I believed you know all stuff ;)

If I disappoint you, sorry for that. 
Still, there are many thing to study to me. :)

> 
> My answer is, it don't need to fix, because it's not bug. The point is
> that this one is race issue. not "pageout after mlock" issue.
> If pageout and mlock occur at the exactly same time, the human can't
> observe which event occur in first. it's not API violation.


If it might happen, it's obviously API violation, I think.

int main()
{
	mlock(any vma, CURRENT|FUTURE);
	system("cat /proc/self/smaps | grep "any vma");
	..
}
result : 

08884000-088a5000 rw-p 00000000 00:00 0          [any vma]
Size:                  4 kB
Rss:                   4 kB
...
Swap:                  4 kB
...

Apparently, user expected that "If I call mlock, there are whole pages
of the vma in DRAM". But the result make him embarrassed :(

side note : 
Of course, mlock's semantic is rather different with smaps's Swap. 
mlock's semantic just makes sure pages on DRAM after success of mlock
call. it's not relate smap's swap entry. 
Actually, smaps's swap entry cannot compare to mlock's semantic.
Some page isn't on swap device yet but on swap cache and whole PTEs of
page already have swap entry(ie, all unmapped). In such case, smap's
Swap entry represent it with swap page. But with semantic of mlock, it's
still on RAM so that it's okay.  

I looked the code more detail.
Fortunately, the situation you said "page_referenced() already can take
unstable VM_LOCKED value. So, In worst case we make false positive
pageout, but it's not disaster" cannot happen, I think. 

1) 
mlock_fixup				shrink_page_list

					lock_page
					try_to_unmap

vma->vm_flags = VM_LOCKED
pte_lock				
pte_present test
get_page
pte_unlock
					pte_lock
					VM_LOCKED test fail
					pte_unlock
					never pageout
So, no problem. 

2) 
mlock_fixup				shrink_page_list

					lock_page
					try_to_unmap
					pte_lock
					VM_LOCKED test pass
vma->vm_flags = VM_LOCKED		make pte to swap entry
pte_lock				pte_unlock
pte_present test fail
pte_unlock
					pageout
swapin by handle_mm_fault     

So, no problem. 

3)
mlock_fixup				shrink_page_list

					lock_page
					try_to_unmap
					pte_lock
					VM_LOCKED test pass
vma->vm_flags = VM_LOCKED		make pte to swap entry
pte_lock				pte_unlock
pte_present test fail
pte_unlock
cachehit in swapcache by handle_mm_fault 
					pageout
					is_page_cache_freeable fail
So, no problem, too. 

I can't think the race situation you mentioned.
When 'false positive pageout' happens?
Could you elaborate on it?


-- 
Kind regards,
Minchan Kim



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10/13] lockdep, mutex: Provide mutex_lock_nest_lock
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-09 15:36   ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2010-04-09 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> Provide the mutex_lock_nest_lock() annotation.
>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/13] mutex: Provide mutex_is_contended
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
@ 2010-04-09 15:37   ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2010-04-09 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> Usable for lock-breaks and such.
>
> Signed-off-by: Peter Zijlstra<a.p.zijlstra@chello.nl>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13/13] mm: Optimize page_lock_anon_vma
  2010-04-09  8:35     ` Peter Zijlstra
@ 2010-04-09 19:22       ` Paul E. McKenney
  0 siblings, 0 replies; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-09 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, Apr 09, 2010 at 10:35:29AM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-08 at 15:18 -0700, Paul E. McKenney wrote:
> > On Thu, Apr 08, 2010 at 09:17:50PM +0200, Peter Zijlstra wrote:
> > > Optimize page_lock_anon_vma() by removing the atomic ref count
> > > ops from the fast path.
> > > 
> > > Rather complicates the code a lot, but might be worth it.
> > 
> > Some questions and a disclaimer below.
> > 
> > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > ---
> > >  mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> > >  1 file changed, 67 insertions(+), 4 deletions(-)
> > > 
> > > Index: linux-2.6/mm/rmap.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/rmap.c
> > > +++ linux-2.6/mm/rmap.c
> > > @@ -78,6 +78,12 @@ static inline struct anon_vma *anon_vma_
> > >  void anon_vma_free(struct anon_vma *anon_vma)
> > >  {
> > >  	VM_BUG_ON(atomic_read(&anon_vma->ref));
> > > +	/*
> > > +	 * Sync against the anon_vma->lock, so that we can hold the
> > > +	 * lock without requiring a reference. See page_lock_anon_vma().
> > > +	 */
> > > +	mutex_lock(&anon_vma->lock);
> > 
> > On some systems, the CPU is permitted to pull references into the critical
> > section from either side.  So, do we also need an smp_mb() here?
> > 
> > > +	mutex_unlock(&anon_vma->lock);
> > 
> > So, a question...
> > 
> > Can the above mutex be contended?  If yes, what happens when the
> > competing mutex_lock() acquires the lock at this point?  Or, worse yet,
> > after the kmem_cache_free()?
> > 
> > If no, what do we accomplish by acquiring the lock?
> 
> The thing we gain is that when the holder of the lock finds a !0
> refcount it knows it can't go away because any free will first wait to
> acquire the lock.

OK.  Here is the sequence of events that I am concerned about:

1.	CPU 0 invokes page_lock_anon_vma() [13/13], and executes the
	assignment to anon_vma.  It has not yet attempted to
	acquire the anon_vma->lock mutex.

2.	CPU 1 invokes page_unlock_anon_vma() [13/13], which in turn
	calls anon_vma_put() [5/13], which atomically decrements
	->ref, finds it zero, invokes anon_vma_free() [13/13], which 
	finds ->ref still zero, so acquires ->lock and immediately
	releases it, and then calls kmem_cache_free().

3.	This kmem_cache does have SLAB_DESTROY_BY_RCU, so this
	anon_vma structure will remain an anon_vma for as long as
	CPU 0 remains in its RCU read-side critical section.

4.	CPU 2 allocates an anon_vma, and gets the one that
	CPU 0 just freed.  It initializes it and makes ->ref
	non-zero.

5.	CPU 0 continues executing page_lock_anon_vma(), and therefore
	invokes mutex_trylock() on a now-reused struct anon_vma.
	It finds ->ref nonzero, so increments it and continues using
	it, despite its having been reallocated, possibly to some
	other process.

Or am I missing a step?  (Extremely possible, as I am not as familiar
with this code as I might be.)

> > If the above mutex can be contended, can we fix by substituting
> > synchronize_rcu_expedited()?  Which will soon require some scalability
> > attention if it gets used here, but what else is new?  ;-)
> 
> No, synchronize_rcu_expedited() will not work here, there is no RCU read
> side that covers the full usage of the anon_vma (there can't be, it
> needs to sleep).

Got it, apologies for my confusion.

> > >  	kmem_cache_free(anon_vma_cachep, anon_vma);
> > >  }
> > > 
> > > @@ -291,7 +297,7 @@ void __init anon_vma_init(void)
> > > 
> > >  /*
> > >   * Getting a lock on a stable anon_vma from a page off the LRU is
> > > - * tricky: page_lock_anon_vma relies on RCU to guard against the races.
> > > + * tricky: anon_vma_get relies on RCU to guard against the races.
> > >   */
> > >  struct anon_vma *anon_vma_get(struct page *page)
> > >  {
> > > @@ -320,12 +326,70 @@ out:
> > >  	return anon_vma;
> > >  }
> > > 
> > > +/*
> > > + * Similar to anon_vma_get(), however it relies on the anon_vma->lock
> > > + * to pin the object. However since we cannot wait for the mutex
> > > + * acquisition inside the RCU read lock, we use the ref count
> > > + * in the slow path.
> > > + */
> > >  struct anon_vma *page_lock_anon_vma(struct page *page)
> > >  {
> > > -	struct anon_vma *anon_vma = anon_vma_get(page);
> > > +	struct anon_vma *anon_vma = NULL;
> > > +	unsigned long anon_mapping;
> > > +
> > > +again:
> > > +	rcu_read_lock();
> > 
> > This is interesting.  You have an RCU read-side critical section with
> > no rcu_dereference().
> > 
> > This strange state of affairs is actually legal (assuming that
> > anon_mapping is the RCU-protected structure) because all dereferences
> > of the anon_vma variable are atomic operations that guarantee ordering
> > (the mutex_trylock() and the atomic_inc_not_zero().
> > 
> > The other dereferences (the atomic_read()s) are under the lock, so
> > are also OK assuming that the lock is held when initializing and
> > updating these fields, and even more OK due to the smp_rmb() below.
> > 
> > But see below.
> 
> Right so the only thing rcu_read_lock() does here is create the
> guarantee that anon_vma is safe to dereference (it lives on a
> SLAB_DESTROY_BY_RCU slab).
> 
> But yes, I suppose that page->mapping read that now uses ACCESS_ONCE()
> would actually want to be an rcu_dereference(), since that both provides
> the ACCESS_ONCE() as the read-depend barrier that I thing would be
> needed.

Ah, I was getting the wrong access.  Now that I see it, yes, this is
tied to the access of page->mapping that is assigned to anon_mapping.

> > > +	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > > +	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > > +		goto unlock;
> > > +	if (!page_mapped(page))
> > > +		goto unlock;
> > > +
> > > +	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > > +	if (!mutex_trylock(&anon_vma->lock)) {
> > > +		/*
> > > +		 * We failed to acquire the lock, take a ref so we can
> > > +		 * drop the RCU read lock and sleep on it.
> > > +		 */
> > > +		if (!atomic_inc_not_zero(&anon_vma->ref)) {
> > > +			/*
> > > +			 * Failed to get a ref, we're dead, bail.
> > > +			 */
> > > +			anon_vma = NULL;
> > > +			goto unlock;
> > > +		}
> > > +		rcu_read_unlock();
> > > 
> > > -	if (anon_vma)
> > >  		mutex_lock(&anon_vma->lock);
> > > +		/*
> > > +		 * We got the lock, drop the temp. ref, if it was the last
> > > +		 * one free it and bail.
> > > +		 */
> > > +		if (atomic_dec_and_test(&anon_vma->ref)) {
> > > +			mutex_unlock(&anon_vma->lock);
> > > +			anon_vma_free(anon_vma);
> > > +			anon_vma = NULL;
> > > +		}
> > > +		goto out;
> > > +	}
> > > +	/*
> > > +	 * Got the lock, check we're still alive. Seeing a ref
> > > +	 * here guarantees the object will stay alive due to
> > > +	 * anon_vma_free() syncing against the lock we now hold.
> > > +	 */
> > > +	smp_rmb(); /* Order against anon_vma_put() */
> > 
> > This is ordering the fetch into anon_vma against the atomic_read() below?
> > If so, smp_read_barrier_depends() will cover it more cheaply.  Alternatively,
> > use rcu_dereference() when fetching into anon_vma.
> > 
> > Or am I misunderstanding the purpose of this barrier?
> 
> Yes, it is:
> 
>   atomic_dec_and_test(&anon_vma->ref) /* implies mb */
> 
> 				smp_rmb();
> 				atomic_read(&anon_vma->ref);
> 
> > (Disclaimer: I have not yet found anon_vma_put(), so I am assuming that
> > anon_vma_free() plays the role of a grace period.)
> 
> Yes, that lives in one of the other patches (does not exist in
> mainline).

Thank you -- and yes, I should have thought to search the patch set.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06/13] mm: Preemptible mmu_gather
  2010-04-09  3:25   ` Nick Piggin
  2010-04-09  8:18     ` Peter Zijlstra
@ 2010-04-09 20:36     ` Peter Zijlstra
  2010-04-19 19:16       ` Peter Zijlstra
  1 sibling, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-09 20:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 13:25 +1000, Nick Piggin wrote:
> Have you done some profiling on this? What I would like to see, if
> it's not too much complexity, is to have a small set of pages to
> handle common size frees, and then use them up first by default
> before attempting to allocate more.
> 
> Also, it would be cool to be able to chain allocations to avoid
> TLB flushes even on big frees (overridable by arch of course, in
> case they're doing some non-preeemptible work or you wish to break
> up lock hold times). But that might be just getting over engineered.
> 
Measuring ITLB_FLUSH on Intel nehalem using:

perf stat -a -e r01ae make O=defconfig-build/ -j48 bzImage

-linus      5825850 +- 2545  (100%)
 +patches   5891341 +- 6045  (101%)
 +below     5783991 +- 4725  ( 99%)

(No slab allocations yet)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/asm-generic/tlb.h |  122 ++++++++++++++++++++++++++++++----------------
 1 file changed, 82 insertions(+), 40 deletions(-)

Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -17,16 +17,6 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
-#else
-  #define tlb_fast_mode(tlb) 1
-#endif
-
 #ifdef HAVE_ARCH_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -70,31 +60,66 @@ extern void tlb_remove_table(struct mmu_
 
 #endif
 
+struct mmu_gather_batch {
+	struct mmu_gather_batch	*next;
+	unsigned int		nr;
+	unsigned int		max;
+	struct page		*pages[0];
+};
+
+#define MAX_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(unsigned long))
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
-	unsigned int		nr;	/* set to ~0U means fast mode */
-	unsigned int		max;	/* nr < max */
-	unsigned int		need_flush;/* Really unmapped some ptes? */
-	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page		**pages;
-	struct page		*local[8];
+	unsigned int		need_flush : 1,	/* Did free PTEs */
+				fast_mode  : 1; /* No batching   */
+	unsigned int		fullmm;		/* Flush full mm */
+
+	struct mmu_gather_batch *active;
+	struct mmu_gather_batch	local;
+	struct page		*__pages[8];
 
 #ifdef HAVE_ARCH_RCU_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 };
 
-static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+  #define tlb_fast_mode(tlb) (tlb->fast_mode)
+#else
+  #define tlb_fast_mode(tlb) 1
+#endif
+
+static inline int tlb_next_batch(struct mmu_gather *tlb)
 {
-	unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+	struct mmu_gather_batch *batch;
 
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	batch = tlb->active;
+	if (batch->next) {
+		tlb->active = batch->next;
+		return 1;
 	}
+
+	batch = (void *)__get_free_pages(GFP_ATOMIC, 0);
+	if (!batch)
+		return 0;
+
+	batch->next = NULL;
+	batch->nr   = 0;
+	batch->max  = MAX_GATHER_BATCH;
+
+	tlb->active->next = batch;
+	tlb->active = batch;
+
+	return 1;
 }
 
 /* tlb_gather_mmu
@@ -105,17 +130,16 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 {
 	tlb->mm = mm;
 
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-
-	if (num_online_cpus() > 1) {
-		tlb->nr = 0;
-		__tlb_alloc_pages(tlb);
-	} else /* Use fast mode if only one CPU is online */
-		tlb->nr = ~0U;
-
+	tlb->need_flush = 0;
+	if (num_online_cpus() == 1)
+		tlb->fast_mode = 1;
 	tlb->fullmm = full_mm_flush;
 
+	tlb->local.next = NULL;
+	tlb->local.nr   = 0;
+	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
+	tlb->active     = &tlb->local;
+
 #ifdef HAVE_ARCH_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
@@ -124,6 +148,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 static inline void
 tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch;
+
 	if (!tlb->need_flush)
 		return;
 	tlb->need_flush = 0;
@@ -131,12 +157,14 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 #ifdef HAVE_ARCH_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
-	if (!tlb_fast_mode(tlb)) {
-		free_pages_and_swap_cache(tlb->pages, tlb->nr);
-		tlb->nr = 0;
-		if (tlb->pages == tlb->local)
-			__tlb_alloc_pages(tlb);
+	if (tlb_fast_mode(tlb))
+		return;
+
+	for (batch = &tlb->local; batch; batch = batch->next) {
+		free_pages_and_swap_cache(batch->pages, batch->nr);
+		batch->nr = 0;
 	}
+	tlb->active = &tlb->local;
 }
 
 /* tlb_finish_mmu
@@ -146,13 +174,18 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch, *next;
+
 	tlb_flush_mmu(tlb, start, end);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
+	for (batch = tlb->local.next; batch; batch = next) {
+		next = batch->next;
+		free_pages((unsigned long)batch, 0);
+	}
+	tlb->local.next = NULL;
 }
 
 /* tlb_remove_page
@@ -162,14 +195,23 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  */
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
+	struct mmu_gather_batch *batch;
+
 	tlb->need_flush = 1;
+
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return;
 	}
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		tlb_flush_mmu(tlb, 0, 0);
+
+	batch = tlb->active;
+	if (batch->nr == batch->max) {
+		if (!tlb_next_batch(tlb))
+			tlb_flush_mmu(tlb, 0, 0);
+		batch = tlb->active;
+	}
+
+	batch->pages[batch->nr++] = page;
 }
 
 /**



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-08 19:17   ` Peter Zijlstra
                     ` (2 preceding siblings ...)
  (?)
@ 2010-04-13  1:05   ` Benjamin Herrenschmidt
  2010-04-13  3:43     ` Paul E. McKenney
  -1 siblings, 1 reply; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-13  1:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul E. McKenney

On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:
> plain text document attachment (powerpc-gup_fast-rcu.patch)
> The powerpc page table freeing relies on the fact that IRQs hold off
> an RCU grace period, this is currently true for all existing RCU
> implementations but is not an assumption Paul wants to support.
> 
> Therefore, also take the RCU read lock along with disabling IRQs to
> ensure the RCU grace period does at least cover these lookups.

There's a few other places that need a similar fix then. The hash page
code for example. All the C cases should end up calling the
find_linux_pte() helper afaik, so we should be able to stick the lock in
there (and the hugetlbfs variant, find_linux_pte_or_hugepte()).

However, we also have cases of tight asm code walking the page tables,
such as the tlb miss handler on embedded processors. I don't see how I
could do that there. IE. I only have a handful of registers to play
with, no stack, etc...

So we might have to support the interrupt assumption, at least in some
form, with those guys...

Cheers,
Ben. 

> Requested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
>  arch/powerpc/mm/gup.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6/arch/powerpc/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/gup.c
> +++ linux-2.6/arch/powerpc/mm/gup.c
> @@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
>  	 * So long as we atomically load page table pointers versus teardown,
>  	 * we can follow the address down to the the page and take a ref on it.
>  	 */
> +	rcu_read_lock();
>  	local_irq_disable();
>  
>  	pgdp = pgd_offset(mm, addr);
> @@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
>  	} while (pgdp++, addr = next, addr != end);
>  
>  	local_irq_enable();
> +	rcu_read_unlock();
>  
>  	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
>  	return nr;
> @@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
>  
>  slow:
>  		local_irq_enable();
> +		rcu_read_unlock();
>  slow_irqon:
>  		pr_devel("  slow path ! nr = %d\n", nr);
>  
> 



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-08 19:17   ` Peter Zijlstra
  (?)
  (?)
@ 2010-04-13  1:23   ` Benjamin Herrenschmidt
  2010-04-13 10:22     ` Peter Zijlstra
                       ` (2 more replies)
  -1 siblings, 3 replies; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-13  1:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:

 .../...

>  static inline void arch_leave_lazy_mmu_mode(void)
>  {
> -	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
> +
> +	if (batch->active) {
> +		if (batch->index)
> +			__flush_tlb_pending(batch);
> +		batch->active = 0;
> +	}

Can index be > 0 if active == 0 ? I though not, which means you don't
need to add a test on active, do you ?

I'm also pondering whether we should just stick something in the
task/thread struct and avoid that whole per-cpu manipulation including
the stuff in process.c in fact.

Heh, maybe it's time to introduce thread vars ? :-)
 
> -	if (batch->index)
> -		__flush_tlb_pending(batch);
> -	batch->active = 0;
> +	put_cpu_var(ppc64_tlb_batch);
>  }
>  
>  #define arch_flush_lazy_mmu_mode()      do {} while (0)
> Index: linux-2.6/arch/powerpc/kernel/process.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/process.c
> +++ linux-2.6/arch/powerpc/kernel/process.c
> @@ -389,6 +389,9 @@ struct task_struct *__switch_to(struct t
>  	struct thread_struct *new_thread, *old_thread;
>  	unsigned long flags;
>  	struct task_struct *last;
> +#ifdef CONFIG_PPC64
> +	struct ppc64_tlb_batch *batch;
> +#endif

>  #ifdef CONFIG_SMP
>  	/* avoid complexity of lazy save/restore of fpu
> @@ -479,6 +482,14 @@ struct task_struct *__switch_to(struct t
>  		old_thread->accum_tb += (current_tb - start_tb);
>  		new_thread->start_tb = current_tb;
>  	}
> +
> +	batch = &__get_cpu_var(ppc64_tlb_batch);
> +	if (batch->active) {
> +		set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU);
> +		if (batch->index)
> +			__flush_tlb_pending(batch);
> +		batch->active = 0;
> +	}
>  #endif

Use ti->local_flags so you can do it non-atomically, which is a lot
cheaper.

>  	local_irq_save(flags);
> @@ -495,6 +506,13 @@ struct task_struct *__switch_to(struct t
>  	hard_irq_disable();
>  	last = _switch(old_thread, new_thread);
>  
> +#ifdef CONFIG_PPC64
> +	if (test_and_clear_ti_thread_flag(task_thread_info(new), TIF_LAZY_MMU)) {
> +		batch = &__get_cpu_var(ppc64_tlb_batch);
> +		batch->active = 1;
> +	}
> +#endif
> +
>  	local_irq_restore(flags);
>  
>  	return last;
> Index: linux-2.6/arch/powerpc/mm/pgtable.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/pgtable.c
> +++ linux-2.6/arch/powerpc/mm/pgtable.c
> @@ -33,8 +33,6 @@
>  
>  #include "mmu_decl.h"
>  
> -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
> -
>  #ifdef CONFIG_SMP
>  
>  /*
> @@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
>   * freeing a page table page that is being walked without locks
>   */
>  
> -static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
>  static unsigned long pte_freelist_forced_free;
>  
>  struct pte_freelist_batch
> @@ -98,12 +95,30 @@ static void pte_free_submit(struct pte_f
>  
>  void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
> -	/* This is safe since tlb_gather_mmu has disabled preemption */
> -	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> +	struct pte_freelist_batch **batchp = &tlb->arch.batch;
>  	unsigned long pgf;
>  
> -	if (atomic_read(&tlb->mm->mm_users) < 2 ||
> -	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
> +	/*
> +	 * A comment here about on why we have RCU freed page tables might be
> +	 * interesting, also explaining why we don't need any sort of grace
> +	 * period for mm_users == 1, and have some home brewn smp_call_func()
> +	 * for single frees.

iirc, we are synchronizing with CPUs walking page tables in their hash
or TLB miss code, which is lockless. The mm_users test is a -little- bit
dubious indeed. It may have to be mm_users < 2 && mm ==
current->active_mm, ie, we know for sure nobody else is currently
walking those page tables ... 

Tho even than is fishy nowadays. We -can- walk page tables on behave of
another process. In fact, we do it in the Cell SPU code for faulting
page table entries as a result of SPEs taking faults for example. So I'm
starting to suspect that this mm_users optimisation is bogus.

We -do- want to optimize out the case where there is no user left
though, ie, the MM is dead. IE. The typical exit case.

> +	 *
> +	 * The only lockless page table walker I know of is gup_fast() which
> +	 * relies on irq_disable(). So my guess is that mm_users == 1 means
> +	 * that there cannot be another thread and so precludes gup_fast()
> +	 * concurrency.

Which is fishy as I said above.

> +	 * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
> +	 * to serialize against the IRQ disable. In case we do batch, the RCU
> +	 * grace period is at least long enough to cover IRQ disabled sections
> +	 * (XXX assumption, not strictly true).

Yeah well ... I'm not that familiar with RCU anymore. Back when I wrote
that, iirc, I would more or less be safe against a CPU that doesn't
schedule, but things may have well changed.

We are trying to be safe against another CPU walking page tables in the
asm lockless hash miss or TLB miss code. Note that sparc64 has a similar
issue. This is highly optimized asm code that -cannot- call into things
like rcu_read_lock().

> +	 * All this results in us doing our own free batching and not using
> +	 * the generic mmu_gather batches (XXX fix that somehow?).
> +	 */

When this was written, generic batch only dealt with the actual pages,
not the page tables, and as you may have noticed, our page tables on
powerpc aren't struct page*, they come from dedicated slab caches and
are of variable sizes.

Cheers,
Ben.

> +	if (atomic_read(&tlb->mm->mm_users) < 2) {
>  		pgtable_free(table, shift);
>  		return;
>  	}
> @@ -125,10 +140,9 @@ void pgtable_free_tlb(struct mmu_gather 
>  	}
>  }
>  
> -void pte_free_finish(void)
> +void pte_free_finish(struct mmu_gather *tlb)
>  {
> -	/* This is safe since tlb_gather_mmu has disabled preemption */
> -	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> +	struct pte_freelist_batch **batchp = &tlb->arch.batch;
>  
>  	if (*batchp == NULL)
>  		return;
> Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
>   * neesd to be flushed. This function will either perform the flush
>   * immediately or will batch it up if the current CPU has an active
>   * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
>   */
>  void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>  		     pte_t *ptep, unsigned long pte, int huge)
>  {
> -	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
>  	unsigned long vsid, vaddr;
>  	unsigned int psize;
>  	int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
>  	 */
>  	if (!batch->active) {
>  		flush_hash_page(vaddr, rpte, psize, ssize, 0);
> +		put_cpu_var(ppc64_tlb_batch);
>  		return;
>  	}
>  
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
>  	batch->index = ++i;
>  	if (i >= PPC64_TLB_BATCH_NR)
>  		__flush_tlb_pending(batch);
> +	put_cpu_var(ppc64_tlb_batch);
>  }
>  
>  /*
> @@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
>  
>  void tlb_flush(struct mmu_gather *tlb)
>  {
> -	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
> +	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
>  
>  	/* If there's a TLB batch pending, then we must flush it because the
>  	 * pages are going to be freed and we really don't want to have a CPU
> @@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
>  	if (tlbbatch->index)
>  		__flush_tlb_pending(tlbbatch);
>  
> +	put_cpu_var(ppc64_tlb_batch);
> +
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /**
> Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
> +++ linux-2.6/arch/powerpc/include/asm/thread_info.h
> @@ -111,6 +111,7 @@ static inline struct thread_info *curren
>  #define TIF_NOTIFY_RESUME	13	/* callback before returning to user */
>  #define TIF_FREEZE		14	/* Freezing for suspend */
>  #define TIF_RUNLATCH		15	/* Is the runlatch enabled? */
> +#define TIF_LAZY_MMU		16	/* tlb_batch is active */
>  
>  /* as above, but as bit values */
>  #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
> @@ -128,6 +129,7 @@ static inline struct thread_info *curren
>  #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
>  #define _TIF_FREEZE		(1<<TIF_FREEZE)
>  #define _TIF_RUNLATCH		(1<<TIF_RUNLATCH)
> +#define _TIF_LAZY_MMU		(1<<TIF_LAZY_MMU)
>  #define _TIF_SYSCALL_T_OR_A	(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP)
>  
>  #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
> Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
> +++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
> @@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
>  
>  #ifdef CONFIG_SMP
>  extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
> -extern void pte_free_finish(void);
> +extern void pte_free_finish(struct mmu_gather *tlb);
>  #else /* CONFIG_SMP */
>  static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
>  {
>  	pgtable_free(table, shift);
>  }
> -static inline void pte_free_finish(void) { }
> +static inline void pte_free_finish(struct mmu_gather *tlb) { }
>  #endif /* !CONFIG_SMP */
>  
>  static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
> Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
> @@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
>  	}
>  
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /*
> Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
> +++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
> @@ -298,7 +298,7 @@ void tlb_flush(struct mmu_gather *tlb)
>  	flush_tlb_mm(tlb->mm);
>  
>  	/* Push out batch of freed page tables */
> -	pte_free_finish();
> +	pte_free_finish(tlb);
>  }
>  
>  /*
> 



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-09  4:07   ` Nick Piggin
  2010-04-09  8:14     ` Peter Zijlstra
@ 2010-04-13  1:56     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-13  1:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 14:07 +1000, Nick Piggin wrote:
> > PPC has an extra batching queue to RCU free the actual pagetable
> > allocations, use the ARCH extentions for that for now.
> > 
> > For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> > hardware hash-table, keep using per-cpu arrays but flush on context
> > switch and use a TIF bit to track the laxy_mmu state.
> 
> Hm. Pity powerpc can't just use tlb flush gathering for this batching,
> (which is what it was designed for). Then it could avoid these tricks.
> What's preventing this? Adding a tlb gather for COW case in
> copy_page_range? 

We must flush before the pte_lock is released. If not, we end up with
this funny situation:

	- PTE is read-only, hash contains a translation for it
	- PTE gets cleared & added to the batch, hash not flushed yet
	- PTE lock released, maybe even VMA fully removed
	- Other CPU takes a write fault, puts in a new PTE
	- Hash ends up with duplicates of the vaddr -> arch violation

Now we could get out of that one, I suppose, if we had some kind of way
to force flush any batch pertaining to a given mm before a new valid PTE
can be written, but that doesn't sound such a trivial thing to do.

Any better idea ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-09  8:14     ` Peter Zijlstra
  2010-04-09  8:46       ` Nick Piggin
@ 2010-04-13  2:06       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-13  2:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 10:14 +0200, Peter Zijlstra wrote:
> 
> and doing that vaddr collection right along with it in the same batch.
> 
> I think that that would work, Ben, Dave?

Well, ours aren't struct pages.

IE. There's fundamentally 3 things that we are trying to batch here :-)

 1- The original mmu_gather: batching the freeing of the actual user
pages, so that the TLB flush can be delayed/gathered, plus there might
be some micro-improvement in passing the page list to the allocator for
freeing all at omce. This is thus purely a batch of struct pages.

 2- The batching of the TLB flushes (or hash invalidates in the ppc
case) proper, which needs the addition of the vaddr for things like
sparc and powerpc since we don't just invalidate the whole bloody thing
unlike x86 :-) On powerpc, we actually need more, we need the actual PTE
content since it also contains tracking information relative to where
things have been put in the hash table.

 3- The batching of the freeing of the page table structure, which we
want to delay more than batch, ie, the goal here is to delay that
freeing using RCU until everybody has stopped walking them. This does
rely on RCU grace period being "interrupt safe", ie, there's no
rcu_read_lock() in the low level TLB or hash miss code, but that code
runs with interrupts off.

Now, 2. has a problem I described earlier, which is that we must not
have the possibility of introducing a duplicate in the hash table, thus
it must not be possible to put a new PTE in until the previous one has
been flushed or bad things would happen. This is why powerpc doesn't use
the mmu_gather the way it was originally intended to do both 1. and 2.
but really only for 1., while for 2. we use a small batch that only
exist between lazy_mmu_enter/exit, since those are always fully enclosed
by a pte lock section.

3. As you have noticed, relies on the irq stuff. Plus there seem to be a
dubious optimization here with mm_users. Might be worth sorting that
out. However, it's a very different goal than 1. and 2. in the sense
that batching proper is a minor issue, what we want is synchronization
with walkers, and that batching is a way to lower the cost of that
synchronization (allocating of the RCU struct etc...).

Cheers,
Ben.
 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-13  1:05   ` Benjamin Herrenschmidt
@ 2010-04-13  3:43     ` Paul E. McKenney
  2010-04-14 13:51       ` Peter Zijlstra
  2010-04-16  6:51       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-13  3:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Tue, Apr 13, 2010 at 11:05:31AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:
> > plain text document attachment (powerpc-gup_fast-rcu.patch)
> > The powerpc page table freeing relies on the fact that IRQs hold off
> > an RCU grace period, this is currently true for all existing RCU
> > implementations but is not an assumption Paul wants to support.
> > 
> > Therefore, also take the RCU read lock along with disabling IRQs to
> > ensure the RCU grace period does at least cover these lookups.
> 
> There's a few other places that need a similar fix then. The hash page
> code for example. All the C cases should end up calling the
> find_linux_pte() helper afaik, so we should be able to stick the lock in
> there (and the hugetlbfs variant, find_linux_pte_or_hugepte()).
> 
> However, we also have cases of tight asm code walking the page tables,
> such as the tlb miss handler on embedded processors. I don't see how I
> could do that there. IE. I only have a handful of registers to play
> with, no stack, etc...
> 
> So we might have to support the interrupt assumption, at least in some
> form, with those guys...

One way to make the interrupt assumption official is to use
synchronize_sched() rather than synchronize_rcu().

							Thanx, Paul

> Cheers,
> Ben. 
> 
> > Requested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Cc: Nick Piggin <npiggin@suse.de>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > ---
> >  arch/powerpc/mm/gup.c |    3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > Index: linux-2.6/arch/powerpc/mm/gup.c
> > ===================================================================
> > --- linux-2.6.orig/arch/powerpc/mm/gup.c
> > +++ linux-2.6/arch/powerpc/mm/gup.c
> > @@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
> >  	 * So long as we atomically load page table pointers versus teardown,
> >  	 * we can follow the address down to the the page and take a ref on it.
> >  	 */
> > +	rcu_read_lock();
> >  	local_irq_disable();
> >  
> >  	pgdp = pgd_offset(mm, addr);
> > @@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
> >  	} while (pgdp++, addr = next, addr != end);
> >  
> >  	local_irq_enable();
> > +	rcu_read_unlock();
> >  
> >  	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> >  	return nr;
> > @@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
> >  
> >  slow:
> >  		local_irq_enable();
> > +		rcu_read_unlock();
> >  slow_irqon:
> >  		pr_devel("  slow path ! nr = %d\n", nr);
> >  
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-13  1:23   ` Benjamin Herrenschmidt
@ 2010-04-13 10:22     ` Peter Zijlstra
  2010-04-14 13:34     ` Peter Zijlstra
  2010-04-14 13:51     ` Peter Zijlstra
  2 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-13 10:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Tue, 2010-04-13 at 11:23 +1000, Benjamin Herrenschmidt wrote:
> >  static inline void arch_leave_lazy_mmu_mode(void)
> >  {
> > -     struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> > +     struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
> > +
> > +     if (batch->active) {
> > +             if (batch->index)
> > +                     __flush_tlb_pending(batch);
> > +             batch->active = 0;
> > +     }
> 
> Can index be > 0 if active == 0 ? I though not, which means you don't
> need to add a test on active, do you ?

True I guess, but like this we avoid a write, doesn't really matter I
suspect.

> I'm also pondering whether we should just stick something in the
> task/thread struct and avoid that whole per-cpu manipulation including
> the stuff in process.c in fact.

Can do, I can add batching similar to the generic code to the
thread_info thingy.

> Heh, maybe it's time to introduce thread vars ? :-) 

Heh, that seems like a real good way to waste a lot of memory fast ;-)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-13  1:23   ` Benjamin Herrenschmidt
  2010-04-13 10:22     ` Peter Zijlstra
@ 2010-04-14 13:34     ` Peter Zijlstra
  2010-04-14 13:51     ` Peter Zijlstra
  2 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-14 13:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Tue, 2010-04-13 at 11:23 +1000, Benjamin Herrenschmidt wrote:
> > +      * A comment here about on why we have RCU freed page tables might be
> > +      * interesting, also explaining why we don't need any sort of grace
> > +      * period for mm_users == 1, and have some home brewn smp_call_func()
> > +      * for single frees.
> 
> iirc, we are synchronizing with CPUs walking page tables in their hash
> or TLB miss code, which is lockless. The mm_users test is a -little- bit
> dubious indeed. It may have to be mm_users < 2 && mm ==
> current->active_mm, ie, we know for sure nobody else is currently
> walking those page tables ... 
> 
> Tho even than is fishy nowadays. We -can- walk page tables on behave of
> another process. In fact, we do it in the Cell SPU code for faulting
> page table entries as a result of SPEs taking faults for example. So I'm
> starting to suspect that this mm_users optimisation is bogus.
> 
> We -do- want to optimize out the case where there is no user left
> though, ie, the MM is dead. IE. The typical exit case.

Can't you fix that by having the SPE code take a reference on these
mm_structs they're playing with?

Poking at one without a ref seems fishy anyway.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/13] powerpc: Preemptible mmu_gather
  2010-04-13  1:23   ` Benjamin Herrenschmidt
  2010-04-13 10:22     ` Peter Zijlstra
  2010-04-14 13:34     ` Peter Zijlstra
@ 2010-04-14 13:51     ` Peter Zijlstra
  2 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-14 13:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Paul E. McKenney

On Tue, 2010-04-13 at 11:23 +1000, Benjamin Herrenschmidt wrote:
> > +      * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
> > +      * to serialize against the IRQ disable. In case we do batch, the RCU
> > +      * grace period is at least long enough to cover IRQ disabled sections
> > +      * (XXX assumption, not strictly true).
> 
> Yeah well ... I'm not that familiar with RCU anymore. Back when I wrote
> that, iirc, I would more or less be safe against a CPU that doesn't
> schedule, but things may have well changed.
> 
> We are trying to be safe against another CPU walking page tables in the
> asm lockless hash miss or TLB miss code. Note that sparc64 has a similar
> issue. This is highly optimized asm code that -cannot- call into things
> like rcu_read_lock().

Right, so Paul has been working hard to remove certain implementation
artifact from RCU, such as the preempt-disable == rcu_read_lock thing.

Now, even Preemptible RCU has IRQ-disabled == rcu_read_lock, simply
because the RCU grace period state machine is driven from an interrupt.

But there is no such requirement on RCU at all, so in the interest of
removing assumptions and code validating we're trying to remove such
things.





^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-13  3:43     ` Paul E. McKenney
@ 2010-04-14 13:51       ` Peter Zijlstra
  2010-04-15 14:28         ` Paul E. McKenney
  2010-04-16  6:51       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-14 13:51 UTC (permalink / raw)
  To: paulmck
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > So we might have to support the interrupt assumption, at least in some
> > form, with those guys...
> 
> One way to make the interrupt assumption official is to use
> synchronize_sched() rather than synchronize_rcu().

Well, call_rcu_sched() then, because the current usage is to use
call_rcu() to free the page directories.

Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
right in reading that code that that would not be available for
preemptible RCU?



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-14 13:51       ` Peter Zijlstra
@ 2010-04-15 14:28         ` Paul E. McKenney
  2010-04-16  6:54           ` Benjamin Herrenschmidt
  2010-04-16 13:51           ` Peter Zijlstra
  0 siblings, 2 replies; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-15 14:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Wed, Apr 14, 2010 at 03:51:50PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > So we might have to support the interrupt assumption, at least in some
> > > form, with those guys...
> > 
> > One way to make the interrupt assumption official is to use
> > synchronize_sched() rather than synchronize_rcu().
> 
> Well, call_rcu_sched() then, because the current usage is to use
> call_rcu() to free the page directories.
> 
> Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
> right in reading that code that that would not be available for
> preemptible RCU?

Both call_rcu_sched() and call_rcu() are always there for you.  ;-)

o	If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same
	implementation.

o	If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and
	call_rcu() is not.

Of course, with call_rcu_sched(), the corresponding RCU read-side critical
sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
read-side critical sections must use raw spinlocks.

Can the code in question accommodate these restrictions?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-13  3:43     ` Paul E. McKenney
  2010-04-14 13:51       ` Peter Zijlstra
@ 2010-04-16  6:51       ` Benjamin Herrenschmidt
  2010-04-16  8:18         ` Nick Piggin
  1 sibling, 1 reply; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-16  6:51 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > So we might have to support the interrupt assumption, at least in
> some
> > form, with those guys...
> 
> One way to make the interrupt assumption official is to use
> synchronize_sched() rather than synchronize_rcu().

Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
neither, we use call_rcu and we free the pages from the callback.

Ben.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-15 14:28         ` Paul E. McKenney
@ 2010-04-16  6:54           ` Benjamin Herrenschmidt
  2010-04-16 13:43               ` Paul E. McKenney
  2010-04-16 13:51           ` Peter Zijlstra
  1 sibling, 1 reply; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-16  6:54 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> 
> Of course, with call_rcu_sched(), the corresponding RCU read-side
> critical
> sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> read-side critical sections must use raw spinlocks.
> 
> Can the code in question accommodate these restrictions? 

What we protect against is always code that hard-disable IRQs (though
there seem to be a bug in the hugepages code there...). Would that
work ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16  6:51       ` Benjamin Herrenschmidt
@ 2010-04-16  8:18         ` Nick Piggin
  2010-04-16  8:29           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 113+ messages in thread
From: Nick Piggin @ 2010-04-16  8:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: paulmck, Peter Zijlstra, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman

On Fri, Apr 16, 2010 at 04:51:34PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > So we might have to support the interrupt assumption, at least in
> > some
> > > form, with those guys...
> > 
> > One way to make the interrupt assumption official is to use
> > synchronize_sched() rather than synchronize_rcu().
> 
> Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
> neither, we use call_rcu and we free the pages from the callback.

BTW. you currently have an interesting page table freeing path where
you usually free by RCU, but (occasionally) free by IPI. This means
you need to disable both RCU and interrupts to walk page tables.

If you change it to always use RCU, then you wouldn't need to disable
interrupts. Whether this actually matters anywhere in your mm code, I
don't know (it's probably not terribly important for gup_fast). But
rcu disable is always preferable for latency and performance.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16  8:18         ` Nick Piggin
@ 2010-04-16  8:29           ` Benjamin Herrenschmidt
  2010-04-16  9:22             ` Nick Piggin
  0 siblings, 1 reply; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-16  8:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Peter Zijlstra, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-16 at 18:18 +1000, Nick Piggin wrote:
> On Fri, Apr 16, 2010 at 04:51:34PM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > So we might have to support the interrupt assumption, at least in
> > > some
> > > > form, with those guys...
> > > 
> > > One way to make the interrupt assumption official is to use
> > > synchronize_sched() rather than synchronize_rcu().
> > 
> > Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
> > neither, we use call_rcu and we free the pages from the callback.
> 
> BTW. you currently have an interesting page table freeing path where
> you usually free by RCU, but (occasionally) free by IPI. This means
> you need to disable both RCU and interrupts to walk page tables.

Well, the point is we use interrupts to synchronize. The fact that RCU
used to do the job was an added benefit. I may need to switch to rcu
_sched variants tho to keep that. The IPI case is a slow path in case we
are out of memory and cannot allocate our page of RCU batch.

> If you change it to always use RCU, then you wouldn't need to disable
> interrupts. Whether this actually matters anywhere in your mm code, I
> don't know (it's probably not terribly important for gup_fast). But
> rcu disable is always preferable for latency and performance.

Well, the main case is the hash miss and that always runs with IRQs off.

Cheers,
Ben.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16  8:29           ` Benjamin Herrenschmidt
@ 2010-04-16  9:22             ` Nick Piggin
  0 siblings, 0 replies; 113+ messages in thread
From: Nick Piggin @ 2010-04-16  9:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: paulmck, Peter Zijlstra, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman

On Fri, Apr 16, 2010 at 06:29:02PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-04-16 at 18:18 +1000, Nick Piggin wrote:
> > On Fri, Apr 16, 2010 at 04:51:34PM +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > > So we might have to support the interrupt assumption, at least in
> > > > some
> > > > > form, with those guys...
> > > > 
> > > > One way to make the interrupt assumption official is to use
> > > > synchronize_sched() rather than synchronize_rcu().
> > > 
> > > Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
> > > neither, we use call_rcu and we free the pages from the callback.
> > 
> > BTW. you currently have an interesting page table freeing path where
> > you usually free by RCU, but (occasionally) free by IPI. This means
> > you need to disable both RCU and interrupts to walk page tables.
> 
> Well, the point is we use interrupts to synchronize. The fact that RCU
> used to do the job was an added benefit. I may need to switch to rcu
> _sched variants tho to keep that. The IPI case is a slow path in case we
> are out of memory and cannot allocate our page of RCU batch.

It is the slowpath but it forces all lookup paths to do irq disable
too.

 
> > If you change it to always use RCU, then you wouldn't need to disable
> > interrupts. Whether this actually matters anywhere in your mm code, I
> > don't know (it's probably not terribly important for gup_fast). But
> > rcu disable is always preferable for latency and performance.
> 
> Well, the main case is the hash miss and that always runs with IRQs off.

Probably not a big deal then.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16  6:54           ` Benjamin Herrenschmidt
@ 2010-04-16 13:43               ` Paul E. McKenney
  0 siblings, 0 replies; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-16 13:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Fri, Apr 16, 2010 at 04:54:51PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> > 
> > Of course, with call_rcu_sched(), the corresponding RCU read-side
> > critical
> > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > read-side critical sections must use raw spinlocks.
> > 
> > Can the code in question accommodate these restrictions? 
> 
> What we protect against is always code that hard-disable IRQs (though
> there seem to be a bug in the hugepages code there...). Would that
> work ?

>From the perspective of call_rcu_sched() and synchronize_sched(),
the following things mark RCU-sched read-side critical sections:

1.	rcu_read_lock_sched() and rcu_read_unlock_sched().

2.	preempt_disable() and preempt_enable(), along with anything
	else that disables preemption.

3.	local_bh_disable() and local_bh_enable(), along with anything
	else that disables bottom halves.

4.	local_irq_disable() and local_irq_enable(), along wiht anything
	else that disables hardirqs.

5.	Handlers for NMIs.

So I believe that in this case call_rcu_sched() is your friend.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
@ 2010-04-16 13:43               ` Paul E. McKenney
  0 siblings, 0 replies; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-16 13:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Fri, Apr 16, 2010 at 04:54:51PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> > 
> > Of course, with call_rcu_sched(), the corresponding RCU read-side
> > critical
> > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > read-side critical sections must use raw spinlocks.
> > 
> > Can the code in question accommodate these restrictions? 
> 
> What we protect against is always code that hard-disable IRQs (though
> there seem to be a bug in the hugepages code there...). Would that
> work ?

From the perspective of call_rcu_sched() and synchronize_sched(),
the following things mark RCU-sched read-side critical sections:

1.	rcu_read_lock_sched() and rcu_read_unlock_sched().

2.	preempt_disable() and preempt_enable(), along with anything
	else that disables preemption.

3.	local_bh_disable() and local_bh_enable(), along with anything
	else that disables bottom halves.

4.	local_irq_disable() and local_irq_enable(), along wiht anything
	else that disables hardirqs.

5.	Handlers for NMIs.

So I believe that in this case call_rcu_sched() is your friend.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-15 14:28         ` Paul E. McKenney
  2010-04-16  6:54           ` Benjamin Herrenschmidt
@ 2010-04-16 13:51           ` Peter Zijlstra
  2010-04-16 14:17             ` Paul E. McKenney
  1 sibling, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-16 13:51 UTC (permalink / raw)
  To: paulmck
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> On Wed, Apr 14, 2010 at 03:51:50PM +0200, Peter Zijlstra wrote:
> > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > So we might have to support the interrupt assumption, at least in some
> > > > form, with those guys...
> > > 
> > > One way to make the interrupt assumption official is to use
> > > synchronize_sched() rather than synchronize_rcu().
> > 
> > Well, call_rcu_sched() then, because the current usage is to use
> > call_rcu() to free the page directories.
> > 
> > Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
> > right in reading that code that that would not be available for
> > preemptible RCU?
> 
> Both call_rcu_sched() and call_rcu() are always there for you.  ;-)
> 
> o	If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same
> 	implementation.
> 
> o	If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and
> 	call_rcu() is not.

(The reverse I suspect?)

> Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> read-side critical sections must use raw spinlocks.

OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
synchronize_rcu} functions to {*}_preempt and then add a new
CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
{*}_preempt, we've basically got what I've been asking for for a while,
no?

> Can the code in question accommodate these restrictions?

Yes, that should do just fine I think.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 13:51           ` Peter Zijlstra
@ 2010-04-16 14:17             ` Paul E. McKenney
  2010-04-16 14:23               ` Peter Zijlstra
  0 siblings, 1 reply; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-16 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, Apr 16, 2010 at 03:51:21PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> > On Wed, Apr 14, 2010 at 03:51:50PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > > So we might have to support the interrupt assumption, at least in some
> > > > > form, with those guys...
> > > > 
> > > > One way to make the interrupt assumption official is to use
> > > > synchronize_sched() rather than synchronize_rcu().
> > > 
> > > Well, call_rcu_sched() then, because the current usage is to use
> > > call_rcu() to free the page directories.
> > > 
> > > Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
> > > right in reading that code that that would not be available for
> > > preemptible RCU?
> > 
> > Both call_rcu_sched() and call_rcu() are always there for you.  ;-)
> > 
> > o	If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same
> > 	implementation.
> > 
> > o	If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and
> > 	call_rcu() is not.
> 
> (The reverse I suspect?)

Indeed:  If CONFIG_TREE_PREEMPT_RCU, call_rcu() is preemptible and
call_rcu_sched() is not.

> > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > read-side critical sections must use raw spinlocks.
> 
> OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> synchronize_rcu} functions to {*}_preempt and then add a new
> CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> {*}_preempt, we've basically got what I've been asking for for a while,
> no?

What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?

> > Can the code in question accommodate these restrictions?
> 
> Yes, that should do just fine I think.

Cool!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 14:17             ` Paul E. McKenney
@ 2010-04-16 14:23               ` Peter Zijlstra
  2010-04-16 14:32                 ` Paul E. McKenney
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-16 14:23 UTC (permalink / raw)
  To: paulmck
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:

> > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > > read-side critical sections must use raw spinlocks.
> > 
> > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > synchronize_rcu} functions to {*}_preempt and then add a new
> > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > {*}_preempt, we've basically got what I've been asking for for a while,
> > no?
> 
> What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?

Same as for a preempt one, since you'd have to be able to schedule()
while holding it to be able to do things like mutex_lock().

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 14:23               ` Peter Zijlstra
@ 2010-04-16 14:32                 ` Paul E. McKenney
  2010-04-16 14:56                   ` Peter Zijlstra
  0 siblings, 1 reply; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-16 14:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> 
> > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > > > read-side critical sections must use raw spinlocks.
> > > 
> > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > no?
> > 
> > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> 
> Same as for a preempt one, since you'd have to be able to schedule()
> while holding it to be able to do things like mutex_lock().

So what you really want is something like rcu_read_lock_sleep() rather
than rcu_read_lock_preempt(), right?  The point is that you want to do
more than merely preempt, given that it is legal to do general blocking
while holding a mutex, correct?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 14:32                 ` Paul E. McKenney
@ 2010-04-16 14:56                   ` Peter Zijlstra
  2010-04-16 15:09                     ` Paul E. McKenney
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-16 14:56 UTC (permalink / raw)
  To: paulmck
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> > 
> > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > > > > read-side critical sections must use raw spinlocks.
> > > > 
> > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > no?
> > > 
> > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> > 
> > Same as for a preempt one, since you'd have to be able to schedule()
> > while holding it to be able to do things like mutex_lock().
> 
> So what you really want is something like rcu_read_lock_sleep() rather
> than rcu_read_lock_preempt(), right?  The point is that you want to do
> more than merely preempt, given that it is legal to do general blocking
> while holding a mutex, correct?

Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
the name to _sleep, but we've been calling it preemptible-rcu for a long
while now.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 14:56                   ` Peter Zijlstra
@ 2010-04-16 15:09                     ` Paul E. McKenney
  2010-04-16 15:14                       ` Peter Zijlstra
  0 siblings, 1 reply; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-16 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, Apr 16, 2010 at 04:56:50PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> > > 
> > > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > > > > > read-side critical sections must use raw spinlocks.
> > > > > 
> > > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > > no?
> > > > 
> > > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> > > 
> > > Same as for a preempt one, since you'd have to be able to schedule()
> > > while holding it to be able to do things like mutex_lock().
> > 
> > So what you really want is something like rcu_read_lock_sleep() rather
> > than rcu_read_lock_preempt(), right?  The point is that you want to do
> > more than merely preempt, given that it is legal to do general blocking
> > while holding a mutex, correct?
> 
> Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
> the name to _sleep, but we've been calling it preemptible-rcu for a long
> while now.

It is actually not permitted to do general blocking in a preemptible RCU
read-side critical section.  Otherwise, someone is going to block waiting
for a network packet that never comes, thus OOMing the system.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 15:09                     ` Paul E. McKenney
@ 2010-04-16 15:14                       ` Peter Zijlstra
  2010-04-16 16:45                         ` Paul E. McKenney
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-16 15:14 UTC (permalink / raw)
  To: paulmck
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, 2010-04-16 at 08:09 -0700, Paul E. McKenney wrote:
> On Fri, Apr 16, 2010 at 04:56:50PM +0200, Peter Zijlstra wrote:
> > On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> > > On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > > > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> > > > 
> > > > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > > > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > > > > > > read-side critical sections must use raw spinlocks.
> > > > > > 
> > > > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > > > no?
> > > > > 
> > > > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> > > > 
> > > > Same as for a preempt one, since you'd have to be able to schedule()
> > > > while holding it to be able to do things like mutex_lock().
> > > 
> > > So what you really want is something like rcu_read_lock_sleep() rather
> > > than rcu_read_lock_preempt(), right?  The point is that you want to do
> > > more than merely preempt, given that it is legal to do general blocking
> > > while holding a mutex, correct?
> > 
> > Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
> > the name to _sleep, but we've been calling it preemptible-rcu for a long
> > while now.
> 
> It is actually not permitted to do general blocking in a preemptible RCU
> read-side critical section.  Otherwise, someone is going to block waiting
> for a network packet that never comes, thus OOMing the system.

Sure, something that guarantees progress seems like a sensible
restriction for any lock, and in particular RCU :-)



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 15:14                       ` Peter Zijlstra
@ 2010-04-16 16:45                         ` Paul E. McKenney
  2010-04-16 19:37                           ` Peter Zijlstra
  2010-04-18  3:06                           ` James Bottomley
  0 siblings, 2 replies; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-16 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, Apr 16, 2010 at 05:14:15PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 08:09 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 16, 2010 at 04:56:50PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> > > > On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > > > > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > > > > sections are non-preemptible.  Therefore, in CONFIG_PREEMPT_RT, these
> > > > > > > > read-side critical sections must use raw spinlocks.
> > > > > > > 
> > > > > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > > > > no?
> > > > > > 
> > > > > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> > > > > 
> > > > > Same as for a preempt one, since you'd have to be able to schedule()
> > > > > while holding it to be able to do things like mutex_lock().
> > > > 
> > > > So what you really want is something like rcu_read_lock_sleep() rather
> > > > than rcu_read_lock_preempt(), right?  The point is that you want to do
> > > > more than merely preempt, given that it is legal to do general blocking
> > > > while holding a mutex, correct?
> > > 
> > > Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
> > > the name to _sleep, but we've been calling it preemptible-rcu for a long
> > > while now.
> > 
> > It is actually not permitted to do general blocking in a preemptible RCU
> > read-side critical section.  Otherwise, someone is going to block waiting
> > for a network packet that never comes, thus OOMing the system.
> 
> Sure, something that guarantees progress seems like a sensible
> restriction for any lock, and in particular RCU :-)

Excellent point -- much of the issue really does center around
forward-progress guarantees.  In fact the Linux kernel has a number of
locking primitives that require different degrees of forward-progress
guarantee from the code in their respective critical sections:

o	spin_lock_irqsave(): Critical sections must guarantee forward
	progress against everything except NMI handlers.

o	raw_spin_lock(): Critical sections must guarantee forward
	progress against everything except IRQ (including softirq)
	and NMI handlers.

o	spin_lock(): Critical sections must guarantee forward
	progress against everything except IRQ (again including softirq)
	and NMI handlers and (given CONFIG_PREEMPT_RT) higher-priority
	realtime tasks.

o	mutex_lock(): Critical sections need not guarantee
	forward progress, as general blocking is permitted.

The other issue is the scope of the lock.  The Linux kernel has
the following:

o	BKL: global scope.

o	Everything else: scope defined by the use of the underlying
	lock variable.

One of the many reasons that we are trying to get rid of BKL is because
it combines global scope with relatively weak forward-progress guarantees.

So here is how the various RCU flavors stack up:

o	rcu_read_lock_bh(): critical sections must guarantee forward
	progress against everything except NMI handlers and IRQ handlers,
	but not against softirq handlers.  Global in scope, so that
	violating the forward-progress guarantee risks OOMing the system.

o	rcu_read_lock_sched(): critical sections must guarantee
	forward progress against everything except NMI and IRQ handlers,
	including softirq handlers.  Global in scope, so that violating
	the forward-progress guarantee risks OOMing the system.

o	rcu_read_lock(): critical sections must guarantee forward
	progress against everything except NMI handlers, IRQ handlers,
	softirq handlers, and (in CONFIG_PREEMPT_RT) higher-priority
	realtime tasks.  Global in scope, so that violating the
	forward-progress guarantee risks OOMing the system.

o	srcu_read_lock(): critical sections need not guarantee forward
	progress, as general blocking is permitted.  Scope is controlled
	by the use of the underlying srcu_struct structure.

As you say, one can block in rcu_read_lock() critical sections, but
the only blocking that is really safe is blocking that is subject to
priority inheritance.  This prohibits mutexes, because although the
mutexes themselves are subject to priority inheritance, the mutexes'
critical sections might well not be.

So the easy response is "just use SRCU."  Of course, SRCU has some
disadvantages at the moment:

o	The return value from srcu_read_lock() must be passed to
	srcu_read_unlock().  I believe that I can fix this.

o	There is no call_srcu().  I believe that I can fix this.

o	SRCU uses a flat per-CPU counter scheme that is not particularly
	scalable.  I believe that I can fix this.

o	SRCU's current implementation makes it almost impossible to
	implement priority boosting.  I believe that I can fix this.

o	SRCU requires explicit initialization of the underlying
	srcu_struct.  Unfortunately, I don't see a reasonable way
	around this.  Not yet, anyway.

So, is there anything else that you don't like about SRCU?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 16:45                         ` Paul E. McKenney
@ 2010-04-16 19:37                           ` Peter Zijlstra
  2010-04-16 20:28                             ` Paul E. McKenney
  2010-04-18  3:06                           ` James Bottomley
  1 sibling, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-16 19:37 UTC (permalink / raw)
  To: paulmck
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> o       mutex_lock(): Critical sections need not guarantee
>         forward progress, as general blocking is permitted.
> 
Right, I would argue that they should guarantee fwd progress, but due to
being able to schedule while holding them, its harder to enforce.

Anything that is waiting for uncertainty should do so without any locks
held and simply re-acquire them once such an event does occur.

> So the easy response is "just use SRCU."  Of course, SRCU has some
> disadvantages at the moment:
> 
> o       The return value from srcu_read_lock() must be passed to
>         srcu_read_unlock().  I believe that I can fix this.
> 
> o       There is no call_srcu().  I believe that I can fix this.
> 
> o       SRCU uses a flat per-CPU counter scheme that is not particularly
>         scalable.  I believe that I can fix this.
> 
> o       SRCU's current implementation makes it almost impossible to
>         implement priority boosting.  I believe that I can fix this.
> 
> o       SRCU requires explicit initialization of the underlying
>         srcu_struct.  Unfortunately, I don't see a reasonable way
>         around this.  Not yet, anyway.
> 
> So, is there anything else that you don't like about SRCU?

No, I quite like SRCU when implemented as preemptible tree RCU, and I
don't at all mind that last point, all dynamic things need some sort of
init. All locks certainly have.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 19:37                           ` Peter Zijlstra
@ 2010-04-16 20:28                             ` Paul E. McKenney
  0 siblings, 0 replies; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-16 20:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Andrea Arcangeli, Avi Kivity,
	Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm, Linus Torvalds,
	linux-kernel, linux-arch, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin

On Fri, Apr 16, 2010 at 09:37:02PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> > o       mutex_lock(): Critical sections need not guarantee
> >         forward progress, as general blocking is permitted.
> > 
> Right, I would argue that they should guarantee fwd progress, but due to
> being able to schedule while holding them, its harder to enforce.
> 
> Anything that is waiting for uncertainty should do so without any locks
> held and simply re-acquire them once such an event does occur.

Agreed.  But holding a small-scope mutex for (say) 60 seconds would not be
a problem (at 120 seconds, you might start seeing softlockup messages).
In contrast, holding off an RCU grace period for 60 seconds might well
OOM the machine, especially a small embedded system with limited memory.

> > So the easy response is "just use SRCU."  Of course, SRCU has some
> > disadvantages at the moment:
> > 
> > o       The return value from srcu_read_lock() must be passed to
> >         srcu_read_unlock().  I believe that I can fix this.
> > 
> > o       There is no call_srcu().  I believe that I can fix this.
> > 
> > o       SRCU uses a flat per-CPU counter scheme that is not particularly
> >         scalable.  I believe that I can fix this.
> > 
> > o       SRCU's current implementation makes it almost impossible to
> >         implement priority boosting.  I believe that I can fix this.
> > 
> > o       SRCU requires explicit initialization of the underlying
> >         srcu_struct.  Unfortunately, I don't see a reasonable way
> >         around this.  Not yet, anyway.
> > 
> > So, is there anything else that you don't like about SRCU?
> 
> No, I quite like SRCU when implemented as preemptible tree RCU, and I
> don't at all mind that last point, all dynamic things need some sort of
> init. All locks certainly have.

Very good!!!  I should clarify, though -- by "explicit initialization",
I mean that there needs to be a run-time call to init_srcu_struct().
Unless there is some clever way to initialize an array of pointers to
per-CPU structures at compile time.  And, conversely, a way to initialize
pointers in a per-CPU structure to point to possibly-different rcu_node
structures.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 13:43               ` Paul E. McKenney
  (?)
@ 2010-04-16 23:25               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 113+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-16 23:25 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, David Miller, Hugh Dickins, Mel Gorman, Nick Piggin

On Fri, 2010-04-16 at 06:43 -0700, Paul E. McKenney wrote:
> So I believe that in this case call_rcu_sched() is your friend.  ;-)

Looks like it :-)

I'll cook up a patch changing my current call_rcu() to call_rcu_sched().

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-16 16:45                         ` Paul E. McKenney
  2010-04-16 19:37                           ` Peter Zijlstra
@ 2010-04-18  3:06                           ` James Bottomley
  2010-04-18 13:55                             ` Paul E. McKenney
  1 sibling, 1 reply; 113+ messages in thread
From: James Bottomley @ 2010-04-18  3:06 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin

On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> o	mutex_lock(): Critical sections need not guarantee
> 	forward progress, as general blocking is permitted.

This isn't quite right.  mutex critical sections must guarantee eventual
forward progress against the class of other potential acquirers of the
mutex otherwise the system will become either deadlocked or livelocked.

James



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-18  3:06                           ` James Bottomley
@ 2010-04-18 13:55                             ` Paul E. McKenney
  2010-04-18 18:55                               ` James Bottomley
  0 siblings, 1 reply; 113+ messages in thread
From: Paul E. McKenney @ 2010-04-18 13:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin

On Sat, Apr 17, 2010 at 10:06:36PM -0500, James Bottomley wrote:
> On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> > o	mutex_lock(): Critical sections need not guarantee
> > 	forward progress, as general blocking is permitted.
> 
> This isn't quite right.  mutex critical sections must guarantee eventual
> forward progress against the class of other potential acquirers of the
> mutex otherwise the system will become either deadlocked or livelocked.

If I understand you correctly, you are saying that it is OK for a given
critical section for a given mutex to fail to make forward progress if
nothing else happens to acquire that mutex during that time.  I would
agree, at least I would if you were to further add that the soft-lockup
checks permit an additional 120 seconds of failure to make forward progress
even if something -is- attempting to acquire that mutex.

By my standards, 120 seconds is a reasonable approximation to infinity,
hence my statement above.

So, would you agree with the following as a more precise statement?

o	mutex_lock(): Critical sections need not guarantee
	forward progress unless some other task is waiting
	on the mutex in question, in which case critical sections
	should complete in 120 seconds.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation
  2010-04-18 13:55                             ` Paul E. McKenney
@ 2010-04-18 18:55                               ` James Bottomley
  0 siblings, 0 replies; 113+ messages in thread
From: James Bottomley @ 2010-04-18 18:55 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Andrea Arcangeli,
	Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin

On Sun, 2010-04-18 at 06:55 -0700, Paul E. McKenney wrote:
> On Sat, Apr 17, 2010 at 10:06:36PM -0500, James Bottomley wrote:
> > On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> > > o	mutex_lock(): Critical sections need not guarantee
> > > 	forward progress, as general blocking is permitted.
> > 
> > This isn't quite right.  mutex critical sections must guarantee eventual
> > forward progress against the class of other potential acquirers of the
> > mutex otherwise the system will become either deadlocked or livelocked.
> 
> If I understand you correctly, you are saying that it is OK for a given
> critical section for a given mutex to fail to make forward progress if
> nothing else happens to acquire that mutex during that time.  I would
> agree, at least I would if you were to further add that the soft-lockup
> checks permit an additional 120 seconds of failure to make forward progress
> even if something -is- attempting to acquire that mutex.

Yes ... I was thinking of two specific cases: one is wrong programming
of lock acquisition where the system deadlocks; the other is doing silly
things like taking a mutex around an event loop instead of inside it so
incoming events prevent forward progress and the system livelocks, but
there are many other ways of producing deadlocks and livelocks.  I just
couldn't think of a concise way of saying all of that but I didn't want
a statement about mutexes to give the impression that anything goes.

I've got to say that I also dislike seeing any form of sleep within a
critical section because it's just asking for a nasty entangled deadlock
which can be very hard to sort out.  So I also didn't like the statement
"general blocking is permitted"

> By my standards, 120 seconds is a reasonable approximation to infinity,
> hence my statement above.
> 
> So, would you agree with the following as a more precise statement?
> 
> o	mutex_lock(): Critical sections need not guarantee
> 	forward progress unless some other task is waiting
> 	on the mutex in question, in which case critical sections
> 	should complete in 120 seconds.

Sounds fair.

James



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06/13] mm: Preemptible mmu_gather
  2010-04-09 20:36     ` Peter Zijlstra
@ 2010-04-19 19:16       ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-04-19 19:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 2010-04-09 at 22:36 +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 13:25 +1000, Nick Piggin wrote:
> > Have you done some profiling on this? What I would like to see, if
> > it's not too much complexity, is to have a small set of pages to
> > handle common size frees, and then use them up first by default
> > before attempting to allocate more.
> > 
> > Also, it would be cool to be able to chain allocations to avoid
> > TLB flushes even on big frees (overridable by arch of course, in
> > case they're doing some non-preeemptible work or you wish to break
> > up lock hold times). But that might be just getting over engineered.

[ patch to do very long queues ]

One thing that comes from having preemptible mmu_gather, and esp. when
we allow such very long gathers, is that we can potentially have a very
large amount of pages stuck on these lists.

So we'd need to hook into reclaim somehow to allow flushing of them when
we're falling short.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-04-09  8:44             ` [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma() Peter Zijlstra
@ 2010-05-24 19:32               ` Andrew Morton
  2010-05-25  9:01                 ` Peter Zijlstra
  0 siblings, 1 reply; 113+ messages in thread
From: Andrew Morton @ 2010-05-24 19:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Fri, 09 Apr 2010 10:44:53 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2010-04-09 at 16:29 +0900, KOSAKI Motohiro wrote:
> > 
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 8b088f0..b4a0b5b 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page
> > *page)
> >         unsigned long anon_mapping;
> >  
> >         rcu_read_lock();
> > -       anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > +       anon_mapping = (unsigned long) rcu_dereference(page->mapping);
> >         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >                 goto out;
> >         if (!page_mapped(page)) 
> 
> Yes, I think this is indeed required.
> 
> I'll do a new version of the patch that includes the comment updates
> requested by Andrew.

Either this didn't happen or I lost the patch.

I parked mm-revalidate-anon_vma-in-page_lock_anon_vma.patch for now. 
Hopefully everything still works without it..


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma()
  2010-05-24 19:32               ` Andrew Morton
@ 2010-05-25  9:01                 ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2010-05-25  9:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, Linus Torvalds, linux-kernel, linux-arch,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman

On Mon, 2010-05-24 at 12:32 -0700, Andrew Morton wrote:
> On Fri, 09 Apr 2010 10:44:53 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, 2010-04-09 at 16:29 +0900, KOSAKI Motohiro wrote:
> > > 
> > > 
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 8b088f0..b4a0b5b 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page
> > > *page)
> > >         unsigned long anon_mapping;
> > >  
> > >         rcu_read_lock();
> > > -       anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > > +       anon_mapping = (unsigned long) rcu_dereference(page->mapping);
> > >         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > >                 goto out;
> > >         if (!page_mapped(page)) 
> > 
> > Yes, I think this is indeed required.
> > 
> > I'll do a new version of the patch that includes the comment updates
> > requested by Andrew.
> 
> Either this didn't happen or I lost the patch.

It didn't happen, I got distracted.

> I parked mm-revalidate-anon_vma-in-page_lock_anon_vma.patch for now. 
> Hopefully everything still works without it..

Yes, I think it actually does due to a number of really non-obvious
things ;-)

I'll try and get back to my make mmu_gather preemptible stuff shortly
and write comments there.

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2010-05-25  9:02 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-08 19:17 [PATCH 00/13] mm: preemptibility -v2 Peter Zijlstra
2010-04-08 19:17 ` Peter Zijlstra
2010-04-08 19:17 ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 20:31   ` Rik van Riel
2010-04-09  3:11   ` Nick Piggin
2010-04-13  1:05   ` Benjamin Herrenschmidt
2010-04-13  3:43     ` Paul E. McKenney
2010-04-14 13:51       ` Peter Zijlstra
2010-04-15 14:28         ` Paul E. McKenney
2010-04-16  6:54           ` Benjamin Herrenschmidt
2010-04-16 13:43             ` Paul E. McKenney
2010-04-16 13:43               ` Paul E. McKenney
2010-04-16 23:25               ` Benjamin Herrenschmidt
2010-04-16 13:51           ` Peter Zijlstra
2010-04-16 14:17             ` Paul E. McKenney
2010-04-16 14:23               ` Peter Zijlstra
2010-04-16 14:32                 ` Paul E. McKenney
2010-04-16 14:56                   ` Peter Zijlstra
2010-04-16 15:09                     ` Paul E. McKenney
2010-04-16 15:14                       ` Peter Zijlstra
2010-04-16 16:45                         ` Paul E. McKenney
2010-04-16 19:37                           ` Peter Zijlstra
2010-04-16 20:28                             ` Paul E. McKenney
2010-04-18  3:06                           ` James Bottomley
2010-04-18 13:55                             ` Paul E. McKenney
2010-04-18 18:55                               ` James Bottomley
2010-04-16  6:51       ` Benjamin Herrenschmidt
2010-04-16  8:18         ` Nick Piggin
2010-04-16  8:29           ` Benjamin Herrenschmidt
2010-04-16  9:22             ` Nick Piggin
2010-04-08 19:17 ` [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma() Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 20:50   ` Rik van Riel
2010-04-08 21:20   ` Andrew Morton
2010-04-08 21:54     ` Peter Zijlstra
2010-04-08 21:54       ` Peter Zijlstra
2010-04-09  2:19       ` KOSAKI Motohiro
2010-04-09  2:19   ` Minchan Kim
2010-04-09  3:16   ` Nick Piggin
2010-04-09  4:56     ` KAMEZAWA Hiroyuki
2010-04-09  6:34       ` KOSAKI Motohiro
2010-04-09  6:47         ` KAMEZAWA Hiroyuki
2010-04-09  7:29           ` KOSAKI Motohiro
2010-04-09  7:57             ` KAMEZAWA Hiroyuki
2010-04-09  8:03               ` KAMEZAWA Hiroyuki
2010-04-09  8:24                 ` KAMEZAWA Hiroyuki
2010-04-09  8:01             ` Minchan Kim
2010-04-09  8:17               ` KOSAKI Motohiro
2010-04-09 14:41                 ` mlock and pageout race? Minchan Kim
2010-04-09  8:44             ` [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma() Peter Zijlstra
2010-05-24 19:32               ` Andrew Morton
2010-05-25  9:01                 ` Peter Zijlstra
2010-04-09 12:57   ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 03/13] x86: Remove last traces of quicklist usage Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 20:51   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 04/13] mm: Move anon_vma ref out from under CONFIG_KSM Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09 12:35   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 05/13] mm: Make use of the anon_vma ref count Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  7:04   ` Christian Ehrhardt
2010-04-09  9:57     ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 06/13] mm: Preemptible mmu_gather Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  3:25   ` Nick Piggin
2010-04-09  8:18     ` Peter Zijlstra
2010-04-09 20:36     ` Peter Zijlstra
2010-04-19 19:16       ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 07/13] powerpc: " Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  4:07   ` Nick Piggin
2010-04-09  8:14     ` Peter Zijlstra
2010-04-09  8:46       ` Nick Piggin
2010-04-09  9:22         ` Peter Zijlstra
2010-04-13  2:06       ` Benjamin Herrenschmidt
2010-04-13  1:56     ` Benjamin Herrenschmidt
2010-04-13  1:23   ` Benjamin Herrenschmidt
2010-04-13 10:22     ` Peter Zijlstra
2010-04-14 13:34     ` Peter Zijlstra
2010-04-14 13:51     ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 08/13] sparc: " Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 09/13] mm, powerpc: Move the RCU page-table freeing into generic code Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  3:35   ` Nick Piggin
2010-04-09  8:08     ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 10/13] lockdep, mutex: Provide mutex_lock_nest_lock Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09 15:36   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 11/13] mutex: Provide mutex_is_contended Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09 15:37   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 12/13] mm: Convert i_mmap_lock and anon_vma->lock to mutexes Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 13/13] mm: Optimize page_lock_anon_vma Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 22:18   ` Paul E. McKenney
2010-04-09  8:35     ` Peter Zijlstra
2010-04-09 19:22       ` Paul E. McKenney
2010-04-08 20:29 ` [PATCH 00/13] mm: preemptibility -v2 David Miller
2010-04-08 20:35   ` Peter Zijlstra
2010-04-09  1:00   ` David Miller
2010-04-09  4:14 ` Nick Piggin
2010-04-09  8:35   ` Peter Zijlstra
2010-04-09  8:50     ` Nick Piggin
2010-04-09  8:58       ` Peter Zijlstra
2010-04-09  8:58 ` Martin Schwidefsky
2010-04-09  9:53   ` Peter Zijlstra
2010-04-09  9:03 ` David Howells
2010-04-09  9:22   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.