All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] increase PREEMPT_BITS to 12 to avoid overflow when starting KVM
@ 2010-03-30 17:36 Rik van Riel
  2010-03-30 17:56 ` Peter Zijlstra
  2010-04-01  9:40 ` [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks() Avi Kivity
  0 siblings, 2 replies; 43+ messages in thread
From: Rik van Riel @ 2010-03-30 17:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, avi, aarcange, akpm, Kent Overstreet

Increase the PREEMPT_BITS to 12, to deal with a larger number
of locks that can be taken in mm_take_all_locks or other places
where many instances of the same type of lock can be taken.

The overflow of PREEMPT_BITS should be harmless, since it simply
increments the counter into the SOFTIRQ_BITS, and the counter
will be decremented again later.

However, the overflow does lead to backtraces with CONFIG_PREEMPT_DEBUG
enabled.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>

---
Kent, does this patch fix the issue you saw?

Peter, I know you do not like this approach. However, I could not
think of a way around mm_take_all_locks. We need to take those locks
and want to track that fact for lock debugging...

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d5b3876..e74108f 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -14,8 +14,8 @@
  * We put the hardirq and softirq counter into the preemption
  * counter. The bitmask has the following meaning:
  *
- * - bits 0-7 are the preemption count (max preemption depth: 256)
- * - bits 8-15 are the softirq count (max # of softirqs: 256)
+ * - bits 0-12 are the preemption count (max preemption depth: 4096)
+ * - bits 13-19 are the softirq count (max # of softirqs: 256)
  *
  * The hardirq count can in theory reach the same as NR_IRQS.
  * In reality, the number of nested IRQS is limited to the stack
@@ -24,16 +24,16 @@
  * hardirq nesting. An arch may choose to give less than 10 bits.
  * m68k expects it to be 8.
  *
- * - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024)
- * - bit 26 is the NMI_MASK
- * - bit 28 is the PREEMPT_ACTIVE flag
+ * - bits 26-29 are the hardirq count (max # of nested hardirqs: 1024)
+ * - bit 30 is the NMI_MASK
+ * - bit 31 is the PREEMPT_ACTIVE flag
  *
- * PREEMPT_MASK: 0x000000ff
- * SOFTIRQ_MASK: 0x0000ff00
- * HARDIRQ_MASK: 0x03ff0000
- *     NMI_MASK: 0x04000000
+ * PREEMPT_MASK: 0x00000fff
+ * SOFTIRQ_MASK: 0x000ff000
+ * HARDIRQ_MASK: 0x3ff00000
+ *     NMI_MASK: 0x40000000
  */
-#define PREEMPT_BITS	8
+#define PREEMPT_BITS	12
 #define SOFTIRQ_BITS	8
 #define NMI_BITS	1
 


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] increase PREEMPT_BITS to 12 to avoid overflow when starting KVM
  2010-03-30 17:36 [PATCH] increase PREEMPT_BITS to 12 to avoid overflow when starting KVM Rik van Riel
@ 2010-03-30 17:56 ` Peter Zijlstra
  2010-03-30 18:05   ` Rik van Riel
  2010-04-01  9:40 ` [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks() Avi Kivity
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2010-03-30 17:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, avi, aarcange, akpm, Kent Overstreet, Ingo Molnar, tglx

On Tue, 2010-03-30 at 13:36 -0400, Rik van Riel wrote:
> Increase the PREEMPT_BITS to 12, to deal with a larger number
> of locks that can be taken in mm_take_all_locks or other places
> where many instances of the same type of lock can be taken.
> 
> The overflow of PREEMPT_BITS should be harmless, since it simply
> increments the counter into the SOFTIRQ_BITS, and the counter
> will be decremented again later.
> 
> However, the overflow does lead to backtraces with CONFIG_PREEMPT_DEBUG
> enabled.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
> 
> ---
> Kent, does this patch fix the issue you saw?
> 
> Peter, I know you do not like this approach. However, I could not
> think of a way around mm_take_all_locks. We need to take those locks
> and want to track that fact for lock debugging...

Does this even boot? It tramples all over PREEMPT_ACTIVE for x86.

Also, you'll need to convince mingo and tglx too.. taking that many
spinlocks is utter suckage..

> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index d5b3876..e74108f 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -14,8 +14,8 @@
>   * We put the hardirq and softirq counter into the preemption
>   * counter. The bitmask has the following meaning:
>   *
> - * - bits 0-7 are the preemption count (max preemption depth: 256)
> - * - bits 8-15 are the softirq count (max # of softirqs: 256)
> + * - bits 0-12 are the preemption count (max preemption depth: 4096)
> + * - bits 13-19 are the softirq count (max # of softirqs: 256)
>   *
>   * The hardirq count can in theory reach the same as NR_IRQS.
>   * In reality, the number of nested IRQS is limited to the stack
> @@ -24,16 +24,16 @@
>   * hardirq nesting. An arch may choose to give less than 10 bits.
>   * m68k expects it to be 8.
>   *
> - * - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024)
> - * - bit 26 is the NMI_MASK
> - * - bit 28 is the PREEMPT_ACTIVE flag
> + * - bits 26-29 are the hardirq count (max # of nested hardirqs: 1024)
> + * - bit 30 is the NMI_MASK
> + * - bit 31 is the PREEMPT_ACTIVE flag
>   *
> - * PREEMPT_MASK: 0x000000ff
> - * SOFTIRQ_MASK: 0x0000ff00
> - * HARDIRQ_MASK: 0x03ff0000
> - *     NMI_MASK: 0x04000000
> + * PREEMPT_MASK: 0x00000fff
> + * SOFTIRQ_MASK: 0x000ff000
> + * HARDIRQ_MASK: 0x3ff00000
> + *     NMI_MASK: 0x40000000
>   */
> -#define PREEMPT_BITS	8
> +#define PREEMPT_BITS	12
>  #define SOFTIRQ_BITS	8
>  #define NMI_BITS	1
>  
> 




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] increase PREEMPT_BITS to 12 to avoid overflow when starting KVM
  2010-03-30 17:56 ` Peter Zijlstra
@ 2010-03-30 18:05   ` Rik van Riel
  2010-03-30 18:34     ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2010-03-30 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, avi, aarcange, akpm, Kent Overstreet, Ingo Molnar, tglx

On 03/30/2010 01:56 PM, Peter Zijlstra wrote:
> On Tue, 2010-03-30 at 13:36 -0400, Rik van Riel wrote:
>> Increase the PREEMPT_BITS to 12, to deal with a larger number
>> of locks that can be taken in mm_take_all_locks or other places
>> where many instances of the same type of lock can be taken.
>>
>> The overflow of PREEMPT_BITS should be harmless, since it simply
>> increments the counter into the SOFTIRQ_BITS, and the counter
>> will be decremented again later.
>>
>> However, the overflow does lead to backtraces with CONFIG_PREEMPT_DEBUG
>> enabled.
>>
>> Signed-off-by: Rik van Riel<riel@redhat.com>
>> Reported-by: Kent Overstreet<kent.overstreet@gmail.com>
>>
>> ---
>> Kent, does this patch fix the issue you saw?
>>
>> Peter, I know you do not like this approach. However, I could not
>> think of a way around mm_take_all_locks. We need to take those locks
>> and want to track that fact for lock debugging...
>
> Does this even boot? It tramples all over PREEMPT_ACTIVE for x86.

Awww nuts.  I thought PREEMPT_ACTIVE used the same scheme of
adding shifts, but now I see it is hard coded on x86. Doh!

I'm guessing the best thing to do would be to remove the
PREEMPT_ACTIVE #define on x86 and use the one from hardirq.h?

> Also, you'll need to convince mingo and tglx too.. taking that many
> spinlocks is utter suckage..

No argument there.  I can't think of an alternative to mm_take_all_locks
though.  Andrea?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] increase PREEMPT_BITS to 12 to avoid overflow when starting KVM
  2010-03-30 18:05   ` Rik van Riel
@ 2010-03-30 18:34     ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2010-03-30 18:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, avi, aarcange, akpm, Kent Overstreet, Ingo Molnar, tglx

On Tue, 2010-03-30 at 14:05 -0400, Rik van Riel wrote:
> > Also, you'll need to convince mingo and tglx too.. taking that many
> > spinlocks is utter suckage..
> 
> No argument there.  I can't think of an alternative to mm_take_all_locks
> though.  Andrea? 

Well there's altneratives enough, back when this got introduces quite a
few got mentioned too, but they all make the !mm_take_all_locks() lock
side slower so Andrea didn't like them.

I think one of them got mentioned in yesterday's thread as well.

Another alternative is making all those spinlocks mutexes, that will get
rid of that massive !preempt section too.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [COUNTERPATCH] mm: avoid overflowing preempt_count() in  mmu_take_all_locks()
  2010-03-30 17:36 [PATCH] increase PREEMPT_BITS to 12 to avoid overflow when starting KVM Rik van Riel
  2010-03-30 17:56 ` Peter Zijlstra
@ 2010-04-01  9:40 ` Avi Kivity
  2010-04-01 10:31   ` Peter Zijlstra
  1 sibling, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-04-01  9:40 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel; +Cc: peterz, aarcange, akpm, Kent Overstreet

mmu_take_all_locks() takes a spinlock for each vma, which means we increase
the preempt count by the number of vmas in an address space.  Since the user
controls the number of vmas, they can cause preempt_count to overflow.

Fix by making mmu_take_all_locks() only disable preemption once by making
the spinlocks preempt-neutral.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 mm/mmap.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..7305510 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2442,6 +2442,12 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
 		 */
 		spin_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
 		/*
+		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
+		 * vma preemp_count neutral and instead disable preempt
+		 * once per mmu_take_all_lock().
+		 */
+		preempt_enable();
+		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->lock. If some other vma in this mm shares
 		 * the same anon_vma we won't take it again.
@@ -2471,6 +2477,12 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
 		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
+		/*
+		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
+		 * vma preemp_count neutral and instead disable preempt
+		 * once per mmu_take_all_lock().
+		 */
+		preempt_enable();
 	}
 }
 
@@ -2516,6 +2528,13 @@ int mm_take_all_locks(struct mm_struct *mm)
 
 	mutex_lock(&mm_all_locks_mutex);
 
+	/*
+	 * To avoid O(nr_vmas) preempt_disable()s, we keep each
+	 * vma preemp_count neutral and instead disable preempt
+	 * once per mmu_take_all_lock().
+	 */
+	preempt_disable();
+
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (signal_pending(current))
 			goto out_unlock;
@@ -2558,6 +2577,12 @@ static void vm_unlock_anon_vma(struct anon_vma *anon_vma)
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->head.next))
 			BUG();
+		/*
+		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
+		 * vma preemp_count neutral and instead disable preempt
+		 * once per mmu_take_all_lock().
+		 */
+		preempt_disable();
 		spin_unlock(&anon_vma->lock);
 	}
 }
@@ -2569,6 +2594,12 @@ static void vm_unlock_mapping(struct address_space *mapping)
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
+		/*
+		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
+		 * vma preemp_count neutral and instead disable preempt
+		 * once per mmu_take_all_lock().
+		 */
+		preempt_disable();
 		spin_unlock(&mapping->i_mmap_lock);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
@@ -2596,6 +2627,12 @@ void mm_drop_all_locks(struct mm_struct *mm)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
 	}
 
+	/*
+	 * To avoid O(nr_vmas) preempt_disable()s, we keep each
+	 * vma preemp_count neutral and instead disable preempt
+	 * once per mmu_take_all_lock().
+	 */
+	preempt_enable();
 	mutex_unlock(&mm_all_locks_mutex);
 }
 
-- 
1.7.0.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in  mmu_take_all_locks()
  2010-04-01  9:40 ` [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks() Avi Kivity
@ 2010-04-01 10:31   ` Peter Zijlstra
  2010-04-01 11:04     ` Thomas Gleixner
  2010-04-01 11:09     ` Avi Kivity
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 10:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, linux-kernel, aarcange, akpm, Kent Overstreet,
	Ingo Molnar, tglx

I'm sure you dropped Ingo and Thomas by accident.

On Thu, 2010-04-01 at 12:40 +0300, Avi Kivity wrote:
> mmu_take_all_locks() takes a spinlock for each vma, which means we increase
> the preempt count by the number of vmas in an address space.  Since the user
> controls the number of vmas, they can cause preempt_count to overflow.
> 
> Fix by making mmu_take_all_locks() only disable preemption once by making
> the spinlocks preempt-neutral.

Right, so while this will get rid of the warning it doesn't make the
code any nicer, its still a massive !preempt latency spot.

> Signed-off-by: Avi Kivity <avi@redhat.com>
> ---
>  mm/mmap.c |   37 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 75557c6..7305510 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2442,6 +2442,12 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
>  		 */
>  		spin_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
>  		/*
> +		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
> +		 * vma preemp_count neutral and instead disable preempt
> +		 * once per mmu_take_all_lock().
> +		 */
> +		preempt_enable();
> +		/*
>  		 * We can safely modify head.next after taking the
>  		 * anon_vma->lock. If some other vma in this mm shares
>  		 * the same anon_vma we won't take it again.
> @@ -2471,6 +2477,12 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
>  		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
>  			BUG();
>  		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
> +		/*
> +		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
> +		 * vma preemp_count neutral and instead disable preempt
> +		 * once per mmu_take_all_lock().
> +		 */
> +		preempt_enable();
>  	}
>  }
>  
> @@ -2516,6 +2528,13 @@ int mm_take_all_locks(struct mm_struct *mm)
>  
>  	mutex_lock(&mm_all_locks_mutex);
>  
> +	/*
> +	 * To avoid O(nr_vmas) preempt_disable()s, we keep each
> +	 * vma preemp_count neutral and instead disable preempt
> +	 * once per mmu_take_all_lock().
> +	 */
> +	preempt_disable();
> +
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  		if (signal_pending(current))
>  			goto out_unlock;
> @@ -2558,6 +2577,12 @@ static void vm_unlock_anon_vma(struct anon_vma *anon_vma)
>  		if (!__test_and_clear_bit(0, (unsigned long *)
>  					  &anon_vma->head.next))
>  			BUG();
> +		/*
> +		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
> +		 * vma preemp_count neutral and instead disable preempt
> +		 * once per mmu_take_all_lock().
> +		 */
> +		preempt_disable();
>  		spin_unlock(&anon_vma->lock);
>  	}
>  }
> @@ -2569,6 +2594,12 @@ static void vm_unlock_mapping(struct address_space *mapping)
>  		 * AS_MM_ALL_LOCKS can't change to 0 from under us
>  		 * because we hold the mm_all_locks_mutex.
>  		 */
> +		/*
> +		 * To avoid O(nr_vmas) preempt_disable()s, we keep each
> +		 * vma preemp_count neutral and instead disable preempt
> +		 * once per mmu_take_all_lock().
> +		 */
> +		preempt_disable();
>  		spin_unlock(&mapping->i_mmap_lock);
>  		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
>  					&mapping->flags))
> @@ -2596,6 +2627,12 @@ void mm_drop_all_locks(struct mm_struct *mm)
>  			vm_unlock_mapping(vma->vm_file->f_mapping);
>  	}
>  
> +	/*
> +	 * To avoid O(nr_vmas) preempt_disable()s, we keep each
> +	 * vma preemp_count neutral and instead disable preempt
> +	 * once per mmu_take_all_lock().
> +	 */
> +	preempt_enable();
>  	mutex_unlock(&mm_all_locks_mutex);
>  }
>  




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 10:31   ` Peter Zijlstra
@ 2010-04-01 11:04     ` Thomas Gleixner
  2010-04-01 11:13       ` Avi Kivity
                         ` (2 more replies)
  2010-04-01 11:09     ` Avi Kivity
  1 sibling, 3 replies; 43+ messages in thread
From: Thomas Gleixner @ 2010-04-01 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 1 Apr 2010, Peter Zijlstra wrote:

> I'm sure you dropped Ingo and Thomas by accident.
> 
> On Thu, 2010-04-01 at 12:40 +0300, Avi Kivity wrote:
> > mmu_take_all_locks() takes a spinlock for each vma, which means we increase
> > the preempt count by the number of vmas in an address space.  Since the user
> > controls the number of vmas, they can cause preempt_count to overflow.
> > 
> > Fix by making mmu_take_all_locks() only disable preemption once by making
> > the spinlocks preempt-neutral.
> 
> Right, so while this will get rid of the warning it doesn't make the
> code any nicer, its still a massive !preempt latency spot.

I'm not sure whether this is a real well done April 1st joke or if there
is someone trying to secure the "bad taste patch of the month" price.

Anyway, I don't see a reason why we can't convert those locks to
mutexes and get rid of the whole preempt disabled region.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in  mmu_take_all_locks()
  2010-04-01 10:31   ` Peter Zijlstra
  2010-04-01 11:04     ` Thomas Gleixner
@ 2010-04-01 11:09     ` Avi Kivity
  1 sibling, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, linux-kernel, aarcange, akpm, Kent Overstreet,
	Ingo Molnar, tglx

On 04/01/2010 01:31 PM, Peter Zijlstra wrote:
> I'm sure you dropped Ingo and Thomas by accident.
>
> On Thu, 2010-04-01 at 12:40 +0300, Avi Kivity wrote:
>    
>> mmu_take_all_locks() takes a spinlock for each vma, which means we increase
>> the preempt count by the number of vmas in an address space.  Since the user
>> controls the number of vmas, they can cause preempt_count to overflow.
>>
>> Fix by making mmu_take_all_locks() only disable preemption once by making
>> the spinlocks preempt-neutral.
>>      
> Right, so while this will get rid of the warning it doesn't make the
> code any nicer, its still a massive !preempt latency spot.
>    

True.  But this is a band-aid we can apply now while the correct fix is 
being worked out.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:04     ` Thomas Gleixner
@ 2010-04-01 11:13       ` Avi Kivity
  2010-04-01 11:16         ` Peter Zijlstra
  2010-04-01 11:17         ` Avi Kivity
  2010-04-01 14:16       ` Rik van Riel
  2010-04-01 15:32       ` Andrea Arcangeli
  2 siblings, 2 replies; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 11:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On 04/01/2010 02:04 PM, Thomas Gleixner wrote:
>>
>>> mmu_take_all_locks() takes a spinlock for each vma, which means we increase
>>> the preempt count by the number of vmas in an address space.  Since the user
>>> controls the number of vmas, they can cause preempt_count to overflow.
>>>
>>> Fix by making mmu_take_all_locks() only disable preemption once by making
>>> the spinlocks preempt-neutral.
>>>        
>> Right, so while this will get rid of the warning it doesn't make the
>> code any nicer, its still a massive !preempt latency spot.
>>      
> I'm not sure whether this is a real well done April 1st joke or if there
> is someone trying to secure the "bad taste patch of the month" price.
>    

I don't think a spin_lock_nested_while_preempt_disabled() is worthwhile 
for this.

> Anyway, I don't see a reason why we can't convert those locks to
> mutexes and get rid of the whole preempt disabled region.
>    

If someone is willing to audit all code paths to make sure these locks 
are always taken in schedulable context I agree that's a better fix.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:13       ` Avi Kivity
@ 2010-04-01 11:16         ` Peter Zijlstra
  2010-04-01 11:19           ` Avi Kivity
  2010-04-01 15:36           ` Andrea Arcangeli
  2010-04-01 11:17         ` Avi Kivity
  1 sibling, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 11:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Thomas Gleixner, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 14:13 +0300, Avi Kivity wrote:

> If someone is willing to audit all code paths to make sure these locks 
> are always taken in schedulable context I agree that's a better fix.

They had better be, they're not irq-safe. Also that's what lockdep is
for.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:13       ` Avi Kivity
  2010-04-01 11:16         ` Peter Zijlstra
@ 2010-04-01 11:17         ` Avi Kivity
  2010-04-01 11:27           ` Peter Zijlstra
  1 sibling, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 11:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On 04/01/2010 02:13 PM, Avi Kivity wrote:
>
>> Anyway, I don't see a reason why we can't convert those locks to
>> mutexes and get rid of the whole preempt disabled region.
>
> If someone is willing to audit all code paths to make sure these locks 
> are always taken in schedulable context I agree that's a better fix.
>

 From mm/rmap.c:

> /*
>  * Lock ordering in mm:
>  *
>  * inode->i_mutex    (while writing or truncating, not reading or 
> faulting)
>  *   inode->i_alloc_sem (vmtruncate_range)
>  *   mm->mmap_sem
>  *     page->flags PG_locked (lock_page)
>  *       mapping->i_mmap_lock
>  *         anon_vma->lock
...
> *
>  * (code doesn't rely on that order so it could be switched around)
>  * ->tasklist_lock
>  *   anon_vma->lock      (memory_failure, collect_procs_anon)
>  *     pte map lock
>  */

i_mmap_lock is a spinlock, and tasklist_lock is a rwlock, so some 
changes will be needed.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:16         ` Peter Zijlstra
@ 2010-04-01 11:19           ` Avi Kivity
  2010-04-01 15:36           ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 11:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On 04/01/2010 02:16 PM, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 14:13 +0300, Avi Kivity wrote:
>
>    
>> If someone is willing to audit all code paths to make sure these locks
>> are always taken in schedulable context I agree that's a better fix.
>>      
> They had better be, they're not irq-safe. Also that's what lockdep is
> for.
>    

I don't understand.  There are non-schedulable contexts (i.e. with a 
spinlock held) that are not irq contexts.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:17         ` Avi Kivity
@ 2010-04-01 11:27           ` Peter Zijlstra
  2010-04-01 11:43             ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 11:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Thomas Gleixner, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 14:17 +0300, Avi Kivity wrote:
> On 04/01/2010 02:13 PM, Avi Kivity wrote:
> >
> >> Anyway, I don't see a reason why we can't convert those locks to
> >> mutexes and get rid of the whole preempt disabled region.
> >
> > If someone is willing to audit all code paths to make sure these locks 
> > are always taken in schedulable context I agree that's a better fix.
> >
> 
>  From mm/rmap.c:
> 
> > /*
> >  * Lock ordering in mm:
> >  *
> >  * inode->i_mutex    (while writing or truncating, not reading or 
> > faulting)
> >  *   inode->i_alloc_sem (vmtruncate_range)
> >  *   mm->mmap_sem
> >  *     page->flags PG_locked (lock_page)
> >  *       mapping->i_mmap_lock
> >  *         anon_vma->lock
> ...
> > *
> >  * (code doesn't rely on that order so it could be switched around)
> >  * ->tasklist_lock
> >  *   anon_vma->lock      (memory_failure, collect_procs_anon)
> >  *     pte map lock
> >  */
> 
> i_mmap_lock is a spinlock, and tasklist_lock is a rwlock, so some 
> changes will be needed.

i_mmap_lock will need to change just as well, mm_take_all_locks() uses
both anon_vma->lock and mapping->i_mmap_lock.

I've almost got a patch done that converts those two, still need to look
where that tasklist_lock muck happens.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:27           ` Peter Zijlstra
@ 2010-04-01 11:43             ` Peter Zijlstra
  2010-04-01 11:47               ` Avi Kivity
  2010-04-01 15:42               ` Andrea Arcangeli
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 11:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Thomas Gleixner, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> 
> I've almost got a patch done that converts those two, still need to look
> where that tasklist_lock muck happens.

OK, so the below builds and boots, only need to track down that
tasklist_lock nesting, but I got to run an errand first.

---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/lockdep.h   |    3 +++
 include/linux/mm.h        |    2 +-
 include/linux/mutex.h     |    8 ++++++++
 include/linux/rmap.h      |    8 ++++----
 kernel/fork.c             |    4 ++--
 kernel/mutex.c            |   25 +++++++++++++++++--------
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   12 ++++++------
 mm/ksm.c                  |   18 +++++++++---------
 mm/memory-failure.c       |    4 ++--
 mm/memory.c               |   21 ++++++---------------
 mm/mmap.c                 |   20 ++++++++++----------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   40 +++++++++++++++++++---------------------
 19 files changed, 99 insertions(+), 90 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index f46c340..5e5ac7d 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -73,7 +73,7 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -98,7 +98,7 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a0bbd3d..141065d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -429,10 +429,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 407bf39..6d1ea52 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -258,7 +258,7 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	mutex_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 10b8ded..6aa624b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -627,7 +627,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct mutex		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index a03977a..4bb7620 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -484,12 +484,15 @@ static inline void print_irqtrace_events(struct task_struct *curr)
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # ifdef CONFIG_PROVE_LOCKING
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 2, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 2, n, i)
 # else
 #  define mutex_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 1, NULL, i)
+#  define mutex_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 1, n, i)
 # endif
 # define mutex_release(l, n, i)			lock_release(l, n, i)
 #else
 # define mutex_acquire(l, s, t, i)		do { } while (0)
+# define mutex_acquire_nest(l, s, t, n, i)	do { } while (0)
 # define mutex_release(l, n, i)			do { } while (0)
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8442b6..ed7cfd2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -749,7 +749,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct mutex *i_mmap_lock;		/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 878cab4..2bd2847 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -132,6 +132,13 @@ extern int __must_check mutex_lock_killable_nested(struct mutex *lock,
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) mutex_lock_killable_nested(lock, 0)
+
+#define mutex_lock_nest_lock(lock, nest_lock)				\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);		\
+} while (0)
+
 #else
 extern void mutex_lock(struct mutex *lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock);
@@ -140,6 +147,7 @@ extern int __must_check mutex_lock_killable(struct mutex *lock);
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
 # define mutex_lock_killable_nested(lock, subclass) mutex_lock_killable(lock)
+# define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
 #endif
 
 /*
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..a3c2657 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -25,7 +25,7 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	struct mutex lock;	/* Serialize access to vma list */
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
@@ -94,14 +94,14 @@ static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index d67f1db..a3a688e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -355,7 +355,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			spin_lock(&mapping->i_mmap_lock);
+			mutex_lock(&mapping->i_mmap_lock);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
@@ -363,7 +363,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			/* insert tmp into the share list, just after mpnt */
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(mapping);
-			spin_unlock(&mapping->i_mmap_lock);
+			mutex_unlock(&mapping->i_mmap_lock);
 		}
 
 		/*
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 632f04c..e3a0f26 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -140,14 +140,14 @@ EXPORT_SYMBOL(mutex_unlock);
  */
 static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+		    struct lockdep_map *nest_lock, unsigned long ip)
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
 
 	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
@@ -278,7 +278,16 @@ void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
+}
+
+EXPORT_SYMBOL_GPL(mutex_lock_nested);
+
+void __sched
+_mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest)
+{
+	might_sleep();
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, nest, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
@@ -287,7 +296,7 @@ int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, NULL, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
@@ -296,7 +305,7 @@ mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
 	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE,
-				   subclass, _RET_IP_);
+				   subclass, NULL, _RET_IP_);
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -402,7 +411,7 @@ __mutex_lock_slowpath(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, _RET_IP_);
+	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -410,7 +419,7 @@ __mutex_lock_killable_slowpath(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_KILLABLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_KILLABLE, 0, NULL, _RET_IP_);
 }
 
 static noinline int __sched
@@ -418,7 +427,7 @@ __mutex_lock_interruptible_slowpath(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, _RET_IP_);
+	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
 }
 #endif
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 78b94f0..61157dc 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapping,
 		return;
 
 retry:
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -200,7 +200,7 @@ retry:
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	if (locked) {
 		mutex_unlock(&xip_sparse_mutex);
diff --git a/mm/fremap.c b/mm/fremap.c
index 46f5dac..2c0528e 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -208,13 +208,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3a5aeb3..3807dd5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2210,9 +2210,9 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page)
 {
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	__unmap_hugepage_range(vma, start, end, ref_page);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 /*
@@ -2244,7 +2244,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * this mapping should be shared between all the VMAs,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(iter_vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
@@ -2262,7 +2262,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 				address, address + huge_page_size(h),
 				page);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 
 	return 1;
 }
@@ -2678,7 +2678,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
@@ -2693,7 +2693,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 8cdfc2a..27861d8 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -327,7 +327,7 @@ static void drop_anon_vma(struct rmap_item *rmap_item)
 
 	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (empty)
 			anon_vma_free(anon_vma);
 	}
@@ -1566,7 +1566,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1589,7 +1589,7 @@ again:
 			if (!search_new_forks || !mapcount)
 				break;
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (!mapcount)
 			goto out;
 	}
@@ -1619,7 +1619,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1637,11 +1637,11 @@ again:
 			ret = try_to_unmap_one(page, vma,
 					rmap_item->address, flags);
 			if (ret != SWAP_AGAIN || !page_mapped(page)) {
-				spin_unlock(&anon_vma->lock);
+				mutex_unlock(&anon_vma->lock);
 				goto out;
 			}
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 	if (!search_new_forks++)
 		goto again;
@@ -1671,7 +1671,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (rmap_item->address < vma->vm_start ||
@@ -1688,11 +1688,11 @@ again:
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
 			if (ret != SWAP_AGAIN) {
-				spin_unlock(&anon_vma->lock);
+				mutex_unlock(&anon_vma->lock);
 				goto out;
 			}
 		}
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 	if (!search_new_forks++)
 		goto again;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d1f3351..ebbfb33 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -421,7 +421,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	 */
 
 	read_lock(&tasklist_lock);
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 
@@ -441,7 +441,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 				add_to_kill(tsk, page, vma, to_kill, tkc);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	read_unlock(&tasklist_lock);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index bc9ba5a..9e386b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,7 +1104,6 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
@@ -1161,21 +1160,13 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
 
 			tlb_finish_mmu(*tlbp, tlb_start, start);
 
-			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
-					goto out;
-				}
-				cond_resched();
-			}
+			cond_resched();
 
 			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
-out:
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
@@ -2442,7 +2433,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched();
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -2456,9 +2447,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	mutex_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	mutex_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -2554,7 +2545,7 @@ void unmap_mapping_range(struct address_space *mapping,
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2569,7 +2560,7 @@ void unmap_mapping_range(struct address_space *mapping,
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 75557c6..2602682 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -216,9 +216,9 @@ void unlink_file_vma(struct vm_area_struct *vma)
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -449,7 +449,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -459,7 +459,7 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -565,7 +565,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 	}
 
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file) {
@@ -2440,7 +2440,7 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->lock. If some other vma in this mm shares
@@ -2470,7 +2470,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
+		mutex_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
 	}
 }
 
@@ -2558,7 +2558,7 @@ static void vm_unlock_anon_vma(struct anon_vma *anon_vma)
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->head.next))
 			BUG();
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 	}
 }
 
@@ -2569,7 +2569,7 @@ static void vm_unlock_mapping(struct address_space *mapping)
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
 			BUG();
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c75ef..e47933a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -91,7 +91,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		mutex_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -123,7 +123,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		mutex_unlock(&mapping->i_mmap_lock);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..00c646a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -133,7 +133,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
 		}
-		spin_lock(&anon_vma->lock);
+		mutex_lock(&anon_vma->lock);
 
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
@@ -147,7 +147,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 		}
 		spin_unlock(&mm->page_table_lock);
 
-		spin_unlock(&anon_vma->lock);
+		mutex_unlock(&anon_vma->lock);
 		if (unlikely(allocated)) {
 			anon_vma_free(allocated);
 			anon_vma_chain_free(avc);
@@ -169,9 +169,9 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
 	avc->anon_vma = anon_vma;
 	list_add(&avc->same_vma, &vma->anon_vma_chain);
 
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_add_tail(&avc->same_anon_vma, &anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 }
 
 /*
@@ -244,12 +244,12 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 
 	if (empty)
 		anon_vma_free(anon_vma);
@@ -271,7 +271,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	mutex_init(&anon_vma->lock);
 	ksm_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
@@ -300,7 +300,7 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	return anon_vma;
 out:
 	rcu_read_unlock();
@@ -309,7 +309,7 @@ out:
 
 void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	rcu_read_unlock();
 }
 
@@ -554,7 +554,7 @@ static int page_referenced_file(struct page *page,
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -579,7 +579,7 @@ static int page_referenced_file(struct page *page,
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -666,7 +666,7 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page)
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED) {
 			unsigned long address = vma_address(page, vma);
@@ -675,7 +675,7 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page)
 			ret += page_mkclean_one(page, vma, address);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1181,7 +1181,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1227,7 +1227,6 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -1249,7 +1248,6 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1261,7 +1259,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -1346,7 +1344,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
+	mutex_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1356,7 +1354,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	spin_unlock(&anon_vma->lock);
+	mutex_unlock(&anon_vma->lock);
 	return ret;
 }
 
@@ -1371,7 +1369,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *,
 
 	if (!mapping)
 		return ret;
-	spin_lock(&mapping->i_mmap_lock);
+	mutex_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
@@ -1385,7 +1383,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *,
 	 * never contain migration ptes.  Decide what to do about this
 	 * limitation to linear when we need rmap_walk() on nonlinear.
 	 */
-	spin_unlock(&mapping->i_mmap_lock);
+	mutex_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:43             ` Peter Zijlstra
@ 2010-04-01 11:47               ` Avi Kivity
  2010-04-01 15:42               ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 11:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Rik van Riel, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On 04/01/2010 02:43 PM, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
>    
>> I've almost got a patch done that converts those two, still need to look
>> where that tasklist_lock muck happens.
>>      
> OK, so the below builds and boots, only need to track down that
> tasklist_lock nesting, but I got to run an errand first.
>
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -327,7 +327,7 @@ static void drop_anon_vma(struct rmap_item *rmap_item)
>
>   	if (atomic_dec_and_lock(&anon_vma->ksm_refcount,&anon_vma->lock)) {
>   		int empty = list_empty(&anon_vma->head);
> -		spin_unlock(&anon_vma->lock);
> +		mutex_unlock(&anon_vma->lock);
>   		if (empty)
>   			anon_vma_free(anon_vma);
>   	}
>    

You need to convert the atomic_dec_and_lock() as well.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:04     ` Thomas Gleixner
  2010-04-01 11:13       ` Avi Kivity
@ 2010-04-01 14:16       ` Rik van Riel
  2010-04-01 15:32       ` Andrea Arcangeli
  2 siblings, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2010-04-01 14:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Avi Kivity, linux-kernel, aarcange, akpm,
	Kent Overstreet, Ingo Molnar

On 04/01/2010 07:04 AM, Thomas Gleixner wrote:
> On Thu, 1 Apr 2010, Peter Zijlstra wrote:
>> I'm sure you dropped Ingo and Thomas by accident.
>>
>> On Thu, 2010-04-01 at 12:40 +0300, Avi Kivity wrote:
>>> mmu_take_all_locks() takes a spinlock for each vma, which means we increase
>>> the preempt count by the number of vmas in an address space.  Since the user
>>> controls the number of vmas, they can cause preempt_count to overflow.
>>>
>>> Fix by making mmu_take_all_locks() only disable preemption once by making
>>> the spinlocks preempt-neutral.
>>
>> Right, so while this will get rid of the warning it doesn't make the
>> code any nicer, its still a massive !preempt latency spot.
>
> I'm not sure whether this is a real well done April 1st joke or if there
> is someone trying to secure the "bad taste patch of the month" price.
>
> Anyway, I don't see a reason why we can't convert those locks to
> mutexes and get rid of the whole preempt disabled region.

That would involve converting most of the locks in mm/
to mutexes, since these two locks get nested under
all kinds of other spinlocks...


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:04     ` Thomas Gleixner
  2010-04-01 11:13       ` Avi Kivity
  2010-04-01 14:16       ` Rik van Riel
@ 2010-04-01 15:32       ` Andrea Arcangeli
  2010-04-01 15:37         ` Avi Kivity
  2 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 15:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Avi Kivity, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 01:04:04PM +0200, Thomas Gleixner wrote:
> On Thu, 1 Apr 2010, Peter Zijlstra wrote:
> 
> > I'm sure you dropped Ingo and Thomas by accident.
> > 
> > On Thu, 2010-04-01 at 12:40 +0300, Avi Kivity wrote:
> > > mmu_take_all_locks() takes a spinlock for each vma, which means we increase
> > > the preempt count by the number of vmas in an address space.  Since the user
> > > controls the number of vmas, they can cause preempt_count to overflow.
> > > 
> > > Fix by making mmu_take_all_locks() only disable preemption once by making
> > > the spinlocks preempt-neutral.
> > 
> > Right, so while this will get rid of the warning it doesn't make the
> > code any nicer, its still a massive !preempt latency spot.
> 
> I'm not sure whether this is a real well done April 1st joke or if there
> is someone trying to secure the "bad taste patch of the month" price.
> 
> Anyway, I don't see a reason why we can't convert those locks to
> mutexes and get rid of the whole preempt disabled region.

Converting those locks to mutexes will also allow to cleanly handle
XPMEM schedule-in-mmu-notifier-handler requirement the right way.

For now getting rid of the warning is enough though. Changing the
locking would be possible but it'd slowdown the whole kernel all the
time even if nobody would ever load the kvm or gru kernel modules.

Let's be practical, this isn't even a syscall, this is only called by
device driver ioctl and it's about losing 1msec or so in latency, to
keep the whole kernel as fast as if mmu notifier didn't exist. I don't
think we should have 1 single wide lock to take in
mmu_notifier_register and then slowdown the kernel when nobody uses
mmu notifier at all. Losing 1msec when a VM starts isn't a big deal
really. If this wasn't the case it wouldn't have been merged in the
first place I think. Besides with -rt these locks aren't going to hurt
latency AFIK.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:16         ` Peter Zijlstra
  2010-04-01 11:19           ` Avi Kivity
@ 2010-04-01 15:36           ` Andrea Arcangeli
  2010-04-01 15:39             ` Avi Kivity
  1 sibling, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 01:16:32PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 14:13 +0300, Avi Kivity wrote:
> 
> > If someone is willing to audit all code paths to make sure these locks 
> > are always taken in schedulable context I agree that's a better fix.
> 
> They had better be, they're not irq-safe. Also that's what lockdep is
> for.

In my original patchset I included patches from Christoph to convert
those locks to mutexes, there was apparently no problem at all with
that. But frankly I think the only problem here is the warning. The
only compliant we ever had here is from developers, no users at
all. If this was a practical problem I think we should have heard
something by now with so many KVM users out there (and gru too).

The only single reason I'd go for mutexes would be to accommodate
XPMEM requirements once and for all, no other reason.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:32       ` Andrea Arcangeli
@ 2010-04-01 15:37         ` Avi Kivity
  0 siblings, 0 replies; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 15:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Peter Zijlstra, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar

On 04/01/2010 06:32 PM, Andrea Arcangeli wrote:
>
>> I'm not sure whether this is a real well done April 1st joke or if there
>> is someone trying to secure the "bad taste patch of the month" price.
>>
>> Anyway, I don't see a reason why we can't convert those locks to
>> mutexes and get rid of the whole preempt disabled region.
>>      
> Converting those locks to mutexes will also allow to cleanly handle
> XPMEM schedule-in-mmu-notifier-handler requirement the right way.
>    

It would also allow kvm not to take a spinlock over potentially long 
operations (iterating over rmaps) as it does now.

> For now getting rid of the warning is enough though. Changing the
> locking would be possible but it'd slowdown the whole kernel all the
> time even if nobody would ever load the kvm or gru kernel modules.
>
> Let's be practical, this isn't even a syscall, this is only called by
> device driver ioctl and it's about losing 1msec or so in latency, to
> keep the whole kernel as fast as if mmu notifier didn't exist. I don't
> think we should have 1 single wide lock to take in
> mmu_notifier_register and then slowdown the kernel when nobody uses
> mmu notifier at all. Losing 1msec when a VM starts isn't a big deal
> really. If this wasn't the case it wouldn't have been merged in the
> first place I think. Besides with -rt these locks aren't going to hurt
> latency AFIK.
>    

Well, with my patch applied they sure will.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:36           ` Andrea Arcangeli
@ 2010-04-01 15:39             ` Avi Kivity
  2010-04-01 15:54               ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 15:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar

On 04/01/2010 06:36 PM, Andrea Arcangeli wrote:
> On Thu, Apr 01, 2010 at 01:16:32PM +0200, Peter Zijlstra wrote:
>    
>> On Thu, 2010-04-01 at 14:13 +0300, Avi Kivity wrote:
>>
>>      
>>> If someone is willing to audit all code paths to make sure these locks
>>> are always taken in schedulable context I agree that's a better fix.
>>>        
>> They had better be, they're not irq-safe. Also that's what lockdep is
>> for.
>>      
> In my original patchset I included patches from Christoph to convert
> those locks to mutexes, there was apparently no problem at all with
> that. But frankly I think the only problem here is the warning. The
> only compliant we ever had here is from developers, no users at
> all. If this was a practical problem I think we should have heard
> something by now with so many KVM users out there (and gru too).
>
> The only single reason I'd go for mutexes would be to accommodate
> XPMEM requirements once and for all, no other reason.
>    

There is also a minor benefit for kvm.  Reduced latency over large mmu 
operations; code simplification (we now have some 
copy_from_user_inatomic() that could be simplified).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 11:43             ` Peter Zijlstra
  2010-04-01 11:47               ` Avi Kivity
@ 2010-04-01 15:42               ` Andrea Arcangeli
  2010-04-01 15:50                 ` Peter Zijlstra
                                   ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> > 
> > I've almost got a patch done that converts those two, still need to look
> > where that tasklist_lock muck happens.
> 
> OK, so the below builds and boots, only need to track down that
> tasklist_lock nesting, but I got to run an errand first.

You should have a look at my old patchset where Christoph already
implemented this (and not for decreasing latency but to allow
scheduling in mmu notifier handlers, only needed by XPMEM):

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/

The ugliest part of it (that I think you missed below) is the breakage
of the RCU locking in the anon-vma which requires adding refcounting
to it. That was the worst part of the conversion as far as I can tell.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma

I personally prefer read-write locks that Christoph used for both of
them, but I'm not against mutex either. Still the refcounting problem
should be the same as it's introduced by allowing the critical
sections under anon_vma->lock to schedule (no matter if it's mutex or
read-write sem).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:42               ` Andrea Arcangeli
@ 2010-04-01 15:50                 ` Peter Zijlstra
  2010-04-01 15:56                   ` Peter Zijlstra
  2010-04-01 16:00                   ` Andrea Arcangeli
  2010-04-01 15:51                 ` Avi Kivity
  2010-04-01 16:12                 ` Peter Zijlstra
  2 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 15:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 17:42 +0200, Andrea Arcangeli wrote:
> On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
> > On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> > > 
> > > I've almost got a patch done that converts those two, still need to look
> > > where that tasklist_lock muck happens.
> > 
> > OK, so the below builds and boots, only need to track down that
> > tasklist_lock nesting, but I got to run an errand first.
> 
> You should have a look at my old patchset where Christoph already
> implemented this (and not for decreasing latency but to allow
> scheduling in mmu notifier handlers, only needed by XPMEM):
> 
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/
> 
> The ugliest part of it (that I think you missed below) is the breakage
> of the RCU locking in the anon-vma which requires adding refcounting
> to it. That was the worst part of the conversion as far as I can tell.
> 
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma
> 
> I personally prefer read-write locks that Christoph used for both of
> them, but I'm not against mutex either. Still the refcounting problem
> should be the same as it's introduced by allowing the critical
> sections under anon_vma->lock to schedule (no matter if it's mutex or
> read-write sem).

Right, so the problem with the rwsem is that, esp for very short hold
times, they introduce more pain than they're worth. Also the rwsem
doesn't do adaptive spinning nor allows for lock stealing, resulting in
a much much heavier sync. object than the mutex is.

You also seem to move the tlb_gather stuff around, we have patches in
-rt that make tlb_gather preemptible, once i_mmap_lock is preemptible we
can do in mainline too.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:42               ` Andrea Arcangeli
  2010-04-01 15:50                 ` Peter Zijlstra
@ 2010-04-01 15:51                 ` Avi Kivity
  2010-04-01 15:56                   ` Peter Zijlstra
  2010-04-01 16:02                   ` Andrea Arcangeli
  2010-04-01 16:12                 ` Peter Zijlstra
  2 siblings, 2 replies; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 15:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar

On 04/01/2010 06:42 PM, Andrea Arcangeli wrote:
> On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
>    
>> On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
>>      
>>> I've almost got a patch done that converts those two, still need to look
>>> where that tasklist_lock muck happens.
>>>        
>> OK, so the below builds and boots, only need to track down that
>> tasklist_lock nesting, but I got to run an errand first.
>>      
> You should have a look at my old patchset where Christoph already
> implemented this (and not for decreasing latency but to allow
> scheduling in mmu notifier handlers, only needed by XPMEM):
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/
>
> The ugliest part of it (that I think you missed below) is the breakage
> of the RCU locking in the anon-vma which requires adding refcounting
> to it. That was the worst part of the conversion as far as I can tell.
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma
>    

Can we use srcu now instead?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:39             ` Avi Kivity
@ 2010-04-01 15:54               ` Andrea Arcangeli
  2010-04-01 16:02                 ` Avi Kivity
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 15:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 06:39:48PM +0300, Avi Kivity wrote:
> On 04/01/2010 06:36 PM, Andrea Arcangeli wrote:
> > On Thu, Apr 01, 2010 at 01:16:32PM +0200, Peter Zijlstra wrote:
> >    
> >> On Thu, 2010-04-01 at 14:13 +0300, Avi Kivity wrote:
> >>
> >>      
> >>> If someone is willing to audit all code paths to make sure these locks
> >>> are always taken in schedulable context I agree that's a better fix.
> >>>        
> >> They had better be, they're not irq-safe. Also that's what lockdep is
> >> for.
> >>      
> > In my original patchset I included patches from Christoph to convert
> > those locks to mutexes, there was apparently no problem at all with
> > that. But frankly I think the only problem here is the warning. The
> > only compliant we ever had here is from developers, no users at
> > all. If this was a practical problem I think we should have heard
> > something by now with so many KVM users out there (and gru too).
> >
> > The only single reason I'd go for mutexes would be to accommodate
> > XPMEM requirements once and for all, no other reason.
> >    
> 
> There is also a minor benefit for kvm.  Reduced latency over large mmu 
> operations; code simplification (we now have some 
> copy_from_user_inatomic() that could be simplified).

Where exactly is KVM taking these locks? KVM should only call into
GUP, and GUP itself won't iterate over rmaps either. GUP just walks
the host pagetables and trigger page faults if the pages aren't
mapped. I don't see how you're going to remove
copy_from_user_inatomic() given we don't have vmas and other metadata
to take those locks. Maybe we can stop calling GUP but even if we take
the anon_vma mutex/semaphore I think it won't still prevent munmap to
drop the anon pages from under us (even if it'd stop the VM to unmap
them through rmap). To freeze the mapping one would need to take
mmap_sem in write mode in addition to the anon_vma mutex/sem which is
unlikely a win compared to just gup+copy_from_user_inatomic. So I
don't see immediate benefits for KVM but maybe I'm missing something
of course!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:50                 ` Peter Zijlstra
@ 2010-04-01 15:56                   ` Peter Zijlstra
  2010-04-01 16:07                     ` Andrea Arcangeli
  2010-04-01 16:00                   ` Andrea Arcangeli
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 15:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 17:50 +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 17:42 +0200, Andrea Arcangeli wrote:
> > On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> > > > 
> > > > I've almost got a patch done that converts those two, still need to look
> > > > where that tasklist_lock muck happens.
> > > 
> > > OK, so the below builds and boots, only need to track down that
> > > tasklist_lock nesting, but I got to run an errand first.
> > 
> > You should have a look at my old patchset where Christoph already
> > implemented this (and not for decreasing latency but to allow
> > scheduling in mmu notifier handlers, only needed by XPMEM):
> > 
> > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/
> > 
> > The ugliest part of it (that I think you missed below) is the breakage
> > of the RCU locking in the anon-vma which requires adding refcounting
> > to it. That was the worst part of the conversion as far as I can tell.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma
> > 
> > I personally prefer read-write locks that Christoph used for both of
> > them, but I'm not against mutex either. Still the refcounting problem
> > should be the same as it's introduced by allowing the critical
> > sections under anon_vma->lock to schedule (no matter if it's mutex or
> > read-write sem).
> 
> Right, so the problem with the rwsem is that, esp for very short hold
> times, they introduce more pain than they're worth. Also the rwsem
> doesn't do adaptive spinning nor allows for lock stealing, resulting in
> a much much heavier sync. object than the mutex is.
> 
> You also seem to move the tlb_gather stuff around, we have patches in
> -rt that make tlb_gather preemptible, once i_mmap_lock is preemptible we
> can do in mainline too.

Another thing is mm->nr_ptes, that doens't appear to be properly
serialized, __pte_alloc() does ++ under mm->page_table_lock, but
free_pte_range() does -- which afaict isn't always with page_table_lock
held, it does however always seem to have mmap_sem for writing.

However __pte_alloc() callers do not in fact hold mmap_sem for writing.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:51                 ` Avi Kivity
@ 2010-04-01 15:56                   ` Peter Zijlstra
  2010-04-01 16:06                     ` Avi Kivity
                                       ` (2 more replies)
  2010-04-01 16:02                   ` Andrea Arcangeli
  1 sibling, 3 replies; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 15:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar, Paul E. McKenney

On Thu, 2010-04-01 at 18:51 +0300, Avi Kivity wrote:
> On 04/01/2010 06:42 PM, Andrea Arcangeli wrote:
> > On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
> >    
> >> On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> >>      
> >>> I've almost got a patch done that converts those two, still need to look
> >>> where that tasklist_lock muck happens.
> >>>        
> >> OK, so the below builds and boots, only need to track down that
> >> tasklist_lock nesting, but I got to run an errand first.
> >>      
> > You should have a look at my old patchset where Christoph already
> > implemented this (and not for decreasing latency but to allow
> > scheduling in mmu notifier handlers, only needed by XPMEM):
> >
> > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/
> >
> > The ugliest part of it (that I think you missed below) is the breakage
> > of the RCU locking in the anon-vma which requires adding refcounting
> > to it. That was the worst part of the conversion as far as I can tell.
> >
> > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma
> >    
> 
> Can we use srcu now instead?

I would much rather we make call_rcu_preempt() available at all times.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:50                 ` Peter Zijlstra
  2010-04-01 15:56                   ` Peter Zijlstra
@ 2010-04-01 16:00                   ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 05:50:02PM +0200, Peter Zijlstra wrote:
> You also seem to move the tlb_gather stuff around, we have patches in

Well that patchset is years old, there are much more recent patches to
move tlb_gather around so that it will happen after the mmu_notifier
invalidate calls. That is only relevant to allow mmu notifier handlers
to schedule.

> -rt that make tlb_gather preemptible, once i_mmap_lock is preemptible we
> can do in mainline too.

Ok. However just moving it inside the mmu notifier range calls won't
slowdown anything or reduce its effectiveness, either ways will be
fine with XPMEM. This is only XPMEM material and tlb_gather is the
trivial part of it. The hard part is to make those locks schedule
capable, and I'm sure XPMEM developers will greatly appreciate such a
change. I thought initially it had to be conditional to XPMEM=y but
maintaining two different locks is a bit of a pain (especially for
distros that wants to ship a single kernel) but as this effort
started, it'd provide some minor latency improvement in that 1msec
when a new virtual machine starts or when a new taks registers itself
to GRU or XPMEM device drivers. I'm personally fine to make these
either mutexes or rwsem unconditionally as you prefer.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:51                 ` Avi Kivity
  2010-04-01 15:56                   ` Peter Zijlstra
@ 2010-04-01 16:02                   ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 16:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 06:51:17PM +0300, Avi Kivity wrote:
> Can we use srcu now instead?

We can always switch to srcu. Switching to srcu is not a noop for all
mmu notifier invalidates only after these locks can schedule. At that
point, so with srcu + mutex in the rmap locks, all the mmu notifier
invalidates can schedule, allowing XPMEM to be synchronous its
invalidates, and making it safe.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:54               ` Andrea Arcangeli
@ 2010-04-01 16:02                 ` Avi Kivity
  2010-04-01 16:12                   ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 16:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar

On 04/01/2010 06:54 PM, Andrea Arcangeli wrote:
>
>>> The only single reason I'd go for mutexes would be to accommodate
>>> XPMEM requirements once and for all, no other reason.
>>>
>>>        
>> There is also a minor benefit for kvm.  Reduced latency over large mmu
>> operations; code simplification (we now have some
>> copy_from_user_inatomic() that could be simplified).
>>      
> Where exactly is KVM taking these locks?

Not these locks, but if we go all the way and make mmu notifiers 
sleepable, we can convert mmu_lock to a mutex.

> KVM should only call into
> GUP, and GUP itself won't iterate over rmaps either. GUP just walks
> the host pagetables and trigger page faults if the pages aren't
> mapped.

We'll probably deadlock then, gup -> change_pte notifier -> mmu_lock.  
But we can probably work around it.

> I don't see how you're going to remove
> copy_from_user_inatomic() given we don't have vmas and other metadata
> to take those locks. Maybe we can stop calling GUP but even if we take
> the anon_vma mutex/semaphore I think it won't still prevent munmap to
> drop the anon pages from under us (even if it'd stop the VM to unmap
> them through rmap). To freeze the mapping one would need to take
> mmap_sem in write mode in addition to the anon_vma mutex/sem which is
> unlikely a win compared to just gup+copy_from_user_inatomic. So I
> don't see immediate benefits for KVM but maybe I'm missing something
> of course!
>    

I meant replace c_f_u_inatomic() by c_f_u() (that's why the benefit is 
minor - we only simplify the failure path).  Sorry for being unclear.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:56                   ` Peter Zijlstra
@ 2010-04-01 16:06                     ` Avi Kivity
  2010-04-01 16:15                       ` Paul E. McKenney
  2010-04-01 16:08                     ` Andrea Arcangeli
  2010-04-01 16:14                     ` Paul E. McKenney
  2 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-04-01 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar, Paul E. McKenney

On 04/01/2010 06:56 PM, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 18:51 +0300, Avi Kivity wrote:
>    
>> On 04/01/2010 06:42 PM, Andrea Arcangeli wrote:
>>      
>>> On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
>>>
>>>        
>>>> On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
>>>>
>>>>          
>>>>> I've almost got a patch done that converts those two, still need to look
>>>>> where that tasklist_lock muck happens.
>>>>>
>>>>>            
>>>> OK, so the below builds and boots, only need to track down that
>>>> tasklist_lock nesting, but I got to run an errand first.
>>>>
>>>>          
>>> You should have a look at my old patchset where Christoph already
>>> implemented this (and not for decreasing latency but to allow
>>> scheduling in mmu notifier handlers, only needed by XPMEM):
>>>
>>> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/
>>>
>>> The ugliest part of it (that I think you missed below) is the breakage
>>> of the RCU locking in the anon-vma which requires adding refcounting
>>> to it. That was the worst part of the conversion as far as I can tell.
>>>
>>> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma
>>>
>>>        
>> Can we use srcu now instead?
>>      
> I would much rather we make call_rcu_preempt() available at all times.
>    

I don't understand.  I thought the problem was that the locks were taken 
inside an rcu critical section; switching to srcu would fix that.  But 
how is call_rcu_preempt() related?  Grepping a bit, what is 
call_rcu_preempt()?  my tree doesn't have it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:56                   ` Peter Zijlstra
@ 2010-04-01 16:07                     ` Andrea Arcangeli
  2010-04-01 16:32                       ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 05:56:02PM +0200, Peter Zijlstra wrote:
> Another thing is mm->nr_ptes, that doens't appear to be properly
> serialized, __pte_alloc() does ++ under mm->page_table_lock, but
> free_pte_range() does -- which afaict isn't always with page_table_lock
> held, it does however always seem to have mmap_sem for writing.

Not saying this is necessarily safe, but how can be that relevant with
spinlock->mutex/rwsem conversion? Only thing that breaks with that
conversion would be RCU (the very anon_vma rcu breaks because it
rcu_read_lock disabling preempt and then takes the anon_vma->lock,
that falls apart because taking the anon_vma->lock will imply a
schedule), but nr_ptes is a write operation so it can't be protected
by RCU.

> However __pte_alloc() callers do not in fact hold mmap_sem for writing.

As long as the mmap_sem readers always also take the page_table_lock
we're safe.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:56                   ` Peter Zijlstra
  2010-04-01 16:06                     ` Avi Kivity
@ 2010-04-01 16:08                     ` Andrea Arcangeli
  2010-04-01 16:14                     ` Paul E. McKenney
  2 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar, Paul E. McKenney

On Thu, Apr 01, 2010 at 05:56:46PM +0200, Peter Zijlstra wrote:
> I would much rather we make call_rcu_preempt() available at all times.

srcu is needed only for XPMEM to make the mmu notifier handlers
sleepable. Ignore it for now, it can be done later. The locks you're
changing are always taken _before_ the mmu notifier_range_start and
always after the mmu_notifier_range_end, so srcu can be done later...

It's orthogonal issue, but the moment these locks are sleepable it
simply worth to switch mmu notifiers to srcu to accommodate XPMEM.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:42               ` Andrea Arcangeli
  2010-04-01 15:50                 ` Peter Zijlstra
  2010-04-01 15:51                 ` Avi Kivity
@ 2010-04-01 16:12                 ` Peter Zijlstra
  2010-04-01 16:18                   ` Andrea Arcangeli
  2 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 17:42 +0200, Andrea Arcangeli wrote:
> The ugliest part of it (that I think you missed below) is the breakage
> of the RCU locking in the anon-vma which requires adding refcounting
> to it. That was the worst part of the conversion as far as I can tell.
> 
One thing we can do there is to mutex_trylock() if we get the lock, see
if we've got the right object, if the trylock fails we can do the
refcount thing and sleep. That would allow the fast-path to remain a
single atomic.

The only thing is growing that anon_vma struct, but KSM seems to already
do that for the most common .configs.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:02                 ` Avi Kivity
@ 2010-04-01 16:12                   ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 16:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Thomas Gleixner, Rik van Riel, linux-kernel,
	akpm, Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 07:02:51PM +0300, Avi Kivity wrote:
> Not these locks, but if we go all the way and make mmu notifiers 
> sleepable, we can convert mmu_lock to a mutex.

Ah yes, sure! I didn't get the objective was to convert the kvm
mmu_lock to mutex too, I thought you were talking about the linux VM
locks, and of course with rmap walks you meant the kvm rmaps, not the
linux rmaps. It's all clear now sorry.

I guess I was biased because I see mmu_lock like the page_table_lock
and in linux we never had mutex protecting pagetable access, because
pagetable mangling is so quick and never blocks. But there are the
rmap walks too protected by the same mmu_lock in kvm (not the case for
the page_table_lock that only protects pagetable access), so I agree
it may be worth converting it to mutex, agreed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 15:56                   ` Peter Zijlstra
  2010-04-01 16:06                     ` Avi Kivity
  2010-04-01 16:08                     ` Andrea Arcangeli
@ 2010-04-01 16:14                     ` Paul E. McKenney
  2 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2010-04-01 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	linux-kernel, akpm, Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 05:56:46PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 18:51 +0300, Avi Kivity wrote:
> > On 04/01/2010 06:42 PM, Andrea Arcangeli wrote:
> > > On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
> > >    
> > >> On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> > >>      
> > >>> I've almost got a patch done that converts those two, still need to look
> > >>> where that tasklist_lock muck happens.
> > >>>        
> > >> OK, so the below builds and boots, only need to track down that
> > >> tasklist_lock nesting, but I got to run an errand first.
> > >>      
> > > You should have a look at my old patchset where Christoph already
> > > implemented this (and not for decreasing latency but to allow
> > > scheduling in mmu notifier handlers, only needed by XPMEM):
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/
> > >
> > > The ugliest part of it (that I think you missed below) is the breakage
> > > of the RCU locking in the anon-vma which requires adding refcounting
> > > to it. That was the worst part of the conversion as far as I can tell.
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma
> > >    
> > 
> > Can we use srcu now instead?
> 
> I would much rather we make call_rcu_preempt() available at all times.

Even in !CONFIG_PREEMPT kernels?  Or am I missing your point?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:06                     ` Avi Kivity
@ 2010-04-01 16:15                       ` Paul E. McKenney
  2010-04-01 16:36                         ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2010-04-01 16:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	linux-kernel, akpm, Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 07:06:36PM +0300, Avi Kivity wrote:
> On 04/01/2010 06:56 PM, Peter Zijlstra wrote:
> >On Thu, 2010-04-01 at 18:51 +0300, Avi Kivity wrote:
> >>On 04/01/2010 06:42 PM, Andrea Arcangeli wrote:
> >>>On Thu, Apr 01, 2010 at 01:43:14PM +0200, Peter Zijlstra wrote:
> >>>
> >>>>On Thu, 2010-04-01 at 13:27 +0200, Peter Zijlstra wrote:
> >>>>
> >>>>>I've almost got a patch done that converts those two, still need to look
> >>>>>where that tasklist_lock muck happens.
> >>>>>
> >>>>OK, so the below builds and boots, only need to track down that
> >>>>tasklist_lock nesting, but I got to run an errand first.
> >>>>
> >>>You should have a look at my old patchset where Christoph already
> >>>implemented this (and not for decreasing latency but to allow
> >>>scheduling in mmu notifier handlers, only needed by XPMEM):
> >>>
> >>>http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/
> >>>
> >>>The ugliest part of it (that I think you missed below) is the breakage
> >>>of the RCU locking in the anon-vma which requires adding refcounting
> >>>to it. That was the worst part of the conversion as far as I can tell.
> >>>
> >>>http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/anon-vma
> >>>
> >>Can we use srcu now instead?
> >I would much rather we make call_rcu_preempt() available at all times.
> 
> I don't understand.  I thought the problem was that the locks were
> taken inside an rcu critical section; switching to srcu would fix
> that.  But how is call_rcu_preempt() related?  Grepping a bit, what
> is call_rcu_preempt()?  my tree doesn't have it.

I believe that Peter is referring to the RCU implementation you get
with CONFIG_TREE_PREEMPT_RCU, which currently depends on CONFIG_PREEMPT.
The other implementation is CONFIG_TREE_RCU, which is usually called
"classic RCU".

							Thanx, Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:12                 ` Peter Zijlstra
@ 2010-04-01 16:18                   ` Andrea Arcangeli
  2010-04-01 16:45                     ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> One thing we can do there is to mutex_trylock() if we get the lock, see
> if we've got the right object, if the trylock fails we can do the
> refcount thing and sleep. That would allow the fast-path to remain a
> single atomic.

But then how do you know which anon_vma_unlink has to decrease the
refcount and which not? That info should need to be stored in the
kernel stack, can't be stored in the vma. I guess it's feasible but
passing that info around sounds more tricky than the trylock itself
(adding params to those functions with int &refcount).

> The only thing is growing that anon_vma struct, but KSM seems to already
> do that for the most common .configs.

Ok, from my initial review of memory compaction I see that it already
adds its own recount and unifies it too with the ksm_refcount. So it's
worth stop calling it ksm_refcount and to remove the #ifdef (which
memory compaction didn't remove but added an defined(CONFIG_KSM) ||
defined(CONFIG_MIGRATION)).

If you do this change we can have your change applied first, and then
Mel can adapt memory compaction to it sharing the same unconditional
compiled-in refcount.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:07                     ` Andrea Arcangeli
@ 2010-04-01 16:32                       ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 16:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 18:07 +0200, Andrea Arcangeli wrote:
> On Thu, Apr 01, 2010 at 05:56:02PM +0200, Peter Zijlstra wrote:
> > Another thing is mm->nr_ptes, that doens't appear to be properly
> > serialized, __pte_alloc() does ++ under mm->page_table_lock, but
> > free_pte_range() does -- which afaict isn't always with page_table_lock
> > held, it does however always seem to have mmap_sem for writing.
> 
> Not saying this is necessarily safe, but how can be that relevant with
> spinlock->mutex/rwsem conversion? 

Not directly, but I keep running into that BUG_ON() at the end up
exit_mmap() with my conversion patch, and I though that maybe I widened
the race window.

But I guess I simply messed something up.

> Only thing that breaks with that
> conversion would be RCU (the very anon_vma rcu breaks because it
> rcu_read_lock disabling preempt and then takes the anon_vma->lock,
> that falls apart because taking the anon_vma->lock will imply a
> schedule), but nr_ptes is a write operation so it can't be protected
> by RCU.
> 
> > However __pte_alloc() callers do not in fact hold mmap_sem for writing.
> 
> As long as the mmap_sem readers always also take the page_table_lock
> we're safe.

Ah, I see so its: down_read(mmap_sem) + page_table_lock that's exclusive
against down_write(mmap_sem), nifty, should be a comment somewhere.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:15                       ` Paul E. McKenney
@ 2010-04-01 16:36                         ` Peter Zijlstra
  2010-04-01 17:02                           ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 16:36 UTC (permalink / raw)
  To: paulmck
  Cc: Avi Kivity, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	linux-kernel, akpm, Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 09:15 -0700, Paul E. McKenney wrote:
> > I don't understand.  I thought the problem was that the locks were
> > taken inside an rcu critical section; switching to srcu would fix
> > that.  But how is call_rcu_preempt() related?  Grepping a bit, what
> > is call_rcu_preempt()?  my tree doesn't have it.
> 
> I believe that Peter is referring to the RCU implementation you get
> with CONFIG_TREE_PREEMPT_RCU, which currently depends on CONFIG_PREEMPT.
> The other implementation is CONFIG_TREE_RCU, which is usually called
> "classic RCU". 

Right, so I've been nudging Paul a while to make it so that we always
have preemptible rcu available and that only the default interface
switches between sched/classic and preempt.

Currently we already have:

call_rcu_sched()
call_rcu_bh()
call_rcu()	(depends on CONFIG_PREEMPT_RCU)

I'm saying it would be nice to also have:

call_rcu_preempt()




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:18                   ` Andrea Arcangeli
@ 2010-04-01 16:45                     ` Peter Zijlstra
  2010-04-01 16:49                       ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 16:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 18:18 +0200, Andrea Arcangeli wrote:
> On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> > One thing we can do there is to mutex_trylock() if we get the lock, see
> > if we've got the right object, if the trylock fails we can do the
> > refcount thing and sleep. That would allow the fast-path to remain a
> > single atomic.
> 
> But then how do you know which anon_vma_unlink has to decrease the
> refcount and which not? That info should need to be stored in the
> kernel stack, can't be stored in the vma. I guess it's feasible but
> passing that info around sounds more tricky than the trylock itself
> (adding params to those functions with int &refcount). 

I was thinking of something like:

struct anon_vma *page_lock_anon_vma(struct page *page)
{
	struct anon_vma *anon_vma = NULL;
	unsigned long anon_mapping;

	rcu_read_lock();
	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)              
                goto out;                                                          
        if (!page_mapped(page))                                                    
                goto out;                                                          

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);         
        if (!mutex_trylock(&anon_vma->lock)) {
		if (atomic_inc_unless_zero(&anon_vma->ref)) {
			rcu_read_unlock();
			mutex_lock(&anon_vma->lock);
			atomic_dec(&anon_vma->ref); /* ensure the lock pins it */
		} else
			anon_vma = NULL;
	}
	rcu_read_unlock();
	
        return anon_vma;
}

void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
	mutex_unlock(&anon_vma->lock);	
}

Then anybody reaching ref==0 would only need to sync against the lock
before freeing.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:45                     ` Peter Zijlstra
@ 2010-04-01 16:49                       ` Peter Zijlstra
  2010-04-01 17:04                         ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2010-04-01 16:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, 2010-04-01 at 18:45 +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 18:18 +0200, Andrea Arcangeli wrote:
> > On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> > > One thing we can do there is to mutex_trylock() if we get the lock, see
> > > if we've got the right object, if the trylock fails we can do the
> > > refcount thing and sleep. That would allow the fast-path to remain a
> > > single atomic.
> > 
> > But then how do you know which anon_vma_unlink has to decrease the
> > refcount and which not? That info should need to be stored in the
> > kernel stack, can't be stored in the vma. I guess it's feasible but
> > passing that info around sounds more tricky than the trylock itself
> > (adding params to those functions with int &refcount). 
> 
> I was thinking of something like:
> 
> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
> 	struct anon_vma *anon_vma = NULL;
> 	unsigned long anon_mapping;
> 
> 	rcu_read_lock();
> 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)              
>                 goto out;                                                          
>         if (!page_mapped(page))                                                    
>                 goto out;                                                          
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);         
>         if (!mutex_trylock(&anon_vma->lock)) {
> 		if (atomic_inc_unless_zero(&anon_vma->ref)) {
> 			rcu_read_unlock();
> 			mutex_lock(&anon_vma->lock);
> 			atomic_dec(&anon_vma->ref); /* ensure the lock pins it */
> 		} else
> 			anon_vma = NULL;
> 	}
> 	rcu_read_unlock();
> 	
>         return anon_vma;
> }
> 
> void page_unlock_anon_vma(struct anon_vma *anon_vma)
> {
> 	mutex_unlock(&anon_vma->lock);	
> }
> 
> Then anybody reaching ref==0 would only need to sync against the lock
> before freeing.

Ah, there is a race where the dec after lock makes it 0, we could catch
that by making it -1 and free in unlock_anon_vma().


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:36                         ` Peter Zijlstra
@ 2010-04-01 17:02                           ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2010-04-01 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	linux-kernel, akpm, Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 06:36:14PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 09:15 -0700, Paul E. McKenney wrote:
> > > I don't understand.  I thought the problem was that the locks were
> > > taken inside an rcu critical section; switching to srcu would fix
> > > that.  But how is call_rcu_preempt() related?  Grepping a bit, what
> > > is call_rcu_preempt()?  my tree doesn't have it.
> > 
> > I believe that Peter is referring to the RCU implementation you get
> > with CONFIG_TREE_PREEMPT_RCU, which currently depends on CONFIG_PREEMPT.
> > The other implementation is CONFIG_TREE_RCU, which is usually called
> > "classic RCU". 
> 
> Right, so I've been nudging Paul a while to make it so that we always
> have preemptible rcu available and that only the default interface
> switches between sched/classic and preempt.
> 
> Currently we already have:
> 
> call_rcu_sched()
> call_rcu_bh()
> call_rcu()	(depends on CONFIG_PREEMPT_RCU)
> 
> I'm saying it would be nice to also have:
> 
> call_rcu_preempt()

And, given the !CONFIG_PREEMPT issue, along with the issue of
sleeping forever in RCU read-side critical sections, my counteroffer
has been to integrate SRCU into the treercu (and of course the
tinyrcu) implementations, thus getting roughly the same performance
as CONFIG_TREE_RCU.

Delivering on this counteroffer has proven to be another kettle of fish,
although I am making some progress.  It will be several months, best case.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks()
  2010-04-01 16:49                       ` Peter Zijlstra
@ 2010-04-01 17:04                         ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2010-04-01 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, linux-kernel, akpm,
	Kent Overstreet, Ingo Molnar

On Thu, Apr 01, 2010 at 06:49:52PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-01 at 18:45 +0200, Peter Zijlstra wrote:
> > On Thu, 2010-04-01 at 18:18 +0200, Andrea Arcangeli wrote:
> > > On Thu, Apr 01, 2010 at 06:12:34PM +0200, Peter Zijlstra wrote:
> > > > One thing we can do there is to mutex_trylock() if we get the lock, see
> > > > if we've got the right object, if the trylock fails we can do the
> > > > refcount thing and sleep. That would allow the fast-path to remain a
> > > > single atomic.
> > > 
> > > But then how do you know which anon_vma_unlink has to decrease the
> > > refcount and which not? That info should need to be stored in the
> > > kernel stack, can't be stored in the vma. I guess it's feasible but
> > > passing that info around sounds more tricky than the trylock itself
> > > (adding params to those functions with int &refcount). 
> > 
> > I was thinking of something like:
> > 
> > struct anon_vma *page_lock_anon_vma(struct page *page)
> > {
> > 	struct anon_vma *anon_vma = NULL;
> > 	unsigned long anon_mapping;
> > 
> > 	rcu_read_lock();
> > 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> >         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)              
> >                 goto out;                                                          
> >         if (!page_mapped(page))                                                    
> >                 goto out;                                                          
> > 
> >         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);         
> >         if (!mutex_trylock(&anon_vma->lock)) {
> > 		if (atomic_inc_unless_zero(&anon_vma->ref)) {
> > 			rcu_read_unlock();
> > 			mutex_lock(&anon_vma->lock);
> > 			atomic_dec(&anon_vma->ref); /* ensure the lock pins it */
> > 		} else
> > 			anon_vma = NULL;
> > 	}
> > 	rcu_read_unlock();
> > 	
> >         return anon_vma;
> > }
> > 
> > void page_unlock_anon_vma(struct anon_vma *anon_vma)
> > {
> > 	mutex_unlock(&anon_vma->lock);	
> > }
> > 
> > Then anybody reaching ref==0 would only need to sync against the lock
> > before freeing.
> 
> Ah, there is a race where the dec after lock makes it 0, we could catch
> that by making it -1 and free in unlock_anon_vma().

You'd simply need to atomic_dec_and_test instead of atomic_dec above,
then you free it there above and return NULL.

The bug is to ever call atomic_dec instead of always
atomic_dec_and_test all over the place. I doubt you could fix it with
a -1.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2010-04-01 17:11 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-30 17:36 [PATCH] increase PREEMPT_BITS to 12 to avoid overflow when starting KVM Rik van Riel
2010-03-30 17:56 ` Peter Zijlstra
2010-03-30 18:05   ` Rik van Riel
2010-03-30 18:34     ` Peter Zijlstra
2010-04-01  9:40 ` [COUNTERPATCH] mm: avoid overflowing preempt_count() in mmu_take_all_locks() Avi Kivity
2010-04-01 10:31   ` Peter Zijlstra
2010-04-01 11:04     ` Thomas Gleixner
2010-04-01 11:13       ` Avi Kivity
2010-04-01 11:16         ` Peter Zijlstra
2010-04-01 11:19           ` Avi Kivity
2010-04-01 15:36           ` Andrea Arcangeli
2010-04-01 15:39             ` Avi Kivity
2010-04-01 15:54               ` Andrea Arcangeli
2010-04-01 16:02                 ` Avi Kivity
2010-04-01 16:12                   ` Andrea Arcangeli
2010-04-01 11:17         ` Avi Kivity
2010-04-01 11:27           ` Peter Zijlstra
2010-04-01 11:43             ` Peter Zijlstra
2010-04-01 11:47               ` Avi Kivity
2010-04-01 15:42               ` Andrea Arcangeli
2010-04-01 15:50                 ` Peter Zijlstra
2010-04-01 15:56                   ` Peter Zijlstra
2010-04-01 16:07                     ` Andrea Arcangeli
2010-04-01 16:32                       ` Peter Zijlstra
2010-04-01 16:00                   ` Andrea Arcangeli
2010-04-01 15:51                 ` Avi Kivity
2010-04-01 15:56                   ` Peter Zijlstra
2010-04-01 16:06                     ` Avi Kivity
2010-04-01 16:15                       ` Paul E. McKenney
2010-04-01 16:36                         ` Peter Zijlstra
2010-04-01 17:02                           ` Paul E. McKenney
2010-04-01 16:08                     ` Andrea Arcangeli
2010-04-01 16:14                     ` Paul E. McKenney
2010-04-01 16:02                   ` Andrea Arcangeli
2010-04-01 16:12                 ` Peter Zijlstra
2010-04-01 16:18                   ` Andrea Arcangeli
2010-04-01 16:45                     ` Peter Zijlstra
2010-04-01 16:49                       ` Peter Zijlstra
2010-04-01 17:04                         ` Andrea Arcangeli
2010-04-01 14:16       ` Rik van Riel
2010-04-01 15:32       ` Andrea Arcangeli
2010-04-01 15:37         ` Avi Kivity
2010-04-01 11:09     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.