All of lore.kernel.org
 help / color / mirror / Atom feed
* [v1 5/6] sparc64: new context wrap
@ 2017-05-31 15:25 Pavel Tatashin
  2017-06-05  2:01 ` David Miller
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Pavel Tatashin @ 2017-05-31 15:25 UTC (permalink / raw)
  To: sparclinux

The current wrap implementation has a race issue: it is called outside of
the ctx_alloc_lock, and also does not wait for all CPUs to complete the
wrap.  This means that a thread can get a new context with a new version
and another thread might still be running with the same context. The
problem is especially severe on CPUs with shared TLBs, like sun4v. I used
the following test to very quickly reproduce the problem:
- start over 8K processes (must be more than context IDs)
- write and read values at a  memory location in every process.

Very quickly memory corruptions start happening, and what we read back
does not equal what we wrote.

Several approaches were explored before settling on this one:

Approach 1:
Move smp_new_mmu_context_version() inside ctx_alloc_lock, and wait for
every process to complete the wrap. (Note: every CPU must WAIT before
leaving smp_new_mmu_context_version_client() until every one arrives).

This approach ends up with deadlocks, as some threads own locks which other
threads are waiting for, and they never receive softint until these threads
exit smp_new_mmu_context_version_client(). Since we do not allow the exit,
deadlock happens.

Approach 2:
Handle wrap right during mondo interrupt. Use etrap/rtrap to enter into
into C code, and issue new versions to every CPU.
This approach adds some overhead to runtime: in switch_mm() we must add
some checks to make sure that versions have not changed due to wrap while
we were loading the new secondary context. (could be protected by PSTATE_IE
but that degrades performance as on M7 and older CPUs as it takes 50 cycles
for each access). Also, we still need a global per-cpu array of MMs to know
where we need to load new contexts, otherwise we can change context to a
thread that is going way (if we received mondo between switch_mm() and
switch_to() time). Finally, there are some issues with window registers in
rtrap() when context IDs are changed during CPU mondo time.

The approach in this patch is the simplest and has almost no impact on
runtime.  We use the array with mm's where last secondary contexts were
loaded onto CPUs and bump their versions to the new generation without
changing context IDs. If a new process comes in to get a context ID, it
will go through get_new_mmu_context() because of version mismatch. But the
running processes do not need to be interrupted. And wrap is quicker as we
do not need to xcall and wait for everyone to receive and complete wrap.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
---
 arch/sparc/mm/init_64.c |   81 +++++++++++++++++++++++++++++++---------------
 1 files changed, 54 insertions(+), 27 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 5b4f030..bfb7672 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -712,6 +712,53 @@ void __flush_dcache_range(unsigned long start, unsigned long end)
 DECLARE_BITMAP(mmu_context_bmap, MAX_CTX_NR);
 DEFINE_PER_CPU(struct mm_struct *, per_cpu_secondary_mm) = {0};
 
+static void mmu_context_wrap(void)
+{
+	unsigned long old_ver = tlb_context_cache & CTX_VERSION_MASK;
+	unsigned long new_ver, new_ctx, old_ctx;
+	struct mm_struct *mm;
+	int cpu;
+
+	bitmap_zero(mmu_context_bmap, 1 << CTX_NR_BITS);
+
+	/* Reserve kernel context */
+	set_bit(0, mmu_context_bmap);
+
+	new_ver = (tlb_context_cache & CTX_VERSION_MASK) + CTX_FIRST_VERSION;
+	if (unlikely(new_ver = 0))
+		new_ver = CTX_FIRST_VERSION;
+	tlb_context_cache = new_ver;
+
+	/*
+	 * Make sure that any new mm that are added into per_cpu_secondary_mm,
+	 * are going to go through get_new_mmu_context() path.
+	 */
+	mb();
+
+	/*
+	 * Updated versions to current on those CPUs that had valid secondary
+	 * contexts
+	 */
+	for_each_online_cpu(cpu) {
+		/*
+		 * If a new mm is stored after we took this mm from the array,
+		 * it will go into get_new_mmu_context() path, because we
+		 * already bumped the version in tlb_context_cache.
+		 */
+		mm = per_cpu(per_cpu_secondary_mm, cpu);
+
+		if (unlikely(!mm || mm = &init_mm))
+			continue;
+
+		old_ctx = mm->context.sparc64_ctx_val;
+		if (likely((old_ctx & CTX_VERSION_MASK) = old_ver)) {
+			new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver;
+			set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap);
+			mm->context.sparc64_ctx_val = new_ctx;
+		}
+	}
+}
+
 /* Caller does TLB context flushing on local CPU if necessary.
  * The caller also ensures that CTX_VALID(mm->context) is false.
  *
@@ -726,50 +773,30 @@ void get_new_mmu_context(struct mm_struct *mm)
 {
 	unsigned long ctx, new_ctx;
 	unsigned long orig_pgsz_bits;
-	int new_version;
 
 	spin_lock(&ctx_alloc_lock);
+retry:
+	/* wrap might have happened, test again if our context became valid */
+	if (unlikely(CTX_VALID(mm->context)))
+		goto out;
 	orig_pgsz_bits = (mm->context.sparc64_ctx_val & CTX_PGSZ_MASK);
 	ctx = (tlb_context_cache + 1) & CTX_NR_MASK;
 	new_ctx = find_next_zero_bit(mmu_context_bmap, 1 << CTX_NR_BITS, ctx);
-	new_version = 0;
 	if (new_ctx >= (1 << CTX_NR_BITS)) {
 		new_ctx = find_next_zero_bit(mmu_context_bmap, ctx, 1);
 		if (new_ctx >= ctx) {
-			int i;
-			new_ctx = (tlb_context_cache & CTX_VERSION_MASK) +
-				CTX_FIRST_VERSION + 1;
-			if (new_ctx = 1)
-				new_ctx = CTX_FIRST_VERSION + 1;
-
-			/* Don't call memset, for 16 entries that's just
-			 * plain silly...
-			 */
-			mmu_context_bmap[0] = 3;
-			mmu_context_bmap[1] = 0;
-			mmu_context_bmap[2] = 0;
-			mmu_context_bmap[3] = 0;
-			for (i = 4; i < CTX_BMAP_SLOTS; i += 4) {
-				mmu_context_bmap[i + 0] = 0;
-				mmu_context_bmap[i + 1] = 0;
-				mmu_context_bmap[i + 2] = 0;
-				mmu_context_bmap[i + 3] = 0;
-			}
-			new_version = 1;
-			goto out;
+			mmu_context_wrap();
+			goto retry;
 		}
 	}
 	if (mm->context.sparc64_ctx_val)
 		cpumask_clear(mm_cpumask(mm));
 	mmu_context_bmap[new_ctx>>6] |= (1UL << (new_ctx & 63));
 	new_ctx |= (tlb_context_cache & CTX_VERSION_MASK);
-out:
 	tlb_context_cache = new_ctx;
 	mm->context.sparc64_ctx_val = new_ctx | orig_pgsz_bits;
+out:
 	spin_unlock(&ctx_alloc_lock);
-
-	if (unlikely(new_version))
-		smp_new_mmu_context_version();
 }
 
 static int numa_enabled = 1;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [v1 5/6] sparc64: new context wrap
  2017-05-31 15:25 [v1 5/6] sparc64: new context wrap Pavel Tatashin
@ 2017-06-05  2:01 ` David Miller
  2017-06-05  3:50 ` Pasha Tatashin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2017-06-05  2:01 UTC (permalink / raw)
  To: sparclinux

From: Pavel Tatashin <pasha.tatashin@oracle.com>
Date: Wed, 31 May 2017 11:25:24 -0400

> +	for_each_online_cpu(cpu) {
> +		/*
> +		 * If a new mm is stored after we took this mm from the array,
> +		 * it will go into get_new_mmu_context() path, because we
> +		 * already bumped the version in tlb_context_cache.
> +		 */
> +		mm = per_cpu(per_cpu_secondary_mm, cpu);
> +
> +		if (unlikely(!mm || mm = &init_mm))
> +			continue;
> +
> +		old_ctx = mm->context.sparc64_ctx_val;
> +		if (likely((old_ctx & CTX_VERSION_MASK) = old_ver)) {
> +			new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver;
> +			set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap);
> +			mm->context.sparc64_ctx_val = new_ctx;

I wonder if there is a potential use after free here.

What synchronizes the per-cpu mm pointers with free_mm()?

For example, what stops another cpu from exiting a thread
and dropping the mm between when you do the per_cpu() read
of the 'mm' pointer and the tests and sets you do a few
lines later?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v1 5/6] sparc64: new context wrap
  2017-05-31 15:25 [v1 5/6] sparc64: new context wrap Pavel Tatashin
  2017-06-05  2:01 ` David Miller
@ 2017-06-05  3:50 ` Pasha Tatashin
  2017-06-05 19:03 ` David Miller
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Pasha Tatashin @ 2017-06-05  3:50 UTC (permalink / raw)
  To: sparclinux

On 2017-06-04 22:01, David Miller wrote:
> From: Pavel Tatashin <pasha.tatashin@oracle.com>
> Date: Wed, 31 May 2017 11:25:24 -0400
> 
>> +	for_each_online_cpu(cpu) {
>> +		/*
>> +		 * If a new mm is stored after we took this mm from the array,
>> +		 * it will go into get_new_mmu_context() path, because we
>> +		 * already bumped the version in tlb_context_cache.
>> +		 */
>> +		mm = per_cpu(per_cpu_secondary_mm, cpu);
>> +
>> +		if (unlikely(!mm || mm = &init_mm))
>> +			continue;
>> +
>> +		old_ctx = mm->context.sparc64_ctx_val;
>> +		if (likely((old_ctx & CTX_VERSION_MASK) = old_ver)) {
>> +			new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver;
>> +			set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap);
>> +			mm->context.sparc64_ctx_val = new_ctx;
> 
> I wonder if there is a potential use after free here.
> 
> What synchronizes the per-cpu mm pointers with free_mm()?
> 
> For example, what stops another cpu from exiting a thread
> and dropping the mm between when you do the per_cpu() read
> of the 'mm' pointer and the tests and sets you do a few
> lines later?

Hi Dave,

ctx_alloc_lock in destroy_context() synchronizes wrap with free_mm(). By 
the time destroy_context() is called, we know that the last user of mm 
is freeing it, and we take the same lock that is owned during wrap.

I had the following asserts in destroy_context() for testing purposes:

	for_each_cpu(cpu, mm_cpumask(mm)) {
		mmp = per_cpu_ptr(&per_cpu_secondary_mm, cpu);
		if (*mmp = mm) {
			cmpxchg(mmp, mm, NULL);
			panic("BUG: mm[%p] in per_cpu_secondary_mm
			  CPU[%d, %d]", mm, smp_processor_id(), cpu);
		}
	}

And never hit it.

Pasha

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v1 5/6] sparc64: new context wrap
  2017-05-31 15:25 [v1 5/6] sparc64: new context wrap Pavel Tatashin
  2017-06-05  2:01 ` David Miller
  2017-06-05  3:50 ` Pasha Tatashin
@ 2017-06-05 19:03 ` David Miller
  2017-06-05 20:10 ` Pasha Tatashin
  2017-06-05 20:24 ` David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2017-06-05 19:03 UTC (permalink / raw)
  To: sparclinux

From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Sun, 4 Jun 2017 23:50:55 -0400

> On 2017-06-04 22:01, David Miller wrote:
>> From: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Date: Wed, 31 May 2017 11:25:24 -0400
>> 
>>> +	for_each_online_cpu(cpu) {
>>> +		/*
>>> + * If a new mm is stored after we took this mm from the array,
>>> +		 * it will go into get_new_mmu_context() path, because we
>>> +		 * already bumped the version in tlb_context_cache.
>>> +		 */
>>> +		mm = per_cpu(per_cpu_secondary_mm, cpu);
>>> +
>>> +		if (unlikely(!mm || mm = &init_mm))
>>> +			continue;
>>> +
>>> +		old_ctx = mm->context.sparc64_ctx_val;
>>> +		if (likely((old_ctx & CTX_VERSION_MASK) = old_ver)) {
>>> + new_ctx = (old_ctx & ~CTX_VERSION_MASK) | new_ver;
>>> +			set_bit(new_ctx & CTX_NR_MASK, mmu_context_bmap);
>>> +			mm->context.sparc64_ctx_val = new_ctx;
>> I wonder if there is a potential use after free here.
>> What synchronizes the per-cpu mm pointers with free_mm()?
>> For example, what stops another cpu from exiting a thread
>> and dropping the mm between when you do the per_cpu() read
>> of the 'mm' pointer and the tests and sets you do a few
>> lines later?
> 
> Hi Dave,
> 
> ctx_alloc_lock in destroy_context() synchronizes wrap with
> free_mm(). By the time destroy_context() is called, we know that the
> last user of mm is freeing it, and we take the same lock that is owned
> during wrap.
> 
> I had the following asserts in destroy_context() for testing purposes:
> 
> 	for_each_cpu(cpu, mm_cpumask(mm)) {
> 		mmp = per_cpu_ptr(&per_cpu_secondary_mm, cpu);
> 		if (*mmp = mm) {
> 			cmpxchg(mmp, mm, NULL);
> 			panic("BUG: mm[%p] in per_cpu_secondary_mm
> 			  CPU[%d, %d]", mm, smp_processor_id(), cpu);
> 		}
> 	}
> 
> And never hit it.

Ok, thanks for the details.

I'm not too happy about scanning NCPUS 'mm' pointers every context
wrap, that can be up to 4096 so the cost is not exactly trivial.

However, I don't have an alternative suggestion at this time and such
quick wraps only happen in extreme situations, so your changes
definitely improve the situation.

So I'll apply this series and queued it up for -stable, thank you.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v1 5/6] sparc64: new context wrap
  2017-05-31 15:25 [v1 5/6] sparc64: new context wrap Pavel Tatashin
                   ` (2 preceding siblings ...)
  2017-06-05 19:03 ` David Miller
@ 2017-06-05 20:10 ` Pasha Tatashin
  2017-06-05 20:24 ` David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: Pasha Tatashin @ 2017-06-05 20:10 UTC (permalink / raw)
  To: sparclinux

> 
> Ok, thanks for the details.
> 
> I'm not too happy about scanning NCPUS 'mm' pointers every context
> wrap, that can be up to 4096 so the cost is not exactly trivial.

Hi Dave,

Thank you for accepting the changes.

Going through NCPU proportional loop for xcall purpose or 'mm' pointers 
is a fundamental limitation when there is only one global context 
counter per system. To solve this limitation would be a different 
project: we will need multiple context counters per system: up-to one 
per core. Bob Picco is working on adding this support. In that case the 
loop sizes are going to be bound by:
NCPU / (# context domains)

Pasha

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v1 5/6] sparc64: new context wrap
  2017-05-31 15:25 [v1 5/6] sparc64: new context wrap Pavel Tatashin
                   ` (3 preceding siblings ...)
  2017-06-05 20:10 ` Pasha Tatashin
@ 2017-06-05 20:24 ` David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2017-06-05 20:24 UTC (permalink / raw)
  To: sparclinux

From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Mon, 5 Jun 2017 16:10:04 -0400

> Going through NCPU proportional loop for xcall purpose or 'mm'
> pointers is a fundamental limitation when there is only one global
> context counter per system. To solve this limitation would be a
> different project: we will need multiple context counters per system:
> up-to one per core. Bob Picco is working on adding this support. In
> that case the loop sizes are going to be bound by:
> NCPU / (# context domains)

I see, I think MIPS uses per-cpu context IDs, perhaps you can see how
they handle this.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-05 20:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-31 15:25 [v1 5/6] sparc64: new context wrap Pavel Tatashin
2017-06-05  2:01 ` David Miller
2017-06-05  3:50 ` Pasha Tatashin
2017-06-05 19:03 ` David Miller
2017-06-05 20:10 ` Pasha Tatashin
2017-06-05 20:24 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.