Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading

From: "Alex Bennée" <alex.bennee@linaro.org>
To: Sergey Fedorov <serge.fdrv@gmail.com>
Cc: Alvise Rigo <a.rigo@virtualopensystems.com>,
	mttcg@listserver.greensocs.com, qemu-devel@nongnu.org,
	jani.kokkonen@huawei.com, claudio.fontana@huawei.com,
	tech@virtualopensystems.com, fred.konrad@greensocs.com,
	pbonzini@redhat.com, rth@twiddle.net, cota@braap.org,
	peter.maydell@linaro.org
Subject: Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading
Date: Tue, 14 Jun 2016 09:37:47 +0100	[thread overview]
Message-ID: <87bn34go84.fsf@linaro.org> (raw)
In-Reply-To: <87ziqtf2ac.fsf@linaro.org>

Alex Bennée <alex.bennee@linaro.org> writes:

> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 26/05/16 19:35, Alvise Rigo wrote:
>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>> LoadLink/StoreConditional thread safe.
>>>
>>> During an LL access, this lock protects the load access itself, the
>>> update of the exclusive history and the update of the VCPU's protected
>>> range.  In a SC access, the lock protects the store access itself, the
>>> possible reset of other VCPUs' protected range and the reset of the
>>> exclusive context of calling VCPU.
>>>
>>> The lock is also taken when a normal store happens to access an
>>> exclusive page to reset other VCPUs' protected range in case of
>>> collision.
>>
>> I think the key problem here is that the load in LL helper can race with
>> a concurrent regular fast-path store. It's probably easier to annotate
>> the source here:
>>
>>      1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>      2                                  TCGMemOpIdx oi, uintptr_t retaddr)
>>      3  {
>>      4      WORD_TYPE ret;
>>      5      int index;
>>      6      CPUState *this_cpu = ENV_GET_CPU(env);
>>      7      CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>      8      hwaddr hw_addr;
>>      9      unsigned mmu_idx = get_mmuidx(oi);
>>
>>     10      index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>
>>     11      tcg_exclusive_lock();
>>
>>     12      /* Use the proper load helper from cpu_ldst.h */
>>     13      ret = helper_ld(env, addr, oi, retaddr);
>>
>>     14      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>> + xlat)
>>     15       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>     16      hw_addr = (env->iotlb[mmu_idx][index].addr &
>> TARGET_PAGE_MASK) + addr;
>>     17      if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>> TLB_MMIO))) {
>>     18          /* If all the vCPUs have the EXCL bit set for this page
>> there is no need
>>     19           * to request any flush. */
>>     20          if (cpu_physical_memory_set_excl(hw_addr)) {
>>     21              CPUState *cpu;
>>
>>     22              excl_history_put_addr(hw_addr);
>>     23              CPU_FOREACH(cpu) {
>>     24                  if (this_cpu != cpu) {
>>     25                      tlb_flush_other(this_cpu, cpu, 1);
>>     26                  }
>>     27              }
>>     28          }
>>     29          /* For this vCPU, just update the TLB entry, no need to
>> flush. */
>>     30          env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>     31      } else {
>>     32          /* Set a pending exclusive access in the MemoryRegion */
>>     33          MemoryRegion *mr = iotlb_to_region(this_cpu,
>>     34
>> env->iotlb[mmu_idx][index].addr,
>>     35
>> env->iotlb[mmu_idx][index].attrs);
>>     36          mr->pending_excl_access = true;
>>     37      }
>>
>>     38      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>
>>     39      tcg_exclusive_unlock();
>>
>>     40      /* From now on we are in LL/SC context */
>>     41      this_cpu->ll_sc_context = true;
>>
>>     42      return ret;
>>     43  }
>>
>>
>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>> store at this address occurs after we finished load at line 13 but
>> before TLB is flushed as a result of line 25. If we reorder the load to
>> happen after the TLB flush request we still must be sure that the flush
>> is complete before we can do the load safely.
>
> I think this can be fixed using async_safe_run_on_cpu and tweaking the
> ldlink helper.
>
>   * Change the helper_ldlink call
>     - pass it offset-of(cpu->reg[n]) so it can store result of load
>     - maybe pass it next-pc (unless there is some other way to know)
>
>   vCPU runs until the ldlink instruction occurs and jumps to the helper
>
>   * Once in the helper_ldlink
>     - queue up an async helper function with info of offset
>     - cpu_loop_exit_restore(with next PC)
>
>   vCPU the issued the ldlink exits immediately, waits until all vCPUs are
>   out of generated code.
>
>   * Once in helper_ldlink async helper
>     - Everything at this point is quiescent, no vCPU activity
>     - Flush all TLBs/set flags
>     - Do the load from memory, store directly into cpu->reg[n]
>
> The key thing is once we are committed to load in the async helper
> nothing else can get in the way. Any stores before we are in the helper
> happen as normal, once we exit the async helper all potential
> conflicting stores will slow path.
>
> There is a little messing about in knowing the next PC which is simple
> in the ARM case but gets a bit more complicated for architectures that
> have deferred jump slots. I haven't looked into this nit yet.

Thinking on it further the messing about with offset and PCs could be
solved just by having a simple flag:

 - First run, no flag set, queue safe work, restart loop at PC
 - Second run, flag set, do load, clear flag

As long as the flag is per-vCPU I think its safe from races.

>
>>
>>>
>>> Moreover, adapt target-arm to also cope with the new multi-threaded
>>> execution.
>>>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  softmmu_llsc_template.h | 11 +++++++++--
>>>  softmmu_template.h      |  6 ++++++
>>>  target-arm/op_helper.c  |  6 ++++++
>>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>>> index 2c4a494..d3810c0 100644
>>> --- a/softmmu_llsc_template.h
>>> +++ b/softmmu_llsc_template.h
>>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>      hwaddr hw_addr;
>>>      unsigned mmu_idx = get_mmuidx(oi);
>>>
>>> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>> +
>>> +    tcg_exclusive_lock();
>>> +
>>>      /* Use the proper load helper from cpu_ldst.h */
>>>      ret = helper_ld(env, addr, oi, retaddr);
>>>
>>> -    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>> -
>>>      /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
>>>       * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>>      hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
>>> @@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>
>>>      cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>> +    tcg_exclusive_unlock();
>>> +
>>>      /* From now on we are in LL/SC context */
>>>      this_cpu->ll_sc_context = true;
>>>
>>> @@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>           * access as one made by the store conditional wrapper. If the store
>>>           * conditional does not succeed, the value will be set to 0.*/
>>>          cpu->excl_succeeded = true;
>>> +
>>> +        tcg_exclusive_lock();
>>>          helper_st(env, addr, val, oi, retaddr);
>>>
>>>          if (cpu->excl_succeeded) {
>>> @@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
>>>
>>>      /* Unset LL/SC context */
>>>      cc->cpu_reset_excl_context(cpu);
>>> +    tcg_exclusive_unlock();
>>>
>>>      return ret;
>>>  }
>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>> index 76fe37e..9363a7b 100644
>>> --- a/softmmu_template.h
>>> +++ b/softmmu_template.h
>>> @@ -537,11 +537,16 @@ static inline void smmu_helper(do_excl_store)(CPUArchState *env,
>>>          }
>>>      }
>>>
>>> +    /* Take the lock in case we are not coming from a SC */
>>> +    tcg_exclusive_lock();
>>> +
>>>      smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
>>>                                get_mmuidx(oi), index, retaddr);
>>>
>>>      reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
>>>
>>> +    tcg_exclusive_unlock();
>>> +
>>>      return;
>>>  }
>>>
>>> @@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>      /* Handle an IO access or exclusive access.  */
>>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>>          if (tlb_addr & TLB_EXCL) {
>>> +
>>>              smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
>>>                                         retaddr);
>>>              return;
>>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>>> index e22afc5..19ea52d 100644
>>> --- a/target-arm/op_helper.c
>>> +++ b/target-arm/op_helper.c
>>> @@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
>>>      cs->exception_index = excp;
>>>      env->exception.syndrome = syndrome;
>>>      env->exception.target_el = target_el;
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>      cpu_loop_exit(cs);
>>>  }
>>>
>>> @@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
>>>      CPUState *cs = ENV_GET_CPU(env);
>>>      CPUClass *cc = CPU_GET_CLASS(cs);
>>>
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>  }
>>>
>>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>> @@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
>>>
>>>      aarch64_save_sp(env, cur_el);
>>>
>>> +    tcg_exclusive_lock();
>>>      cc->cpu_reset_excl_context(cs);
>>> +    tcg_exclusive_unlock();
>>>
>>>      /* We must squash the PSTATE.SS bit to zero unless both of the
>>>       * following hold:

--
Alex Bennée