From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35805) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z4VJf-0001jE-5b for qemu-devel@nongnu.org; Mon, 15 Jun 2015 10:24:48 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z4VJb-0004fL-UV for qemu-devel@nongnu.org; Mon, 15 Jun 2015 10:24:47 -0400 Received: from static.88-198-71-155.clients.your-server.de ([88.198.71.155]:39352 helo=socrates.bennee.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z4VJb-0004eZ-Nq for qemu-devel@nongnu.org; Mon, 15 Jun 2015 10:24:43 -0400 References: <878uborigh.fsf@linaro.org> From: Alex =?utf-8?Q?Benn=C3=A9e?= In-reply-to: Date: Mon, 15 Jun 2015 15:25:02 +0100 Message-ID: <87zj41f3r5.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] RFC Multi-threaded TCG design document List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: alvise rigo Cc: mttcg@listserver.greensocs.com, Peter Maydell , Mark Burton , QEMU Developers , Alexander Graf , guillaume.delbergue@greensocs.com, Paolo Bonzini , KONRAD =?utf-8?B?RnLDqWTDqXJp?= =?utf-8?B?Yw==?= alvise rigo writes: > Hi Alex, > > Let me just add one comment. > >> >> Memory Barriers >> --------------- >> >> Barriers (sometimes known as fences) provide a mechanism for software >> to enforce a particular ordering of memory operations from the point >> of view of external observers (e.g. another processor core). They can >> apply to any memory operations as well as just loads or stores. >> >> The Linux kernel has an excellent write-up on the various forms of >> memory barrier and the guarantees they can provide [1]. >> >> Barriers are often wrapped around synchronisation primitives to >> provide explicit memory ordering semantics. However they can be used >> by themselves to provide safe lockless access by ensuring for example >> a signal flag will always be set after a payload. >> >> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op >> >> This would enforce a strong load/store ordering so all loads/stores >> complete at the memory barrier. On single-core non-SMP strongly >> ordered backends this could become a NOP. > > I believe the main problem here is not just about translating guest > barriers to host barriers, but also about adding barriers in the TCG > generated code where they are needed i.e. when, in the guest code, the > synchronization/memory barriers don't wrap atomic instructions. Not all atomic instructions imply memory barriers. AIUI on ARMv8 you only have explicit memory barriers is you use the load-acquire/store-release variants of load/store exclusive. > > To give a concrete example, let's suppose a case where we emulate an x86 > guest on ARM (on ARMv8 the situation should be not so complicated). At > some point TCG will be asked to translate a Linux spin_lock(), that > eventually uses arch_spin_lock(). Simplifying a bit, what happens is > along the lines of: > > - barrier() // meaning a compiler barrier > - atomic update of the (spin)lock value > - barrier() > > The architecture dependent part is of course the "atomic update of the > spinlock" implementation, which, on ARM, relies on ldrex/strex > instructions and eventually issues a full hardware memory barrier (dmb). > On the other hand, on x86, only the cmpxchg instructions is used > (coupled with a memory compiler clobber), but no hardware full memory > barrier is required because of a stronger memory model. I'm pretty sure > that the TCG code generated from spin_lock() will not be the same as the > one present in an ARM kernel binary compiled with the latest > GCC, but still, that full memory barrier is likely to be required also > in the TCG generated code. > > Now the question could be: looking at the bare flow of asm x86 > instructions used to implement spin_lock(), how can we deduce that a dmb > instruction has to be added after the atomic instructions? Should we > pair every guest atomic instruction with a dmb? I don't think so. We should follow the guest processors semantics which AIUI for x86 is cmpxchg does enforce memory ordering across cores if prefixed with the LOCK prefix. At that point we can prefix the cmpxchg TCG ops with our new tcg_dmb barrier. Without the LOCK prefix we still guarantee an atomic update but without any explicit synchronisation between the cores. In practice Linux at least uses LOCK prefixed cmpxchg instructions in its synchronisation code. x86 code will still emit s/m/lfence instructions to ensure external devices see memory accesses in the right order. These should certainly cause memory barriers tcg ops to be emitted. > > Regards, > alvise > -- Alex Bennée