Re: [Qemu-devel] [RFC 01/10] target-arm: protect cpu_exclusive_*.

From: Mark Burton <mark.burton@greensocs.com>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: mttcg@listserver.greensocs.com,
	"J. Kiszka" <jan.kiszka@siemens.com>,
	"Alexander Graf" <agraf@suse.de>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"KONRAD Frédéric" <fred.konrad@greensocs.com>
Subject: Re: [Qemu-devel] [RFC 01/10] target-arm: protect cpu_exclusive_*.
Date: Tue, 3 Mar 2015 16:29:13 +0100	[thread overview]
Message-ID: <711BBCAD-9514-4523-A456-AD4F397C41CD@greensocs.com> (raw)
In-Reply-To: <CAFEAcA8Lg_iLhd2Vs8Q1KtvWUrqRyHGjmU92RemqNmC2MDEf0w@mail.gmail.com>

Hi Peter, thanks for the feedback

So - what we have discovered - using the slow path as discussed has a significant performance hit (no surprise), likewise the user-mode exit mechanism. However we are investigating that in parallel...

However, for now, we have left the TCG doing the majority of the work. 
(This was supposed to be the ‘quick an easy’ approach - we have already discussed better approaches to atomic instructions at length - but they are even more invasive!)

we have tried several schemes, and not found anything totally satisfactory:  we have a gremlin. There seems to be a race condition somewhere effectively means that the guest ‘ticket’ mechanism inside the guest kernel goes off by one, and therefore the guest kernel stalls as it never gets the ticket it’s looking for…. (it’s off by one in the ‘one too few’ sense, both CPU’s end up looking for tickets that are higher than the current finished ticket)...

The mechanism to hand out the tickets is of course a LDREX/STREX, leading us to believe that is the cause of our problems, however I am increasingly sceptical.

Our scheme is actually quite simple now:
We keep the same overall scheme as exits upstream today - we store the address and value.
We provide a lock which is used for LDREX, STREX, CLREX, and in the raise_exception code.
For LDREX we check that no other CPU is tagging the address. Because of the lock, we can be sure no STREX is executing so we are always safe to ‘steal’ a tag from another CPU (which is what the ARM ARM defines).
STREX and CLREX are similar to upstream...
We are also careful to invalidate the address tag if we take an exception.

The drawback of this scheme is of course that we are not totally protected against non exclusive writes, which could potentially fall between our reading a value and checking it against our saved value. But I am not convinced that is our problem (I have checked to make sure we dont get non exclusive writes).

This scheme would seem to be elegant and simple - but it suffers from the one small drawback of still having the issue as above…

Cheers

Mark.

ps. on our bug - we believe somehow the STREX is being marked as failed, but actually succeeds to write something.  There are only 3 ways the strex can fail:
1/ the address doesn't match - in which case no ST will be attempted
2/ the value doesn't match - which means - no ST attempted
3/ the store starts, but causes a TLB fill/exception…

The 3rd has 2 possibilities - the TLB is filled, and the store goes ahead totally normally - there should be no ‘fail’ - or an exception is generated in which case we will long jump away and never return. 

Am I missing something?

>> 
>> 

> On 2 Mar 2015, at 13:27, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On 27 February 2015 at 16:54, Mark Burton <mark.burton@greensocs.com> wrote:
>> 
>>> On 26 Feb 2015, at 23:56, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> cpu_physical_memory_rw would bypass the TLB and so be much slower.
>>> Make sure you use the functions which go via the TLB if you do
>>> this in a helper (and remember that they will longjmp out on a
>>> tlb miss!)
>> 
>> At this point speed isn’t our main concern - it’s simplicity of implementation - we want it to work, then we can worry about a better implementation (which likely should not go this path at all - as discussed above).
>> Given that - isn’t it reasonable to pass through cpu_physical_memory_rw - and hence not have to worry about the long jump ? Or am I missing something?
> 
> If you use cpu_physical_memory_rw you need to do the
> virt-to-phys translation by hand (preferably via the TLB).
> That might be something you needed to do anyway if we want
> to have architecturally correct monitors that work on
> physaddrs rather than vaddrs, but if not then the two
> step process is a bit awkward.
> 
>>> Pretty sure we've already discussed how the current ldrex/strex
>>> implementation is not architecturally correct. I think this is
>>> another of those areas.
>> 
>> We have indeed discussed this - but this is a surprise.
> 
> You're right that I didn't specifically realise this exact
> part of our incorrectness earlier.
> 
> -- PMM

	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton