All of lore.kernel.org
 help / color / mirror / Atom feed
* KVM and variable-endianness guest CPUs
@ 2014-01-17 17:53 ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-17 17:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, QEMU Developers, qemu-ppc, kvm-devel,
	Christoffer Dall, kvmarm

[This seemed like a good jumping off point for this question.]

On 16 January 2014 17:51, Alexander Graf <agraf@suse.de> wrote:
> Am 16.01.2014 um 18:41 schrieb Peter Maydell <peter.maydell@linaro.org>:
>> Also see my remarks on the previous patch series suggesting
>> that we should look at this in a more holistic way than
>> just randomly fixing small bits of things. A good place
>> to start would be "what should the semantics of stl_p()
>> be for a QEMU where the CPU is currently operating with
>> a reversed endianness to the TARGET_WORDS_BIGENDIAN
>> setting?".
>
> That'd open a giant can of worms that I'd rather not open.

Yeah, but you kind of have to open that can, because stl_p()
is used in the code path for KVM MMIO accesses to devices.

Specifically, the KVM API says "here's a uint8_t[] byte
array and a length", and the current QEMU code treats that
as "this is a byte array written as if the guest CPU
(a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
I/O access to this buffer rather than to the device".

The KVM API docs don't actually specify the endianness
semantics of the byte array, but I think that that really
needs to be nailed down. I can think of a couple of options:
 * always LE
 * always BE
   [these first two are non-starters because they would
   break either x86 or PPC existing code]
 * always the endianness the guest is at the time
 * always some arbitrary endianness based purely on the
   endianness the KVM implementation used historically
 * always the endianness of the host QEMU binary
 * something else?

Any preferences? Current QEMU code basically assumes
"always the endianness of TARGET_WORDS_BIGENDIAN",
which is pretty random.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-17 17:53 ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-17 17:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

[This seemed like a good jumping off point for this question.]

On 16 January 2014 17:51, Alexander Graf <agraf@suse.de> wrote:
> Am 16.01.2014 um 18:41 schrieb Peter Maydell <peter.maydell@linaro.org>:
>> Also see my remarks on the previous patch series suggesting
>> that we should look at this in a more holistic way than
>> just randomly fixing small bits of things. A good place
>> to start would be "what should the semantics of stl_p()
>> be for a QEMU where the CPU is currently operating with
>> a reversed endianness to the TARGET_WORDS_BIGENDIAN
>> setting?".
>
> That'd open a giant can of worms that I'd rather not open.

Yeah, but you kind of have to open that can, because stl_p()
is used in the code path for KVM MMIO accesses to devices.

Specifically, the KVM API says "here's a uint8_t[] byte
array and a length", and the current QEMU code treats that
as "this is a byte array written as if the guest CPU
(a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
I/O access to this buffer rather than to the device".

The KVM API docs don't actually specify the endianness
semantics of the byte array, but I think that that really
needs to be nailed down. I can think of a couple of options:
 * always LE
 * always BE
   [these first two are non-starters because they would
   break either x86 or PPC existing code]
 * always the endianness the guest is at the time
 * always some arbitrary endianness based purely on the
   endianness the KVM implementation used historically
 * always the endianness of the host QEMU binary
 * something else?

Any preferences? Current QEMU code basically assumes
"always the endianness of TARGET_WORDS_BIGENDIAN",
which is pretty random.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-17 17:53 ` [Qemu-devel] " Peter Maydell
@ 2014-01-17 18:52   ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-17 18:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, QEMU Developers, qemu-ppc, kvm-devel,
	Christoffer Dall, kvmarm

On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> Specifically, the KVM API says "here's a uint8_t[] byte
> array and a length", and the current QEMU code treats that
> as "this is a byte array written as if the guest CPU
> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> I/O access to this buffer rather than to the device".
>
> The KVM API docs don't actually specify the endianness
> semantics of the byte array, but I think that that really
> needs to be nailed down. I can think of a couple of options:
>  * always LE
>  * always BE
>    [these first two are non-starters because they would
>    break either x86 or PPC existing code]
>  * always the endianness the guest is at the time
>  * always some arbitrary endianness based purely on the
>    endianness the KVM implementation used historically
>  * always the endianness of the host QEMU binary
>  * something else?
>
> Any preferences? Current QEMU code basically assumes
> "always the endianness of TARGET_WORDS_BIGENDIAN",
> which is pretty random.

Having thought a little more about this, my opinion is:

 * we should specify that the byte order of the mmio.data
   array is host kernel endianness (ie same endianness
   as the QEMU process itself) [this is what it actually
   is, I think, for all the cases that work today]
 * we should fix the code path in QEMU for handling
   mmio.data which currently has the implicit assumption
   that when using KVM TARGET_WORDS_BIGENDIAN is the same
   as the QEMU host process endianness (because it's using
   load/store functions which swap if TARGET_WORDS_BIGENDIAN
   is different from HOST_WORDS_BIGENDIAN)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-17 18:52   ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-17 18:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> Specifically, the KVM API says "here's a uint8_t[] byte
> array and a length", and the current QEMU code treats that
> as "this is a byte array written as if the guest CPU
> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> I/O access to this buffer rather than to the device".
>
> The KVM API docs don't actually specify the endianness
> semantics of the byte array, but I think that that really
> needs to be nailed down. I can think of a couple of options:
>  * always LE
>  * always BE
>    [these first two are non-starters because they would
>    break either x86 or PPC existing code]
>  * always the endianness the guest is at the time
>  * always some arbitrary endianness based purely on the
>    endianness the KVM implementation used historically
>  * always the endianness of the host QEMU binary
>  * something else?
>
> Any preferences? Current QEMU code basically assumes
> "always the endianness of TARGET_WORDS_BIGENDIAN",
> which is pretty random.

Having thought a little more about this, my opinion is:

 * we should specify that the byte order of the mmio.data
   array is host kernel endianness (ie same endianness
   as the QEMU process itself) [this is what it actually
   is, I think, for all the cases that work today]
 * we should fix the code path in QEMU for handling
   mmio.data which currently has the implicit assumption
   that when using KVM TARGET_WORDS_BIGENDIAN is the same
   as the QEMU host process endianness (because it's using
   load/store functions which swap if TARGET_WORDS_BIGENDIAN
   is different from HOST_WORDS_BIGENDIAN)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-17 18:52   ` [Qemu-devel] " Peter Maydell
@ 2014-01-18  4:24     ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-18  4:24 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Alexander Graf, QEMU Developers,
	qemu-ppc, kvmarm

On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> > Specifically, the KVM API says "here's a uint8_t[] byte
> > array and a length", and the current QEMU code treats that
> > as "this is a byte array written as if the guest CPU
> > (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> > I/O access to this buffer rather than to the device".
> >
> > The KVM API docs don't actually specify the endianness
> > semantics of the byte array, but I think that that really
> > needs to be nailed down. I can think of a couple of options:
> >  * always LE
> >  * always BE
> >    [these first two are non-starters because they would
> >    break either x86 or PPC existing code]
> >  * always the endianness the guest is at the time
> >  * always some arbitrary endianness based purely on the
> >    endianness the KVM implementation used historically
> >  * always the endianness of the host QEMU binary
> >  * something else?
> >
> > Any preferences? Current QEMU code basically assumes
> > "always the endianness of TARGET_WORDS_BIGENDIAN",
> > which is pretty random.
> 
> Having thought a little more about this, my opinion is:
> 
>  * we should specify that the byte order of the mmio.data
>    array is host kernel endianness (ie same endianness
>    as the QEMU process itself) [this is what it actually
>    is, I think, for all the cases that work today]

I completely agree, given that it's too late to be set on always LE/BE,
I think the natural choice is something that allows a user to cast the
byte array to an appropriate pointer type and dereference it.

And I think we need to amend the KVM API docs to specify this.

-- 
Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-18  4:24     ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-18  4:24 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Alexander Graf, QEMU Developers,
	qemu-ppc, kvmarm

On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> > Specifically, the KVM API says "here's a uint8_t[] byte
> > array and a length", and the current QEMU code treats that
> > as "this is a byte array written as if the guest CPU
> > (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> > I/O access to this buffer rather than to the device".
> >
> > The KVM API docs don't actually specify the endianness
> > semantics of the byte array, but I think that that really
> > needs to be nailed down. I can think of a couple of options:
> >  * always LE
> >  * always BE
> >    [these first two are non-starters because they would
> >    break either x86 or PPC existing code]
> >  * always the endianness the guest is at the time
> >  * always some arbitrary endianness based purely on the
> >    endianness the KVM implementation used historically
> >  * always the endianness of the host QEMU binary
> >  * something else?
> >
> > Any preferences? Current QEMU code basically assumes
> > "always the endianness of TARGET_WORDS_BIGENDIAN",
> > which is pretty random.
> 
> Having thought a little more about this, my opinion is:
> 
>  * we should specify that the byte order of the mmio.data
>    array is host kernel endianness (ie same endianness
>    as the QEMU process itself) [this is what it actually
>    is, I think, for all the cases that work today]

I completely agree, given that it's too late to be set on always LE/BE,
I think the natural choice is something that allows a user to cast the
byte array to an appropriate pointer type and dereference it.

And I think we need to amend the KVM API docs to specify this.

-- 
Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-18  4:24     ` [Qemu-devel] " Christoffer Dall
@ 2014-01-18  7:32       ` Alexander Graf
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-18  7:32 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm



> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <christoffer.dall@linaro.org>:
> 
>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>> array and a length", and the current QEMU code treats that
>>> as "this is a byte array written as if the guest CPU
>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>> I/O access to this buffer rather than to the device".
>>> 
>>> The KVM API docs don't actually specify the endianness
>>> semantics of the byte array, but I think that that really
>>> needs to be nailed down. I can think of a couple of options:
>>> * always LE
>>> * always BE
>>>   [these first two are non-starters because they would
>>>   break either x86 or PPC existing code]
>>> * always the endianness the guest is at the time
>>> * always some arbitrary endianness based purely on the
>>>   endianness the KVM implementation used historically
>>> * always the endianness of the host QEMU binary
>>> * something else?
>>> 
>>> Any preferences? Current QEMU code basically assumes
>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>> which is pretty random.
>> 
>> Having thought a little more about this, my opinion is:
>> 
>> * we should specify that the byte order of the mmio.data
>>   array is host kernel endianness (ie same endianness
>>   as the QEMU process itself) [this is what it actually
>>   is, I think, for all the cases that work today]
> 
> I completely agree, given that it's too late to be set on always LE/BE,
> I think the natural choice is something that allows a user to cast the
> byte array to an appropriate pointer type and dereference it.
> 
> And I think we need to amend the KVM API docs to specify this.

I don't see the problem. For ppc we always do mmio emulation as if the cpu was big endian. We've had an is_bigendian variable for that since the very first versions.


Alex

> 
> -- 
> Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-18  7:32       ` Alexander Graf
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-18  7:32 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm



> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <christoffer.dall@linaro.org>:
> 
>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>> array and a length", and the current QEMU code treats that
>>> as "this is a byte array written as if the guest CPU
>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>> I/O access to this buffer rather than to the device".
>>> 
>>> The KVM API docs don't actually specify the endianness
>>> semantics of the byte array, but I think that that really
>>> needs to be nailed down. I can think of a couple of options:
>>> * always LE
>>> * always BE
>>>   [these first two are non-starters because they would
>>>   break either x86 or PPC existing code]
>>> * always the endianness the guest is at the time
>>> * always some arbitrary endianness based purely on the
>>>   endianness the KVM implementation used historically
>>> * always the endianness of the host QEMU binary
>>> * something else?
>>> 
>>> Any preferences? Current QEMU code basically assumes
>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>> which is pretty random.
>> 
>> Having thought a little more about this, my opinion is:
>> 
>> * we should specify that the byte order of the mmio.data
>>   array is host kernel endianness (ie same endianness
>>   as the QEMU process itself) [this is what it actually
>>   is, I think, for all the cases that work today]
> 
> I completely agree, given that it's too late to be set on always LE/BE,
> I think the natural choice is something that allows a user to cast the
> byte array to an appropriate pointer type and dereference it.
> 
> And I think we need to amend the KVM API docs to specify this.

I don't see the problem. For ppc we always do mmio emulation as if the cpu was big endian. We've had an is_bigendian variable for that since the very first versions.


Alex

> 
> -- 
> Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-18  7:32       ` [Qemu-devel] " Alexander Graf
@ 2014-01-18 10:15         ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-18 10:15 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Christoffer Dall, Thomas Falcon, QEMU Developers, qemu-ppc,
	kvm-devel, kvmarm

On 18 January 2014 07:32, Alexander Graf <agraf@suse.de> wrote:
>> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <christoffer.dall@linaro.org>:
>>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>> Having thought a little more about this, my opinion is:
>>>
>>> * we should specify that the byte order of the mmio.data
>>>   array is host kernel endianness (ie same endianness
>>>   as the QEMU process itself) [this is what it actually
>>>   is, I think, for all the cases that work today]
>>
>> I completely agree, given that it's too late to be set on always LE/BE,
>> I think the natural choice is something that allows a user to cast the
>> byte array to an appropriate pointer type and dereference it.
>>
>> And I think we need to amend the KVM API docs to specify this.
>
> I don't see the problem.

The problem is (a) the docs aren't clear about the semantics
(b) people have picked behaviour that suited them
to implement without documenting what it was.

> For ppc we always do mmio emulation
> as if the cpu was big endian.

Even if the guest, the host kernel and QEMU in userspace are
all little endian?

Also "mmio emulation as if the CPU was big endian"
doesn't make sense -- MMIO emulation doesn't depend
on CPU endianness.

> We've had an is_bigendian variable
> for that since the very first versions.

Where? In the kernel? In QEMU? What does it control?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-18 10:15         ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-18 10:15 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 18 January 2014 07:32, Alexander Graf <agraf@suse.de> wrote:
>> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <christoffer.dall@linaro.org>:
>>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>> Having thought a little more about this, my opinion is:
>>>
>>> * we should specify that the byte order of the mmio.data
>>>   array is host kernel endianness (ie same endianness
>>>   as the QEMU process itself) [this is what it actually
>>>   is, I think, for all the cases that work today]
>>
>> I completely agree, given that it's too late to be set on always LE/BE,
>> I think the natural choice is something that allows a user to cast the
>> byte array to an appropriate pointer type and dereference it.
>>
>> And I think we need to amend the KVM API docs to specify this.
>
> I don't see the problem.

The problem is (a) the docs aren't clear about the semantics
(b) people have picked behaviour that suited them
to implement without documenting what it was.

> For ppc we always do mmio emulation
> as if the cpu was big endian.

Even if the guest, the host kernel and QEMU in userspace are
all little endian?

Also "mmio emulation as if the CPU was big endian"
doesn't make sense -- MMIO emulation doesn't depend
on CPU endianness.

> We've had an is_bigendian variable
> for that since the very first versions.

Where? In the kernel? In QEMU? What does it control?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-18 10:15         ` [Qemu-devel] " Peter Maydell
@ 2014-01-20 14:20           ` Alexander Graf
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-20 14:20 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, QEMU Developers, qemu-ppc,
	kvm-devel, kvmarm


On 18.01.2014, at 11:15, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 18 January 2014 07:32, Alexander Graf <agraf@suse.de> wrote:
>>> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <christoffer.dall@linaro.org>:
>>>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>>> Having thought a little more about this, my opinion is:
>>>> 
>>>> * we should specify that the byte order of the mmio.data
>>>>  array is host kernel endianness (ie same endianness
>>>>  as the QEMU process itself) [this is what it actually
>>>>  is, I think, for all the cases that work today]
>>> 
>>> I completely agree, given that it's too late to be set on always LE/BE,
>>> I think the natural choice is something that allows a user to cast the
>>> byte array to an appropriate pointer type and dereference it.
>>> 
>>> And I think we need to amend the KVM API docs to specify this.
>> 
>> I don't see the problem.
> 
> The problem is (a) the docs aren't clear about the semantics
> (b) people have picked behaviour that suited them
> to implement without documenting what it was.

I think I see the problem now. You're thinking about LE hosts, not LE guests.

I think the only really sensible options would be to

  a) Always use a statically define target endianness (big for ppc)
  b) Always use host endianness

Currently QEMU apparently implements a), but that can easily be changed. Today we don't have kvm support for ppc64le hosts yet.

I personally prefer b). It's the natural thing to do for a host interface to be in host endianness and it's exactly what we expose for LE-on-BE systems with ppc already.

> 
>> For ppc we always do mmio emulation
>> as if the cpu was big endian.
> 
> Even if the guest, the host kernel and QEMU in userspace are
> all little endian?
> 
> Also "mmio emulation as if the CPU was big endian"
> doesn't make sense -- MMIO emulation doesn't depend
> on CPU endianness.
> 
>> We've had an is_bigendian variable
>> for that since the very first versions.
> 
> Where? In the kernel? In QEMU? What does it control?

In KVM. Check out https://github.com/agraf/linux-2.6/commit/1c00e7c21e39e20be7b03b111d5ab90ce938f108


Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-20 14:20           ` Alexander Graf
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-20 14:20 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall


On 18.01.2014, at 11:15, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 18 January 2014 07:32, Alexander Graf <agraf@suse.de> wrote:
>>> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <christoffer.dall@linaro.org>:
>>>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>>> Having thought a little more about this, my opinion is:
>>>> 
>>>> * we should specify that the byte order of the mmio.data
>>>>  array is host kernel endianness (ie same endianness
>>>>  as the QEMU process itself) [this is what it actually
>>>>  is, I think, for all the cases that work today]
>>> 
>>> I completely agree, given that it's too late to be set on always LE/BE,
>>> I think the natural choice is something that allows a user to cast the
>>> byte array to an appropriate pointer type and dereference it.
>>> 
>>> And I think we need to amend the KVM API docs to specify this.
>> 
>> I don't see the problem.
> 
> The problem is (a) the docs aren't clear about the semantics
> (b) people have picked behaviour that suited them
> to implement without documenting what it was.

I think I see the problem now. You're thinking about LE hosts, not LE guests.

I think the only really sensible options would be to

  a) Always use a statically define target endianness (big for ppc)
  b) Always use host endianness

Currently QEMU apparently implements a), but that can easily be changed. Today we don't have kvm support for ppc64le hosts yet.

I personally prefer b). It's the natural thing to do for a host interface to be in host endianness and it's exactly what we expose for LE-on-BE systems with ppc already.

> 
>> For ppc we always do mmio emulation
>> as if the cpu was big endian.
> 
> Even if the guest, the host kernel and QEMU in userspace are
> all little endian?
> 
> Also "mmio emulation as if the CPU was big endian"
> doesn't make sense -- MMIO emulation doesn't depend
> on CPU endianness.
> 
>> We've had an is_bigendian variable
>> for that since the very first versions.
> 
> Where? In the kernel? In QEMU? What does it control?

In KVM. Check out https://github.com/agraf/linux-2.6/commit/1c00e7c21e39e20be7b03b111d5ab90ce938f108


Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-17 18:52   ` [Qemu-devel] " Peter Maydell
@ 2014-01-20 14:22     ` Alexander Graf
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-20 14:22 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, QEMU Developers, qemu-ppc, kvm-devel,
	Christoffer Dall, kvmarm


On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>> Specifically, the KVM API says "here's a uint8_t[] byte
>> array and a length", and the current QEMU code treats that
>> as "this is a byte array written as if the guest CPU
>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>> I/O access to this buffer rather than to the device".
>> 
>> The KVM API docs don't actually specify the endianness
>> semantics of the byte array, but I think that that really
>> needs to be nailed down. I can think of a couple of options:
>> * always LE
>> * always BE
>>   [these first two are non-starters because they would
>>   break either x86 or PPC existing code]
>> * always the endianness the guest is at the time
>> * always some arbitrary endianness based purely on the
>>   endianness the KVM implementation used historically
>> * always the endianness of the host QEMU binary
>> * something else?
>> 
>> Any preferences? Current QEMU code basically assumes
>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>> which is pretty random.
> 
> Having thought a little more about this, my opinion is:
> 
> * we should specify that the byte order of the mmio.data
>   array is host kernel endianness (ie same endianness
>   as the QEMU process itself) [this is what it actually
>   is, I think, for all the cases that work today]
> * we should fix the code path in QEMU for handling
>   mmio.data which currently has the implicit assumption
>   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>   as the QEMU host process endianness (because it's using
>   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>   is different from HOST_WORDS_BIGENDIAN)

Yes, I fully agree :).


Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-20 14:22     ` Alexander Graf
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-20 14:22 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall


On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>> Specifically, the KVM API says "here's a uint8_t[] byte
>> array and a length", and the current QEMU code treats that
>> as "this is a byte array written as if the guest CPU
>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>> I/O access to this buffer rather than to the device".
>> 
>> The KVM API docs don't actually specify the endianness
>> semantics of the byte array, but I think that that really
>> needs to be nailed down. I can think of a couple of options:
>> * always LE
>> * always BE
>>   [these first two are non-starters because they would
>>   break either x86 or PPC existing code]
>> * always the endianness the guest is at the time
>> * always some arbitrary endianness based purely on the
>>   endianness the KVM implementation used historically
>> * always the endianness of the host QEMU binary
>> * something else?
>> 
>> Any preferences? Current QEMU code basically assumes
>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>> which is pretty random.
> 
> Having thought a little more about this, my opinion is:
> 
> * we should specify that the byte order of the mmio.data
>   array is host kernel endianness (ie same endianness
>   as the QEMU process itself) [this is what it actually
>   is, I think, for all the cases that work today]
> * we should fix the code path in QEMU for handling
>   mmio.data which currently has the implicit assumption
>   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>   as the QEMU host process endianness (because it's using
>   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>   is different from HOST_WORDS_BIGENDIAN)

Yes, I fully agree :).


Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-20 14:20           ` [Qemu-devel] " Alexander Graf
@ 2014-01-20 14:31             ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-20 14:31 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Christoffer Dall, Thomas Falcon, QEMU Developers, qemu-ppc,
	kvm-devel, kvmarm

On 20 January 2014 14:20, Alexander Graf <agraf@suse.de> wrote:
> I think I see the problem now. You're thinking about LE hosts, not LE guests.
>
> I think the only really sensible options would be to
>
>   a) Always use a statically define target endianness (big for ppc)
>   b) Always use host endianness

> Currently QEMU apparently implements a), but that can
> easily be changed. Today we don't have kvm support for
> ppc64le hosts yet.

Yes; I would ideally like us be able to get rid of that
statically defined target endianness eventually, so if we
have the leeway to define the kernel<->userspace ABI in a
way that doesn't care about the current guest CPU endianness
(ie we haven't actually yet claimed support for
reverse-endianness guests in a way that locks us into an
unhelpful definition of the ABI) we should take it while
we still can.

Then the current QEMU restrictions boil down to "you can
only use QEMU for KVM on a host kernel with the same
endianness as QEMU's legacy TARGET_WORDS_BIGENDIAN
setting for that CPU" (but such a QEMU can deal with
guests whatever they do with the endianness control bits).

> I personally prefer b). It's the natural thing to do for
> a host interface to be in host endianness and it's exactly
> what we expose for LE-on-BE systems with ppc already.

Yes. Strictly speaking by "host endianness" here I guess
we mean "the endianness of the kernel-to-userspace ABI",
since it is at least in theory possible to have an LE
kernel which runs BE userspace processes.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-20 14:31             ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-20 14:31 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 20 January 2014 14:20, Alexander Graf <agraf@suse.de> wrote:
> I think I see the problem now. You're thinking about LE hosts, not LE guests.
>
> I think the only really sensible options would be to
>
>   a) Always use a statically define target endianness (big for ppc)
>   b) Always use host endianness

> Currently QEMU apparently implements a), but that can
> easily be changed. Today we don't have kvm support for
> ppc64le hosts yet.

Yes; I would ideally like us be able to get rid of that
statically defined target endianness eventually, so if we
have the leeway to define the kernel<->userspace ABI in a
way that doesn't care about the current guest CPU endianness
(ie we haven't actually yet claimed support for
reverse-endianness guests in a way that locks us into an
unhelpful definition of the ABI) we should take it while
we still can.

Then the current QEMU restrictions boil down to "you can
only use QEMU for KVM on a host kernel with the same
endianness as QEMU's legacy TARGET_WORDS_BIGENDIAN
setting for that CPU" (but such a QEMU can deal with
guests whatever they do with the endianness control bits).

> I personally prefer b). It's the natural thing to do for
> a host interface to be in host endianness and it's exactly
> what we expose for LE-on-BE systems with ppc already.

Yes. Strictly speaking by "host endianness" here I guess
we mean "the endianness of the kernel-to-userspace ABI",
since it is at least in theory possible to have an LE
kernel which runs BE userspace processes.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-20 14:22     ` [Qemu-devel] " Alexander Graf
@ 2014-01-20 19:19       ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-20 19:19 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Thomas Falcon, QEMU Developers, qemu-ppc,
	kvm-devel, kvmarm

On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
> 
> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> > On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> Specifically, the KVM API says "here's a uint8_t[] byte
> >> array and a length", and the current QEMU code treats that
> >> as "this is a byte array written as if the guest CPU
> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> >> I/O access to this buffer rather than to the device".
> >> 
> >> The KVM API docs don't actually specify the endianness
> >> semantics of the byte array, but I think that that really
> >> needs to be nailed down. I can think of a couple of options:
> >> * always LE
> >> * always BE
> >>   [these first two are non-starters because they would
> >>   break either x86 or PPC existing code]
> >> * always the endianness the guest is at the time
> >> * always some arbitrary endianness based purely on the
> >>   endianness the KVM implementation used historically
> >> * always the endianness of the host QEMU binary
> >> * something else?
> >> 
> >> Any preferences? Current QEMU code basically assumes
> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
> >> which is pretty random.
> > 
> > Having thought a little more about this, my opinion is:
> > 
> > * we should specify that the byte order of the mmio.data
> >   array is host kernel endianness (ie same endianness
> >   as the QEMU process itself) [this is what it actually
> >   is, I think, for all the cases that work today]
> > * we should fix the code path in QEMU for handling
> >   mmio.data which currently has the implicit assumption
> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
> >   as the QEMU host process endianness (because it's using
> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
> >   is different from HOST_WORDS_BIGENDIAN)
> 
> Yes, I fully agree :).
> 
Great, I'll prepare a patch for the KVM API documentation.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-20 19:19       ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-20 19:19 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
> 
> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> > On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> Specifically, the KVM API says "here's a uint8_t[] byte
> >> array and a length", and the current QEMU code treats that
> >> as "this is a byte array written as if the guest CPU
> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> >> I/O access to this buffer rather than to the device".
> >> 
> >> The KVM API docs don't actually specify the endianness
> >> semantics of the byte array, but I think that that really
> >> needs to be nailed down. I can think of a couple of options:
> >> * always LE
> >> * always BE
> >>   [these first two are non-starters because they would
> >>   break either x86 or PPC existing code]
> >> * always the endianness the guest is at the time
> >> * always some arbitrary endianness based purely on the
> >>   endianness the KVM implementation used historically
> >> * always the endianness of the host QEMU binary
> >> * something else?
> >> 
> >> Any preferences? Current QEMU code basically assumes
> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
> >> which is pretty random.
> > 
> > Having thought a little more about this, my opinion is:
> > 
> > * we should specify that the byte order of the mmio.data
> >   array is host kernel endianness (ie same endianness
> >   as the QEMU process itself) [this is what it actually
> >   is, I think, for all the cases that work today]
> > * we should fix the code path in QEMU for handling
> >   mmio.data which currently has the implicit assumption
> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
> >   as the QEMU host process endianness (because it's using
> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
> >   is different from HOST_WORDS_BIGENDIAN)
> 
> Yes, I fully agree :).
> 
Great, I'll prepare a patch for the KVM API documentation.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-20 19:19       ` [Qemu-devel] " Christoffer Dall
@ 2014-01-22  5:39         ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22  5:39 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Alexander Graf, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

Hi Guys,

Christoffer and I had a bit heated chat :) on this
subject last night. Christoffer, really appreciate
your time! We did not really reach agreement
during the chat and Christoffer asked me to follow
up on this thread.
Here it goes. Sorry, it is very long email.

I don't believe we can assign any endianity to
mmio.data[] byte array. I believe mmio.data[] and
mmio.len acts just memcpy and that is all. As
memcpy does not imply any endianity of underlying
data mmio.data[] should not either.

Here is my definition:

mmio.data[] is array of bytes that contains memory
bytes in such form, for read case, that if those
bytes are placed in guest memory and guest executes
the same read access instruction with address to this
memory, result would be the same as real h/w device
memory access. Rest of KVM host and hypervisor
part of code should really take care of mmio.data[]
memory so it will be delivered to vcpu registers and
restored by hypervisor part in such way that guest CPU
register value is the same as it would be for real
non-emulated h/w read access (that is emulation part).
The same goes for write access, if guest writes into
memory and those bytes are just copied to emulated
h/w register it would have the same effect as real
mapped h/w register write.

In shorter form, i.e for len=4 access: endianity of integer
at &mmio.data[0] address should match endianity
of emulated h/w device behind phys_addr address,
regardless what is endianity of emulator, KVM host,
hypervisor, and guest

Examples that illustrate my definition
--------------------------------------

1) LE guest (E bit is off in ARM speak) reads integer
(4 bytes) from mapped h/w LE device register -
mmio.data[3] contains MSB, mmio.data[0] contains LSB.

2) BE guest (E bit is on in ARM speak) reads integer
from mapped h/w LE device register - mmio.data[3]
contains MSB, mmio.data[0] contains LSB. Note that
if &mmio.data[0] memory would be placed in guest
address space and instruction restarted with new
address, then it would meet BE guest expectations
- the guest knows that it reads LE h/w so it will byteswap
register before processing it further. This is BE guest ARM
case (regardless of what KVM host endianity is).

3) BE guest reads integer from mapped h/w BE device
register - mmio.data[0] contains MSB, mmio.data[3]
contains LSB. Note that if &mmio.data[0] memory would
be placed in guest address space and instruction
restarted with new address, then it would meet BE
guest expectation - the guest knows that it reads
BE h/w so it will proceed further without any other
work. I guess, it is BE ppc case.


Arguments in favor of memcpy semantics of mmio.data[]
------------------------------------------------------

x) What are possible values of 'len'? Previous discussions
imply that is always powers of 2. Why is that? Maybe
there will be CPU that would need to do 5 bytes mmio
access, or 6 bytes. How do you assign endianity to
such case? 'len' 5 or 6, or any works fine with
memcpy semantics. I admit it is hypothetical case, but
IMHO it tests how clean ABI definition is.

x) Byte array does not have endianity because it
does not have any structure. If one would want to
imply structure why mmio is not defined in such way
so structure reflected in mmio definition?
Something like:


                /* KVM_EXIT_MMIO */
                struct {
                          __u64 phys_addr;
                          union {
                               __u8 byte;
                               __u16 hword;
                               __u32 word;
                               __u64 dword;
                          }  data;
                          __u32 len;
                          __u8  is_write;
                } mmio;

where len is really serves as union discriminator and
only allowed len values are 1, 2, 4, 8.
In this case, I agree, endianity of integer types
should be defined. I believe, use of byte array strongly
implies that original intent was to have semantics of
byte stream copy, just like memcpy does.

x) Note there is nothing wrong with user kernel ABI to
use just bytes stream as parameter. There is already
precedents like 'read' and 'write' system calls :).

x) Consider case when KVM works with emulated memory mapped
h/w devices where some devices operate in LE mode and others
operate in BE mode. It is defined by semantics of real h/w
device which is it, and should be emulated by emulator and KVM
given all other context. As far as mmio.data[] array concerned, if the
same integer value is read from these devices registers, mmio.data[]
memory should contain integer in opposite endianity for these
two cases, i.e MSB is data[0] in one case and MSB is
data[3] is in another case. It cannot be the same, because
except emulator and guest kernel, all other, like KVM host
and hypervisor, have no clue what endianity of device
actually is - it should treat mmio.data[] in the same way.
But resulting guest target CPU register would need to contain
normal integer value in one case and byteswapped in another,
because guest kernel would use it directly in one case and
byteswap it in another. Byte stream semantics allows to do
that. I don't see how it could happen if you fixate mmio.data[]
endianity in such way that it would contain integer in
the same format for BE and LE emulated device types.

If by this point you agree, that mmio.data[] user-land/kernel
ABI semantics should be just memcpy, stop reading :). If not,
you may would like to take a look at below appendix where I
described in great details endianity of data at different
points along mmio processing code path of existing ARM LE KVM,
and proposed ARM BE KVM. Note appendix, is very long and very
detailed, sorry about that, but I feel that earlier more
digested explanations failed, so it driven me to write out
all details how I see them. If I am wrong, I hope it would be
easier for folks to point in detailed explanation places
where my logic goes bad. Also, I am not sure whether this
mail thread is good place to discuss all details described
in the appendix. Christoffer, please advise whether I should take
that one back on [1]. But I hope this bigger picture may help to
see the mmio.data[] semantics issue in context.

More inline and appendix is at the end.

On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>
>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> > On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>> >> Specifically, the KVM API says "here's a uint8_t[] byte
>> >> array and a length", and the current QEMU code treats that
>> >> as "this is a byte array written as if the guest CPU
>> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>> >> I/O access to this buffer rather than to the device".
>> >>
>> >> The KVM API docs don't actually specify the endianness
>> >> semantics of the byte array, but I think that that really
>> >> needs to be nailed down. I can think of a couple of options:
>> >> * always LE
>> >> * always BE
>> >>   [these first two are non-starters because they would
>> >>   break either x86 or PPC existing code]
>> >> * always the endianness the guest is at the time
>> >> * always some arbitrary endianness based purely on the
>> >>   endianness the KVM implementation used historically
>> >> * always the endianness of the host QEMU binary
>> >> * something else?
>> >>
>> >> Any preferences? Current QEMU code basically assumes
>> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
>> >> which is pretty random.
>> >
>> > Having thought a little more about this, my opinion is:
>> >
>> > * we should specify that the byte order of the mmio.data
>> >   array is host kernel endianness (ie same endianness
>> >   as the QEMU process itself) [this is what it actually
>> >   is, I think, for all the cases that work today]

In above please consider two types of mapped emulated
h/w devices: BE and LE they cannot have mmio.data in the
same endianity. Currently in all observable cases LE ARM
and BE PPC devices endianity matches kernel/qemu
endianity but it would break when BE ARM is introduced
or LE PPC or one would start emulating BE devices on LE
ARM.

>> > * we should fix the code path in QEMU for handling
>> >   mmio.data which currently has the implicit assumption
>> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>> >   as the QEMU host process endianness (because it's using
>> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>> >   is different from HOST_WORDS_BIGENDIAN)

I do not follow above. Maybe I am missing bigger context.
What is CPU under discussion in above? On ARM V7 system
when LE device is accessed as integer &mmio.data[0] address
would contain integer is in LE format, ie mmio.data[0] is LSB.

Here is gdb session of LE qemu running on V7 LE kernel and
TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
mapped LE device.
Please check run->mmio structure after read
(cpu_physical_memory_rw) completes it is in 4 bytes integer in
LE format mmio.data[0] is LSB and is equal to 1
(s->syscfgstat value):

(gdb) bt
#0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
/home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
#1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
mask=4294967295)
    at /home/root/20131219/qemu-be/memory.c:407
#2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
access_size_min=1,
    access_size_max=2357596, access=access@entry=0x23b96c
<memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
/home/root/20131219/qemu-be/memory.c:477
#3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
#4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
#5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
/home/root/20131219/qemu-be/memory.c:1743
#6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
<address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
len=4, is_write=false,
    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
#7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
    at /home/root/20131219/qemu-be/exec.c:2070
#8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1701
#9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
/home/root/20131219/qemu-be/cpus.c:874
#10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
#11 0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
#12 0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p /x s->sys_cfgstat
$25 = 0x1
(gdb) finish
Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
/home/root/20131219/qemu-be/memory.c:408
408        trace_memory_region_ops_read(mr, addr, tmp, size);
Value returned is $26 = 1
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1660
1660            kvm_arch_pre_run(cpu, run);
(gdb) bt
#0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1660
#1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
/home/root/20131219/qemu-be/cpus.c:874
#2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
#3  0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
#4  0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p /x run->mmio
$27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}, len = 0x4, is_write = 0x0}

Also please look at adjust_endianness function and
struct MemoryRegion 'endianness' field. IMHO in qemu it
works quite nicely already. MemoryRegion 'read' and 'write'
callbacks return/get data in native format adjust_endianness
function checks whether emulated device endianness matches
emulator endianness and if it is different it does byteswap
according to size. As in above example arm_sysctl_ops memory
region should be marked as DEVICE_LITTLE_ENDIAN when it
returns s->sys_cfgstat value LE qemu sees that endianity
matches and it does not byteswap of result, so integer at
&mmio.data[0] address is in LE form. When qemu would
run in BE mode on BE kernel, it would see that endianity
mismatches and it will byteswap s->sys_cfgstat native value
(BE), so mmio.data would contain integer in LE format again.

Note in currently committed code arm_sysctl_ops endianity
is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
arm_sysctl device always gives/receives data in LE format regardless
of current CPSR E bit value, so it cannot be marked as NATIVE.
LE and BE kernels always read it as LE device; BE kernel follows
with byteswap. It was OK while we just run qemu in LE, but it
should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
... and actually that device and few other ARM specific devices
endianity change to LITTLE_ENDIAN was the only change in qemu
to make BE KVM to work.

>>
>> Yes, I fully agree :).
>>
> Great, I'll prepare a patch for the KVM API documentation.
>
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186


    Appendix
    Data path endianity in ARM KVM mmio
    ===================================

This writeup considers several scenarios and tracks endianity
of data how it travels from emulator to guest CPU register, in
case of ARM KVM. It starts with currently committed code for LE
KVM host case and further discusses proposed BE KVM host
arrangement.

Just to restrict discussion writeup considers code path of
integer (4 bytes) read from h/w mapped emulated device memory.
Writeup considers endianity of essential places involved in such
code path.

For all cases when endianity is defined, it is assumed that
values under consideration are in memory (opposite to be in
register that does not have endianity). I.e even if function
variable could be actually allocated in CPU register writeup
will reference to it as it is in memory, just to keep
discussion clean, except for final guest CPU register.

Let's consider the following places along data path from
emulator to guest CPU register:

1) emulator code that holds integer value to be read, assume
it would be global 'int emulated_hw_device_val' variable.
Normally in emulator it is held in native endian format - i.e
it is CPSR E bit is the same as kernel CPSR E bit. Just for
discussion sake assume that this h/w device registers
holds 5 as its value.

2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
mmio.data byte array. Byte array does not have endianity,
but for this discussion it would track endianity of integer
at &mmio.data[0] address

3) 'data' variable type of 'unsigned long' in
kvm_handle_mmio_return function before vcpu_data_host_to_guest
call. KVM host mmio_read_buf function is used to fill this
variable from mmio.data buffer. mmio_read_buf actually
acts as memcpy from mmio.data buffer address,
just taking access size in account.

4) the same 'data' variable as above, but after
vcpu_data_host_to_guest function call, just before it is copied
to vcpu_reg target register location. Note
vcpu_data_host_to_guest function may byteswap value of 'data'
depending on current KVM host endianity and value of
guest CPSR E bit.

5) guest CPU spilled register array, location of target register
i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address

6) finally guest CPU register filled from vcpu_reg just before
guest resume execution of trapped emulated instruction. Note
it is done by hypervisor part of code and hypervisor EE bit is
the same as KVM host CPSR E bit.

Note again, KVM host, emulator, and hypervisor part of code (guest
CPU registers save and restore code) always run in the same
endianity. Endianity of accessed emulated devices and endianity
of guest varies independently of KVM host endianity.

Below sections consider all permutations of all possible cases,
it maybe quite boring to read. I've created summary table at
the end, you can jump to the table, after reading few cases.
But if you have objections and you see things happen differently
please comment inline of the use cases steps.

LE KVM host
===========

Use case 1
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in LE format, matches device
endianity
3) 'data' is LE
4) 'data' is LE (since guest CPSR E bit is off no byteswap)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 5 (0x00000005)

guest resumes execution ... Let's say after 'ldr r1, [r0]'
instruction, where r0 holds address of devices, it knows
that it reads LE mapped h/w so no addition processing is
needed

Use case 2
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in LE format; matches device
endianity
3) 'data' is LE
4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
will do byteswap: cpu_to_be)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 0x05000000

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode (E bit on), it knows that it reads
LE device memory, it needs to byteswap r1 before further
processing so it does 'rev r1, r1' and proceed with result

Use case 3
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in BE format; emulator byteswaps
it because it knows that device endianity is opposite to native,
and it should match device endianity
3) 'data' is BE
4) 'data' is BE (since guest CPSR E bit is off no byteswap)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 0x05000000

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode (E bit off), it knows that it
reads BE device memory, it need to byteswap r1 before further
processing so it does 'rev r1, r1' and proceeds with result

Use case 4
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in BE format; emulator byteswaps
it because it knows that device endianity is opposite to native,
and should match device endianity
3) 'data' is BE
4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
will do byteswap: cpu_to_be)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 5 (0x00000005)

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads BE device
memory, so it does not need to do anything before further
processing.


Above uses cases that is exactly what we have now after Marc's
commit to support BE guest on LE KVM host. Further use
cases describe how it would work with BE KVM patches I proposed.
It is understood that it is subject of further discussion.


BE KVM host
===========

Use case 5
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in LE format; emulator byteswaps
it because it knows that device endianity is opposite to native;
matches device endianity
3) 'data' is LE
4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
does *not* do byteswap: cpu_to_be no effect in BE host kernel)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 0x05000000 because
hypervisor runs in BE mode, so load of LE integer will be
byteswapped value in register

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads LE device
memory, it need to byteswap r1 before further processing so it
does 'rev r1, r1' and proceeds with result

Use case 6
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in LE format; emulator byteswaps
it because it knows that device endianity is opposite to native;
matches device endianity
3) 'data' is LE
4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
does byteswap: cpu_to_le)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 5 (0x00000005) because
hypervisor runs in BE mode, so load of BE integer will be OK

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode, it knows that it reads LE device
memory, so it does not need to do anything else it just proceeds

Use case 7
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in BE format; matches device
endianity
3) 'data' is BE
4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
does *not* do byteswap: cpu_to_be no effect in BE host kernel)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 5 (0x00000005) because
hypervisor runs in BE mode, so load of BE integer will be OK

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads BE device
memory, so it does not need to do anything else it just proceeds

Use case 8
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in BE format; matches device
endianity
3) 'data' is BE
4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
does byteswap: cpu_to_le)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 0x05000000 because
hypervisor runs in BE mode, so load of LE integer will be
byteswapped value in register

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode, it knows that it reads BE device
memory, it need to byteswap r1 before further processing so it
does 'rev r1, r1' and proceeds with result

Note that with BE kernel we actually have some initial portion
of assembler code that is executed with CPSR bit off and it reads
LE h/w - i.e it falls into use case 1.

Summary Table (please use fixed font to see it correctly)
========================================

--------------------------------------------------------------
| Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
--------------------------------------------------------------
| KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
| Emulator,  |     |     |     |     |     |     |     |     |
| Hypervisor |     |     |     |     |     |     |     |     |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
| Access     |     |     |     |     |     |     |     |     |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
--------------------------------------------------------------
| Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
--------------------------------------------------------------
| Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
--------------------------------------------------------------
| Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
--------------------------------------------------------------
| Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
--------------------------------------------------------------
| Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
| value      |     |     |     |     |     |     |     |     |
| byteswapped|     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
| Follows    |     |     |     |     |     |     |     |     |
| with rev   |     |     |     |     |     |     |     |     |
--------------------------------------------------------------

Few objservations
=================

x) Note above table is symmetric wrt to BE<->LE change:
       1<-->7
       2<-->8
       3<-->5
       4<-->6

x) &mmio.data[0] address always holds integer in the same
format as emulated device endianity

x) During step 4) when vcpu_data_host_to_guest function
is used, if guest E bit value different, but everything else
is the same, opposite result are produced (1&2, 3&4, 5&6,
7&8)

If you reached to this end :), again, thank you very much for
reading it!

- Victor

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22  5:39         ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22  5:39 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Thomas Falcon, kvm-devel, Alexander Graf, QEMU Developers,
	qemu-ppc, kvmarm

Hi Guys,

Christoffer and I had a bit heated chat :) on this
subject last night. Christoffer, really appreciate
your time! We did not really reach agreement
during the chat and Christoffer asked me to follow
up on this thread.
Here it goes. Sorry, it is very long email.

I don't believe we can assign any endianity to
mmio.data[] byte array. I believe mmio.data[] and
mmio.len acts just memcpy and that is all. As
memcpy does not imply any endianity of underlying
data mmio.data[] should not either.

Here is my definition:

mmio.data[] is array of bytes that contains memory
bytes in such form, for read case, that if those
bytes are placed in guest memory and guest executes
the same read access instruction with address to this
memory, result would be the same as real h/w device
memory access. Rest of KVM host and hypervisor
part of code should really take care of mmio.data[]
memory so it will be delivered to vcpu registers and
restored by hypervisor part in such way that guest CPU
register value is the same as it would be for real
non-emulated h/w read access (that is emulation part).
The same goes for write access, if guest writes into
memory and those bytes are just copied to emulated
h/w register it would have the same effect as real
mapped h/w register write.

In shorter form, i.e for len=4 access: endianity of integer
at &mmio.data[0] address should match endianity
of emulated h/w device behind phys_addr address,
regardless what is endianity of emulator, KVM host,
hypervisor, and guest

Examples that illustrate my definition
--------------------------------------

1) LE guest (E bit is off in ARM speak) reads integer
(4 bytes) from mapped h/w LE device register -
mmio.data[3] contains MSB, mmio.data[0] contains LSB.

2) BE guest (E bit is on in ARM speak) reads integer
from mapped h/w LE device register - mmio.data[3]
contains MSB, mmio.data[0] contains LSB. Note that
if &mmio.data[0] memory would be placed in guest
address space and instruction restarted with new
address, then it would meet BE guest expectations
- the guest knows that it reads LE h/w so it will byteswap
register before processing it further. This is BE guest ARM
case (regardless of what KVM host endianity is).

3) BE guest reads integer from mapped h/w BE device
register - mmio.data[0] contains MSB, mmio.data[3]
contains LSB. Note that if &mmio.data[0] memory would
be placed in guest address space and instruction
restarted with new address, then it would meet BE
guest expectation - the guest knows that it reads
BE h/w so it will proceed further without any other
work. I guess, it is BE ppc case.


Arguments in favor of memcpy semantics of mmio.data[]
------------------------------------------------------

x) What are possible values of 'len'? Previous discussions
imply that is always powers of 2. Why is that? Maybe
there will be CPU that would need to do 5 bytes mmio
access, or 6 bytes. How do you assign endianity to
such case? 'len' 5 or 6, or any works fine with
memcpy semantics. I admit it is hypothetical case, but
IMHO it tests how clean ABI definition is.

x) Byte array does not have endianity because it
does not have any structure. If one would want to
imply structure why mmio is not defined in such way
so structure reflected in mmio definition?
Something like:


                /* KVM_EXIT_MMIO */
                struct {
                          __u64 phys_addr;
                          union {
                               __u8 byte;
                               __u16 hword;
                               __u32 word;
                               __u64 dword;
                          }  data;
                          __u32 len;
                          __u8  is_write;
                } mmio;

where len is really serves as union discriminator and
only allowed len values are 1, 2, 4, 8.
In this case, I agree, endianity of integer types
should be defined. I believe, use of byte array strongly
implies that original intent was to have semantics of
byte stream copy, just like memcpy does.

x) Note there is nothing wrong with user kernel ABI to
use just bytes stream as parameter. There is already
precedents like 'read' and 'write' system calls :).

x) Consider case when KVM works with emulated memory mapped
h/w devices where some devices operate in LE mode and others
operate in BE mode. It is defined by semantics of real h/w
device which is it, and should be emulated by emulator and KVM
given all other context. As far as mmio.data[] array concerned, if the
same integer value is read from these devices registers, mmio.data[]
memory should contain integer in opposite endianity for these
two cases, i.e MSB is data[0] in one case and MSB is
data[3] is in another case. It cannot be the same, because
except emulator and guest kernel, all other, like KVM host
and hypervisor, have no clue what endianity of device
actually is - it should treat mmio.data[] in the same way.
But resulting guest target CPU register would need to contain
normal integer value in one case and byteswapped in another,
because guest kernel would use it directly in one case and
byteswap it in another. Byte stream semantics allows to do
that. I don't see how it could happen if you fixate mmio.data[]
endianity in such way that it would contain integer in
the same format for BE and LE emulated device types.

If by this point you agree, that mmio.data[] user-land/kernel
ABI semantics should be just memcpy, stop reading :). If not,
you may would like to take a look at below appendix where I
described in great details endianity of data at different
points along mmio processing code path of existing ARM LE KVM,
and proposed ARM BE KVM. Note appendix, is very long and very
detailed, sorry about that, but I feel that earlier more
digested explanations failed, so it driven me to write out
all details how I see them. If I am wrong, I hope it would be
easier for folks to point in detailed explanation places
where my logic goes bad. Also, I am not sure whether this
mail thread is good place to discuss all details described
in the appendix. Christoffer, please advise whether I should take
that one back on [1]. But I hope this bigger picture may help to
see the mmio.data[] semantics issue in context.

More inline and appendix is at the end.

On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>
>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> > On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>> >> Specifically, the KVM API says "here's a uint8_t[] byte
>> >> array and a length", and the current QEMU code treats that
>> >> as "this is a byte array written as if the guest CPU
>> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>> >> I/O access to this buffer rather than to the device".
>> >>
>> >> The KVM API docs don't actually specify the endianness
>> >> semantics of the byte array, but I think that that really
>> >> needs to be nailed down. I can think of a couple of options:
>> >> * always LE
>> >> * always BE
>> >>   [these first two are non-starters because they would
>> >>   break either x86 or PPC existing code]
>> >> * always the endianness the guest is at the time
>> >> * always some arbitrary endianness based purely on the
>> >>   endianness the KVM implementation used historically
>> >> * always the endianness of the host QEMU binary
>> >> * something else?
>> >>
>> >> Any preferences? Current QEMU code basically assumes
>> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
>> >> which is pretty random.
>> >
>> > Having thought a little more about this, my opinion is:
>> >
>> > * we should specify that the byte order of the mmio.data
>> >   array is host kernel endianness (ie same endianness
>> >   as the QEMU process itself) [this is what it actually
>> >   is, I think, for all the cases that work today]

In above please consider two types of mapped emulated
h/w devices: BE and LE they cannot have mmio.data in the
same endianity. Currently in all observable cases LE ARM
and BE PPC devices endianity matches kernel/qemu
endianity but it would break when BE ARM is introduced
or LE PPC or one would start emulating BE devices on LE
ARM.

>> > * we should fix the code path in QEMU for handling
>> >   mmio.data which currently has the implicit assumption
>> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>> >   as the QEMU host process endianness (because it's using
>> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>> >   is different from HOST_WORDS_BIGENDIAN)

I do not follow above. Maybe I am missing bigger context.
What is CPU under discussion in above? On ARM V7 system
when LE device is accessed as integer &mmio.data[0] address
would contain integer is in LE format, ie mmio.data[0] is LSB.

Here is gdb session of LE qemu running on V7 LE kernel and
TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
mapped LE device.
Please check run->mmio structure after read
(cpu_physical_memory_rw) completes it is in 4 bytes integer in
LE format mmio.data[0] is LSB and is equal to 1
(s->syscfgstat value):

(gdb) bt
#0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
/home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
#1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
mask=4294967295)
    at /home/root/20131219/qemu-be/memory.c:407
#2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
access_size_min=1,
    access_size_max=2357596, access=access@entry=0x23b96c
<memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
/home/root/20131219/qemu-be/memory.c:477
#3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
#4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
#5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
/home/root/20131219/qemu-be/memory.c:1743
#6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
<address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
len=4, is_write=false,
    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
#7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
    at /home/root/20131219/qemu-be/exec.c:2070
#8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1701
#9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
/home/root/20131219/qemu-be/cpus.c:874
#10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
#11 0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
#12 0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p /x s->sys_cfgstat
$25 = 0x1
(gdb) finish
Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
/home/root/20131219/qemu-be/memory.c:408
408        trace_memory_region_ops_read(mr, addr, tmp, size);
Value returned is $26 = 1
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1660
1660            kvm_arch_pre_run(cpu, run);
(gdb) bt
#0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1660
#1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
/home/root/20131219/qemu-be/cpus.c:874
#2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
#3  0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
#4  0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p /x run->mmio
$27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}, len = 0x4, is_write = 0x0}

Also please look at adjust_endianness function and
struct MemoryRegion 'endianness' field. IMHO in qemu it
works quite nicely already. MemoryRegion 'read' and 'write'
callbacks return/get data in native format adjust_endianness
function checks whether emulated device endianness matches
emulator endianness and if it is different it does byteswap
according to size. As in above example arm_sysctl_ops memory
region should be marked as DEVICE_LITTLE_ENDIAN when it
returns s->sys_cfgstat value LE qemu sees that endianity
matches and it does not byteswap of result, so integer at
&mmio.data[0] address is in LE form. When qemu would
run in BE mode on BE kernel, it would see that endianity
mismatches and it will byteswap s->sys_cfgstat native value
(BE), so mmio.data would contain integer in LE format again.

Note in currently committed code arm_sysctl_ops endianity
is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
arm_sysctl device always gives/receives data in LE format regardless
of current CPSR E bit value, so it cannot be marked as NATIVE.
LE and BE kernels always read it as LE device; BE kernel follows
with byteswap. It was OK while we just run qemu in LE, but it
should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
... and actually that device and few other ARM specific devices
endianity change to LITTLE_ENDIAN was the only change in qemu
to make BE KVM to work.

>>
>> Yes, I fully agree :).
>>
> Great, I'll prepare a patch for the KVM API documentation.
>
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186


    Appendix
    Data path endianity in ARM KVM mmio
    ===================================

This writeup considers several scenarios and tracks endianity
of data how it travels from emulator to guest CPU register, in
case of ARM KVM. It starts with currently committed code for LE
KVM host case and further discusses proposed BE KVM host
arrangement.

Just to restrict discussion writeup considers code path of
integer (4 bytes) read from h/w mapped emulated device memory.
Writeup considers endianity of essential places involved in such
code path.

For all cases when endianity is defined, it is assumed that
values under consideration are in memory (opposite to be in
register that does not have endianity). I.e even if function
variable could be actually allocated in CPU register writeup
will reference to it as it is in memory, just to keep
discussion clean, except for final guest CPU register.

Let's consider the following places along data path from
emulator to guest CPU register:

1) emulator code that holds integer value to be read, assume
it would be global 'int emulated_hw_device_val' variable.
Normally in emulator it is held in native endian format - i.e
it is CPSR E bit is the same as kernel CPSR E bit. Just for
discussion sake assume that this h/w device registers
holds 5 as its value.

2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
mmio.data byte array. Byte array does not have endianity,
but for this discussion it would track endianity of integer
at &mmio.data[0] address

3) 'data' variable type of 'unsigned long' in
kvm_handle_mmio_return function before vcpu_data_host_to_guest
call. KVM host mmio_read_buf function is used to fill this
variable from mmio.data buffer. mmio_read_buf actually
acts as memcpy from mmio.data buffer address,
just taking access size in account.

4) the same 'data' variable as above, but after
vcpu_data_host_to_guest function call, just before it is copied
to vcpu_reg target register location. Note
vcpu_data_host_to_guest function may byteswap value of 'data'
depending on current KVM host endianity and value of
guest CPSR E bit.

5) guest CPU spilled register array, location of target register
i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address

6) finally guest CPU register filled from vcpu_reg just before
guest resume execution of trapped emulated instruction. Note
it is done by hypervisor part of code and hypervisor EE bit is
the same as KVM host CPSR E bit.

Note again, KVM host, emulator, and hypervisor part of code (guest
CPU registers save and restore code) always run in the same
endianity. Endianity of accessed emulated devices and endianity
of guest varies independently of KVM host endianity.

Below sections consider all permutations of all possible cases,
it maybe quite boring to read. I've created summary table at
the end, you can jump to the table, after reading few cases.
But if you have objections and you see things happen differently
please comment inline of the use cases steps.

LE KVM host
===========

Use case 1
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in LE format, matches device
endianity
3) 'data' is LE
4) 'data' is LE (since guest CPSR E bit is off no byteswap)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 5 (0x00000005)

guest resumes execution ... Let's say after 'ldr r1, [r0]'
instruction, where r0 holds address of devices, it knows
that it reads LE mapped h/w so no addition processing is
needed

Use case 2
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in LE format; matches device
endianity
3) 'data' is LE
4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
will do byteswap: cpu_to_be)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 0x05000000

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode (E bit on), it knows that it reads
LE device memory, it needs to byteswap r1 before further
processing so it does 'rev r1, r1' and proceed with result

Use case 3
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in BE format; emulator byteswaps
it because it knows that device endianity is opposite to native,
and it should match device endianity
3) 'data' is BE
4) 'data' is BE (since guest CPSR E bit is off no byteswap)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 0x05000000

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode (E bit off), it knows that it
reads BE device memory, it need to byteswap r1 before further
processing so it does 'rev r1, r1' and proceeds with result

Use case 4
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in BE format; emulator byteswaps
it because it knows that device endianity is opposite to native,
and should match device endianity
3) 'data' is BE
4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
will do byteswap: cpu_to_be)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 5 (0x00000005)

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads BE device
memory, so it does not need to do anything before further
processing.


Above uses cases that is exactly what we have now after Marc's
commit to support BE guest on LE KVM host. Further use
cases describe how it would work with BE KVM patches I proposed.
It is understood that it is subject of further discussion.


BE KVM host
===========

Use case 5
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in LE format; emulator byteswaps
it because it knows that device endianity is opposite to native;
matches device endianity
3) 'data' is LE
4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
does *not* do byteswap: cpu_to_be no effect in BE host kernel)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 0x05000000 because
hypervisor runs in BE mode, so load of LE integer will be
byteswapped value in register

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads LE device
memory, it need to byteswap r1 before further processing so it
does 'rev r1, r1' and proceeds with result

Use case 6
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in LE format; emulator byteswaps
it because it knows that device endianity is opposite to native;
matches device endianity
3) 'data' is LE
4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
does byteswap: cpu_to_le)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 5 (0x00000005) because
hypervisor runs in BE mode, so load of BE integer will be OK

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode, it knows that it reads LE device
memory, so it does not need to do anything else it just proceeds

Use case 7
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in BE format; matches device
endianity
3) 'data' is BE
4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
does *not* do byteswap: cpu_to_be no effect in BE host kernel)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 5 (0x00000005) because
hypervisor runs in BE mode, so load of BE integer will be OK

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads BE device
memory, so it does not need to do anything else it just proceeds

Use case 8
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in BE format; matches device
endianity
3) 'data' is BE
4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
does byteswap: cpu_to_le)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 0x05000000 because
hypervisor runs in BE mode, so load of LE integer will be
byteswapped value in register

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode, it knows that it reads BE device
memory, it need to byteswap r1 before further processing so it
does 'rev r1, r1' and proceeds with result

Note that with BE kernel we actually have some initial portion
of assembler code that is executed with CPSR bit off and it reads
LE h/w - i.e it falls into use case 1.

Summary Table (please use fixed font to see it correctly)
========================================

--------------------------------------------------------------
| Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
--------------------------------------------------------------
| KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
| Emulator,  |     |     |     |     |     |     |     |     |
| Hypervisor |     |     |     |     |     |     |     |     |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
| Access     |     |     |     |     |     |     |     |     |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
--------------------------------------------------------------
| Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
--------------------------------------------------------------
| Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
--------------------------------------------------------------
| Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
--------------------------------------------------------------
| Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
--------------------------------------------------------------
| Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
| value      |     |     |     |     |     |     |     |     |
| byteswapped|     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
| Follows    |     |     |     |     |     |     |     |     |
| with rev   |     |     |     |     |     |     |     |     |
--------------------------------------------------------------

Few objservations
=================

x) Note above table is symmetric wrt to BE<->LE change:
       1<-->7
       2<-->8
       3<-->5
       4<-->6

x) &mmio.data[0] address always holds integer in the same
format as emulated device endianity

x) During step 4) when vcpu_data_host_to_guest function
is used, if guest E bit value different, but everything else
is the same, opposite result are produced (1&2, 3&4, 5&6,
7&8)

If you reached to this end :), again, thank you very much for
reading it!

- Victor

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22  5:39         ` [Qemu-devel] " Victor Kamensky
@ 2014-01-22  6:31           ` Anup Patel
  -1 siblings, 0 replies; 102+ messages in thread
From: Anup Patel @ 2014-01-22  6:31 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
<victor.kamensky@linaro.org> wrote:
> Hi Guys,
>
> Christoffer and I had a bit heated chat :) on this
> subject last night. Christoffer, really appreciate
> your time! We did not really reach agreement
> during the chat and Christoffer asked me to follow
> up on this thread.
> Here it goes. Sorry, it is very long email.
>
> I don't believe we can assign any endianity to
> mmio.data[] byte array. I believe mmio.data[] and
> mmio.len acts just memcpy and that is all. As
> memcpy does not imply any endianity of underlying
> data mmio.data[] should not either.
>
> Here is my definition:
>
> mmio.data[] is array of bytes that contains memory
> bytes in such form, for read case, that if those
> bytes are placed in guest memory and guest executes
> the same read access instruction with address to this
> memory, result would be the same as real h/w device
> memory access. Rest of KVM host and hypervisor
> part of code should really take care of mmio.data[]
> memory so it will be delivered to vcpu registers and
> restored by hypervisor part in such way that guest CPU
> register value is the same as it would be for real
> non-emulated h/w read access (that is emulation part).
> The same goes for write access, if guest writes into
> memory and those bytes are just copied to emulated
> h/w register it would have the same effect as real
> mapped h/w register write.
>
> In shorter form, i.e for len=4 access: endianity of integer
> at &mmio.data[0] address should match endianity
> of emulated h/w device behind phys_addr address,
> regardless what is endianity of emulator, KVM host,
> hypervisor, and guest
>
> Examples that illustrate my definition
> --------------------------------------
>
> 1) LE guest (E bit is off in ARM speak) reads integer
> (4 bytes) from mapped h/w LE device register -
> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>
> 2) BE guest (E bit is on in ARM speak) reads integer
> from mapped h/w LE device register - mmio.data[3]
> contains MSB, mmio.data[0] contains LSB. Note that
> if &mmio.data[0] memory would be placed in guest
> address space and instruction restarted with new
> address, then it would meet BE guest expectations
> - the guest knows that it reads LE h/w so it will byteswap
> register before processing it further. This is BE guest ARM
> case (regardless of what KVM host endianity is).
>
> 3) BE guest reads integer from mapped h/w BE device
> register - mmio.data[0] contains MSB, mmio.data[3]
> contains LSB. Note that if &mmio.data[0] memory would
> be placed in guest address space and instruction
> restarted with new address, then it would meet BE
> guest expectation - the guest knows that it reads
> BE h/w so it will proceed further without any other
> work. I guess, it is BE ppc case.
>
>
> Arguments in favor of memcpy semantics of mmio.data[]
> ------------------------------------------------------
>
> x) What are possible values of 'len'? Previous discussions
> imply that is always powers of 2. Why is that? Maybe
> there will be CPU that would need to do 5 bytes mmio
> access, or 6 bytes. How do you assign endianity to
> such case? 'len' 5 or 6, or any works fine with
> memcpy semantics. I admit it is hypothetical case, but
> IMHO it tests how clean ABI definition is.
>
> x) Byte array does not have endianity because it
> does not have any structure. If one would want to
> imply structure why mmio is not defined in such way
> so structure reflected in mmio definition?
> Something like:
>
>
>                 /* KVM_EXIT_MMIO */
>                 struct {
>                           __u64 phys_addr;
>                           union {
>                                __u8 byte;
>                                __u16 hword;
>                                __u32 word;
>                                __u64 dword;
>                           }  data;
>                           __u32 len;
>                           __u8  is_write;
>                 } mmio;
>
> where len is really serves as union discriminator and
> only allowed len values are 1, 2, 4, 8.
> In this case, I agree, endianity of integer types
> should be defined. I believe, use of byte array strongly
> implies that original intent was to have semantics of
> byte stream copy, just like memcpy does.
>
> x) Note there is nothing wrong with user kernel ABI to
> use just bytes stream as parameter. There is already
> precedents like 'read' and 'write' system calls :).
>
> x) Consider case when KVM works with emulated memory mapped
> h/w devices where some devices operate in LE mode and others
> operate in BE mode. It is defined by semantics of real h/w
> device which is it, and should be emulated by emulator and KVM
> given all other context. As far as mmio.data[] array concerned, if the
> same integer value is read from these devices registers, mmio.data[]
> memory should contain integer in opposite endianity for these
> two cases, i.e MSB is data[0] in one case and MSB is
> data[3] is in another case. It cannot be the same, because
> except emulator and guest kernel, all other, like KVM host
> and hypervisor, have no clue what endianity of device
> actually is - it should treat mmio.data[] in the same way.
> But resulting guest target CPU register would need to contain
> normal integer value in one case and byteswapped in another,
> because guest kernel would use it directly in one case and
> byteswap it in another. Byte stream semantics allows to do
> that. I don't see how it could happen if you fixate mmio.data[]
> endianity in such way that it would contain integer in
> the same format for BE and LE emulated device types.
>
> If by this point you agree, that mmio.data[] user-land/kernel
> ABI semantics should be just memcpy, stop reading :). If not,
> you may would like to take a look at below appendix where I
> described in great details endianity of data at different
> points along mmio processing code path of existing ARM LE KVM,
> and proposed ARM BE KVM. Note appendix, is very long and very
> detailed, sorry about that, but I feel that earlier more
> digested explanations failed, so it driven me to write out
> all details how I see them. If I am wrong, I hope it would be
> easier for folks to point in detailed explanation places
> where my logic goes bad. Also, I am not sure whether this
> mail thread is good place to discuss all details described
> in the appendix. Christoffer, please advise whether I should take
> that one back on [1]. But I hope this bigger picture may help to
> see the mmio.data[] semantics issue in context.
>
> More inline and appendix is at the end.
>
> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>
>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>
>>> > On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> >> Specifically, the KVM API says "here's a uint8_t[] byte
>>> >> array and a length", and the current QEMU code treats that
>>> >> as "this is a byte array written as if the guest CPU
>>> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>> >> I/O access to this buffer rather than to the device".
>>> >>
>>> >> The KVM API docs don't actually specify the endianness
>>> >> semantics of the byte array, but I think that that really
>>> >> needs to be nailed down. I can think of a couple of options:
>>> >> * always LE
>>> >> * always BE
>>> >>   [these first two are non-starters because they would
>>> >>   break either x86 or PPC existing code]
>>> >> * always the endianness the guest is at the time
>>> >> * always some arbitrary endianness based purely on the
>>> >>   endianness the KVM implementation used historically
>>> >> * always the endianness of the host QEMU binary
>>> >> * something else?
>>> >>
>>> >> Any preferences? Current QEMU code basically assumes
>>> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>> >> which is pretty random.
>>> >
>>> > Having thought a little more about this, my opinion is:
>>> >
>>> > * we should specify that the byte order of the mmio.data
>>> >   array is host kernel endianness (ie same endianness
>>> >   as the QEMU process itself) [this is what it actually
>>> >   is, I think, for all the cases that work today]
>
> In above please consider two types of mapped emulated
> h/w devices: BE and LE they cannot have mmio.data in the
> same endianity. Currently in all observable cases LE ARM
> and BE PPC devices endianity matches kernel/qemu
> endianity but it would break when BE ARM is introduced
> or LE PPC or one would start emulating BE devices on LE
> ARM.
>
>>> > * we should fix the code path in QEMU for handling
>>> >   mmio.data which currently has the implicit assumption
>>> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>> >   as the QEMU host process endianness (because it's using
>>> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>> >   is different from HOST_WORDS_BIGENDIAN)
>
> I do not follow above. Maybe I am missing bigger context.
> What is CPU under discussion in above? On ARM V7 system
> when LE device is accessed as integer &mmio.data[0] address
> would contain integer is in LE format, ie mmio.data[0] is LSB.
>
> Here is gdb session of LE qemu running on V7 LE kernel and
> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
> mapped LE device.
> Please check run->mmio structure after read
> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
> LE format mmio.data[0] is LSB and is equal to 1
> (s->syscfgstat value):
>
> (gdb) bt
> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
> mask=4294967295)
>     at /home/root/20131219/qemu-be/memory.c:407
> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
> access_size_min=1,
>     access_size_max=2357596, access=access@entry=0x23b96c
> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
> /home/root/20131219/qemu-be/memory.c:477
> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
> /home/root/20131219/qemu-be/memory.c:1743
> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
> len=4, is_write=false,
>     is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>     at /home/root/20131219/qemu-be/exec.c:2070
> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1701
> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
> /home/root/20131219/qemu-be/cpus.c:874
> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
> #11 0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> #12 0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb) p /x s->sys_cfgstat
> $25 = 0x1
> (gdb) finish
> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
> /home/root/20131219/qemu-be/memory.c:408
> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
> Value returned is $26 = 1
> (gdb) enable 2
> (gdb) cont
> Continuing.
>
> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1660
> 1660            kvm_arch_pre_run(cpu, run);
> (gdb) bt
> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1660
> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
> /home/root/20131219/qemu-be/cpus.c:874
> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
> #3  0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> #4  0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb) p /x run->mmio
> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>
> Also please look at adjust_endianness function and
> struct MemoryRegion 'endianness' field. IMHO in qemu it
> works quite nicely already. MemoryRegion 'read' and 'write'
> callbacks return/get data in native format adjust_endianness
> function checks whether emulated device endianness matches
> emulator endianness and if it is different it does byteswap
> according to size. As in above example arm_sysctl_ops memory
> region should be marked as DEVICE_LITTLE_ENDIAN when it
> returns s->sys_cfgstat value LE qemu sees that endianity
> matches and it does not byteswap of result, so integer at
> &mmio.data[0] address is in LE form. When qemu would
> run in BE mode on BE kernel, it would see that endianity
> mismatches and it will byteswap s->sys_cfgstat native value
> (BE), so mmio.data would contain integer in LE format again.
>
> Note in currently committed code arm_sysctl_ops endianity
> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
> arm_sysctl device always gives/receives data in LE format regardless
> of current CPSR E bit value, so it cannot be marked as NATIVE.
> LE and BE kernels always read it as LE device; BE kernel follows
> with byteswap. It was OK while we just run qemu in LE, but it
> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
> ... and actually that device and few other ARM specific devices
> endianity change to LITTLE_ENDIAN was the only change in qemu
> to make BE KVM to work.
>
>>>
>>> Yes, I fully agree :).
>>>
>> Great, I'll prepare a patch for the KVM API documentation.
>>
>> -Christoffer
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>
> Thanks,
> Victor
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>
>
>     Appendix
>     Data path endianity in ARM KVM mmio
>     ===================================
>
> This writeup considers several scenarios and tracks endianity
> of data how it travels from emulator to guest CPU register, in
> case of ARM KVM. It starts with currently committed code for LE
> KVM host case and further discusses proposed BE KVM host
> arrangement.
>
> Just to restrict discussion writeup considers code path of
> integer (4 bytes) read from h/w mapped emulated device memory.
> Writeup considers endianity of essential places involved in such
> code path.
>
> For all cases when endianity is defined, it is assumed that
> values under consideration are in memory (opposite to be in
> register that does not have endianity). I.e even if function
> variable could be actually allocated in CPU register writeup
> will reference to it as it is in memory, just to keep
> discussion clean, except for final guest CPU register.
>
> Let's consider the following places along data path from
> emulator to guest CPU register:
>
> 1) emulator code that holds integer value to be read, assume
> it would be global 'int emulated_hw_device_val' variable.
> Normally in emulator it is held in native endian format - i.e
> it is CPSR E bit is the same as kernel CPSR E bit. Just for
> discussion sake assume that this h/w device registers
> holds 5 as its value.
>
> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
> mmio.data byte array. Byte array does not have endianity,
> but for this discussion it would track endianity of integer
> at &mmio.data[0] address
>
> 3) 'data' variable type of 'unsigned long' in
> kvm_handle_mmio_return function before vcpu_data_host_to_guest
> call. KVM host mmio_read_buf function is used to fill this
> variable from mmio.data buffer. mmio_read_buf actually
> acts as memcpy from mmio.data buffer address,
> just taking access size in account.
>
> 4) the same 'data' variable as above, but after
> vcpu_data_host_to_guest function call, just before it is copied
> to vcpu_reg target register location. Note
> vcpu_data_host_to_guest function may byteswap value of 'data'
> depending on current KVM host endianity and value of
> guest CPSR E bit.
>
> 5) guest CPU spilled register array, location of target register
> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>
> 6) finally guest CPU register filled from vcpu_reg just before
> guest resume execution of trapped emulated instruction. Note
> it is done by hypervisor part of code and hypervisor EE bit is
> the same as KVM host CPSR E bit.
>
> Note again, KVM host, emulator, and hypervisor part of code (guest
> CPU registers save and restore code) always run in the same
> endianity. Endianity of accessed emulated devices and endianity
> of guest varies independently of KVM host endianity.
>
> Below sections consider all permutations of all possible cases,
> it maybe quite boring to read. I've created summary table at
> the end, you can jump to the table, after reading few cases.
> But if you have objections and you see things happen differently
> please comment inline of the use cases steps.
>
> LE KVM host
> ===========
>
> Use case 1
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in LE format, matches device
> endianity
> 3) 'data' is LE
> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 5 (0x00000005)
>
> guest resumes execution ... Let's say after 'ldr r1, [r0]'
> instruction, where r0 holds address of devices, it knows
> that it reads LE mapped h/w so no addition processing is
> needed
>
> Use case 2
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in LE format; matches device
> endianity
> 3) 'data' is LE
> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
> will do byteswap: cpu_to_be)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 0x05000000
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode (E bit on), it knows that it reads
> LE device memory, it needs to byteswap r1 before further
> processing so it does 'rev r1, r1' and proceed with result
>
> Use case 3
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
> it because it knows that device endianity is opposite to native,
> and it should match device endianity
> 3) 'data' is BE
> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 0x05000000
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode (E bit off), it knows that it
> reads BE device memory, it need to byteswap r1 before further
> processing so it does 'rev r1, r1' and proceeds with result
>
> Use case 4
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
> it because it knows that device endianity is opposite to native,
> and should match device endianity
> 3) 'data' is BE
> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
> will do byteswap: cpu_to_be)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 5 (0x00000005)
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads BE device
> memory, so it does not need to do anything before further
> processing.
>
>
> Above uses cases that is exactly what we have now after Marc's
> commit to support BE guest on LE KVM host. Further use
> cases describe how it would work with BE KVM patches I proposed.
> It is understood that it is subject of further discussion.
>
>
> BE KVM host
> ===========
>
> Use case 5
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
> it because it knows that device endianity is opposite to native;
> matches device endianity
> 3) 'data' is LE
> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 0x05000000 because
> hypervisor runs in BE mode, so load of LE integer will be
> byteswapped value in register
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads LE device
> memory, it need to byteswap r1 before further processing so it
> does 'rev r1, r1' and proceeds with result
>
> Use case 6
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
> it because it knows that device endianity is opposite to native;
> matches device endianity
> 3) 'data' is LE
> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
> does byteswap: cpu_to_le)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 5 (0x00000005) because
> hypervisor runs in BE mode, so load of BE integer will be OK
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode, it knows that it reads LE device
> memory, so it does not need to do anything else it just proceeds
>
> Use case 7
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in BE format; matches device
> endianity
> 3) 'data' is BE
> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 5 (0x00000005) because
> hypervisor runs in BE mode, so load of BE integer will be OK
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads BE device
> memory, so it does not need to do anything else it just proceeds
>
> Use case 8
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in BE format; matches device
> endianity
> 3) 'data' is BE
> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
> does byteswap: cpu_to_le)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 0x05000000 because
> hypervisor runs in BE mode, so load of LE integer will be
> byteswapped value in register
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode, it knows that it reads BE device
> memory, it need to byteswap r1 before further processing so it
> does 'rev r1, r1' and proceeds with result
>
> Note that with BE kernel we actually have some initial portion
> of assembler code that is executed with CPSR bit off and it reads
> LE h/w - i.e it falls into use case 1.
>
> Summary Table (please use fixed font to see it correctly)
> ========================================
>
> --------------------------------------------------------------
> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
> --------------------------------------------------------------
> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
> | Emulator,  |     |     |     |     |     |     |     |     |
> | Hypervisor |     |     |     |     |     |     |     |     |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
> | Access     |     |     |     |     |     |     |     |     |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
> --------------------------------------------------------------
> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
> --------------------------------------------------------------
> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
> | value      |     |     |     |     |     |     |     |     |
> | byteswapped|     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
> | Follows    |     |     |     |     |     |     |     |     |
> | with rev   |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
>
> Few objservations
> =================
>
> x) Note above table is symmetric wrt to BE<->LE change:
>        1<-->7
>        2<-->8
>        3<-->5
>        4<-->6
>
> x) &mmio.data[0] address always holds integer in the same
> format as emulated device endianity
>
> x) During step 4) when vcpu_data_host_to_guest function
> is used, if guest E bit value different, but everything else
> is the same, opposite result are produced (1&2, 3&4, 5&6,
> 7&8)
>
> If you reached to this end :), again, thank you very much for
> reading it!
>
> - Victor
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Hi Victor,

First of all I really appreciate the thorough description with
all the use-cases.

Below would be a summary of what I understood from your
analysis:

1. Any MMIO device marked as NATIVE ENDIAN in user
space tool (QEMU or KVMTOOL) is bad for cross-endian
Guest. For supporting cross-endian Guest we need to have
all MMIO device with fixed ENDIANESS.

2. We don't need to do any endianness conversions in KVM
for MMIO writes that are being forwarded to user space. It is
the job of user space (QEMU or KVMTOOL) to interpret the
endianness of MMIO write data based on device endianness.

3. The MMIO read operation is the one which will need
explicit handling in KVM because the target VCPU register
of MMIO read operation should be loaded with MMIO data
(returned from user space) based upon current VCPU
endianness (i.e. VCPU CPSR.E bit).

4. In-kernel emulated devices (such as VGIC) will have not
require any explicit endianness conversion of MMIO data for
MMIO write operations (same as point 2).

5. In-kernel emulated devices (such as VGIC) will have to
explicit endianness conversion of MMIO data for MMIO read
operations based on device endianness (same as point 3).

I hope above summary of my understanding is as-per your
description. If so then I am in-support of your description.

I think your description (and above 5 points) takes care of
all use cases of cross-endianness without changing current
MMIO ABI.

Regards,
Anup

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22  6:31           ` Anup Patel
  0 siblings, 0 replies; 102+ messages in thread
From: Anup Patel @ 2014-01-22  6:31 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
<victor.kamensky@linaro.org> wrote:
> Hi Guys,
>
> Christoffer and I had a bit heated chat :) on this
> subject last night. Christoffer, really appreciate
> your time! We did not really reach agreement
> during the chat and Christoffer asked me to follow
> up on this thread.
> Here it goes. Sorry, it is very long email.
>
> I don't believe we can assign any endianity to
> mmio.data[] byte array. I believe mmio.data[] and
> mmio.len acts just memcpy and that is all. As
> memcpy does not imply any endianity of underlying
> data mmio.data[] should not either.
>
> Here is my definition:
>
> mmio.data[] is array of bytes that contains memory
> bytes in such form, for read case, that if those
> bytes are placed in guest memory and guest executes
> the same read access instruction with address to this
> memory, result would be the same as real h/w device
> memory access. Rest of KVM host and hypervisor
> part of code should really take care of mmio.data[]
> memory so it will be delivered to vcpu registers and
> restored by hypervisor part in such way that guest CPU
> register value is the same as it would be for real
> non-emulated h/w read access (that is emulation part).
> The same goes for write access, if guest writes into
> memory and those bytes are just copied to emulated
> h/w register it would have the same effect as real
> mapped h/w register write.
>
> In shorter form, i.e for len=4 access: endianity of integer
> at &mmio.data[0] address should match endianity
> of emulated h/w device behind phys_addr address,
> regardless what is endianity of emulator, KVM host,
> hypervisor, and guest
>
> Examples that illustrate my definition
> --------------------------------------
>
> 1) LE guest (E bit is off in ARM speak) reads integer
> (4 bytes) from mapped h/w LE device register -
> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>
> 2) BE guest (E bit is on in ARM speak) reads integer
> from mapped h/w LE device register - mmio.data[3]
> contains MSB, mmio.data[0] contains LSB. Note that
> if &mmio.data[0] memory would be placed in guest
> address space and instruction restarted with new
> address, then it would meet BE guest expectations
> - the guest knows that it reads LE h/w so it will byteswap
> register before processing it further. This is BE guest ARM
> case (regardless of what KVM host endianity is).
>
> 3) BE guest reads integer from mapped h/w BE device
> register - mmio.data[0] contains MSB, mmio.data[3]
> contains LSB. Note that if &mmio.data[0] memory would
> be placed in guest address space and instruction
> restarted with new address, then it would meet BE
> guest expectation - the guest knows that it reads
> BE h/w so it will proceed further without any other
> work. I guess, it is BE ppc case.
>
>
> Arguments in favor of memcpy semantics of mmio.data[]
> ------------------------------------------------------
>
> x) What are possible values of 'len'? Previous discussions
> imply that is always powers of 2. Why is that? Maybe
> there will be CPU that would need to do 5 bytes mmio
> access, or 6 bytes. How do you assign endianity to
> such case? 'len' 5 or 6, or any works fine with
> memcpy semantics. I admit it is hypothetical case, but
> IMHO it tests how clean ABI definition is.
>
> x) Byte array does not have endianity because it
> does not have any structure. If one would want to
> imply structure why mmio is not defined in such way
> so structure reflected in mmio definition?
> Something like:
>
>
>                 /* KVM_EXIT_MMIO */
>                 struct {
>                           __u64 phys_addr;
>                           union {
>                                __u8 byte;
>                                __u16 hword;
>                                __u32 word;
>                                __u64 dword;
>                           }  data;
>                           __u32 len;
>                           __u8  is_write;
>                 } mmio;
>
> where len is really serves as union discriminator and
> only allowed len values are 1, 2, 4, 8.
> In this case, I agree, endianity of integer types
> should be defined. I believe, use of byte array strongly
> implies that original intent was to have semantics of
> byte stream copy, just like memcpy does.
>
> x) Note there is nothing wrong with user kernel ABI to
> use just bytes stream as parameter. There is already
> precedents like 'read' and 'write' system calls :).
>
> x) Consider case when KVM works with emulated memory mapped
> h/w devices where some devices operate in LE mode and others
> operate in BE mode. It is defined by semantics of real h/w
> device which is it, and should be emulated by emulator and KVM
> given all other context. As far as mmio.data[] array concerned, if the
> same integer value is read from these devices registers, mmio.data[]
> memory should contain integer in opposite endianity for these
> two cases, i.e MSB is data[0] in one case and MSB is
> data[3] is in another case. It cannot be the same, because
> except emulator and guest kernel, all other, like KVM host
> and hypervisor, have no clue what endianity of device
> actually is - it should treat mmio.data[] in the same way.
> But resulting guest target CPU register would need to contain
> normal integer value in one case and byteswapped in another,
> because guest kernel would use it directly in one case and
> byteswap it in another. Byte stream semantics allows to do
> that. I don't see how it could happen if you fixate mmio.data[]
> endianity in such way that it would contain integer in
> the same format for BE and LE emulated device types.
>
> If by this point you agree, that mmio.data[] user-land/kernel
> ABI semantics should be just memcpy, stop reading :). If not,
> you may would like to take a look at below appendix where I
> described in great details endianity of data at different
> points along mmio processing code path of existing ARM LE KVM,
> and proposed ARM BE KVM. Note appendix, is very long and very
> detailed, sorry about that, but I feel that earlier more
> digested explanations failed, so it driven me to write out
> all details how I see them. If I am wrong, I hope it would be
> easier for folks to point in detailed explanation places
> where my logic goes bad. Also, I am not sure whether this
> mail thread is good place to discuss all details described
> in the appendix. Christoffer, please advise whether I should take
> that one back on [1]. But I hope this bigger picture may help to
> see the mmio.data[] semantics issue in context.
>
> More inline and appendix is at the end.
>
> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>
>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>
>>> > On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> >> Specifically, the KVM API says "here's a uint8_t[] byte
>>> >> array and a length", and the current QEMU code treats that
>>> >> as "this is a byte array written as if the guest CPU
>>> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>> >> I/O access to this buffer rather than to the device".
>>> >>
>>> >> The KVM API docs don't actually specify the endianness
>>> >> semantics of the byte array, but I think that that really
>>> >> needs to be nailed down. I can think of a couple of options:
>>> >> * always LE
>>> >> * always BE
>>> >>   [these first two are non-starters because they would
>>> >>   break either x86 or PPC existing code]
>>> >> * always the endianness the guest is at the time
>>> >> * always some arbitrary endianness based purely on the
>>> >>   endianness the KVM implementation used historically
>>> >> * always the endianness of the host QEMU binary
>>> >> * something else?
>>> >>
>>> >> Any preferences? Current QEMU code basically assumes
>>> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>> >> which is pretty random.
>>> >
>>> > Having thought a little more about this, my opinion is:
>>> >
>>> > * we should specify that the byte order of the mmio.data
>>> >   array is host kernel endianness (ie same endianness
>>> >   as the QEMU process itself) [this is what it actually
>>> >   is, I think, for all the cases that work today]
>
> In above please consider two types of mapped emulated
> h/w devices: BE and LE they cannot have mmio.data in the
> same endianity. Currently in all observable cases LE ARM
> and BE PPC devices endianity matches kernel/qemu
> endianity but it would break when BE ARM is introduced
> or LE PPC or one would start emulating BE devices on LE
> ARM.
>
>>> > * we should fix the code path in QEMU for handling
>>> >   mmio.data which currently has the implicit assumption
>>> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>> >   as the QEMU host process endianness (because it's using
>>> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>> >   is different from HOST_WORDS_BIGENDIAN)
>
> I do not follow above. Maybe I am missing bigger context.
> What is CPU under discussion in above? On ARM V7 system
> when LE device is accessed as integer &mmio.data[0] address
> would contain integer is in LE format, ie mmio.data[0] is LSB.
>
> Here is gdb session of LE qemu running on V7 LE kernel and
> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
> mapped LE device.
> Please check run->mmio structure after read
> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
> LE format mmio.data[0] is LSB and is equal to 1
> (s->syscfgstat value):
>
> (gdb) bt
> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
> mask=4294967295)
>     at /home/root/20131219/qemu-be/memory.c:407
> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
> access_size_min=1,
>     access_size_max=2357596, access=access@entry=0x23b96c
> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
> /home/root/20131219/qemu-be/memory.c:477
> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
> /home/root/20131219/qemu-be/memory.c:1743
> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
> len=4, is_write=false,
>     is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>     at /home/root/20131219/qemu-be/exec.c:2070
> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1701
> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
> /home/root/20131219/qemu-be/cpus.c:874
> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
> #11 0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> #12 0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb) p /x s->sys_cfgstat
> $25 = 0x1
> (gdb) finish
> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
> /home/root/20131219/qemu-be/memory.c:408
> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
> Value returned is $26 = 1
> (gdb) enable 2
> (gdb) cont
> Continuing.
>
> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1660
> 1660            kvm_arch_pre_run(cpu, run);
> (gdb) bt
> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1660
> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
> /home/root/20131219/qemu-be/cpus.c:874
> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
> #3  0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> #4  0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb) p /x run->mmio
> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>
> Also please look at adjust_endianness function and
> struct MemoryRegion 'endianness' field. IMHO in qemu it
> works quite nicely already. MemoryRegion 'read' and 'write'
> callbacks return/get data in native format adjust_endianness
> function checks whether emulated device endianness matches
> emulator endianness and if it is different it does byteswap
> according to size. As in above example arm_sysctl_ops memory
> region should be marked as DEVICE_LITTLE_ENDIAN when it
> returns s->sys_cfgstat value LE qemu sees that endianity
> matches and it does not byteswap of result, so integer at
> &mmio.data[0] address is in LE form. When qemu would
> run in BE mode on BE kernel, it would see that endianity
> mismatches and it will byteswap s->sys_cfgstat native value
> (BE), so mmio.data would contain integer in LE format again.
>
> Note in currently committed code arm_sysctl_ops endianity
> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
> arm_sysctl device always gives/receives data in LE format regardless
> of current CPSR E bit value, so it cannot be marked as NATIVE.
> LE and BE kernels always read it as LE device; BE kernel follows
> with byteswap. It was OK while we just run qemu in LE, but it
> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
> ... and actually that device and few other ARM specific devices
> endianity change to LITTLE_ENDIAN was the only change in qemu
> to make BE KVM to work.
>
>>>
>>> Yes, I fully agree :).
>>>
>> Great, I'll prepare a patch for the KVM API documentation.
>>
>> -Christoffer
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>
> Thanks,
> Victor
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>
>
>     Appendix
>     Data path endianity in ARM KVM mmio
>     ===================================
>
> This writeup considers several scenarios and tracks endianity
> of data how it travels from emulator to guest CPU register, in
> case of ARM KVM. It starts with currently committed code for LE
> KVM host case and further discusses proposed BE KVM host
> arrangement.
>
> Just to restrict discussion writeup considers code path of
> integer (4 bytes) read from h/w mapped emulated device memory.
> Writeup considers endianity of essential places involved in such
> code path.
>
> For all cases when endianity is defined, it is assumed that
> values under consideration are in memory (opposite to be in
> register that does not have endianity). I.e even if function
> variable could be actually allocated in CPU register writeup
> will reference to it as it is in memory, just to keep
> discussion clean, except for final guest CPU register.
>
> Let's consider the following places along data path from
> emulator to guest CPU register:
>
> 1) emulator code that holds integer value to be read, assume
> it would be global 'int emulated_hw_device_val' variable.
> Normally in emulator it is held in native endian format - i.e
> it is CPSR E bit is the same as kernel CPSR E bit. Just for
> discussion sake assume that this h/w device registers
> holds 5 as its value.
>
> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
> mmio.data byte array. Byte array does not have endianity,
> but for this discussion it would track endianity of integer
> at &mmio.data[0] address
>
> 3) 'data' variable type of 'unsigned long' in
> kvm_handle_mmio_return function before vcpu_data_host_to_guest
> call. KVM host mmio_read_buf function is used to fill this
> variable from mmio.data buffer. mmio_read_buf actually
> acts as memcpy from mmio.data buffer address,
> just taking access size in account.
>
> 4) the same 'data' variable as above, but after
> vcpu_data_host_to_guest function call, just before it is copied
> to vcpu_reg target register location. Note
> vcpu_data_host_to_guest function may byteswap value of 'data'
> depending on current KVM host endianity and value of
> guest CPSR E bit.
>
> 5) guest CPU spilled register array, location of target register
> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>
> 6) finally guest CPU register filled from vcpu_reg just before
> guest resume execution of trapped emulated instruction. Note
> it is done by hypervisor part of code and hypervisor EE bit is
> the same as KVM host CPSR E bit.
>
> Note again, KVM host, emulator, and hypervisor part of code (guest
> CPU registers save and restore code) always run in the same
> endianity. Endianity of accessed emulated devices and endianity
> of guest varies independently of KVM host endianity.
>
> Below sections consider all permutations of all possible cases,
> it maybe quite boring to read. I've created summary table at
> the end, you can jump to the table, after reading few cases.
> But if you have objections and you see things happen differently
> please comment inline of the use cases steps.
>
> LE KVM host
> ===========
>
> Use case 1
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in LE format, matches device
> endianity
> 3) 'data' is LE
> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 5 (0x00000005)
>
> guest resumes execution ... Let's say after 'ldr r1, [r0]'
> instruction, where r0 holds address of devices, it knows
> that it reads LE mapped h/w so no addition processing is
> needed
>
> Use case 2
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in LE format; matches device
> endianity
> 3) 'data' is LE
> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
> will do byteswap: cpu_to_be)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 0x05000000
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode (E bit on), it knows that it reads
> LE device memory, it needs to byteswap r1 before further
> processing so it does 'rev r1, r1' and proceed with result
>
> Use case 3
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
> it because it knows that device endianity is opposite to native,
> and it should match device endianity
> 3) 'data' is BE
> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 0x05000000
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode (E bit off), it knows that it
> reads BE device memory, it need to byteswap r1 before further
> processing so it does 'rev r1, r1' and proceeds with result
>
> Use case 4
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
> it because it knows that device endianity is opposite to native,
> and should match device endianity
> 3) 'data' is BE
> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
> will do byteswap: cpu_to_be)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 5 (0x00000005)
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads BE device
> memory, so it does not need to do anything before further
> processing.
>
>
> Above uses cases that is exactly what we have now after Marc's
> commit to support BE guest on LE KVM host. Further use
> cases describe how it would work with BE KVM patches I proposed.
> It is understood that it is subject of further discussion.
>
>
> BE KVM host
> ===========
>
> Use case 5
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
> it because it knows that device endianity is opposite to native;
> matches device endianity
> 3) 'data' is LE
> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 0x05000000 because
> hypervisor runs in BE mode, so load of LE integer will be
> byteswapped value in register
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads LE device
> memory, it need to byteswap r1 before further processing so it
> does 'rev r1, r1' and proceeds with result
>
> Use case 6
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
> it because it knows that device endianity is opposite to native;
> matches device endianity
> 3) 'data' is LE
> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
> does byteswap: cpu_to_le)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 5 (0x00000005) because
> hypervisor runs in BE mode, so load of BE integer will be OK
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode, it knows that it reads LE device
> memory, so it does not need to do anything else it just proceeds
>
> Use case 7
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in BE format; matches device
> endianity
> 3) 'data' is BE
> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 5 (0x00000005) because
> hypervisor runs in BE mode, so load of BE integer will be OK
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads BE device
> memory, so it does not need to do anything else it just proceeds
>
> Use case 8
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in BE format; matches device
> endianity
> 3) 'data' is BE
> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
> does byteswap: cpu_to_le)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 0x05000000 because
> hypervisor runs in BE mode, so load of LE integer will be
> byteswapped value in register
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode, it knows that it reads BE device
> memory, it need to byteswap r1 before further processing so it
> does 'rev r1, r1' and proceeds with result
>
> Note that with BE kernel we actually have some initial portion
> of assembler code that is executed with CPSR bit off and it reads
> LE h/w - i.e it falls into use case 1.
>
> Summary Table (please use fixed font to see it correctly)
> ========================================
>
> --------------------------------------------------------------
> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
> --------------------------------------------------------------
> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
> | Emulator,  |     |     |     |     |     |     |     |     |
> | Hypervisor |     |     |     |     |     |     |     |     |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
> | Access     |     |     |     |     |     |     |     |     |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
> --------------------------------------------------------------
> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
> --------------------------------------------------------------
> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
> | value      |     |     |     |     |     |     |     |     |
> | byteswapped|     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
> | Follows    |     |     |     |     |     |     |     |     |
> | with rev   |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
>
> Few objservations
> =================
>
> x) Note above table is symmetric wrt to BE<->LE change:
>        1<-->7
>        2<-->8
>        3<-->5
>        4<-->6
>
> x) &mmio.data[0] address always holds integer in the same
> format as emulated device endianity
>
> x) During step 4) when vcpu_data_host_to_guest function
> is used, if guest E bit value different, but everything else
> is the same, opposite result are produced (1&2, 3&4, 5&6,
> 7&8)
>
> If you reached to this end :), again, thank you very much for
> reading it!
>
> - Victor
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Hi Victor,

First of all I really appreciate the thorough description with
all the use-cases.

Below would be a summary of what I understood from your
analysis:

1. Any MMIO device marked as NATIVE ENDIAN in user
space tool (QEMU or KVMTOOL) is bad for cross-endian
Guest. For supporting cross-endian Guest we need to have
all MMIO device with fixed ENDIANESS.

2. We don't need to do any endianness conversions in KVM
for MMIO writes that are being forwarded to user space. It is
the job of user space (QEMU or KVMTOOL) to interpret the
endianness of MMIO write data based on device endianness.

3. The MMIO read operation is the one which will need
explicit handling in KVM because the target VCPU register
of MMIO read operation should be loaded with MMIO data
(returned from user space) based upon current VCPU
endianness (i.e. VCPU CPSR.E bit).

4. In-kernel emulated devices (such as VGIC) will have not
require any explicit endianness conversion of MMIO data for
MMIO write operations (same as point 2).

5. In-kernel emulated devices (such as VGIC) will have to
explicit endianness conversion of MMIO data for MMIO read
operations based on device endianness (same as point 3).

I hope above summary of my understanding is as-per your
description. If so then I am in-support of your description.

I think your description (and above 5 points) takes care of
all use cases of cross-endianness without changing current
MMIO ABI.

Regards,
Anup

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22  6:31           ` [Qemu-devel] " Anup Patel
@ 2014-01-22  6:41             ` Alexander Graf
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-22  6:41 UTC (permalink / raw)
  To: Anup Patel
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall



> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@brainfault.org>:
> 
> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
> <victor.kamensky@linaro.org> wrote:
>> Hi Guys,
>> 
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>> 
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
>> 
>> Here is my definition:
>> 
>> mmio.data[] is array of bytes that contains memory
>> bytes in such form, for read case, that if those
>> bytes are placed in guest memory and guest executes
>> the same read access instruction with address to this
>> memory, result would be the same as real h/w device
>> memory access. Rest of KVM host and hypervisor
>> part of code should really take care of mmio.data[]
>> memory so it will be delivered to vcpu registers and
>> restored by hypervisor part in such way that guest CPU
>> register value is the same as it would be for real
>> non-emulated h/w read access (that is emulation part).
>> The same goes for write access, if guest writes into
>> memory and those bytes are just copied to emulated
>> h/w register it would have the same effect as real
>> mapped h/w register write.
>> 
>> In shorter form, i.e for len=4 access: endianity of integer
>> at &mmio.data[0] address should match endianity
>> of emulated h/w device behind phys_addr address,
>> regardless what is endianity of emulator, KVM host,
>> hypervisor, and guest
>> 
>> Examples that illustrate my definition
>> --------------------------------------
>> 
>> 1) LE guest (E bit is off in ARM speak) reads integer
>> (4 bytes) from mapped h/w LE device register -
>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>> 
>> 2) BE guest (E bit is on in ARM speak) reads integer
>> from mapped h/w LE device register - mmio.data[3]
>> contains MSB, mmio.data[0] contains LSB. Note that
>> if &mmio.data[0] memory would be placed in guest
>> address space and instruction restarted with new
>> address, then it would meet BE guest expectations
>> - the guest knows that it reads LE h/w so it will byteswap
>> register before processing it further. This is BE guest ARM
>> case (regardless of what KVM host endianity is).
>> 
>> 3) BE guest reads integer from mapped h/w BE device
>> register - mmio.data[0] contains MSB, mmio.data[3]
>> contains LSB. Note that if &mmio.data[0] memory would
>> be placed in guest address space and instruction
>> restarted with new address, then it would meet BE
>> guest expectation - the guest knows that it reads
>> BE h/w so it will proceed further without any other
>> work. I guess, it is BE ppc case.
>> 
>> 
>> Arguments in favor of memcpy semantics of mmio.data[]
>> ------------------------------------------------------
>> 
>> x) What are possible values of 'len'? Previous discussions
>> imply that is always powers of 2. Why is that? Maybe
>> there will be CPU that would need to do 5 bytes mmio
>> access, or 6 bytes. How do you assign endianity to
>> such case? 'len' 5 or 6, or any works fine with
>> memcpy semantics. I admit it is hypothetical case, but
>> IMHO it tests how clean ABI definition is.
>> 
>> x) Byte array does not have endianity because it
>> does not have any structure. If one would want to
>> imply structure why mmio is not defined in such way
>> so structure reflected in mmio definition?
>> Something like:
>> 
>> 
>>                /* KVM_EXIT_MMIO */
>>                struct {
>>                          __u64 phys_addr;
>>                          union {
>>                               __u8 byte;
>>                               __u16 hword;
>>                               __u32 word;
>>                               __u64 dword;
>>                          }  data;
>>                          __u32 len;
>>                          __u8  is_write;
>>                } mmio;
>> 
>> where len is really serves as union discriminator and
>> only allowed len values are 1, 2, 4, 8.
>> In this case, I agree, endianity of integer types
>> should be defined. I believe, use of byte array strongly
>> implies that original intent was to have semantics of
>> byte stream copy, just like memcpy does.
>> 
>> x) Note there is nothing wrong with user kernel ABI to
>> use just bytes stream as parameter. There is already
>> precedents like 'read' and 'write' system calls :).
>> 
>> x) Consider case when KVM works with emulated memory mapped
>> h/w devices where some devices operate in LE mode and others
>> operate in BE mode. It is defined by semantics of real h/w
>> device which is it, and should be emulated by emulator and KVM
>> given all other context. As far as mmio.data[] array concerned, if the
>> same integer value is read from these devices registers, mmio.data[]
>> memory should contain integer in opposite endianity for these
>> two cases, i.e MSB is data[0] in one case and MSB is
>> data[3] is in another case. It cannot be the same, because
>> except emulator and guest kernel, all other, like KVM host
>> and hypervisor, have no clue what endianity of device
>> actually is - it should treat mmio.data[] in the same way.
>> But resulting guest target CPU register would need to contain
>> normal integer value in one case and byteswapped in another,
>> because guest kernel would use it directly in one case and
>> byteswap it in another. Byte stream semantics allows to do
>> that. I don't see how it could happen if you fixate mmio.data[]
>> endianity in such way that it would contain integer in
>> the same format for BE and LE emulated device types.
>> 
>> If by this point you agree, that mmio.data[] user-land/kernel
>> ABI semantics should be just memcpy, stop reading :). If not,
>> you may would like to take a look at below appendix where I
>> described in great details endianity of data at different
>> points along mmio processing code path of existing ARM LE KVM,
>> and proposed ARM BE KVM. Note appendix, is very long and very
>> detailed, sorry about that, but I feel that earlier more
>> digested explanations failed, so it driven me to write out
>> all details how I see them. If I am wrong, I hope it would be
>> easier for folks to point in detailed explanation places
>> where my logic goes bad. Also, I am not sure whether this
>> mail thread is good place to discuss all details described
>> in the appendix. Christoffer, please advise whether I should take
>> that one back on [1]. But I hope this bigger picture may help to
>> see the mmio.data[] semantics issue in context.
>> 
>> More inline and appendix is at the end.
>> 
>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>> 
>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>> 
>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>> array and a length", and the current QEMU code treats that
>>>>>> as "this is a byte array written as if the guest CPU
>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>> I/O access to this buffer rather than to the device".
>>>>>> 
>>>>>> The KVM API docs don't actually specify the endianness
>>>>>> semantics of the byte array, but I think that that really
>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>> * always LE
>>>>>> * always BE
>>>>>>  [these first two are non-starters because they would
>>>>>>  break either x86 or PPC existing code]
>>>>>> * always the endianness the guest is at the time
>>>>>> * always some arbitrary endianness based purely on the
>>>>>>  endianness the KVM implementation used historically
>>>>>> * always the endianness of the host QEMU binary
>>>>>> * something else?
>>>>>> 
>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>> which is pretty random.
>>>>> 
>>>>> Having thought a little more about this, my opinion is:
>>>>> 
>>>>> * we should specify that the byte order of the mmio.data
>>>>>  array is host kernel endianness (ie same endianness
>>>>>  as the QEMU process itself) [this is what it actually
>>>>>  is, I think, for all the cases that work today]
>> 
>> In above please consider two types of mapped emulated
>> h/w devices: BE and LE they cannot have mmio.data in the
>> same endianity. Currently in all observable cases LE ARM
>> and BE PPC devices endianity matches kernel/qemu
>> endianity but it would break when BE ARM is introduced
>> or LE PPC or one would start emulating BE devices on LE
>> ARM.
>> 
>>>>> * we should fix the code path in QEMU for handling
>>>>>  mmio.data which currently has the implicit assumption
>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>  as the QEMU host process endianness (because it's using
>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>  is different from HOST_WORDS_BIGENDIAN)
>> 
>> I do not follow above. Maybe I am missing bigger context.
>> What is CPU under discussion in above? On ARM V7 system
>> when LE device is accessed as integer &mmio.data[0] address
>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>> 
>> Here is gdb session of LE qemu running on V7 LE kernel and
>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>> mapped LE device.
>> Please check run->mmio structure after read
>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>> LE format mmio.data[0] is LSB and is equal to 1
>> (s->syscfgstat value):
>> 
>> (gdb) bt
>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>> mask=4294967295)
>>    at /home/root/20131219/qemu-be/memory.c:407
>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>> access_size_min=1,
>>    access_size_max=2357596, access=access@entry=0x23b96c
>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>> /home/root/20131219/qemu-be/memory.c:477
>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>> /home/root/20131219/qemu-be/memory.c:1743
>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>> len=4, is_write=false,
>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>    at /home/root/20131219/qemu-be/exec.c:2070
>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1701
>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>> /home/root/20131219/qemu-be/cpus.c:874
>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>> #11 0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> #12 0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>> (gdb) p /x s->sys_cfgstat
>> $25 = 0x1
>> (gdb) finish
>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>> /home/root/20131219/qemu-be/memory.c:408
>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>> Value returned is $26 = 1
>> (gdb) enable 2
>> (gdb) cont
>> Continuing.
>> 
>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1660
>> 1660            kvm_arch_pre_run(cpu, run);
>> (gdb) bt
>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1660
>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>> /home/root/20131219/qemu-be/cpus.c:874
>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>> #3  0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> #4  0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>> (gdb) p /x run->mmio
>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>> 
>> Also please look at adjust_endianness function and
>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>> works quite nicely already. MemoryRegion 'read' and 'write'
>> callbacks return/get data in native format adjust_endianness
>> function checks whether emulated device endianness matches
>> emulator endianness and if it is different it does byteswap
>> according to size. As in above example arm_sysctl_ops memory
>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>> returns s->sys_cfgstat value LE qemu sees that endianity
>> matches and it does not byteswap of result, so integer at
>> &mmio.data[0] address is in LE form. When qemu would
>> run in BE mode on BE kernel, it would see that endianity
>> mismatches and it will byteswap s->sys_cfgstat native value
>> (BE), so mmio.data would contain integer in LE format again.
>> 
>> Note in currently committed code arm_sysctl_ops endianity
>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>> arm_sysctl device always gives/receives data in LE format regardless
>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>> LE and BE kernels always read it as LE device; BE kernel follows
>> with byteswap. It was OK while we just run qemu in LE, but it
>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>> ... and actually that device and few other ARM specific devices
>> endianity change to LITTLE_ENDIAN was the only change in qemu
>> to make BE KVM to work.
>> 
>>>> 
>>>> Yes, I fully agree :).
>>> Great, I'll prepare a patch for the KVM API documentation.
>>> 
>>> -Christoffer
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@lists.cs.columbia.edu
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>> 
>> Thanks,
>> Victor
>> 
>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>> 
>> 
>>    Appendix
>>    Data path endianity in ARM KVM mmio
>>    ===================================
>> 
>> This writeup considers several scenarios and tracks endianity
>> of data how it travels from emulator to guest CPU register, in
>> case of ARM KVM. It starts with currently committed code for LE
>> KVM host case and further discusses proposed BE KVM host
>> arrangement.
>> 
>> Just to restrict discussion writeup considers code path of
>> integer (4 bytes) read from h/w mapped emulated device memory.
>> Writeup considers endianity of essential places involved in such
>> code path.
>> 
>> For all cases when endianity is defined, it is assumed that
>> values under consideration are in memory (opposite to be in
>> register that does not have endianity). I.e even if function
>> variable could be actually allocated in CPU register writeup
>> will reference to it as it is in memory, just to keep
>> discussion clean, except for final guest CPU register.
>> 
>> Let's consider the following places along data path from
>> emulator to guest CPU register:
>> 
>> 1) emulator code that holds integer value to be read, assume
>> it would be global 'int emulated_hw_device_val' variable.
>> Normally in emulator it is held in native endian format - i.e
>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>> discussion sake assume that this h/w device registers
>> holds 5 as its value.
>> 
>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>> mmio.data byte array. Byte array does not have endianity,
>> but for this discussion it would track endianity of integer
>> at &mmio.data[0] address
>> 
>> 3) 'data' variable type of 'unsigned long' in
>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>> call. KVM host mmio_read_buf function is used to fill this
>> variable from mmio.data buffer. mmio_read_buf actually
>> acts as memcpy from mmio.data buffer address,
>> just taking access size in account.
>> 
>> 4) the same 'data' variable as above, but after
>> vcpu_data_host_to_guest function call, just before it is copied
>> to vcpu_reg target register location. Note
>> vcpu_data_host_to_guest function may byteswap value of 'data'
>> depending on current KVM host endianity and value of
>> guest CPSR E bit.
>> 
>> 5) guest CPU spilled register array, location of target register
>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>> 
>> 6) finally guest CPU register filled from vcpu_reg just before
>> guest resume execution of trapped emulated instruction. Note
>> it is done by hypervisor part of code and hypervisor EE bit is
>> the same as KVM host CPSR E bit.
>> 
>> Note again, KVM host, emulator, and hypervisor part of code (guest
>> CPU registers save and restore code) always run in the same
>> endianity. Endianity of accessed emulated devices and endianity
>> of guest varies independently of KVM host endianity.
>> 
>> Below sections consider all permutations of all possible cases,
>> it maybe quite boring to read. I've created summary table at
>> the end, you can jump to the table, after reading few cases.
>> But if you have objections and you see things happen differently
>> please comment inline of the use cases steps.
>> 
>> LE KVM host
>> ===========
>> 
>> Use case 1
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in LE format, matches device
>> endianity
>> 3) 'data' is LE
>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 5 (0x00000005)
>> 
>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>> instruction, where r0 holds address of devices, it knows
>> that it reads LE mapped h/w so no addition processing is
>> needed
>> 
>> Use case 2
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in LE format; matches device
>> endianity
>> 3) 'data' is LE
>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>> will do byteswap: cpu_to_be)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 0x05000000
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode (E bit on), it knows that it reads
>> LE device memory, it needs to byteswap r1 before further
>> processing so it does 'rev r1, r1' and proceed with result
>> 
>> Use case 3
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native,
>> and it should match device endianity
>> 3) 'data' is BE
>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 0x05000000
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode (E bit off), it knows that it
>> reads BE device memory, it need to byteswap r1 before further
>> processing so it does 'rev r1, r1' and proceeds with result
>> 
>> Use case 4
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native,
>> and should match device endianity
>> 3) 'data' is BE
>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>> will do byteswap: cpu_to_be)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 5 (0x00000005)
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads BE device
>> memory, so it does not need to do anything before further
>> processing.
>> 
>> 
>> Above uses cases that is exactly what we have now after Marc's
>> commit to support BE guest on LE KVM host. Further use
>> cases describe how it would work with BE KVM patches I proposed.
>> It is understood that it is subject of further discussion.
>> 
>> 
>> BE KVM host
>> ===========
>> 
>> Use case 5
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native;
>> matches device endianity
>> 3) 'data' is LE
>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 0x05000000 because
>> hypervisor runs in BE mode, so load of LE integer will be
>> byteswapped value in register
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads LE device
>> memory, it need to byteswap r1 before further processing so it
>> does 'rev r1, r1' and proceeds with result
>> 
>> Use case 6
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native;
>> matches device endianity
>> 3) 'data' is LE
>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>> does byteswap: cpu_to_le)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 5 (0x00000005) because
>> hypervisor runs in BE mode, so load of BE integer will be OK
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode, it knows that it reads LE device
>> memory, so it does not need to do anything else it just proceeds
>> 
>> Use case 7
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in BE format; matches device
>> endianity
>> 3) 'data' is BE
>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 5 (0x00000005) because
>> hypervisor runs in BE mode, so load of BE integer will be OK
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads BE device
>> memory, so it does not need to do anything else it just proceeds
>> 
>> Use case 8
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in BE format; matches device
>> endianity
>> 3) 'data' is BE
>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>> does byteswap: cpu_to_le)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 0x05000000 because
>> hypervisor runs in BE mode, so load of LE integer will be
>> byteswapped value in register
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode, it knows that it reads BE device
>> memory, it need to byteswap r1 before further processing so it
>> does 'rev r1, r1' and proceeds with result
>> 
>> Note that with BE kernel we actually have some initial portion
>> of assembler code that is executed with CPSR bit off and it reads
>> LE h/w - i.e it falls into use case 1.
>> 
>> Summary Table (please use fixed font to see it correctly)
>> ========================================
>> 
>> --------------------------------------------------------------
>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>> --------------------------------------------------------------
>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>> | Emulator,  |     |     |     |     |     |     |     |     |
>> | Hypervisor |     |     |     |     |     |     |     |     |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>> | Access     |     |     |     |     |     |     |     |     |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>> --------------------------------------------------------------
>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>> --------------------------------------------------------------
>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>> | value      |     |     |     |     |     |     |     |     |
>> | byteswapped|     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>> | Follows    |     |     |     |     |     |     |     |     |
>> | with rev   |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> 
>> Few objservations
>> =================
>> 
>> x) Note above table is symmetric wrt to BE<->LE change:
>>       1<-->7
>>       2<-->8
>>       3<-->5
>>       4<-->6
>> 
>> x) &mmio.data[0] address always holds integer in the same
>> format as emulated device endianity
>> 
>> x) During step 4) when vcpu_data_host_to_guest function
>> is used, if guest E bit value different, but everything else
>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>> 7&8)
>> 
>> If you reached to this end :), again, thank you very much for
>> reading it!
>> 
>> - Victor
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
> 
> Hi Victor,
> 
> First of all I really appreciate the thorough description with
> all the use-cases.
> 
> Below would be a summary of what I understood from your
> analysis:
> 
> 1. Any MMIO device marked as NATIVE ENDIAN in user

"Native endian" really is just a shortcut for "target endian" which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le.

QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU.

A bus write on the CPU however honors the endianness setting of the CPU. So when we convert from a value in register to a value on the bus we need to take this endian configuration into account.

That's exactly what we are talking about here. KVM should do the cpu configured register->bus endian mapping while QEMU does the bus->device endian map.


Alex

> space tool (QEMU or KVMTOOL) is bad for cross-endian
> Guest. For supporting cross-endian Guest we need to have
> all MMIO device with fixed ENDIANESS.
> 
> 2. We don't need to do any endianness conversions in KVM
> for MMIO writes that are being forwarded to user space. It is
> the job of user space (QEMU or KVMTOOL) to interpret the
> endianness of MMIO write data based on device endianness.
> 
> 3. The MMIO read operation is the one which will need
> explicit handling in KVM because the target VCPU register
> of MMIO read operation should be loaded with MMIO data
> (returned from user space) based upon current VCPU
> endianness (i.e. VCPU CPSR.E bit).
> 
> 4. In-kernel emulated devices (such as VGIC) will have not
> require any explicit endianness conversion of MMIO data for
> MMIO write operations (same as point 2).
> 
> 5. In-kernel emulated devices (such as VGIC) will have to
> explicit endianness conversion of MMIO data for MMIO read
> operations based on device endianness (same as point 3).
> 
> I hope above summary of my understanding is as-per your
> description. If so then I am in-support of your description.
> 
> I think your description (and above 5 points) takes care of
> all use cases of cross-endianness without changing current
> MMIO ABI.
> 
> Regards,
> Anup
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-22  6:41             ` Alexander Graf
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-22  6:41 UTC (permalink / raw)
  To: Anup Patel
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall



> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@brainfault.org>:
> 
> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
> <victor.kamensky@linaro.org> wrote:
>> Hi Guys,
>> 
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>> 
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
>> 
>> Here is my definition:
>> 
>> mmio.data[] is array of bytes that contains memory
>> bytes in such form, for read case, that if those
>> bytes are placed in guest memory and guest executes
>> the same read access instruction with address to this
>> memory, result would be the same as real h/w device
>> memory access. Rest of KVM host and hypervisor
>> part of code should really take care of mmio.data[]
>> memory so it will be delivered to vcpu registers and
>> restored by hypervisor part in such way that guest CPU
>> register value is the same as it would be for real
>> non-emulated h/w read access (that is emulation part).
>> The same goes for write access, if guest writes into
>> memory and those bytes are just copied to emulated
>> h/w register it would have the same effect as real
>> mapped h/w register write.
>> 
>> In shorter form, i.e for len=4 access: endianity of integer
>> at &mmio.data[0] address should match endianity
>> of emulated h/w device behind phys_addr address,
>> regardless what is endianity of emulator, KVM host,
>> hypervisor, and guest
>> 
>> Examples that illustrate my definition
>> --------------------------------------
>> 
>> 1) LE guest (E bit is off in ARM speak) reads integer
>> (4 bytes) from mapped h/w LE device register -
>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>> 
>> 2) BE guest (E bit is on in ARM speak) reads integer
>> from mapped h/w LE device register - mmio.data[3]
>> contains MSB, mmio.data[0] contains LSB. Note that
>> if &mmio.data[0] memory would be placed in guest
>> address space and instruction restarted with new
>> address, then it would meet BE guest expectations
>> - the guest knows that it reads LE h/w so it will byteswap
>> register before processing it further. This is BE guest ARM
>> case (regardless of what KVM host endianity is).
>> 
>> 3) BE guest reads integer from mapped h/w BE device
>> register - mmio.data[0] contains MSB, mmio.data[3]
>> contains LSB. Note that if &mmio.data[0] memory would
>> be placed in guest address space and instruction
>> restarted with new address, then it would meet BE
>> guest expectation - the guest knows that it reads
>> BE h/w so it will proceed further without any other
>> work. I guess, it is BE ppc case.
>> 
>> 
>> Arguments in favor of memcpy semantics of mmio.data[]
>> ------------------------------------------------------
>> 
>> x) What are possible values of 'len'? Previous discussions
>> imply that is always powers of 2. Why is that? Maybe
>> there will be CPU that would need to do 5 bytes mmio
>> access, or 6 bytes. How do you assign endianity to
>> such case? 'len' 5 or 6, or any works fine with
>> memcpy semantics. I admit it is hypothetical case, but
>> IMHO it tests how clean ABI definition is.
>> 
>> x) Byte array does not have endianity because it
>> does not have any structure. If one would want to
>> imply structure why mmio is not defined in such way
>> so structure reflected in mmio definition?
>> Something like:
>> 
>> 
>>                /* KVM_EXIT_MMIO */
>>                struct {
>>                          __u64 phys_addr;
>>                          union {
>>                               __u8 byte;
>>                               __u16 hword;
>>                               __u32 word;
>>                               __u64 dword;
>>                          }  data;
>>                          __u32 len;
>>                          __u8  is_write;
>>                } mmio;
>> 
>> where len is really serves as union discriminator and
>> only allowed len values are 1, 2, 4, 8.
>> In this case, I agree, endianity of integer types
>> should be defined. I believe, use of byte array strongly
>> implies that original intent was to have semantics of
>> byte stream copy, just like memcpy does.
>> 
>> x) Note there is nothing wrong with user kernel ABI to
>> use just bytes stream as parameter. There is already
>> precedents like 'read' and 'write' system calls :).
>> 
>> x) Consider case when KVM works with emulated memory mapped
>> h/w devices where some devices operate in LE mode and others
>> operate in BE mode. It is defined by semantics of real h/w
>> device which is it, and should be emulated by emulator and KVM
>> given all other context. As far as mmio.data[] array concerned, if the
>> same integer value is read from these devices registers, mmio.data[]
>> memory should contain integer in opposite endianity for these
>> two cases, i.e MSB is data[0] in one case and MSB is
>> data[3] is in another case. It cannot be the same, because
>> except emulator and guest kernel, all other, like KVM host
>> and hypervisor, have no clue what endianity of device
>> actually is - it should treat mmio.data[] in the same way.
>> But resulting guest target CPU register would need to contain
>> normal integer value in one case and byteswapped in another,
>> because guest kernel would use it directly in one case and
>> byteswap it in another. Byte stream semantics allows to do
>> that. I don't see how it could happen if you fixate mmio.data[]
>> endianity in such way that it would contain integer in
>> the same format for BE and LE emulated device types.
>> 
>> If by this point you agree, that mmio.data[] user-land/kernel
>> ABI semantics should be just memcpy, stop reading :). If not,
>> you may would like to take a look at below appendix where I
>> described in great details endianity of data at different
>> points along mmio processing code path of existing ARM LE KVM,
>> and proposed ARM BE KVM. Note appendix, is very long and very
>> detailed, sorry about that, but I feel that earlier more
>> digested explanations failed, so it driven me to write out
>> all details how I see them. If I am wrong, I hope it would be
>> easier for folks to point in detailed explanation places
>> where my logic goes bad. Also, I am not sure whether this
>> mail thread is good place to discuss all details described
>> in the appendix. Christoffer, please advise whether I should take
>> that one back on [1]. But I hope this bigger picture may help to
>> see the mmio.data[] semantics issue in context.
>> 
>> More inline and appendix is at the end.
>> 
>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>> 
>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>> 
>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>> array and a length", and the current QEMU code treats that
>>>>>> as "this is a byte array written as if the guest CPU
>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>> I/O access to this buffer rather than to the device".
>>>>>> 
>>>>>> The KVM API docs don't actually specify the endianness
>>>>>> semantics of the byte array, but I think that that really
>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>> * always LE
>>>>>> * always BE
>>>>>>  [these first two are non-starters because they would
>>>>>>  break either x86 or PPC existing code]
>>>>>> * always the endianness the guest is at the time
>>>>>> * always some arbitrary endianness based purely on the
>>>>>>  endianness the KVM implementation used historically
>>>>>> * always the endianness of the host QEMU binary
>>>>>> * something else?
>>>>>> 
>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>> which is pretty random.
>>>>> 
>>>>> Having thought a little more about this, my opinion is:
>>>>> 
>>>>> * we should specify that the byte order of the mmio.data
>>>>>  array is host kernel endianness (ie same endianness
>>>>>  as the QEMU process itself) [this is what it actually
>>>>>  is, I think, for all the cases that work today]
>> 
>> In above please consider two types of mapped emulated
>> h/w devices: BE and LE they cannot have mmio.data in the
>> same endianity. Currently in all observable cases LE ARM
>> and BE PPC devices endianity matches kernel/qemu
>> endianity but it would break when BE ARM is introduced
>> or LE PPC or one would start emulating BE devices on LE
>> ARM.
>> 
>>>>> * we should fix the code path in QEMU for handling
>>>>>  mmio.data which currently has the implicit assumption
>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>  as the QEMU host process endianness (because it's using
>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>  is different from HOST_WORDS_BIGENDIAN)
>> 
>> I do not follow above. Maybe I am missing bigger context.
>> What is CPU under discussion in above? On ARM V7 system
>> when LE device is accessed as integer &mmio.data[0] address
>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>> 
>> Here is gdb session of LE qemu running on V7 LE kernel and
>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>> mapped LE device.
>> Please check run->mmio structure after read
>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>> LE format mmio.data[0] is LSB and is equal to 1
>> (s->syscfgstat value):
>> 
>> (gdb) bt
>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>> mask=4294967295)
>>    at /home/root/20131219/qemu-be/memory.c:407
>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>> access_size_min=1,
>>    access_size_max=2357596, access=access@entry=0x23b96c
>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>> /home/root/20131219/qemu-be/memory.c:477
>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>> /home/root/20131219/qemu-be/memory.c:1743
>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>> len=4, is_write=false,
>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>    at /home/root/20131219/qemu-be/exec.c:2070
>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1701
>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>> /home/root/20131219/qemu-be/cpus.c:874
>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>> #11 0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> #12 0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>> (gdb) p /x s->sys_cfgstat
>> $25 = 0x1
>> (gdb) finish
>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>> /home/root/20131219/qemu-be/memory.c:408
>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>> Value returned is $26 = 1
>> (gdb) enable 2
>> (gdb) cont
>> Continuing.
>> 
>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1660
>> 1660            kvm_arch_pre_run(cpu, run);
>> (gdb) bt
>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1660
>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>> /home/root/20131219/qemu-be/cpus.c:874
>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>> #3  0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> #4  0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>> (gdb) p /x run->mmio
>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>> 
>> Also please look at adjust_endianness function and
>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>> works quite nicely already. MemoryRegion 'read' and 'write'
>> callbacks return/get data in native format adjust_endianness
>> function checks whether emulated device endianness matches
>> emulator endianness and if it is different it does byteswap
>> according to size. As in above example arm_sysctl_ops memory
>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>> returns s->sys_cfgstat value LE qemu sees that endianity
>> matches and it does not byteswap of result, so integer at
>> &mmio.data[0] address is in LE form. When qemu would
>> run in BE mode on BE kernel, it would see that endianity
>> mismatches and it will byteswap s->sys_cfgstat native value
>> (BE), so mmio.data would contain integer in LE format again.
>> 
>> Note in currently committed code arm_sysctl_ops endianity
>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>> arm_sysctl device always gives/receives data in LE format regardless
>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>> LE and BE kernels always read it as LE device; BE kernel follows
>> with byteswap. It was OK while we just run qemu in LE, but it
>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>> ... and actually that device and few other ARM specific devices
>> endianity change to LITTLE_ENDIAN was the only change in qemu
>> to make BE KVM to work.
>> 
>>>> 
>>>> Yes, I fully agree :).
>>> Great, I'll prepare a patch for the KVM API documentation.
>>> 
>>> -Christoffer
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@lists.cs.columbia.edu
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>> 
>> Thanks,
>> Victor
>> 
>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>> 
>> 
>>    Appendix
>>    Data path endianity in ARM KVM mmio
>>    ===================================
>> 
>> This writeup considers several scenarios and tracks endianity
>> of data how it travels from emulator to guest CPU register, in
>> case of ARM KVM. It starts with currently committed code for LE
>> KVM host case and further discusses proposed BE KVM host
>> arrangement.
>> 
>> Just to restrict discussion writeup considers code path of
>> integer (4 bytes) read from h/w mapped emulated device memory.
>> Writeup considers endianity of essential places involved in such
>> code path.
>> 
>> For all cases when endianity is defined, it is assumed that
>> values under consideration are in memory (opposite to be in
>> register that does not have endianity). I.e even if function
>> variable could be actually allocated in CPU register writeup
>> will reference to it as it is in memory, just to keep
>> discussion clean, except for final guest CPU register.
>> 
>> Let's consider the following places along data path from
>> emulator to guest CPU register:
>> 
>> 1) emulator code that holds integer value to be read, assume
>> it would be global 'int emulated_hw_device_val' variable.
>> Normally in emulator it is held in native endian format - i.e
>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>> discussion sake assume that this h/w device registers
>> holds 5 as its value.
>> 
>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>> mmio.data byte array. Byte array does not have endianity,
>> but for this discussion it would track endianity of integer
>> at &mmio.data[0] address
>> 
>> 3) 'data' variable type of 'unsigned long' in
>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>> call. KVM host mmio_read_buf function is used to fill this
>> variable from mmio.data buffer. mmio_read_buf actually
>> acts as memcpy from mmio.data buffer address,
>> just taking access size in account.
>> 
>> 4) the same 'data' variable as above, but after
>> vcpu_data_host_to_guest function call, just before it is copied
>> to vcpu_reg target register location. Note
>> vcpu_data_host_to_guest function may byteswap value of 'data'
>> depending on current KVM host endianity and value of
>> guest CPSR E bit.
>> 
>> 5) guest CPU spilled register array, location of target register
>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>> 
>> 6) finally guest CPU register filled from vcpu_reg just before
>> guest resume execution of trapped emulated instruction. Note
>> it is done by hypervisor part of code and hypervisor EE bit is
>> the same as KVM host CPSR E bit.
>> 
>> Note again, KVM host, emulator, and hypervisor part of code (guest
>> CPU registers save and restore code) always run in the same
>> endianity. Endianity of accessed emulated devices and endianity
>> of guest varies independently of KVM host endianity.
>> 
>> Below sections consider all permutations of all possible cases,
>> it maybe quite boring to read. I've created summary table at
>> the end, you can jump to the table, after reading few cases.
>> But if you have objections and you see things happen differently
>> please comment inline of the use cases steps.
>> 
>> LE KVM host
>> ===========
>> 
>> Use case 1
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in LE format, matches device
>> endianity
>> 3) 'data' is LE
>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 5 (0x00000005)
>> 
>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>> instruction, where r0 holds address of devices, it knows
>> that it reads LE mapped h/w so no addition processing is
>> needed
>> 
>> Use case 2
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in LE format; matches device
>> endianity
>> 3) 'data' is LE
>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>> will do byteswap: cpu_to_be)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 0x05000000
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode (E bit on), it knows that it reads
>> LE device memory, it needs to byteswap r1 before further
>> processing so it does 'rev r1, r1' and proceed with result
>> 
>> Use case 3
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native,
>> and it should match device endianity
>> 3) 'data' is BE
>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 0x05000000
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode (E bit off), it knows that it
>> reads BE device memory, it need to byteswap r1 before further
>> processing so it does 'rev r1, r1' and proceeds with result
>> 
>> Use case 4
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native,
>> and should match device endianity
>> 3) 'data' is BE
>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>> will do byteswap: cpu_to_be)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 5 (0x00000005)
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads BE device
>> memory, so it does not need to do anything before further
>> processing.
>> 
>> 
>> Above uses cases that is exactly what we have now after Marc's
>> commit to support BE guest on LE KVM host. Further use
>> cases describe how it would work with BE KVM patches I proposed.
>> It is understood that it is subject of further discussion.
>> 
>> 
>> BE KVM host
>> ===========
>> 
>> Use case 5
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native;
>> matches device endianity
>> 3) 'data' is LE
>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 0x05000000 because
>> hypervisor runs in BE mode, so load of LE integer will be
>> byteswapped value in register
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads LE device
>> memory, it need to byteswap r1 before further processing so it
>> does 'rev r1, r1' and proceeds with result
>> 
>> Use case 6
>> ----------
>> 
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native;
>> matches device endianity
>> 3) 'data' is LE
>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>> does byteswap: cpu_to_le)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 5 (0x00000005) because
>> hypervisor runs in BE mode, so load of BE integer will be OK
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode, it knows that it reads LE device
>> memory, so it does not need to do anything else it just proceeds
>> 
>> Use case 7
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in BE format; matches device
>> endianity
>> 3) 'data' is BE
>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 5 (0x00000005) because
>> hypervisor runs in BE mode, so load of BE integer will be OK
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads BE device
>> memory, so it does not need to do anything else it just proceeds
>> 
>> Use case 8
>> ----------
>> 
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>> 
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in BE format; matches device
>> endianity
>> 3) 'data' is BE
>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>> does byteswap: cpu_to_le)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 0x05000000 because
>> hypervisor runs in BE mode, so load of LE integer will be
>> byteswapped value in register
>> 
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode, it knows that it reads BE device
>> memory, it need to byteswap r1 before further processing so it
>> does 'rev r1, r1' and proceeds with result
>> 
>> Note that with BE kernel we actually have some initial portion
>> of assembler code that is executed with CPSR bit off and it reads
>> LE h/w - i.e it falls into use case 1.
>> 
>> Summary Table (please use fixed font to see it correctly)
>> ========================================
>> 
>> --------------------------------------------------------------
>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>> --------------------------------------------------------------
>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>> | Emulator,  |     |     |     |     |     |     |     |     |
>> | Hypervisor |     |     |     |     |     |     |     |     |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>> | Access     |     |     |     |     |     |     |     |     |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>> --------------------------------------------------------------
>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>> --------------------------------------------------------------
>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>> | value      |     |     |     |     |     |     |     |     |
>> | byteswapped|     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>> | Follows    |     |     |     |     |     |     |     |     |
>> | with rev   |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> 
>> Few objservations
>> =================
>> 
>> x) Note above table is symmetric wrt to BE<->LE change:
>>       1<-->7
>>       2<-->8
>>       3<-->5
>>       4<-->6
>> 
>> x) &mmio.data[0] address always holds integer in the same
>> format as emulated device endianity
>> 
>> x) During step 4) when vcpu_data_host_to_guest function
>> is used, if guest E bit value different, but everything else
>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>> 7&8)
>> 
>> If you reached to this end :), again, thank you very much for
>> reading it!
>> 
>> - Victor
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
> 
> Hi Victor,
> 
> First of all I really appreciate the thorough description with
> all the use-cases.
> 
> Below would be a summary of what I understood from your
> analysis:
> 
> 1. Any MMIO device marked as NATIVE ENDIAN in user

"Native endian" really is just a shortcut for "target endian" which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le.

QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU.

A bus write on the CPU however honors the endianness setting of the CPU. So when we convert from a value in register to a value on the bus we need to take this endian configuration into account.

That's exactly what we are talking about here. KVM should do the cpu configured register->bus endian mapping while QEMU does the bus->device endian map.


Alex

> space tool (QEMU or KVMTOOL) is bad for cross-endian
> Guest. For supporting cross-endian Guest we need to have
> all MMIO device with fixed ENDIANESS.
> 
> 2. We don't need to do any endianness conversions in KVM
> for MMIO writes that are being forwarded to user space. It is
> the job of user space (QEMU or KVMTOOL) to interpret the
> endianness of MMIO write data based on device endianness.
> 
> 3. The MMIO read operation is the one which will need
> explicit handling in KVM because the target VCPU register
> of MMIO read operation should be loaded with MMIO data
> (returned from user space) based upon current VCPU
> endianness (i.e. VCPU CPSR.E bit).
> 
> 4. In-kernel emulated devices (such as VGIC) will have not
> require any explicit endianness conversion of MMIO data for
> MMIO write operations (same as point 2).
> 
> 5. In-kernel emulated devices (such as VGIC) will have to
> explicit endianness conversion of MMIO data for MMIO read
> operations based on device endianness (same as point 3).
> 
> I hope above summary of my understanding is as-per your
> description. If so then I am in-support of your description.
> 
> I think your description (and above 5 points) takes care of
> all use cases of cross-endianness without changing current
> MMIO ABI.
> 
> Regards,
> Anup
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22  6:41             ` [Qemu-devel] " Alexander Graf
@ 2014-01-22  7:26               ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22  7:26 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anup Patel, Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall

On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@brainfault.org>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <victor.kamensky@linaro.org> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm@lists.cs.columbia.edu
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@lists.cs.columbia.edu
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian"
> which is LE for ARM and BE for PPC. There shouldn't be
> a qemu-system-armeb or qemu-system-ppc64le.

I disagree. Fully functional ARM BE system is what we've
been working on for last few months. 'We' is Linaro
Networking Group, Endian subteam and some other guys
in ARM and across community. Why we do that is a bit
beyond of this discussion.

ARM BE patches for both V7 and V8 are already in mainline
kernel. But ARM BE KVM host is broken now. It is known
deficiency that I am trying to fix. Please look at [1]. Patches
for V7 BE KVM were proposed and currently under active
discussion. Currently I work on ARM V8 BE KVM changes.

So "native endian" in ARM is value of CPSR register E bit.
If it is off native endian is LE, if it is on it is BE.

Once and if we agree on ARM BE KVM host changes, the
next step would be patches in qemu one of which introduces
qemu-system-armeb. Please see [2].

> QEMU emulates everything that comes after the CPU, so
> imagine the ioctl struct as a bus package. Your bus
> doesn't care what endianness the CPU is in - it just
> gets data from the CPU.

I am not sure that I follow above. Suppose I have

move r1, #1
str r1, [r0]

where r0 is device address. Now depending on CPSR
E bit value device address will receive 1 as integer either
in LE order or in BE order. That is how ARM v7 CPU
works, regardless whether it is emulated or not.

So if E bit is off (LE case) after str is executed
 byte at r0 address will get 1
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 0

If E bit is on (BE case) after str is executed
 byte at r0 address will get 0
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 1

my point that mmio.data[] just carries bytes for phys_addr
mmio.data[0] would be value for byte at phys_addr,
mmio.data[1] would be value for byte at phys_addr + 1, and
so on.

> A bus write on the CPU however honors the endianness
> setting of the CPU. So when we convert from a value in
> register to a value on the bus we need to take this endian
> configuration into account.

for read it is the same mmio.data[0] just carries memory
for emulated phys_addr. It is the same as for write case.

But if one would want to look at endianity of integer
at &mmio.data[0] address its endianity would be really
defined as endianity of emulated h/w memory mapped
device.

Not sure, maybe I miss your point.

Also please consider endianity of device memory could BE or
LE and it does not depend on "native endianity" it could
exist in any combination and it would work because in
all proper place explicit byteswap would be executed
by code that works with device memory that is in opposite
endianity. Admittedly for ARM most dominating case now is
LE devices, but nothing prevent us to attach memory
mapped devices that would work in BE mode. For example
my parent company, Cisco, which Linaro assignee I am,
has a lot fabric chips that operate in BE, and once attached
to the system they would be treated properly - read in BE
mode without byteswap and read with byteswap in LE mode.
Note last point is oversimplified picture.

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-December/220973.html

[2] https://git.linaro.org/people/victor.kamensky/qemu-be.git/shortlog/refs/heads/armv7be

> That's exactly what we are talking about here. KVM
> should do the cpu configured register->bus endian
> mapping while QEMU does the bus->device endian map.
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-22  7:26               ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22  7:26 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, kvm-devel, Anup Patel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall

On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@brainfault.org>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <victor.kamensky@linaro.org> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm@lists.cs.columbia.edu
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@lists.cs.columbia.edu
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian"
> which is LE for ARM and BE for PPC. There shouldn't be
> a qemu-system-armeb or qemu-system-ppc64le.

I disagree. Fully functional ARM BE system is what we've
been working on for last few months. 'We' is Linaro
Networking Group, Endian subteam and some other guys
in ARM and across community. Why we do that is a bit
beyond of this discussion.

ARM BE patches for both V7 and V8 are already in mainline
kernel. But ARM BE KVM host is broken now. It is known
deficiency that I am trying to fix. Please look at [1]. Patches
for V7 BE KVM were proposed and currently under active
discussion. Currently I work on ARM V8 BE KVM changes.

So "native endian" in ARM is value of CPSR register E bit.
If it is off native endian is LE, if it is on it is BE.

Once and if we agree on ARM BE KVM host changes, the
next step would be patches in qemu one of which introduces
qemu-system-armeb. Please see [2].

> QEMU emulates everything that comes after the CPU, so
> imagine the ioctl struct as a bus package. Your bus
> doesn't care what endianness the CPU is in - it just
> gets data from the CPU.

I am not sure that I follow above. Suppose I have

move r1, #1
str r1, [r0]

where r0 is device address. Now depending on CPSR
E bit value device address will receive 1 as integer either
in LE order or in BE order. That is how ARM v7 CPU
works, regardless whether it is emulated or not.

So if E bit is off (LE case) after str is executed
 byte at r0 address will get 1
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 0

If E bit is on (BE case) after str is executed
 byte at r0 address will get 0
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 1

my point that mmio.data[] just carries bytes for phys_addr
mmio.data[0] would be value for byte at phys_addr,
mmio.data[1] would be value for byte at phys_addr + 1, and
so on.

> A bus write on the CPU however honors the endianness
> setting of the CPU. So when we convert from a value in
> register to a value on the bus we need to take this endian
> configuration into account.

for read it is the same mmio.data[0] just carries memory
for emulated phys_addr. It is the same as for write case.

But if one would want to look at endianity of integer
at &mmio.data[0] address its endianity would be really
defined as endianity of emulated h/w memory mapped
device.

Not sure, maybe I miss your point.

Also please consider endianity of device memory could BE or
LE and it does not depend on "native endianity" it could
exist in any combination and it would work because in
all proper place explicit byteswap would be executed
by code that works with device memory that is in opposite
endianity. Admittedly for ARM most dominating case now is
LE devices, but nothing prevent us to attach memory
mapped devices that would work in BE mode. For example
my parent company, Cisco, which Linaro assignee I am,
has a lot fabric chips that operate in BE, and once attached
to the system they would be treated properly - read in BE
mode without byteswap and read with byteswap in LE mode.
Note last point is oversimplified picture.

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-December/220973.html

[2] https://git.linaro.org/people/victor.kamensky/qemu-be.git/shortlog/refs/heads/armv7be

> That's exactly what we are talking about here. KVM
> should do the cpu configured register->bus endian
> mapping while QEMU does the bus->device endian map.
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22  6:41             ` [Qemu-devel] " Alexander Graf
@ 2014-01-22  8:57               ` Anup Patel
  -1 siblings, 0 replies; 102+ messages in thread
From: Anup Patel @ 2014-01-22  8:57 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

Hi Alex,

On Wed, Jan 22, 2014 at 12:11 PM, Alexander Graf <agraf@suse.de> wrote:
>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@brainfault.org>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <victor.kamensky@linaro.org> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm@lists.cs.columbia.edu
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@lists.cs.columbia.edu
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian" which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le.
>
> QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU.
>
> A bus write on the CPU however honors the endianness setting of the CPU. So when we convert from a value in register to a value on the bus we need to take this endian configuration into account.
>
> That's exactly what we are talking about here. KVM should do the cpu configured register->bus endian mapping while QEMU does the bus->device endian map.

Thanks for the info on QEMU side handling of MMIO data.

I was not aware that we would be only have "target endian = LE"
for ARM/ARM64 in QEMU. I think Marc Z had mentioned similar
thing about MMIO this in our previous discussions on his patches.
(Please refer, http://www.spinics.net/lists/arm-kernel/msg283313.html)

This clearly means MMIO data passed to user space (QEMU) has
to of host endianness so that QEMU can take care of bust->device
endian map.

Current vcpu_data_guest_to_host() and vcpu_data_host_to_guest()
does not perform endianness conversion of MMIO data to LE when
we are running LE guest on BE host so we do need Victor's patch
for fixing vcpu_data_guest_to_host() and vcpu_data_host_to_guest().
(Already reported long time back by me,
http://www.spinics.net/lists/arm-kernel/msg283308.html)

Regards,
Anup

>
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-22  8:57               ` Anup Patel
  0 siblings, 0 replies; 102+ messages in thread
From: Anup Patel @ 2014-01-22  8:57 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

Hi Alex,

On Wed, Jan 22, 2014 at 12:11 PM, Alexander Graf <agraf@suse.de> wrote:
>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@brainfault.org>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <victor.kamensky@linaro.org> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm@lists.cs.columbia.edu
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@lists.cs.columbia.edu
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian" which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le.
>
> QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU.
>
> A bus write on the CPU however honors the endianness setting of the CPU. So when we convert from a value in register to a value on the bus we need to take this endian configuration into account.
>
> That's exactly what we are talking about here. KVM should do the cpu configured register->bus endian mapping while QEMU does the bus->device endian map.

Thanks for the info on QEMU side handling of MMIO data.

I was not aware that we would be only have "target endian = LE"
for ARM/ARM64 in QEMU. I think Marc Z had mentioned similar
thing about MMIO this in our previous discussions on his patches.
(Please refer, http://www.spinics.net/lists/arm-kernel/msg283313.html)

This clearly means MMIO data passed to user space (QEMU) has
to of host endianness so that QEMU can take care of bust->device
endian map.

Current vcpu_data_guest_to_host() and vcpu_data_host_to_guest()
does not perform endianness conversion of MMIO data to LE when
we are running LE guest on BE host so we do need Victor's patch
for fixing vcpu_data_guest_to_host() and vcpu_data_host_to_guest().
(Already reported long time back by me,
http://www.spinics.net/lists/arm-kernel/msg283308.html)

Regards,
Anup

>
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22  5:39         ` [Qemu-devel] " Victor Kamensky
@ 2014-01-22 10:22           ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 10:22 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 22 January 2014 05:39, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> Hi Guys,
>
> Christoffer and I had a bit heated chat :) on this
> subject last night. Christoffer, really appreciate
> your time! We did not really reach agreement
> during the chat and Christoffer asked me to follow
> up on this thread.
> Here it goes. Sorry, it is very long email.
>
> I don't believe we can assign any endianity to
> mmio.data[] byte array. I believe mmio.data[] and
> mmio.len acts just memcpy and that is all. As
> memcpy does not imply any endianity of underlying
> data mmio.data[] should not either.

This email is about five times too long to be actually
useful, but the major issue here is that the data being
transferred is not just a bag of bytes. The data[]
array plus the size field are being (mis)used to indicate
that the memory transaction is one of:
 * an 8 bit access
 * a 16 bit access of some uint16_t value
 * a 32 bit access of some uint32_t value
 * a 64 bit access of some uint64_t value

exactly as a CPU hardware bus would do. It's
because the API is defined in this awkward way with
a uint8_t[] array that we need to specify how both
sides should go from the actual properties of the
memory transaction (value and size) to filling in the
array.

Furthermore, device endianness is entirely irrelevant
for deciding the properties of mmio.data[], because the
thing we're modelling here is essentially the CPU->bus
interface. In real hardware, the properties of individual
devices on the bus are irrelevant to how the CPU's
interface to the bus behaves, and similarly here the
properties of emulated devices don't affect how KVM's
interface to QEMU userspace needs to work.

MemoryRegion's 'endianness' field, incidentally, is
a dreadful mess that we should get rid of. It is attempting
to model the property that some buses/bridges have of
doing byte-lane-swaps on data that passes through as
a property of the device itself. It would be better if we
modelled it properly, with container regions having possible
byte-swapping and devices just being devices.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22 10:22           ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 10:22 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 22 January 2014 05:39, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> Hi Guys,
>
> Christoffer and I had a bit heated chat :) on this
> subject last night. Christoffer, really appreciate
> your time! We did not really reach agreement
> during the chat and Christoffer asked me to follow
> up on this thread.
> Here it goes. Sorry, it is very long email.
>
> I don't believe we can assign any endianity to
> mmio.data[] byte array. I believe mmio.data[] and
> mmio.len acts just memcpy and that is all. As
> memcpy does not imply any endianity of underlying
> data mmio.data[] should not either.

This email is about five times too long to be actually
useful, but the major issue here is that the data being
transferred is not just a bag of bytes. The data[]
array plus the size field are being (mis)used to indicate
that the memory transaction is one of:
 * an 8 bit access
 * a 16 bit access of some uint16_t value
 * a 32 bit access of some uint32_t value
 * a 64 bit access of some uint64_t value

exactly as a CPU hardware bus would do. It's
because the API is defined in this awkward way with
a uint8_t[] array that we need to specify how both
sides should go from the actual properties of the
memory transaction (value and size) to filling in the
array.

Furthermore, device endianness is entirely irrelevant
for deciding the properties of mmio.data[], because the
thing we're modelling here is essentially the CPU->bus
interface. In real hardware, the properties of individual
devices on the bus are irrelevant to how the CPU's
interface to the bus behaves, and similarly here the
properties of emulated devices don't affect how KVM's
interface to QEMU userspace needs to work.

MemoryRegion's 'endianness' field, incidentally, is
a dreadful mess that we should get rid of. It is attempting
to model the property that some buses/bridges have of
doing byte-lane-swaps on data that passes through as
a property of the device itself. It would be better if we
modelled it properly, with container regions having possible
byte-swapping and devices just being devices.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22  7:26               ` [Qemu-devel] " Victor Kamensky
@ 2014-01-22 10:52                 ` Alexander Graf
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-22 10:52 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Anup Patel, Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall


On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org> wrote:

> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>> 
>> 
>> "Native endian" really is just a shortcut for "target endian"
>> which is LE for ARM and BE for PPC. There shouldn't be
>> a qemu-system-armeb or qemu-system-ppc64le.
> 
> I disagree. Fully functional ARM BE system is what we've
> been working on for last few months. 'We' is Linaro
> Networking Group, Endian subteam and some other guys
> in ARM and across community. Why we do that is a bit
> beyond of this discussion.
> 
> ARM BE patches for both V7 and V8 are already in mainline
> kernel. But ARM BE KVM host is broken now. It is known
> deficiency that I am trying to fix. Please look at [1]. Patches
> for V7 BE KVM were proposed and currently under active
> discussion. Currently I work on ARM V8 BE KVM changes.
> 
> So "native endian" in ARM is value of CPSR register E bit.
> If it is off native endian is LE, if it is on it is BE.
> 
> Once and if we agree on ARM BE KVM host changes, the
> next step would be patches in qemu one of which introduces
> qemu-system-armeb. Please see [2].

I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the "default endiannes" for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the "default".

Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. But there's no reason this little endian kernel couldn't theoretically have big endian user space running with access to emulated device registers.

As Peter already pointed out, the actual breakage behind this is that we have a "default endianness" at all. But that's a very difficult thing to resolve and I don't think should be our primary goal. Just live with the fact that we declare ARM little endian in QEMU and swap things accordingly - then everyone's happy.

This really only ever becomes a problem if you have devices that have awareness of the CPUs endian mode. The only one on PPC that I'm aware of that falls into this category is virtio and there are patches pending to solve that. I don't know if there are any QEMU emulated devices outside of virtio with this issue on ARM, but you'll have to make the emulation code for those look at the CPU state then.

> 
>> QEMU emulates everything that comes after the CPU, so
>> imagine the ioctl struct as a bus package. Your bus
>> doesn't care what endianness the CPU is in - it just
>> gets data from the CPU.
> 
> I am not sure that I follow above. Suppose I have
> 
> move r1, #1
> str r1, [r0]
> 
> where r0 is device address. Now depending on CPSR
> E bit value device address will receive 1 as integer either
> in LE order or in BE order. That is how ARM v7 CPU
> works, regardless whether it is emulated or not.
> 
> So if E bit is off (LE case) after str is executed
> byte at r0 address will get 1
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 0
> 
> If E bit is on (BE case) after str is executed
> byte at r0 address will get 0
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 1
> 
> my point that mmio.data[] just carries bytes for phys_addr
> mmio.data[0] would be value for byte at phys_addr,
> mmio.data[1] would be value for byte at phys_addr + 1, and
> so on.

What we get is an instruction that traps because it wants to "write r1 (which has value=1) into address x". So at that point we get the register value.

Then we need to take a look at the E bit to see whether the write was supposed to be in non-host endianness because we need to emulate exactly the LE/BE difference you're indicating above. The way we implement this on PPC is that we simply byte swap the register value when guest_endian != host_endian.

With this in place, QEMU can just memcpy() the value into a local register and feed it into its emulation code which expects a "register value as if the CPU was running in native endianness" as parameter - with "native" meaning "little endian" for qemu-system-arm. Device emulation code doesn't know what to do with a byte array.

Take a look at QEMU's MMIO handler:

        case KVM_EXIT_MMIO:
            DPRINTF("handle_mmio\n");
            cpu_physical_memory_rw(run->mmio.phys_addr,
                                   run->mmio.data,
                                   run->mmio.len,
                                   run->mmio.is_write);
            ret = 0;
            break;

which translates to

                switch (l) {
                case 8:
                    /* 64 bit write access */
                    val = ldq_p(buf);
                    error |= io_mem_write(mr, addr1, val, 8);
                    break;
                case 4:
                    /* 32 bit write access */
                    val = ldl_p(buf);
                    error |= io_mem_write(mr, addr1, val, 4);
                    break;
                case 2:
                    /* 16 bit write access */
                    val = lduw_p(buf);
                    error |= io_mem_write(mr, addr1, val, 2);
                    break;
                case 1:
                    /* 8 bit write access */
                    val = ldub_p(buf);
                    error |= io_mem_write(mr, addr1, val, 1);
                    break;
                default:
                    abort();
                }

which calls the ldx_p primitives

#if defined(TARGET_WORDS_BIGENDIAN)
#define lduw_p(p) lduw_be_p(p)
#define ldsw_p(p) ldsw_be_p(p)
#define ldl_p(p) ldl_be_p(p)
#define ldq_p(p) ldq_be_p(p)
#define ldfl_p(p) ldfl_be_p(p)
#define ldfq_p(p) ldfq_be_p(p)
#define stw_p(p, v) stw_be_p(p, v)
#define stl_p(p, v) stl_be_p(p, v)
#define stq_p(p, v) stq_be_p(p, v)
#define stfl_p(p, v) stfl_be_p(p, v)
#define stfq_p(p, v) stfq_be_p(p, v)
#else
#define lduw_p(p) lduw_le_p(p)
#define ldsw_p(p) ldsw_le_p(p)
#define ldl_p(p) ldl_le_p(p)
#define ldq_p(p) ldq_le_p(p)
#define ldfl_p(p) ldfl_le_p(p)
#define ldfq_p(p) ldfq_le_p(p)
#define stw_p(p, v) stw_le_p(p, v)
#define stl_p(p, v) stl_le_p(p, v)
#define stq_p(p, v) stq_le_p(p, v)
#define stfl_p(p, v) stfl_le_p(p, v)
#define stfq_p(p, v) stfq_le_p(p, v)
#endif

and then passes the result as "originating register access" to the device emulation part of QEMU.


Maybe it becomes more clear if you understand the code flow that TCG is going through. With TCG whenever a write traps into MMIO we go through these functions

void
glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                         DATA_TYPE val, int mmu_idx)
{
    helper_te_st_name(env, addr, val, mmu_idx, GETRA());
}

#ifdef TARGET_WORDS_BIGENDIAN
# define TGT_BE(X)  (X)
# define TGT_LE(X)  BSWAP(X)
#else
# define TGT_BE(X)  BSWAP(X)
# define TGT_LE(X)  (X)
#endif

void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                       int mmu_idx, uintptr_t retaddr)
{
[...]
    /* Handle an IO access.  */
    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
        hwaddr ioaddr;
        if ((addr & (DATA_SIZE - 1)) != 0) {
            goto do_unaligned_access;
        }
        ioaddr = env->iotlb[mmu_idx][index];

        /* ??? Note that the io helpers always read data in the target
           byte ordering.  We should push the LE/BE request down into io.  */
        val = TGT_LE(val);
        glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
        return;
    }
    [...]
}

static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                          hwaddr physaddr,
                                          DATA_TYPE val,
                                          target_ulong addr,
                                          uintptr_t retaddr)
{
    MemoryRegion *mr = iotlb_to_region(physaddr);

    physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
    if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env)) {
        cpu_io_recompile(env, retaddr);
    }

    env->mem_io_vaddr = addr;
    env->mem_io_pc = retaddr;
    io_mem_write(mr, physaddr, val, 1 << SHIFT);
}

which at the end of the chain means if you're running an same endianness on guest and host, you get the original register value as function parameter. If you run different endianness you get a swapped value as function parameter.

So at the end of all of this, if you're running qemu-system-arm (TCG) on a BE host the request into the io callback function will come in as register, then stay all the way it is until it reaches the IO callback function. Unless you define a specific endianness for your device in which case the callback may swizzle it again. But if your device defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle it.

What happens when you switch your guest to BE mode (or LE for PPC)? Very simple. The TCG frontend swizzles every memory read and write before it hits TCG's memory operations.

If you're running qemu-system-arm (KVM) on a BE host the request will come into kvm-all.c, get read with swapped endianness (ldq_p) and then passed into that way into the IO callback function. That's where the bug lies. It should behave the same way as TCG, so it needs to know the value the register originally had. So instead of doing an ldq_p() it should go through a different path that does memcpy().

But that doesn't fix the other-endian issue yet, right? Every value now would come in as the register value.

Well, unless you do the same thing TCG does inside the kernel. So the kernel would swap the reads and writes before it accesses the ioctl struct that connects kvm with QEMU. Then all abstraction layers work just fine again and we don't need any qemu-system-armeb.


Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-22 10:52                 ` Alexander Graf
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-22 10:52 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, Anup Patel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall


On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org> wrote:

> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>> 
>> 
>> "Native endian" really is just a shortcut for "target endian"
>> which is LE for ARM and BE for PPC. There shouldn't be
>> a qemu-system-armeb or qemu-system-ppc64le.
> 
> I disagree. Fully functional ARM BE system is what we've
> been working on for last few months. 'We' is Linaro
> Networking Group, Endian subteam and some other guys
> in ARM and across community. Why we do that is a bit
> beyond of this discussion.
> 
> ARM BE patches for both V7 and V8 are already in mainline
> kernel. But ARM BE KVM host is broken now. It is known
> deficiency that I am trying to fix. Please look at [1]. Patches
> for V7 BE KVM were proposed and currently under active
> discussion. Currently I work on ARM V8 BE KVM changes.
> 
> So "native endian" in ARM is value of CPSR register E bit.
> If it is off native endian is LE, if it is on it is BE.
> 
> Once and if we agree on ARM BE KVM host changes, the
> next step would be patches in qemu one of which introduces
> qemu-system-armeb. Please see [2].

I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the "default endiannes" for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the "default".

Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. But there's no reason this little endian kernel couldn't theoretically have big endian user space running with access to emulated device registers.

As Peter already pointed out, the actual breakage behind this is that we have a "default endianness" at all. But that's a very difficult thing to resolve and I don't think should be our primary goal. Just live with the fact that we declare ARM little endian in QEMU and swap things accordingly - then everyone's happy.

This really only ever becomes a problem if you have devices that have awareness of the CPUs endian mode. The only one on PPC that I'm aware of that falls into this category is virtio and there are patches pending to solve that. I don't know if there are any QEMU emulated devices outside of virtio with this issue on ARM, but you'll have to make the emulation code for those look at the CPU state then.

> 
>> QEMU emulates everything that comes after the CPU, so
>> imagine the ioctl struct as a bus package. Your bus
>> doesn't care what endianness the CPU is in - it just
>> gets data from the CPU.
> 
> I am not sure that I follow above. Suppose I have
> 
> move r1, #1
> str r1, [r0]
> 
> where r0 is device address. Now depending on CPSR
> E bit value device address will receive 1 as integer either
> in LE order or in BE order. That is how ARM v7 CPU
> works, regardless whether it is emulated or not.
> 
> So if E bit is off (LE case) after str is executed
> byte at r0 address will get 1
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 0
> 
> If E bit is on (BE case) after str is executed
> byte at r0 address will get 0
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 1
> 
> my point that mmio.data[] just carries bytes for phys_addr
> mmio.data[0] would be value for byte at phys_addr,
> mmio.data[1] would be value for byte at phys_addr + 1, and
> so on.

What we get is an instruction that traps because it wants to "write r1 (which has value=1) into address x". So at that point we get the register value.

Then we need to take a look at the E bit to see whether the write was supposed to be in non-host endianness because we need to emulate exactly the LE/BE difference you're indicating above. The way we implement this on PPC is that we simply byte swap the register value when guest_endian != host_endian.

With this in place, QEMU can just memcpy() the value into a local register and feed it into its emulation code which expects a "register value as if the CPU was running in native endianness" as parameter - with "native" meaning "little endian" for qemu-system-arm. Device emulation code doesn't know what to do with a byte array.

Take a look at QEMU's MMIO handler:

        case KVM_EXIT_MMIO:
            DPRINTF("handle_mmio\n");
            cpu_physical_memory_rw(run->mmio.phys_addr,
                                   run->mmio.data,
                                   run->mmio.len,
                                   run->mmio.is_write);
            ret = 0;
            break;

which translates to

                switch (l) {
                case 8:
                    /* 64 bit write access */
                    val = ldq_p(buf);
                    error |= io_mem_write(mr, addr1, val, 8);
                    break;
                case 4:
                    /* 32 bit write access */
                    val = ldl_p(buf);
                    error |= io_mem_write(mr, addr1, val, 4);
                    break;
                case 2:
                    /* 16 bit write access */
                    val = lduw_p(buf);
                    error |= io_mem_write(mr, addr1, val, 2);
                    break;
                case 1:
                    /* 8 bit write access */
                    val = ldub_p(buf);
                    error |= io_mem_write(mr, addr1, val, 1);
                    break;
                default:
                    abort();
                }

which calls the ldx_p primitives

#if defined(TARGET_WORDS_BIGENDIAN)
#define lduw_p(p) lduw_be_p(p)
#define ldsw_p(p) ldsw_be_p(p)
#define ldl_p(p) ldl_be_p(p)
#define ldq_p(p) ldq_be_p(p)
#define ldfl_p(p) ldfl_be_p(p)
#define ldfq_p(p) ldfq_be_p(p)
#define stw_p(p, v) stw_be_p(p, v)
#define stl_p(p, v) stl_be_p(p, v)
#define stq_p(p, v) stq_be_p(p, v)
#define stfl_p(p, v) stfl_be_p(p, v)
#define stfq_p(p, v) stfq_be_p(p, v)
#else
#define lduw_p(p) lduw_le_p(p)
#define ldsw_p(p) ldsw_le_p(p)
#define ldl_p(p) ldl_le_p(p)
#define ldq_p(p) ldq_le_p(p)
#define ldfl_p(p) ldfl_le_p(p)
#define ldfq_p(p) ldfq_le_p(p)
#define stw_p(p, v) stw_le_p(p, v)
#define stl_p(p, v) stl_le_p(p, v)
#define stq_p(p, v) stq_le_p(p, v)
#define stfl_p(p, v) stfl_le_p(p, v)
#define stfq_p(p, v) stfq_le_p(p, v)
#endif

and then passes the result as "originating register access" to the device emulation part of QEMU.


Maybe it becomes more clear if you understand the code flow that TCG is going through. With TCG whenever a write traps into MMIO we go through these functions

void
glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                         DATA_TYPE val, int mmu_idx)
{
    helper_te_st_name(env, addr, val, mmu_idx, GETRA());
}

#ifdef TARGET_WORDS_BIGENDIAN
# define TGT_BE(X)  (X)
# define TGT_LE(X)  BSWAP(X)
#else
# define TGT_BE(X)  BSWAP(X)
# define TGT_LE(X)  (X)
#endif

void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                       int mmu_idx, uintptr_t retaddr)
{
[...]
    /* Handle an IO access.  */
    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
        hwaddr ioaddr;
        if ((addr & (DATA_SIZE - 1)) != 0) {
            goto do_unaligned_access;
        }
        ioaddr = env->iotlb[mmu_idx][index];

        /* ??? Note that the io helpers always read data in the target
           byte ordering.  We should push the LE/BE request down into io.  */
        val = TGT_LE(val);
        glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
        return;
    }
    [...]
}

static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                          hwaddr physaddr,
                                          DATA_TYPE val,
                                          target_ulong addr,
                                          uintptr_t retaddr)
{
    MemoryRegion *mr = iotlb_to_region(physaddr);

    physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
    if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env)) {
        cpu_io_recompile(env, retaddr);
    }

    env->mem_io_vaddr = addr;
    env->mem_io_pc = retaddr;
    io_mem_write(mr, physaddr, val, 1 << SHIFT);
}

which at the end of the chain means if you're running an same endianness on guest and host, you get the original register value as function parameter. If you run different endianness you get a swapped value as function parameter.

So at the end of all of this, if you're running qemu-system-arm (TCG) on a BE host the request into the io callback function will come in as register, then stay all the way it is until it reaches the IO callback function. Unless you define a specific endianness for your device in which case the callback may swizzle it again. But if your device defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle it.

What happens when you switch your guest to BE mode (or LE for PPC)? Very simple. The TCG frontend swizzles every memory read and write before it hits TCG's memory operations.

If you're running qemu-system-arm (KVM) on a BE host the request will come into kvm-all.c, get read with swapped endianness (ldq_p) and then passed into that way into the IO callback function. That's where the bug lies. It should behave the same way as TCG, so it needs to know the value the register originally had. So instead of doing an ldq_p() it should go through a different path that does memcpy().

But that doesn't fix the other-endian issue yet, right? Every value now would come in as the register value.

Well, unless you do the same thing TCG does inside the kernel. So the kernel would swap the reads and writes before it accesses the ioctl struct that connects kvm with QEMU. Then all abstraction layers work just fine again and we don't need any qemu-system-armeb.


Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 10:22           ` [Qemu-devel] " Peter Maydell
@ 2014-01-22 17:19             ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22 17:19 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

Hi Peter,

On 22 January 2014 02:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 05:39, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> Hi Guys,
>>
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>>
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
>
> This email is about five times too long to be actually
> useful,

Sorry, about that you may be right.
My below responses much shorter :)

> but the major issue here is that the data being
> transferred is not just a bag of bytes. The data[]
> array plus the size field are being (mis)used to indicate
> that the memory transaction is one of:
>  * an 8 bit access
>  * a 16 bit access of some uint16_t value
>  * a 32 bit access of some uint32_t value
>  * a 64 bit access of some uint64_t value
>
> exactly as a CPU hardware bus would do. It's
> because the API is defined in this awkward way with
> a uint8_t[] array that we need to specify how both
> sides should go from the actual properties of the
> memory transaction (value and size) to filling in the
> array.

While responding to Alex last night, I found, I think,
easiest and shortest way to think about mmio.data[]

Just for discussion reference here it is again
                struct {
                        __u64 phys_addr;
                        __u8  data[8];
                        __u32 len;
                        __u8  is_write;
                } mmio;
I believe that in all cases it should be interpreted
in the following sense
   byte data[0] goes into byte at phys_addr + 0
   byte data[1] goes into byte at phys_addr + 1
   byte data[2] goes into byte at phys_addr + 2
   and so on up to len size

Basically if it would be on real bus, get byte value
that corresponds to phys_addr + 0 address place
it into data[0], get byte value that corresponds to
phys_addr + 1 address place it into data[1], etc.

I believe it is true for current ARM LE case and
PPC BE case. I am asking you to keep it this way
for all other cases. My ARM BE V7 KVM patches
still use it in the same sense.

What is wrong with it?

Note nowhere in my above description I've talked
about endianity of anything: device, access (E bit),
KVM host, guest, hypervisor. All these endianities
are irrelevant to mmio interface.

> Furthermore, device endianness is entirely irrelevant
> for deciding the properties of mmio.data[], because the
> thing we're modelling here is essentially the CPU->bus
> interface. In real hardware, the properties of individual
> devices on the bus are irrelevant to how the CPU's
> interface to the bus behaves, and similarly here the
> properties of emulated devices don't affect how KVM's
> interface to QEMU userspace needs to work.

As far as mmio interface concerned I claim that any
endianity is irrelevant here. I am utterly lost about
endianity of what you care about? Consider
the following ARM code snippets:

setend le
mov r1, #0x04030201
str r1, [r0]

and

setend be
mov r1, #0x01020304
str r1, [r0]

when above snippets are executed memory bus
sees absolutely the same thing, can you tell by
looking at this memory transaction what endianity
is it? And endianity of what? I can't.
The only thing you can tell by looking at this bus
memory transaction is that 0x01 byte value goes at
r0 address, 0x02 byte value goes at r0 + 1 address,
etc.

Thanks,
Victor

> MemoryRegion's 'endianness' field, incidentally, is
> a dreadful mess that we should get rid of. It is attempting
> to model the property that some buses/bridges have of
> doing byte-lane-swaps on data that passes through as
> a property of the device itself. It would be better if we
> modelled it properly, with container regions having possible
> byte-swapping and devices just being devices.
>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22 17:19             ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22 17:19 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

Hi Peter,

On 22 January 2014 02:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 05:39, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> Hi Guys,
>>
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>>
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
>
> This email is about five times too long to be actually
> useful,

Sorry, about that you may be right.
My below responses much shorter :)

> but the major issue here is that the data being
> transferred is not just a bag of bytes. The data[]
> array plus the size field are being (mis)used to indicate
> that the memory transaction is one of:
>  * an 8 bit access
>  * a 16 bit access of some uint16_t value
>  * a 32 bit access of some uint32_t value
>  * a 64 bit access of some uint64_t value
>
> exactly as a CPU hardware bus would do. It's
> because the API is defined in this awkward way with
> a uint8_t[] array that we need to specify how both
> sides should go from the actual properties of the
> memory transaction (value and size) to filling in the
> array.

While responding to Alex last night, I found, I think,
easiest and shortest way to think about mmio.data[]

Just for discussion reference here it is again
                struct {
                        __u64 phys_addr;
                        __u8  data[8];
                        __u32 len;
                        __u8  is_write;
                } mmio;
I believe that in all cases it should be interpreted
in the following sense
   byte data[0] goes into byte at phys_addr + 0
   byte data[1] goes into byte at phys_addr + 1
   byte data[2] goes into byte at phys_addr + 2
   and so on up to len size

Basically if it would be on real bus, get byte value
that corresponds to phys_addr + 0 address place
it into data[0], get byte value that corresponds to
phys_addr + 1 address place it into data[1], etc.

I believe it is true for current ARM LE case and
PPC BE case. I am asking you to keep it this way
for all other cases. My ARM BE V7 KVM patches
still use it in the same sense.

What is wrong with it?

Note nowhere in my above description I've talked
about endianity of anything: device, access (E bit),
KVM host, guest, hypervisor. All these endianities
are irrelevant to mmio interface.

> Furthermore, device endianness is entirely irrelevant
> for deciding the properties of mmio.data[], because the
> thing we're modelling here is essentially the CPU->bus
> interface. In real hardware, the properties of individual
> devices on the bus are irrelevant to how the CPU's
> interface to the bus behaves, and similarly here the
> properties of emulated devices don't affect how KVM's
> interface to QEMU userspace needs to work.

As far as mmio interface concerned I claim that any
endianity is irrelevant here. I am utterly lost about
endianity of what you care about? Consider
the following ARM code snippets:

setend le
mov r1, #0x04030201
str r1, [r0]

and

setend be
mov r1, #0x01020304
str r1, [r0]

when above snippets are executed memory bus
sees absolutely the same thing, can you tell by
looking at this memory transaction what endianity
is it? And endianity of what? I can't.
The only thing you can tell by looking at this bus
memory transaction is that 0x01 byte value goes at
r0 address, 0x02 byte value goes at r0 + 1 address,
etc.

Thanks,
Victor

> MemoryRegion's 'endianness' field, incidentally, is
> a dreadful mess that we should get rid of. It is attempting
> to model the property that some buses/bridges have of
> doing byte-lane-swaps on data that passes through as
> a property of the device itself. It would be better if we
> modelled it properly, with container regions having possible
> byte-swapping and devices just being devices.
>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 17:19             ` [Qemu-devel] " Victor Kamensky
@ 2014-01-22 17:29               ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 17:29 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 22 January 2014 17:19, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> On 22 January 2014 02:22, Peter Maydell <peter.maydell@linaro.org> wrote:
>> but the major issue here is that the data being
>> transferred is not just a bag of bytes. The data[]
>> array plus the size field are being (mis)used to indicate
>> that the memory transaction is one of:
>>  * an 8 bit access
>>  * a 16 bit access of some uint16_t value
>>  * a 32 bit access of some uint32_t value
>>  * a 64 bit access of some uint64_t value
>>
>> exactly as a CPU hardware bus would do. It's
>> because the API is defined in this awkward way with
>> a uint8_t[] array that we need to specify how both
>> sides should go from the actual properties of the
>> memory transaction (value and size) to filling in the
>> array.
>
> While responding to Alex last night, I found, I think,
> easiest and shortest way to think about mmio.data[]
>
> Just for discussion reference here it is again
>                 struct {
>                         __u64 phys_addr;
>                         __u8  data[8];
>                         __u32 len;
>                         __u8  is_write;
>                 } mmio;
> I believe that in all cases it should be interpreted
> in the following sense
>    byte data[0] goes into byte at phys_addr + 0
>    byte data[1] goes into byte at phys_addr + 1
>    byte data[2] goes into byte at phys_addr + 2
>    and so on up to len size
>
> Basically if it would be on real bus, get byte value
> that corresponds to phys_addr + 0 address place
> it into data[0], get byte value that corresponds to
> phys_addr + 1 address place it into data[1], etc.

This just isn't how real buses work. There is no
"address + 1, address + 2". There is a single address
for the memory transaction and a set of data on
data lines and some separate size information.
How the device at the far end of the bus chooses
to respond to 32 bit accesses to address X versus
8 bit accesses to addresses X through X+3 is entirely
its own business and unrelated to the CPU. (It would
be perfectly possible to have a device which when
you read from address X as 32 bits returned 0x12345678,
when you read from address X as 16 bits returned
0x9abc, returned 0x42 for an 8 bit read from X+1,
and so on. Having byte reads from X..X+3 return
values corresponding to parts of the 32 bit access
is purely a convention.)

> Note nowhere in my above description I've talked
> about endianity of anything: device, access (E bit),
> KVM host, guest, hypervisor. All these endianities
> are irrelevant to mmio interface.

As soon as you try to think of the mmio.data as a set
of bytes then you have to specify some endianness to
the data, so that both sides (kernel and userspace)
know how to reconstruct the actual data value from the
array of bytes.

>> Furthermore, device endianness is entirely irrelevant
>> for deciding the properties of mmio.data[], because the
>> thing we're modelling here is essentially the CPU->bus
>> interface. In real hardware, the properties of individual
>> devices on the bus are irrelevant to how the CPU's
>> interface to the bus behaves, and similarly here the
>> properties of emulated devices don't affect how KVM's
>> interface to QEMU userspace needs to work.
>
> As far as mmio interface concerned I claim that any
> endianity is irrelevant here. I am utterly lost about
> endianity of what you care about?

I care about knowing which end of mmio.data is the
least significant byte, obviously.

> Consider
> the following ARM code snippets:
>
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> when above snippets are executed memory bus
> sees absolutely the same thing, can you tell by
> looking at this memory transaction what endianity
> is it? And endianity of what? I can't.

That is correct. That is because the value sent out on
the bus from the CPU is always the same: it says
"32 bit transaction, value 0x04030201, address $whatever".

> The only thing you can tell by looking at this bus
> memory transaction is that 0x01 byte value goes at
> r0 address, 0x02 byte value goes at r0 + 1 address,
> etc.

No, this part is absolutely wrong, see above.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22 17:29               ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 17:29 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 22 January 2014 17:19, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> On 22 January 2014 02:22, Peter Maydell <peter.maydell@linaro.org> wrote:
>> but the major issue here is that the data being
>> transferred is not just a bag of bytes. The data[]
>> array plus the size field are being (mis)used to indicate
>> that the memory transaction is one of:
>>  * an 8 bit access
>>  * a 16 bit access of some uint16_t value
>>  * a 32 bit access of some uint32_t value
>>  * a 64 bit access of some uint64_t value
>>
>> exactly as a CPU hardware bus would do. It's
>> because the API is defined in this awkward way with
>> a uint8_t[] array that we need to specify how both
>> sides should go from the actual properties of the
>> memory transaction (value and size) to filling in the
>> array.
>
> While responding to Alex last night, I found, I think,
> easiest and shortest way to think about mmio.data[]
>
> Just for discussion reference here it is again
>                 struct {
>                         __u64 phys_addr;
>                         __u8  data[8];
>                         __u32 len;
>                         __u8  is_write;
>                 } mmio;
> I believe that in all cases it should be interpreted
> in the following sense
>    byte data[0] goes into byte at phys_addr + 0
>    byte data[1] goes into byte at phys_addr + 1
>    byte data[2] goes into byte at phys_addr + 2
>    and so on up to len size
>
> Basically if it would be on real bus, get byte value
> that corresponds to phys_addr + 0 address place
> it into data[0], get byte value that corresponds to
> phys_addr + 1 address place it into data[1], etc.

This just isn't how real buses work. There is no
"address + 1, address + 2". There is a single address
for the memory transaction and a set of data on
data lines and some separate size information.
How the device at the far end of the bus chooses
to respond to 32 bit accesses to address X versus
8 bit accesses to addresses X through X+3 is entirely
its own business and unrelated to the CPU. (It would
be perfectly possible to have a device which when
you read from address X as 32 bits returned 0x12345678,
when you read from address X as 16 bits returned
0x9abc, returned 0x42 for an 8 bit read from X+1,
and so on. Having byte reads from X..X+3 return
values corresponding to parts of the 32 bit access
is purely a convention.)

> Note nowhere in my above description I've talked
> about endianity of anything: device, access (E bit),
> KVM host, guest, hypervisor. All these endianities
> are irrelevant to mmio interface.

As soon as you try to think of the mmio.data as a set
of bytes then you have to specify some endianness to
the data, so that both sides (kernel and userspace)
know how to reconstruct the actual data value from the
array of bytes.

>> Furthermore, device endianness is entirely irrelevant
>> for deciding the properties of mmio.data[], because the
>> thing we're modelling here is essentially the CPU->bus
>> interface. In real hardware, the properties of individual
>> devices on the bus are irrelevant to how the CPU's
>> interface to the bus behaves, and similarly here the
>> properties of emulated devices don't affect how KVM's
>> interface to QEMU userspace needs to work.
>
> As far as mmio interface concerned I claim that any
> endianity is irrelevant here. I am utterly lost about
> endianity of what you care about?

I care about knowing which end of mmio.data is the
least significant byte, obviously.

> Consider
> the following ARM code snippets:
>
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> when above snippets are executed memory bus
> sees absolutely the same thing, can you tell by
> looking at this memory transaction what endianity
> is it? And endianity of what? I can't.

That is correct. That is because the value sent out on
the bus from the CPU is always the same: it says
"32 bit transaction, value 0x04030201, address $whatever".

> The only thing you can tell by looking at this bus
> memory transaction is that 0x01 byte value goes at
> r0 address, 0x02 byte value goes at r0 + 1 address,
> etc.

No, this part is absolutely wrong, see above.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 17:29               ` [Qemu-devel] " Peter Maydell
@ 2014-01-22 19:29                 ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22 19:29 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 22 January 2014 09:29, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 17:19, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> On 22 January 2014 02:22, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> but the major issue here is that the data being
>>> transferred is not just a bag of bytes. The data[]
>>> array plus the size field are being (mis)used to indicate
>>> that the memory transaction is one of:
>>>  * an 8 bit access
>>>  * a 16 bit access of some uint16_t value
>>>  * a 32 bit access of some uint32_t value
>>>  * a 64 bit access of some uint64_t value
>>>
>>> exactly as a CPU hardware bus would do. It's
>>> because the API is defined in this awkward way with
>>> a uint8_t[] array that we need to specify how both
>>> sides should go from the actual properties of the
>>> memory transaction (value and size) to filling in the
>>> array.
>>
>> While responding to Alex last night, I found, I think,
>> easiest and shortest way to think about mmio.data[]
>>
>> Just for discussion reference here it is again
>>                 struct {
>>                         __u64 phys_addr;
>>                         __u8  data[8];
>>                         __u32 len;
>>                         __u8  is_write;
>>                 } mmio;
>> I believe that in all cases it should be interpreted
>> in the following sense
>>    byte data[0] goes into byte at phys_addr + 0
>>    byte data[1] goes into byte at phys_addr + 1
>>    byte data[2] goes into byte at phys_addr + 2
>>    and so on up to len size
>>
>> Basically if it would be on real bus, get byte value
>> that corresponds to phys_addr + 0 address place
>> it into data[0], get byte value that corresponds to
>> phys_addr + 1 address place it into data[1], etc.
>
> This just isn't how real buses work. There is no
> "address + 1, address + 2". There is a single address
> for the memory transaction and a set of data on
> data lines and some separate size information.

Yes, and those data lines are just binary signal lines
not numbers. If one would want to describe information
on data lines as number he/she needs to assign integer
bits numbers to lines, and that is absolutely arbitrary
process  In one choose one way to assigned those
bits to lines and another choose reverse way, they will
talk about completely different numbers for the same
signals on the bus. Such data lines enumeration has
no reflection on how bus actually works. And I don't
even see why it should be described just as single
integer, for example one can describe information on
data lines as set of 4 byte value, nothing wrong with
such description.

> How the device at the far end of the bus chooses
> to respond to 32 bit accesses to address X versus
> 8 bit accesses to addresses X through X+3 is entirely
> its own business and unrelated to the CPU. (It would
> be perfectly possible to have a device which when
> you read from address X as 32 bits returned 0x12345678,
> when you read from address X as 16 bits returned
> 0x9abc, returned 0x42 for an 8 bit read from X+1,
> and so on. Having byte reads from X..X+3 return
> values corresponding to parts of the 32 bit access
> is purely a convention.)

I don't follow above, one may have one read from
device address X as 32 bits returned 0x12345678,
and another read from the same address X as 32 bit
returned 0xabcdef123, so what? Maybe real example
would help.

>> Note nowhere in my above description I've talked
>> about endianity of anything: device, access (E bit),
>> KVM host, guest, hypervisor. All these endianities
>> are irrelevant to mmio interface.
>
> As soon as you try to think of the mmio.data as a set
> of bytes then you have to specify some endianness to
> the data, so that both sides (kernel and userspace)
> know how to reconstruct the actual data value from the
> array of bytes.

What actual value? In what sense? You need to bring
into discussion semantic of this h/w address to really tell
that. Driver that reads/writes is aware about semantics
of those addresses. For example devices gives chanel1 byte
value at phys_addr, gives chanel2 byte value at
phys_addr + 1, so 16bit integer read from phys_addr
will bring two chanel values into register not one 16bit
integer.

>>> Furthermore, device endianness is entirely irrelevant
>>> for deciding the properties of mmio.data[], because the
>>> thing we're modelling here is essentially the CPU->bus
>>> interface. In real hardware, the properties of individual
>>> devices on the bus are irrelevant to how the CPU's
>>> interface to the bus behaves, and similarly here the
>>> properties of emulated devices don't affect how KVM's
>>> interface to QEMU userspace needs to work.
>>
>> As far as mmio interface concerned I claim that any
>> endianity is irrelevant here. I am utterly lost about
>> endianity of what you care about?
>
> I care about knowing which end of mmio.data is the
> least significant byte, obviously.

LSB of what? Memory semantics does not have
notion of LSB. It comes only when one start
interpreting memory content. memcpy does not
have any LSB involved just bytes.

>> Consider
>> the following ARM code snippets:
>>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> when above snippets are executed memory bus
>> sees absolutely the same thing, can you tell by
>> looking at this memory transaction what endianity
>> is it? And endianity of what? I can't.
>
> That is correct. That is because the value sent out on
> the bus from the CPU is always the same: it says
> "32 bit transaction, value 0x04030201, address $whatever".

Above is just *your* choice to describe signals on
actual bus lines. I can describe the same transaction
as "put set of 4 bytes {0x01, 0x02, 0x03, 0x04} at
address $whatever". It does not change what lines
values would be during this memory transaction.

BTW could you please propose how will you see such
"32 bit transaction, value 0x04030201, address $whatever".
on ARM LE CPU in mmio.data?

If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
me. That is current case ARM LE case when above
snippets would be executed by guest.

Would we  agree that the same arrangement would be
true for all other cases on ARM regardless of all other
endianities of qemu, KVM host, guest, hypervisor, etc?
If we agree on that I think we are talking about the
same thing just in different concepts. My memcpy
semantics of data.mmio[] matches those values just
fine.

>> The only thing you can tell by looking at this bus
>> memory transaction is that 0x01 byte value goes at
>> r0 address, 0x02 byte value goes at r0 + 1 address,
>> etc.
>
> No, this part is absolutely wrong, see above.

I don't see why you so attached to desire to describe
data part of memory transaction as just one of int
types. If we are talking about bunch of hypothetical
cases imagine such bus that allow transaction with
size of 6 bytes. How do you describe such data in
your ints speak? What endianity you can assign to
sequence of 6 bytes? While note that description of
such transaction as set of 6 byte values at address
$whatever makes perfect sense.

Thanks,
Victor

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22 19:29                 ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22 19:29 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 22 January 2014 09:29, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 17:19, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> On 22 January 2014 02:22, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> but the major issue here is that the data being
>>> transferred is not just a bag of bytes. The data[]
>>> array plus the size field are being (mis)used to indicate
>>> that the memory transaction is one of:
>>>  * an 8 bit access
>>>  * a 16 bit access of some uint16_t value
>>>  * a 32 bit access of some uint32_t value
>>>  * a 64 bit access of some uint64_t value
>>>
>>> exactly as a CPU hardware bus would do. It's
>>> because the API is defined in this awkward way with
>>> a uint8_t[] array that we need to specify how both
>>> sides should go from the actual properties of the
>>> memory transaction (value and size) to filling in the
>>> array.
>>
>> While responding to Alex last night, I found, I think,
>> easiest and shortest way to think about mmio.data[]
>>
>> Just for discussion reference here it is again
>>                 struct {
>>                         __u64 phys_addr;
>>                         __u8  data[8];
>>                         __u32 len;
>>                         __u8  is_write;
>>                 } mmio;
>> I believe that in all cases it should be interpreted
>> in the following sense
>>    byte data[0] goes into byte at phys_addr + 0
>>    byte data[1] goes into byte at phys_addr + 1
>>    byte data[2] goes into byte at phys_addr + 2
>>    and so on up to len size
>>
>> Basically if it would be on real bus, get byte value
>> that corresponds to phys_addr + 0 address place
>> it into data[0], get byte value that corresponds to
>> phys_addr + 1 address place it into data[1], etc.
>
> This just isn't how real buses work. There is no
> "address + 1, address + 2". There is a single address
> for the memory transaction and a set of data on
> data lines and some separate size information.

Yes, and those data lines are just binary signal lines
not numbers. If one would want to describe information
on data lines as number he/she needs to assign integer
bits numbers to lines, and that is absolutely arbitrary
process  In one choose one way to assigned those
bits to lines and another choose reverse way, they will
talk about completely different numbers for the same
signals on the bus. Such data lines enumeration has
no reflection on how bus actually works. And I don't
even see why it should be described just as single
integer, for example one can describe information on
data lines as set of 4 byte value, nothing wrong with
such description.

> How the device at the far end of the bus chooses
> to respond to 32 bit accesses to address X versus
> 8 bit accesses to addresses X through X+3 is entirely
> its own business and unrelated to the CPU. (It would
> be perfectly possible to have a device which when
> you read from address X as 32 bits returned 0x12345678,
> when you read from address X as 16 bits returned
> 0x9abc, returned 0x42 for an 8 bit read from X+1,
> and so on. Having byte reads from X..X+3 return
> values corresponding to parts of the 32 bit access
> is purely a convention.)

I don't follow above, one may have one read from
device address X as 32 bits returned 0x12345678,
and another read from the same address X as 32 bit
returned 0xabcdef123, so what? Maybe real example
would help.

>> Note nowhere in my above description I've talked
>> about endianity of anything: device, access (E bit),
>> KVM host, guest, hypervisor. All these endianities
>> are irrelevant to mmio interface.
>
> As soon as you try to think of the mmio.data as a set
> of bytes then you have to specify some endianness to
> the data, so that both sides (kernel and userspace)
> know how to reconstruct the actual data value from the
> array of bytes.

What actual value? In what sense? You need to bring
into discussion semantic of this h/w address to really tell
that. Driver that reads/writes is aware about semantics
of those addresses. For example devices gives chanel1 byte
value at phys_addr, gives chanel2 byte value at
phys_addr + 1, so 16bit integer read from phys_addr
will bring two chanel values into register not one 16bit
integer.

>>> Furthermore, device endianness is entirely irrelevant
>>> for deciding the properties of mmio.data[], because the
>>> thing we're modelling here is essentially the CPU->bus
>>> interface. In real hardware, the properties of individual
>>> devices on the bus are irrelevant to how the CPU's
>>> interface to the bus behaves, and similarly here the
>>> properties of emulated devices don't affect how KVM's
>>> interface to QEMU userspace needs to work.
>>
>> As far as mmio interface concerned I claim that any
>> endianity is irrelevant here. I am utterly lost about
>> endianity of what you care about?
>
> I care about knowing which end of mmio.data is the
> least significant byte, obviously.

LSB of what? Memory semantics does not have
notion of LSB. It comes only when one start
interpreting memory content. memcpy does not
have any LSB involved just bytes.

>> Consider
>> the following ARM code snippets:
>>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> when above snippets are executed memory bus
>> sees absolutely the same thing, can you tell by
>> looking at this memory transaction what endianity
>> is it? And endianity of what? I can't.
>
> That is correct. That is because the value sent out on
> the bus from the CPU is always the same: it says
> "32 bit transaction, value 0x04030201, address $whatever".

Above is just *your* choice to describe signals on
actual bus lines. I can describe the same transaction
as "put set of 4 bytes {0x01, 0x02, 0x03, 0x04} at
address $whatever". It does not change what lines
values would be during this memory transaction.

BTW could you please propose how will you see such
"32 bit transaction, value 0x04030201, address $whatever".
on ARM LE CPU in mmio.data?

If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
me. That is current case ARM LE case when above
snippets would be executed by guest.

Would we  agree that the same arrangement would be
true for all other cases on ARM regardless of all other
endianities of qemu, KVM host, guest, hypervisor, etc?
If we agree on that I think we are talking about the
same thing just in different concepts. My memcpy
semantics of data.mmio[] matches those values just
fine.

>> The only thing you can tell by looking at this bus
>> memory transaction is that 0x01 byte value goes at
>> r0 address, 0x02 byte value goes at r0 + 1 address,
>> etc.
>
> No, this part is absolutely wrong, see above.

I don't see why you so attached to desire to describe
data part of memory transaction as just one of int
types. If we are talking about bunch of hypothetical
cases imagine such bus that allow transaction with
size of 6 bytes. How do you describe such data in
your ints speak? What endianity you can assign to
sequence of 6 bytes? While note that description of
such transaction as set of 6 byte values at address
$whatever makes perfect sense.

Thanks,
Victor

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 19:29                 ` [Qemu-devel] " Victor Kamensky
@ 2014-01-22 20:02                   ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 20:02 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 22 January 2014 19:29, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> On 22 January 2014 09:29, Peter Maydell <peter.maydell@linaro.org> wrote:
>> This just isn't how real buses work. There is no
>> "address + 1, address + 2". There is a single address
>> for the memory transaction and a set of data on
>> data lines and some separate size information.
>
> Yes, and those data lines are just binary signal lines
> not numbers. If one would want to describe information
> on data lines as number he/she needs to assign integer
> bits numbers to lines, and that is absolutely arbitrary
> process

It is part of the definition of the bus which signal pin is
D0 and which is D31...

>  In one choose one way to assigned those
> bits to lines and another choose reverse way, they will
> talk about completely different numbers for the same
> signals on the bus. Such data lines enumeration has
> no reflection on how bus actually works. And I don't
> even see why it should be described just as single
> integer, for example one can describe information on
> data lines as set of 4 byte value, nothing wrong with
> such description.

It is not how the hardware works. If you describe it as
a set of 4 bytes, then you need to also say how you are
mapping from those 4 bytes to the actual 32 bit data
transaction the hardware is doing. Which is the question
we're trying to answer in this thread.

I've snipped a huge chunk of my initial reply to this email,
because it all boiled down to "sorry, you're just not correct
about how the hardware works" and it doesn't seem
necessary to repeat it three times. Devices really do see
"this is a transaction with this value and this size". They
do not in any way see a 32 bit word write as "this is a collection
of byte writes". Therefore:

 1) thinking about a 32 bit word write in terms of a byte array
    is confusing
 2) since the KVM API is unfortunately stuck with this byte
   array, we must define the semantics of what it actually
   contains, so that the kernel and QEMU can go between
  "the value being read/written in the transaction" and
  "the contents of the byte array

>> As soon as you try to think of the mmio.data as a set
>> of bytes then you have to specify some endianness to
>> the data, so that both sides (kernel and userspace)
>> know how to reconstruct the actual data value from the
>> array of bytes.
>
> What actual value? In what sense? You need to bring
> into discussion semantic of this h/w address to really tell
> that.

I've just spent the last two emails doing exactly that.
The actual value, as in "this CPU just did a memory
transaction of a 32 bit data value".

> BTW could you please propose how will you see such
> "32 bit transaction, value 0x04030201, address $whatever".
> on ARM LE CPU in mmio.data?

That is exactly the problem we're discussing in this thread.
Indeed, I proposed an answer to it, which is that the mmio.data
array should be in host kernel byte order, in which case it
would be (for an LE host kernel) 0x01 in mmio.data[0] and so
on up.

> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
> me. That is current case ARM LE case when above
> snippets would be executed by guest.
>
> Would we  agree that the same arrangement would be
> true for all other cases on ARM regardless of all other
> endianities of qemu, KVM host, guest, hypervisor, etc?

No; under my proposal, for a big-endian host kernel (and
thus big-endian QEMU) the order would be
mmio.data[0] = 0x04, etc. (This wouldn't change based
on the guest kernel endianness or whether it happened
to have set the E bit temporarily.)

Defining that mmio.data[] is always little-endian would
be a valid definition of an API if we were doing it from
scratch. It has the unfortunate property that it would
completely break the existing PPC BE setups, which
don't define it that way, so it is a non-starter.

Defining it as being always guest-order would mean that
userspace had to continually look at the guest CPU
endianness bit, which is annoying and awkward.

Defining it as always host-endian order is the most
reasonable option available. It also happens to work
for the current QEMU code, which is nice.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22 20:02                   ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 20:02 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 22 January 2014 19:29, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> On 22 January 2014 09:29, Peter Maydell <peter.maydell@linaro.org> wrote:
>> This just isn't how real buses work. There is no
>> "address + 1, address + 2". There is a single address
>> for the memory transaction and a set of data on
>> data lines and some separate size information.
>
> Yes, and those data lines are just binary signal lines
> not numbers. If one would want to describe information
> on data lines as number he/she needs to assign integer
> bits numbers to lines, and that is absolutely arbitrary
> process

It is part of the definition of the bus which signal pin is
D0 and which is D31...

>  In one choose one way to assigned those
> bits to lines and another choose reverse way, they will
> talk about completely different numbers for the same
> signals on the bus. Such data lines enumeration has
> no reflection on how bus actually works. And I don't
> even see why it should be described just as single
> integer, for example one can describe information on
> data lines as set of 4 byte value, nothing wrong with
> such description.

It is not how the hardware works. If you describe it as
a set of 4 bytes, then you need to also say how you are
mapping from those 4 bytes to the actual 32 bit data
transaction the hardware is doing. Which is the question
we're trying to answer in this thread.

I've snipped a huge chunk of my initial reply to this email,
because it all boiled down to "sorry, you're just not correct
about how the hardware works" and it doesn't seem
necessary to repeat it three times. Devices really do see
"this is a transaction with this value and this size". They
do not in any way see a 32 bit word write as "this is a collection
of byte writes". Therefore:

 1) thinking about a 32 bit word write in terms of a byte array
    is confusing
 2) since the KVM API is unfortunately stuck with this byte
   array, we must define the semantics of what it actually
   contains, so that the kernel and QEMU can go between
  "the value being read/written in the transaction" and
  "the contents of the byte array

>> As soon as you try to think of the mmio.data as a set
>> of bytes then you have to specify some endianness to
>> the data, so that both sides (kernel and userspace)
>> know how to reconstruct the actual data value from the
>> array of bytes.
>
> What actual value? In what sense? You need to bring
> into discussion semantic of this h/w address to really tell
> that.

I've just spent the last two emails doing exactly that.
The actual value, as in "this CPU just did a memory
transaction of a 32 bit data value".

> BTW could you please propose how will you see such
> "32 bit transaction, value 0x04030201, address $whatever".
> on ARM LE CPU in mmio.data?

That is exactly the problem we're discussing in this thread.
Indeed, I proposed an answer to it, which is that the mmio.data
array should be in host kernel byte order, in which case it
would be (for an LE host kernel) 0x01 in mmio.data[0] and so
on up.

> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
> me. That is current case ARM LE case when above
> snippets would be executed by guest.
>
> Would we  agree that the same arrangement would be
> true for all other cases on ARM regardless of all other
> endianities of qemu, KVM host, guest, hypervisor, etc?

No; under my proposal, for a big-endian host kernel (and
thus big-endian QEMU) the order would be
mmio.data[0] = 0x04, etc. (This wouldn't change based
on the guest kernel endianness or whether it happened
to have set the E bit temporarily.)

Defining that mmio.data[] is always little-endian would
be a valid definition of an API if we were doing it from
scratch. It has the unfortunate property that it would
completely break the existing PPC BE setups, which
don't define it that way, so it is a non-starter.

Defining it as being always guest-order would mean that
userspace had to continually look at the guest CPU
endianness bit, which is annoying and awkward.

Defining it as always host-endian order is the most
reasonable option available. It also happens to work
for the current QEMU code, which is nice.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 20:02                   ` [Qemu-devel] " Peter Maydell
@ 2014-01-22 22:47                     ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22 22:47 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 22 January 2014 12:02, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 19:29, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> On 22 January 2014 09:29, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> This just isn't how real buses work. There is no
>>> "address + 1, address + 2". There is a single address
>>> for the memory transaction and a set of data on
>>> data lines and some separate size information.
>>
>> Yes, and those data lines are just binary signal lines
>> not numbers. If one would want to describe information
>> on data lines as number he/she needs to assign integer
>> bits numbers to lines, and that is absolutely arbitrary
>> process
>
> It is part of the definition of the bus which signal pin is
> D0 and which is D31...
>
>>  In one choose one way to assigned those
>> bits to lines and another choose reverse way, they will
>> talk about completely different numbers for the same
>> signals on the bus. Such data lines enumeration has
>> no reflection on how bus actually works. And I don't
>> even see why it should be described just as single
>> integer, for example one can describe information on
>> data lines as set of 4 byte value, nothing wrong with
>> such description.
>
> It is not how the hardware works. If you describe it as
> a set of 4 bytes, then you need to also say how you are
> mapping from those 4 bytes to the actual 32 bit data
> transaction the hardware is doing. Which is the question
> we're trying to answer in this thread.
>
> I've snipped a huge chunk of my initial reply to this email,
> because it all boiled down to "sorry, you're just not correct
> about how the hardware works" and it doesn't seem
> necessary to repeat it three times. Devices really do see
> "this is a transaction with this value and this size". They
> do not in any way see a 32 bit word write as "this is a collection
> of byte writes". Therefore:
>
>  1) thinking about a 32 bit word write in terms of a byte array
>     is confusing
>  2) since the KVM API is unfortunately stuck with this byte
>    array, we must define the semantics of what it actually
>    contains, so that the kernel and QEMU can go between
>   "the value being read/written in the transaction" and
>   "the contents of the byte array
>
>>> As soon as you try to think of the mmio.data as a set
>>> of bytes then you have to specify some endianness to
>>> the data, so that both sides (kernel and userspace)
>>> know how to reconstruct the actual data value from the
>>> array of bytes.
>>
>> What actual value? In what sense? You need to bring
>> into discussion semantic of this h/w address to really tell
>> that.
>
> I've just spent the last two emails doing exactly that.
> The actual value, as in "this CPU just did a memory
> transaction of a 32 bit data value".

You deleted my example, but I need it again:
Consider the following ARM code snippets:

setend le
mov r1, #0x04030201
str r1, [r0]

and

setend be
mov r1, #0x01020304
str r1, [r0]

Just for LE host case basically you are saying that if guest
issues 4 bytes store
instruction for CPU core register and CPSR E bit is off,
mmio.data[0] would contain LSB of integer from this CPU core
register. I don't understand your bus endianity thing, but I do
understand LSB of integer in core CPU register. Do we agree
that in above example in second case when BE access is
on (E bit is on), it is the exactly the same memory transaction
but it data[0] = 0x1 is MSB of integer in CPU core register (still the
same LE host case)?

>> BTW could you please propose how will you see such
>> "32 bit transaction, value 0x04030201, address $whatever".
>> on ARM LE CPU in mmio.data?
>
> That is exactly the problem we're discussing in this thread.
> Indeed, I proposed an answer to it, which is that the mmio.data
> array should be in host kernel byte order, in which case it
> would be (for an LE host kernel) 0x01 in mmio.data[0] and so
> on up.
>
>> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
>> me. That is current case ARM LE case when above
>> snippets would be executed by guest.
>>
>> Would we  agree that the same arrangement would be
>> true for all other cases on ARM regardless of all other
>> endianities of qemu, KVM host, guest, hypervisor, etc?
>
> No; under my proposal, for a big-endian host kernel (and
> thus big-endian QEMU) the order would be
> mmio.data[0] = 0x04, etc. (This wouldn't change based
> on the guest kernel endianness or whether it happened
> to have set the E bit temporarily.)

Consider above big endian case (setend be) example,
but now running in BE KVM host. 0x4 is LSB of CPU
core register in this case.

> Defining that mmio.data[] is always little-endian would
> be a valid definition of an API if we were doing it from
> scratch. It has the unfortunate property that it would
> completely break the existing PPC BE setups, which
> don't define it that way, so it is a non-starter.

I believe, but I need to check, that PPC BE setup actually
acts as the second case in above example  If we have PPC
BE guest executing the following instructions:

lis     r1,0x102
ori     r1,r1,0x304
stw    r1,0(r0)

after first two instructions r1 would contain 0x01020304.
IMHO It exactly corresponds to above my ARM second case -
BE guest when it runs under ARM BE KVM host. I believe
that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
That would be the same as I argue that should be the case for
both ARM and PPC for any choice of endianities of KVM host,
and emulator.

But according to you data[0] must be 0x4 in BE host case and
If you are right and in PPC BE case for above ppc instructions
sequence executed in BE mode, running on BE host
mmio.data[] = {0x04, 0x03, 0x02, 0x01}, i.e integer at &mmio.data[0]
address has LE format from from original CPU core register integer
point of view, my bytes copy interpretation of mmio.data[]
indeed falls to pieces. Such PPC case breaks it.

Could someone on these mailing lists quickly check PPC case
with  similar as above guest instructions sequence. I could try to
do it myself, but I think it may take me couple days to set this
up .. need to dust off my p5020 board :) ...

Thanks,
Victor

> Defining it as being always guest-order would mean that
> userspace had to continually look at the guest CPU
> endianness bit, which is annoying and awkward.
>
> Defining it as always host-endian order is the most
> reasonable option available. It also happens to work
> for the current QEMU code, which is nice.
>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22 22:47                     ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-22 22:47 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 22 January 2014 12:02, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 19:29, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> On 22 January 2014 09:29, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> This just isn't how real buses work. There is no
>>> "address + 1, address + 2". There is a single address
>>> for the memory transaction and a set of data on
>>> data lines and some separate size information.
>>
>> Yes, and those data lines are just binary signal lines
>> not numbers. If one would want to describe information
>> on data lines as number he/she needs to assign integer
>> bits numbers to lines, and that is absolutely arbitrary
>> process
>
> It is part of the definition of the bus which signal pin is
> D0 and which is D31...
>
>>  In one choose one way to assigned those
>> bits to lines and another choose reverse way, they will
>> talk about completely different numbers for the same
>> signals on the bus. Such data lines enumeration has
>> no reflection on how bus actually works. And I don't
>> even see why it should be described just as single
>> integer, for example one can describe information on
>> data lines as set of 4 byte value, nothing wrong with
>> such description.
>
> It is not how the hardware works. If you describe it as
> a set of 4 bytes, then you need to also say how you are
> mapping from those 4 bytes to the actual 32 bit data
> transaction the hardware is doing. Which is the question
> we're trying to answer in this thread.
>
> I've snipped a huge chunk of my initial reply to this email,
> because it all boiled down to "sorry, you're just not correct
> about how the hardware works" and it doesn't seem
> necessary to repeat it three times. Devices really do see
> "this is a transaction with this value and this size". They
> do not in any way see a 32 bit word write as "this is a collection
> of byte writes". Therefore:
>
>  1) thinking about a 32 bit word write in terms of a byte array
>     is confusing
>  2) since the KVM API is unfortunately stuck with this byte
>    array, we must define the semantics of what it actually
>    contains, so that the kernel and QEMU can go between
>   "the value being read/written in the transaction" and
>   "the contents of the byte array
>
>>> As soon as you try to think of the mmio.data as a set
>>> of bytes then you have to specify some endianness to
>>> the data, so that both sides (kernel and userspace)
>>> know how to reconstruct the actual data value from the
>>> array of bytes.
>>
>> What actual value? In what sense? You need to bring
>> into discussion semantic of this h/w address to really tell
>> that.
>
> I've just spent the last two emails doing exactly that.
> The actual value, as in "this CPU just did a memory
> transaction of a 32 bit data value".

You deleted my example, but I need it again:
Consider the following ARM code snippets:

setend le
mov r1, #0x04030201
str r1, [r0]

and

setend be
mov r1, #0x01020304
str r1, [r0]

Just for LE host case basically you are saying that if guest
issues 4 bytes store
instruction for CPU core register and CPSR E bit is off,
mmio.data[0] would contain LSB of integer from this CPU core
register. I don't understand your bus endianity thing, but I do
understand LSB of integer in core CPU register. Do we agree
that in above example in second case when BE access is
on (E bit is on), it is the exactly the same memory transaction
but it data[0] = 0x1 is MSB of integer in CPU core register (still the
same LE host case)?

>> BTW could you please propose how will you see such
>> "32 bit transaction, value 0x04030201, address $whatever".
>> on ARM LE CPU in mmio.data?
>
> That is exactly the problem we're discussing in this thread.
> Indeed, I proposed an answer to it, which is that the mmio.data
> array should be in host kernel byte order, in which case it
> would be (for an LE host kernel) 0x01 in mmio.data[0] and so
> on up.
>
>> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
>> me. That is current case ARM LE case when above
>> snippets would be executed by guest.
>>
>> Would we  agree that the same arrangement would be
>> true for all other cases on ARM regardless of all other
>> endianities of qemu, KVM host, guest, hypervisor, etc?
>
> No; under my proposal, for a big-endian host kernel (and
> thus big-endian QEMU) the order would be
> mmio.data[0] = 0x04, etc. (This wouldn't change based
> on the guest kernel endianness or whether it happened
> to have set the E bit temporarily.)

Consider above big endian case (setend be) example,
but now running in BE KVM host. 0x4 is LSB of CPU
core register in this case.

> Defining that mmio.data[] is always little-endian would
> be a valid definition of an API if we were doing it from
> scratch. It has the unfortunate property that it would
> completely break the existing PPC BE setups, which
> don't define it that way, so it is a non-starter.

I believe, but I need to check, that PPC BE setup actually
acts as the second case in above example  If we have PPC
BE guest executing the following instructions:

lis     r1,0x102
ori     r1,r1,0x304
stw    r1,0(r0)

after first two instructions r1 would contain 0x01020304.
IMHO It exactly corresponds to above my ARM second case -
BE guest when it runs under ARM BE KVM host. I believe
that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
That would be the same as I argue that should be the case for
both ARM and PPC for any choice of endianities of KVM host,
and emulator.

But according to you data[0] must be 0x4 in BE host case and
If you are right and in PPC BE case for above ppc instructions
sequence executed in BE mode, running on BE host
mmio.data[] = {0x04, 0x03, 0x02, 0x01}, i.e integer at &mmio.data[0]
address has LE format from from original CPU core register integer
point of view, my bytes copy interpretation of mmio.data[]
indeed falls to pieces. Such PPC case breaks it.

Could someone on these mailing lists quickly check PPC case
with  similar as above guest instructions sequence. I could try to
do it myself, but I think it may take me couple days to set this
up .. need to dust off my p5020 board :) ...

Thanks,
Victor

> Defining it as being always guest-order would mean that
> userspace had to continually look at the guest CPU
> endianness bit, which is annoying and awkward.
>
> Defining it as always host-endian order is the most
> reasonable option available. It also happens to work
> for the current QEMU code, which is nice.
>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 22:47                     ` [Qemu-devel] " Victor Kamensky
@ 2014-01-22 23:18                       ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 23:18 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 22 January 2014 22:47, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> You deleted my example, but I need it again:
> Consider the following ARM code snippets:
>
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> Just for LE host case basically you are saying that if guest
> issues 4 bytes store
> instruction for CPU core register and CPSR E bit is off,
> mmio.data[0] would contain LSB of integer from this CPU core
> register. I don't understand your bus endianity thing, but I do
> understand LSB of integer in core CPU register. Do we agree
> that in above example in second case when BE access is
> on (E bit is on), it is the exactly the same memory transaction
> but it data[0] = 0x1 is MSB of integer in CPU core register (still the
> same LE host case)?

Yes, this is true both if we define mmio.data[] as "always
little endian" and if we define it as "host kernel endianness",
since you've specified an LE host here.

(The kernel has to byte swap if CPSR.E is set, because
it has to emulate the byte-lane-swap the CPU hardware
does internally before register data goes out to the bus.)

> Consider above big endian case (setend be) example,
> but now running in BE KVM host. 0x4 is LSB of CPU
> core register in this case.

Yes. In this case if we are using the "mmio.data is host
kernel endianness" definition then mmio.data[0] should be
0x01 (the MSB of the 32 bit data value). (Notice that the
BE host kernel can actually just behave exactly like the LE
one: byteswap 32 bit value from guest register if guest
CPSR.E is set, then do a 32-bit store of the 32 bit word
into mmio.data[].)

>> Defining that mmio.data[] is always little-endian would
>> be a valid definition of an API if we were doing it from
>> scratch. It has the unfortunate property that it would
>> completely break the existing PPC BE setups, which
>> don't define it that way, so it is a non-starter.
>
> I believe, but I need to check, that PPC BE setup actually
> acts as the second case in above example  If we have PPC
> BE guest executing the following instructions:
>
> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)
>
> after first two instructions r1 would contain 0x01020304.
> IMHO It exactly corresponds to above my ARM second case -
> BE guest when it runs under ARM BE KVM host. I believe
> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.

Yes, assuming a BE PPC host kernel (which is the usual
arrangement).

> But according to you data[0] must be 0x4 in BE host case

Er, no. The data here is 0x01020304, so for a BE host
data[0] is the big end, ie 0x1. It would only be 0x4 if
mmio.data[] were LE always (or if you were running
your BE PPC guest on an LE PPC host, which I don't
think is supported currently).

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-22 23:18                       ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-22 23:18 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 22 January 2014 22:47, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> You deleted my example, but I need it again:
> Consider the following ARM code snippets:
>
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> Just for LE host case basically you are saying that if guest
> issues 4 bytes store
> instruction for CPU core register and CPSR E bit is off,
> mmio.data[0] would contain LSB of integer from this CPU core
> register. I don't understand your bus endianity thing, but I do
> understand LSB of integer in core CPU register. Do we agree
> that in above example in second case when BE access is
> on (E bit is on), it is the exactly the same memory transaction
> but it data[0] = 0x1 is MSB of integer in CPU core register (still the
> same LE host case)?

Yes, this is true both if we define mmio.data[] as "always
little endian" and if we define it as "host kernel endianness",
since you've specified an LE host here.

(The kernel has to byte swap if CPSR.E is set, because
it has to emulate the byte-lane-swap the CPU hardware
does internally before register data goes out to the bus.)

> Consider above big endian case (setend be) example,
> but now running in BE KVM host. 0x4 is LSB of CPU
> core register in this case.

Yes. In this case if we are using the "mmio.data is host
kernel endianness" definition then mmio.data[0] should be
0x01 (the MSB of the 32 bit data value). (Notice that the
BE host kernel can actually just behave exactly like the LE
one: byteswap 32 bit value from guest register if guest
CPSR.E is set, then do a 32-bit store of the 32 bit word
into mmio.data[].)

>> Defining that mmio.data[] is always little-endian would
>> be a valid definition of an API if we were doing it from
>> scratch. It has the unfortunate property that it would
>> completely break the existing PPC BE setups, which
>> don't define it that way, so it is a non-starter.
>
> I believe, but I need to check, that PPC BE setup actually
> acts as the second case in above example  If we have PPC
> BE guest executing the following instructions:
>
> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)
>
> after first two instructions r1 would contain 0x01020304.
> IMHO It exactly corresponds to above my ARM second case -
> BE guest when it runs under ARM BE KVM host. I believe
> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.

Yes, assuming a BE PPC host kernel (which is the usual
arrangement).

> But according to you data[0] must be 0x4 in BE host case

Er, no. The data here is 0x01020304, so for a BE host
data[0] is the big end, ie 0x1. It would only be 0x4 if
mmio.data[] were LE always (or if you were running
your BE PPC guest on an LE PPC host, which I don't
think is supported currently).

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 23:18                       ` [Qemu-devel] " Peter Maydell
@ 2014-01-23  0:22                         ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23  0:22 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

Peter, could I please ask you a favor. Could you please
stop deleting pieces of your and my previous responses
when you reply.
Please just reply inline. Sometimes I would like to
reference my or your previous statement, but I could not
find it in your response email. It is very bizzar. Sorry,
it will make your response email bigger, but I am very
confused otherwise.

On 22 January 2014 15:18, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 22:47, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> You deleted my example, but I need it again:
>> Consider the following ARM code snippets:
>>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> Just for LE host case basically you are saying that if guest
>> issues 4 bytes store
>> instruction for CPU core register and CPSR E bit is off,
>> mmio.data[0] would contain LSB of integer from this CPU core
>> register. I don't understand your bus endianity thing, but I do
>> understand LSB of integer in core CPU register. Do we agree
>> that in above example in second case when BE access is
>> on (E bit is on), it is the exactly the same memory transaction
>> but it data[0] = 0x1 is MSB of integer in CPU core register (still the
>> same LE host case)?
>
> Yes, this is true both if we define mmio.data[] as "always
> little endian" and if we define it as "host kernel endianness",
> since you've specified an LE host here.

OK, we are in agreement here. mmio.data[] = { 0x1, 0x2, 0x3, 0x4}
for both types of guest access and host is LE  And as
far as bus concerned it is absolutely the same transaction.

> (The kernel has to byte swap if CPSR.E is set, because
> it has to emulate the byte-lane-swap the CPU hardware
> does internally before register data goes out to the bus.)
>
>> Consider above big endian case (setend be) example,
>> but now running in BE KVM host. 0x4 is LSB of CPU
>> core register in this case.
>
> Yes. In this case if we are using the "mmio.data is host
> kernel endianness" definition then mmio.data[0] should be
> 0x01 (the MSB of the 32 bit data value).

If mmio.data[0] is 0x1, mmio.data[] = {0x1, 0x2, 0x3, 0x4},
and now KVM host and emulator running in BE mode.
But that contradicts to what you said before. In previous
email (please see [1]). Here is what your and I just said in few
paragraphs before my "Consider above big endian .." above
paragraph:

----- start quote ------------------------------------------

>> BTW could you please propose how will you see such
>> "32 bit transaction, value 0x04030201, address $whatever".
>> on ARM LE CPU in mmio.data?
>
> That is exactly the problem we're discussing in this thread.
> Indeed, I proposed an answer to it, which is that the mmio.data
> array should be in host kernel byte order, in which case it
> would be (for an LE host kernel) 0x01 in mmio.data[0] and so
> on up.
>
>> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
>> me. That is current case ARM LE case when above
>> snippets would be executed by guest.
>>
>> Would we  agree that the same arrangement would be
>> true for all other cases on ARM regardless of all other
>> endianities of qemu, KVM host, guest, hypervisor, etc?
>
> No; under my proposal, for a big-endian host kernel (and
> thus big-endian QEMU) the order would be
> mmio.data[0] = 0x04, etc. (This wouldn't change based
> on the guest kernel endianness or whether it happened
> to have set the E bit temporarily.)

----- end quote ------------------------------------------

So in one case for the same memory transaction (ARM
setend be snippet) executed
under BE ARM host KVM you said that "mmio.data[0]
should be 0x01 (the MSB of the 32 bit data value)" and
before you said "No; under my proposal, for a big-endian
host kernel (and thus big-endian QEMU) the order would be
mmio.data[0] = 0x04, etc". So which is mmio.data[0]?

I argue that for all three code snippets in this email (two for
ARM and one for PPC) mmio.data[] = {0x1, 0x2, 0x3, 04},
and that does not depend whether it is LE ARM KVM host,
BE ARM KVM host, or BE PPC KVM.

> (Notice that the
> BE host kernel can actually just behave exactly like the LE
> one: byteswap 32 bit value from guest register if guest
> CPSR.E is set, then do a 32-bit store of the 32 bit word
> into mmio.data[].)
>
>>> Defining that mmio.data[] is always little-endian would
>>> be a valid definition of an API if we were doing it from
>>> scratch. It has the unfortunate property that it would
>>> completely break the existing PPC BE setups, which
>>> don't define it that way, so it is a non-starter.
>>
>> I believe, but I need to check, that PPC BE setup actually
>> acts as the second case in above example  If we have PPC
>> BE guest executing the following instructions:
>>
>> lis     r1,0x102
>> ori     r1,r1,0x304
>> stw    r1,0(r0)
>>
>> after first two instructions r1 would contain 0x01020304.
>> IMHO It exactly corresponds to above my ARM second case -
>> BE guest when it runs under ARM BE KVM host. I believe
>> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
>
> Yes, assuming a BE PPC host kernel (which is the usual
> arrangement).

OK, that confirms my understanding how PPC mmio
should work.

>> But according to you data[0] must be 0x4 in BE host case
>
> Er, no. The data here is 0x01020304, so for a BE host
> data[0] is the big end, ie 0x1. It would only be 0x4 if
> mmio.data[] were LE always (or if you were running
> your BE PPC guest on an LE PPC host, which I don't
> think is supported currently).

So do you agree that for all three code snippets cited in this
email, we always will have mmio.data[] = {0x1, 0x2,
0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
and for ppc code snippet in PPC BE qemu/host.
I believe it should be this way, because from emulator (i.e
qemu) code point of view running on ARM BE qemu/host
or PPC BE and emulating the same h/w device, code
should not make difference whether it is ARM or PPC.

Thanks,
Victor

[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008904.html

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-23  0:22                         ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23  0:22 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

Peter, could I please ask you a favor. Could you please
stop deleting pieces of your and my previous responses
when you reply.
Please just reply inline. Sometimes I would like to
reference my or your previous statement, but I could not
find it in your response email. It is very bizzar. Sorry,
it will make your response email bigger, but I am very
confused otherwise.

On 22 January 2014 15:18, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 January 2014 22:47, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> You deleted my example, but I need it again:
>> Consider the following ARM code snippets:
>>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> Just for LE host case basically you are saying that if guest
>> issues 4 bytes store
>> instruction for CPU core register and CPSR E bit is off,
>> mmio.data[0] would contain LSB of integer from this CPU core
>> register. I don't understand your bus endianity thing, but I do
>> understand LSB of integer in core CPU register. Do we agree
>> that in above example in second case when BE access is
>> on (E bit is on), it is the exactly the same memory transaction
>> but it data[0] = 0x1 is MSB of integer in CPU core register (still the
>> same LE host case)?
>
> Yes, this is true both if we define mmio.data[] as "always
> little endian" and if we define it as "host kernel endianness",
> since you've specified an LE host here.

OK, we are in agreement here. mmio.data[] = { 0x1, 0x2, 0x3, 0x4}
for both types of guest access and host is LE  And as
far as bus concerned it is absolutely the same transaction.

> (The kernel has to byte swap if CPSR.E is set, because
> it has to emulate the byte-lane-swap the CPU hardware
> does internally before register data goes out to the bus.)
>
>> Consider above big endian case (setend be) example,
>> but now running in BE KVM host. 0x4 is LSB of CPU
>> core register in this case.
>
> Yes. In this case if we are using the "mmio.data is host
> kernel endianness" definition then mmio.data[0] should be
> 0x01 (the MSB of the 32 bit data value).

If mmio.data[0] is 0x1, mmio.data[] = {0x1, 0x2, 0x3, 0x4},
and now KVM host and emulator running in BE mode.
But that contradicts to what you said before. In previous
email (please see [1]). Here is what your and I just said in few
paragraphs before my "Consider above big endian .." above
paragraph:

----- start quote ------------------------------------------

>> BTW could you please propose how will you see such
>> "32 bit transaction, value 0x04030201, address $whatever".
>> on ARM LE CPU in mmio.data?
>
> That is exactly the problem we're discussing in this thread.
> Indeed, I proposed an answer to it, which is that the mmio.data
> array should be in host kernel byte order, in which case it
> would be (for an LE host kernel) 0x01 in mmio.data[0] and so
> on up.
>
>> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
>> me. That is current case ARM LE case when above
>> snippets would be executed by guest.
>>
>> Would we  agree that the same arrangement would be
>> true for all other cases on ARM regardless of all other
>> endianities of qemu, KVM host, guest, hypervisor, etc?
>
> No; under my proposal, for a big-endian host kernel (and
> thus big-endian QEMU) the order would be
> mmio.data[0] = 0x04, etc. (This wouldn't change based
> on the guest kernel endianness or whether it happened
> to have set the E bit temporarily.)

----- end quote ------------------------------------------

So in one case for the same memory transaction (ARM
setend be snippet) executed
under BE ARM host KVM you said that "mmio.data[0]
should be 0x01 (the MSB of the 32 bit data value)" and
before you said "No; under my proposal, for a big-endian
host kernel (and thus big-endian QEMU) the order would be
mmio.data[0] = 0x04, etc". So which is mmio.data[0]?

I argue that for all three code snippets in this email (two for
ARM and one for PPC) mmio.data[] = {0x1, 0x2, 0x3, 04},
and that does not depend whether it is LE ARM KVM host,
BE ARM KVM host, or BE PPC KVM.

> (Notice that the
> BE host kernel can actually just behave exactly like the LE
> one: byteswap 32 bit value from guest register if guest
> CPSR.E is set, then do a 32-bit store of the 32 bit word
> into mmio.data[].)
>
>>> Defining that mmio.data[] is always little-endian would
>>> be a valid definition of an API if we were doing it from
>>> scratch. It has the unfortunate property that it would
>>> completely break the existing PPC BE setups, which
>>> don't define it that way, so it is a non-starter.
>>
>> I believe, but I need to check, that PPC BE setup actually
>> acts as the second case in above example  If we have PPC
>> BE guest executing the following instructions:
>>
>> lis     r1,0x102
>> ori     r1,r1,0x304
>> stw    r1,0(r0)
>>
>> after first two instructions r1 would contain 0x01020304.
>> IMHO It exactly corresponds to above my ARM second case -
>> BE guest when it runs under ARM BE KVM host. I believe
>> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
>
> Yes, assuming a BE PPC host kernel (which is the usual
> arrangement).

OK, that confirms my understanding how PPC mmio
should work.

>> But according to you data[0] must be 0x4 in BE host case
>
> Er, no. The data here is 0x01020304, so for a BE host
> data[0] is the big end, ie 0x1. It would only be 0x4 if
> mmio.data[] were LE always (or if you were running
> your BE PPC guest on an LE PPC host, which I don't
> think is supported currently).

So do you agree that for all three code snippets cited in this
email, we always will have mmio.data[] = {0x1, 0x2,
0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
and for ppc code snippet in PPC BE qemu/host.
I believe it should be this way, because from emulator (i.e
qemu) code point of view running on ARM BE qemu/host
or PPC BE and emulating the same h/w device, code
should not make difference whether it is ARM or PPC.

Thanks,
Victor

[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008904.html

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22 10:52                 ` [Qemu-devel] " Alexander Graf
@ 2014-01-23  4:25                   ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23  4:25 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anup Patel, Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall

Hi Alex,

Sorry, for delayed reply, I was focusing on discussion
with Peter. Hope you and other folks may get something
out of it :).

Please see responses inline

On 22 January 2014 02:52, Alexander Graf <agraf@suse.de> wrote:
>
> On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>
>> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>>>
>>>
>>> "Native endian" really is just a shortcut for "target endian"
>>> which is LE for ARM and BE for PPC. There shouldn't be
>>> a qemu-system-armeb or qemu-system-ppc64le.
>>
>> I disagree. Fully functional ARM BE system is what we've
>> been working on for last few months. 'We' is Linaro
>> Networking Group, Endian subteam and some other guys
>> in ARM and across community. Why we do that is a bit
>> beyond of this discussion.
>>
>> ARM BE patches for both V7 and V8 are already in mainline
>> kernel. But ARM BE KVM host is broken now. It is known
>> deficiency that I am trying to fix. Please look at [1]. Patches
>> for V7 BE KVM were proposed and currently under active
>> discussion. Currently I work on ARM V8 BE KVM changes.
>>
>> So "native endian" in ARM is value of CPSR register E bit.
>> If it is off native endian is LE, if it is on it is BE.
>>
>> Once and if we agree on ARM BE KVM host changes, the
>> next step would be patches in qemu one of which introduces
>> qemu-system-armeb. Please see [2].
>
> I think we're facing an ideology conflict here. Yes, there
> should be a qemu-system-arm that is BE capable.

Maybe it is not ideology conflict but rather terminology clarity
issue :). I am not sure what do you mean by "qemu-system-arm
that is BE capable". In qemu build system there is just target
name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
target which is ARM V7 cpu in BE mode. That is true for a lot
of open source packages. You could check [1] patch that
introduces armeb target into qemu. Build for
arm target produces qemu-system-arm executable that is
marked 'ELF 32-bit LSB executable' and it could run on LE
traditional ARM Linux. Build for armeb target produces
qemu-system-armeb executable that is marked 'ELF 32-bit
MSB executable' that can run on BE ARM Linux. armbe is
nothing special here, just build option for qemu that should run
on BE ARM Linux.

Both qemu-system-arm and qemu-system-armeb should
be BE/LE capable. I.e either of them along with KVM could
either run LE or BE guest. MarcZ demonstrated that this
is possible. I've tested both LE and BE guests with
qemu-system-arm running on traditional LE ARM Linux,
effectively repeating Marc's setup but with qemu.
And I did test with my patches both BE and LE guests with
qemu-system-armeb running on BE ARM Linux.

> There
> should also be a qemu-system-ppc64 that is LE capable.
> But there is no point in changing the "default endiannes"
> for the virtual CPUs that we plug in there. Both CPUs are
> perfectly capable of running in LE or BE mode, the
> question is just what we declare the "default".

I am not sure, what you mean by "default"? Is it initial
setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
the way it is currently implemented by committed
qemu-system-arm, and proposed qemu-system-armeb
patches, they are both off. I.e even qemu-system-armeb
starts running vcpu in LE mode, exactly by very similar
reason as desribed in your next paragraph
qemu-system-armeb has tiny bootloader that starts
in LE mode, jumps to kernel kernel switches cpu to
run in BE mode 'setend be' and EE bit is set just
before mmu is enabled.

> Think about the PPC bootstrap. We start off with a
> BE firmware, then boot into the Linux kernel which
> calls a hypercall to set the LE bit on every interrupt.

We have very similar situation with BE ARM Linux.
When we run ARM BE Linux we start with bootloader
which is LE and then CPU issues 'setend be' very
soon as it starts executing kernel code, all secondary
CPUs issue 'setend be' when they go out of reset pen
or bootmonitor sleep.

> But there's no reason this little endian kernel
> couldn't theoretically have big endian user space running
> with access to emulated device registers.

I don't want to go there, it is very very messy ...

------ Just a side note: ------
Interestingly, half a year before I joined Linaro in Cisco I and
my colleague implemented kernel patch that allowed to run
BE user-space processes as sort of separate personality on
top of LE ARM kernel ... treated kind of multi-abi system.
Effectively we had to do byteswaps on all non-trivial
system calls and ioctls in side of the kernel. We converted
around 30 system calls and around 10 ioctls. Our target process
was just using those and it works working, but patch was
very intrusive and unnatural. I think in Linaro there was
some public version of my presentation circulated that
explained all this mess. I don't want seriously to consider it.

The only robust mixed mode, as MarcZ demonstrated,
could be done only on VM boundaries. I.e LE host can
run BE guest fine. And BE host can run LE guest fine.
Everything else would be a huge mess. If we want to
start pro and cons of different mixed modes we need to
start separate thread.
------ End of side note ------------

> As Peter already pointed out, the actual breakage behind
> this is that we have a "default endianness" at all. But that's
> a very difficult thing to resolve and I don't think should be
> our primary goal. Just live with the fact that we declare
> ARM little endian in QEMU and swap things
> accordingly - then everyone's happy.

I disagree with Peter's point of view as you saw from our
long thread :). I strongly believe that current mmio.data[]
describes data on the bus perfectly fine with array of bytes.
data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
etc.

Please check "Differences between BE-32 and BE-8 buses"
section in [2]. In modern ARM CPU memory bus is byte invariant (BE-8).
As data lines bytes view concerns, it is the same between LE and
BE-8 that is why IMHO array of bytes view is very good choice.
PPC and MIPS CPUs memory buses are also byte invariant, they
always been that way. I don't think we care about BE-32. So
for all practical purposes, mmio structure is BE-8 bus emulation,
where data signals could be defined by array of bytes. If one
would try to define it as set of other bigger integers
one need to have endianness attribute associated with it. If
such attribute implied by default just through CPU type in order to
work with existing cases it should be different for different CPU
types, which means qemu running in the same endianity but
on different CPU types should acts differently if it emulates
the same device and that is bad IMHO. So I don't see any
value from departing from bytes array view of data on the bus.

> This really only ever becomes a problem if you have devices
> that have awareness of the CPUs endian mode. The only one
> on PPC that I'm aware of that falls into this category is virtio
> and there are patches pending to solve that. I don't know if there
> are any QEMU emulated devices outside of virtio with this
> issue on ARM, but you'll have to make the emulation code
> for those look at the CPU state then.

Agreed on native endianity devices, I don't think we really should
have them on ARM, I believe to your assertion for PPC. In any case
those native endian devices will be very bad for mixed endianess case.
Agreed virtio issues must be addressed, when I tested mixed
modes I had to bring in virtio patches.

Thanks,
Victor

[1] https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/9cc68f682d7c25c6749f0137269de0164d666356?hp=bdc07868d30d3362a4ba0215044a185ff7a80bf4

[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

>>
>>> QEMU emulates everything that comes after the CPU, so
>>> imagine the ioctl struct as a bus package. Your bus
>>> doesn't care what endianness the CPU is in - it just
>>> gets data from the CPU.
>>
>> I am not sure that I follow above. Suppose I have
>>
>> move r1, #1
>> str r1, [r0]
>>
>> where r0 is device address. Now depending on CPSR
>> E bit value device address will receive 1 as integer either
>> in LE order or in BE order. That is how ARM v7 CPU
>> works, regardless whether it is emulated or not.
>>
>> So if E bit is off (LE case) after str is executed
>> byte at r0 address will get 1
>> byte at r0 + 1 address will get 0
>> byte at r0 + 2 address will get 0
>> byte at r0 + 3 address will get 0
>>
>> If E bit is on (BE case) after str is executed
>> byte at r0 address will get 0
>> byte at r0 + 1 address will get 0
>> byte at r0 + 2 address will get 0
>> byte at r0 + 3 address will get 1
>>
>> my point that mmio.data[] just carries bytes for phys_addr
>> mmio.data[0] would be value for byte at phys_addr,
>> mmio.data[1] would be value for byte at phys_addr + 1, and
>> so on.
>
> What we get is an instruction that traps because it wants to "write r1 (which has value=1) into address x". So at that point we get the register value.
>
> Then we need to take a look at the E bit to see whether the write was supposed to be in non-host endianness because we need to emulate exactly the LE/BE difference you're indicating above. The way we implement this on PPC is that we simply byte swap the register value when guest_endian != host_endian.
>
> With this in place, QEMU can just memcpy() the value into a local register and feed it into its emulation code which expects a "register value as if the CPU was running in native endianness" as parameter - with "native" meaning "little endian" for qemu-system-arm. Device emulation code doesn't know what to do with a byte array.
>
> Take a look at QEMU's MMIO handler:
>
>         case KVM_EXIT_MMIO:
>             DPRINTF("handle_mmio\n");
>             cpu_physical_memory_rw(run->mmio.phys_addr,
>                                    run->mmio.data,
>                                    run->mmio.len,
>                                    run->mmio.is_write);
>             ret = 0;
>             break;
>
> which translates to
>
>                 switch (l) {
>                 case 8:
>                     /* 64 bit write access */
>                     val = ldq_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 8);
>                     break;
>                 case 4:
>                     /* 32 bit write access */
>                     val = ldl_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 4);
>                     break;
>                 case 2:
>                     /* 16 bit write access */
>                     val = lduw_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 2);
>                     break;
>                 case 1:
>                     /* 8 bit write access */
>                     val = ldub_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 1);
>                     break;
>                 default:
>                     abort();
>                 }
>
> which calls the ldx_p primitives
>
> #if defined(TARGET_WORDS_BIGENDIAN)
> #define lduw_p(p) lduw_be_p(p)
> #define ldsw_p(p) ldsw_be_p(p)
> #define ldl_p(p) ldl_be_p(p)
> #define ldq_p(p) ldq_be_p(p)
> #define ldfl_p(p) ldfl_be_p(p)
> #define ldfq_p(p) ldfq_be_p(p)
> #define stw_p(p, v) stw_be_p(p, v)
> #define stl_p(p, v) stl_be_p(p, v)
> #define stq_p(p, v) stq_be_p(p, v)
> #define stfl_p(p, v) stfl_be_p(p, v)
> #define stfq_p(p, v) stfq_be_p(p, v)
> #else
> #define lduw_p(p) lduw_le_p(p)
> #define ldsw_p(p) ldsw_le_p(p)
> #define ldl_p(p) ldl_le_p(p)
> #define ldq_p(p) ldq_le_p(p)
> #define ldfl_p(p) ldfl_le_p(p)
> #define ldfq_p(p) ldfq_le_p(p)
> #define stw_p(p, v) stw_le_p(p, v)
> #define stl_p(p, v) stl_le_p(p, v)
> #define stq_p(p, v) stq_le_p(p, v)
> #define stfl_p(p, v) stfl_le_p(p, v)
> #define stfq_p(p, v) stfq_le_p(p, v)
> #endif
>
> and then passes the result as "originating register access" to the device emulation part of QEMU.
>
>
> Maybe it becomes more clear if you understand the code flow that TCG is going through. With TCG whenever a write traps into MMIO we go through these functions
>
> void
> glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
>                                          DATA_TYPE val, int mmu_idx)
> {
>     helper_te_st_name(env, addr, val, mmu_idx, GETRA());
> }
>
> #ifdef TARGET_WORDS_BIGENDIAN
> # define TGT_BE(X)  (X)
> # define TGT_LE(X)  BSWAP(X)
> #else
> # define TGT_BE(X)  BSWAP(X)
> # define TGT_LE(X)  (X)
> #endif
>
> void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                        int mmu_idx, uintptr_t retaddr)
> {
> [...]
>     /* Handle an IO access.  */
>     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>         hwaddr ioaddr;
>         if ((addr & (DATA_SIZE - 1)) != 0) {
>             goto do_unaligned_access;
>         }
>         ioaddr = env->iotlb[mmu_idx][index];
>
>         /* ??? Note that the io helpers always read data in the target
>            byte ordering.  We should push the LE/BE request down into io.  */
>         val = TGT_LE(val);
>         glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
>         return;
>     }
>     [...]
> }
>
> static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>                                           hwaddr physaddr,
>                                           DATA_TYPE val,
>                                           target_ulong addr,
>                                           uintptr_t retaddr)
> {
>     MemoryRegion *mr = iotlb_to_region(physaddr);
>
>     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
>     if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env)) {
>         cpu_io_recompile(env, retaddr);
>     }
>
>     env->mem_io_vaddr = addr;
>     env->mem_io_pc = retaddr;
>     io_mem_write(mr, physaddr, val, 1 << SHIFT);
> }
>
> which at the end of the chain means if you're running an same endianness on guest and host, you get the original register value as function parameter. If you run different endianness you get a swapped value as function parameter.
>
> So at the end of all of this, if you're running qemu-system-arm (TCG) on a BE host the request into the io callback function will come in as register, then stay all the way it is until it reaches the IO callback function. Unless you define a specific endianness for your device in which case the callback may swizzle it again. But if your device defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle it.
>
> What happens when you switch your guest to BE mode (or LE for PPC)? Very simple. The TCG frontend swizzles every memory read and write before it hits TCG's memory operations.
>
> If you're running qemu-system-arm (KVM) on a BE host the request will come into kvm-all.c, get read with swapped endianness (ldq_p) and then passed into that way into the IO callback function. That's where the bug lies. It should behave the same way as TCG, so it needs to know the value the register originally had. So instead of doing an ldq_p() it should go through a different path that does memcpy().
>
> But that doesn't fix the other-endian issue yet, right? Every value now would come in as the register value.
>
> Well, unless you do the same thing TCG does inside the kernel. So the kernel would swap the reads and writes before it accesses the ioctl struct that connects kvm with QEMU. Then all abstraction layers work just fine again and we don't need any qemu-system-armeb.
>
>
> Alex
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-23  4:25                   ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23  4:25 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Thomas Falcon, kvm-devel, Anup Patel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall

Hi Alex,

Sorry, for delayed reply, I was focusing on discussion
with Peter. Hope you and other folks may get something
out of it :).

Please see responses inline

On 22 January 2014 02:52, Alexander Graf <agraf@suse.de> wrote:
>
> On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>
>> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>>>
>>>
>>> "Native endian" really is just a shortcut for "target endian"
>>> which is LE for ARM and BE for PPC. There shouldn't be
>>> a qemu-system-armeb or qemu-system-ppc64le.
>>
>> I disagree. Fully functional ARM BE system is what we've
>> been working on for last few months. 'We' is Linaro
>> Networking Group, Endian subteam and some other guys
>> in ARM and across community. Why we do that is a bit
>> beyond of this discussion.
>>
>> ARM BE patches for both V7 and V8 are already in mainline
>> kernel. But ARM BE KVM host is broken now. It is known
>> deficiency that I am trying to fix. Please look at [1]. Patches
>> for V7 BE KVM were proposed and currently under active
>> discussion. Currently I work on ARM V8 BE KVM changes.
>>
>> So "native endian" in ARM is value of CPSR register E bit.
>> If it is off native endian is LE, if it is on it is BE.
>>
>> Once and if we agree on ARM BE KVM host changes, the
>> next step would be patches in qemu one of which introduces
>> qemu-system-armeb. Please see [2].
>
> I think we're facing an ideology conflict here. Yes, there
> should be a qemu-system-arm that is BE capable.

Maybe it is not ideology conflict but rather terminology clarity
issue :). I am not sure what do you mean by "qemu-system-arm
that is BE capable". In qemu build system there is just target
name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
target which is ARM V7 cpu in BE mode. That is true for a lot
of open source packages. You could check [1] patch that
introduces armeb target into qemu. Build for
arm target produces qemu-system-arm executable that is
marked 'ELF 32-bit LSB executable' and it could run on LE
traditional ARM Linux. Build for armeb target produces
qemu-system-armeb executable that is marked 'ELF 32-bit
MSB executable' that can run on BE ARM Linux. armbe is
nothing special here, just build option for qemu that should run
on BE ARM Linux.

Both qemu-system-arm and qemu-system-armeb should
be BE/LE capable. I.e either of them along with KVM could
either run LE or BE guest. MarcZ demonstrated that this
is possible. I've tested both LE and BE guests with
qemu-system-arm running on traditional LE ARM Linux,
effectively repeating Marc's setup but with qemu.
And I did test with my patches both BE and LE guests with
qemu-system-armeb running on BE ARM Linux.

> There
> should also be a qemu-system-ppc64 that is LE capable.
> But there is no point in changing the "default endiannes"
> for the virtual CPUs that we plug in there. Both CPUs are
> perfectly capable of running in LE or BE mode, the
> question is just what we declare the "default".

I am not sure, what you mean by "default"? Is it initial
setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
the way it is currently implemented by committed
qemu-system-arm, and proposed qemu-system-armeb
patches, they are both off. I.e even qemu-system-armeb
starts running vcpu in LE mode, exactly by very similar
reason as desribed in your next paragraph
qemu-system-armeb has tiny bootloader that starts
in LE mode, jumps to kernel kernel switches cpu to
run in BE mode 'setend be' and EE bit is set just
before mmu is enabled.

> Think about the PPC bootstrap. We start off with a
> BE firmware, then boot into the Linux kernel which
> calls a hypercall to set the LE bit on every interrupt.

We have very similar situation with BE ARM Linux.
When we run ARM BE Linux we start with bootloader
which is LE and then CPU issues 'setend be' very
soon as it starts executing kernel code, all secondary
CPUs issue 'setend be' when they go out of reset pen
or bootmonitor sleep.

> But there's no reason this little endian kernel
> couldn't theoretically have big endian user space running
> with access to emulated device registers.

I don't want to go there, it is very very messy ...

------ Just a side note: ------
Interestingly, half a year before I joined Linaro in Cisco I and
my colleague implemented kernel patch that allowed to run
BE user-space processes as sort of separate personality on
top of LE ARM kernel ... treated kind of multi-abi system.
Effectively we had to do byteswaps on all non-trivial
system calls and ioctls in side of the kernel. We converted
around 30 system calls and around 10 ioctls. Our target process
was just using those and it works working, but patch was
very intrusive and unnatural. I think in Linaro there was
some public version of my presentation circulated that
explained all this mess. I don't want seriously to consider it.

The only robust mixed mode, as MarcZ demonstrated,
could be done only on VM boundaries. I.e LE host can
run BE guest fine. And BE host can run LE guest fine.
Everything else would be a huge mess. If we want to
start pro and cons of different mixed modes we need to
start separate thread.
------ End of side note ------------

> As Peter already pointed out, the actual breakage behind
> this is that we have a "default endianness" at all. But that's
> a very difficult thing to resolve and I don't think should be
> our primary goal. Just live with the fact that we declare
> ARM little endian in QEMU and swap things
> accordingly - then everyone's happy.

I disagree with Peter's point of view as you saw from our
long thread :). I strongly believe that current mmio.data[]
describes data on the bus perfectly fine with array of bytes.
data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
etc.

Please check "Differences between BE-32 and BE-8 buses"
section in [2]. In modern ARM CPU memory bus is byte invariant (BE-8).
As data lines bytes view concerns, it is the same between LE and
BE-8 that is why IMHO array of bytes view is very good choice.
PPC and MIPS CPUs memory buses are also byte invariant, they
always been that way. I don't think we care about BE-32. So
for all practical purposes, mmio structure is BE-8 bus emulation,
where data signals could be defined by array of bytes. If one
would try to define it as set of other bigger integers
one need to have endianness attribute associated with it. If
such attribute implied by default just through CPU type in order to
work with existing cases it should be different for different CPU
types, which means qemu running in the same endianity but
on different CPU types should acts differently if it emulates
the same device and that is bad IMHO. So I don't see any
value from departing from bytes array view of data on the bus.

> This really only ever becomes a problem if you have devices
> that have awareness of the CPUs endian mode. The only one
> on PPC that I'm aware of that falls into this category is virtio
> and there are patches pending to solve that. I don't know if there
> are any QEMU emulated devices outside of virtio with this
> issue on ARM, but you'll have to make the emulation code
> for those look at the CPU state then.

Agreed on native endianity devices, I don't think we really should
have them on ARM, I believe to your assertion for PPC. In any case
those native endian devices will be very bad for mixed endianess case.
Agreed virtio issues must be addressed, when I tested mixed
modes I had to bring in virtio patches.

Thanks,
Victor

[1] https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/9cc68f682d7c25c6749f0137269de0164d666356?hp=bdc07868d30d3362a4ba0215044a185ff7a80bf4

[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

>>
>>> QEMU emulates everything that comes after the CPU, so
>>> imagine the ioctl struct as a bus package. Your bus
>>> doesn't care what endianness the CPU is in - it just
>>> gets data from the CPU.
>>
>> I am not sure that I follow above. Suppose I have
>>
>> move r1, #1
>> str r1, [r0]
>>
>> where r0 is device address. Now depending on CPSR
>> E bit value device address will receive 1 as integer either
>> in LE order or in BE order. That is how ARM v7 CPU
>> works, regardless whether it is emulated or not.
>>
>> So if E bit is off (LE case) after str is executed
>> byte at r0 address will get 1
>> byte at r0 + 1 address will get 0
>> byte at r0 + 2 address will get 0
>> byte at r0 + 3 address will get 0
>>
>> If E bit is on (BE case) after str is executed
>> byte at r0 address will get 0
>> byte at r0 + 1 address will get 0
>> byte at r0 + 2 address will get 0
>> byte at r0 + 3 address will get 1
>>
>> my point that mmio.data[] just carries bytes for phys_addr
>> mmio.data[0] would be value for byte at phys_addr,
>> mmio.data[1] would be value for byte at phys_addr + 1, and
>> so on.
>
> What we get is an instruction that traps because it wants to "write r1 (which has value=1) into address x". So at that point we get the register value.
>
> Then we need to take a look at the E bit to see whether the write was supposed to be in non-host endianness because we need to emulate exactly the LE/BE difference you're indicating above. The way we implement this on PPC is that we simply byte swap the register value when guest_endian != host_endian.
>
> With this in place, QEMU can just memcpy() the value into a local register and feed it into its emulation code which expects a "register value as if the CPU was running in native endianness" as parameter - with "native" meaning "little endian" for qemu-system-arm. Device emulation code doesn't know what to do with a byte array.
>
> Take a look at QEMU's MMIO handler:
>
>         case KVM_EXIT_MMIO:
>             DPRINTF("handle_mmio\n");
>             cpu_physical_memory_rw(run->mmio.phys_addr,
>                                    run->mmio.data,
>                                    run->mmio.len,
>                                    run->mmio.is_write);
>             ret = 0;
>             break;
>
> which translates to
>
>                 switch (l) {
>                 case 8:
>                     /* 64 bit write access */
>                     val = ldq_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 8);
>                     break;
>                 case 4:
>                     /* 32 bit write access */
>                     val = ldl_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 4);
>                     break;
>                 case 2:
>                     /* 16 bit write access */
>                     val = lduw_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 2);
>                     break;
>                 case 1:
>                     /* 8 bit write access */
>                     val = ldub_p(buf);
>                     error |= io_mem_write(mr, addr1, val, 1);
>                     break;
>                 default:
>                     abort();
>                 }
>
> which calls the ldx_p primitives
>
> #if defined(TARGET_WORDS_BIGENDIAN)
> #define lduw_p(p) lduw_be_p(p)
> #define ldsw_p(p) ldsw_be_p(p)
> #define ldl_p(p) ldl_be_p(p)
> #define ldq_p(p) ldq_be_p(p)
> #define ldfl_p(p) ldfl_be_p(p)
> #define ldfq_p(p) ldfq_be_p(p)
> #define stw_p(p, v) stw_be_p(p, v)
> #define stl_p(p, v) stl_be_p(p, v)
> #define stq_p(p, v) stq_be_p(p, v)
> #define stfl_p(p, v) stfl_be_p(p, v)
> #define stfq_p(p, v) stfq_be_p(p, v)
> #else
> #define lduw_p(p) lduw_le_p(p)
> #define ldsw_p(p) ldsw_le_p(p)
> #define ldl_p(p) ldl_le_p(p)
> #define ldq_p(p) ldq_le_p(p)
> #define ldfl_p(p) ldfl_le_p(p)
> #define ldfq_p(p) ldfq_le_p(p)
> #define stw_p(p, v) stw_le_p(p, v)
> #define stl_p(p, v) stl_le_p(p, v)
> #define stq_p(p, v) stq_le_p(p, v)
> #define stfl_p(p, v) stfl_le_p(p, v)
> #define stfq_p(p, v) stfq_le_p(p, v)
> #endif
>
> and then passes the result as "originating register access" to the device emulation part of QEMU.
>
>
> Maybe it becomes more clear if you understand the code flow that TCG is going through. With TCG whenever a write traps into MMIO we go through these functions
>
> void
> glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
>                                          DATA_TYPE val, int mmu_idx)
> {
>     helper_te_st_name(env, addr, val, mmu_idx, GETRA());
> }
>
> #ifdef TARGET_WORDS_BIGENDIAN
> # define TGT_BE(X)  (X)
> # define TGT_LE(X)  BSWAP(X)
> #else
> # define TGT_BE(X)  BSWAP(X)
> # define TGT_LE(X)  (X)
> #endif
>
> void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                        int mmu_idx, uintptr_t retaddr)
> {
> [...]
>     /* Handle an IO access.  */
>     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>         hwaddr ioaddr;
>         if ((addr & (DATA_SIZE - 1)) != 0) {
>             goto do_unaligned_access;
>         }
>         ioaddr = env->iotlb[mmu_idx][index];
>
>         /* ??? Note that the io helpers always read data in the target
>            byte ordering.  We should push the LE/BE request down into io.  */
>         val = TGT_LE(val);
>         glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
>         return;
>     }
>     [...]
> }
>
> static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>                                           hwaddr physaddr,
>                                           DATA_TYPE val,
>                                           target_ulong addr,
>                                           uintptr_t retaddr)
> {
>     MemoryRegion *mr = iotlb_to_region(physaddr);
>
>     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
>     if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env)) {
>         cpu_io_recompile(env, retaddr);
>     }
>
>     env->mem_io_vaddr = addr;
>     env->mem_io_pc = retaddr;
>     io_mem_write(mr, physaddr, val, 1 << SHIFT);
> }
>
> which at the end of the chain means if you're running an same endianness on guest and host, you get the original register value as function parameter. If you run different endianness you get a swapped value as function parameter.
>
> So at the end of all of this, if you're running qemu-system-arm (TCG) on a BE host the request into the io callback function will come in as register, then stay all the way it is until it reaches the IO callback function. Unless you define a specific endianness for your device in which case the callback may swizzle it again. But if your device defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle it.
>
> What happens when you switch your guest to BE mode (or LE for PPC)? Very simple. The TCG frontend swizzles every memory read and write before it hits TCG's memory operations.
>
> If you're running qemu-system-arm (KVM) on a BE host the request will come into kvm-all.c, get read with swapped endianness (ldq_p) and then passed into that way into the IO callback function. That's where the bug lies. It should behave the same way as TCG, so it needs to know the value the register originally had. So instead of doing an ldq_p() it should go through a different path that does memcpy().
>
> But that doesn't fix the other-endian issue yet, right? Every value now would come in as the register value.
>
> Well, unless you do the same thing TCG does inside the kernel. So the kernel would swap the reads and writes before it accesses the ioctl struct that connects kvm with QEMU. Then all abstraction layers work just fine again and we don't need any qemu-system-armeb.
>
>
> Alex
>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-23  0:22                         ` [Qemu-devel] " Victor Kamensky
@ 2014-01-23 10:23                           ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-23 10:23 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 00:22, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> Peter, could I please ask you a favor. Could you please
> stop deleting pieces of your and my previous responses
> when you reply.

No, sorry. It produces excessively long and totally unreadable
emails for everybody else if people don't trim for context.
This is standard mailing list practice.

>>> Consider above big endian case (setend be) example,
>>> but now running in BE KVM host. 0x4 is LSB of CPU
>>> core register in this case.
>>
>> Yes. In this case if we are using the "mmio.data is host
>> kernel endianness" definition then mmio.data[0] should be
>> 0x01 (the MSB of the 32 bit data value).
>
> If mmio.data[0] is 0x1, mmio.data[] = {0x1, 0x2, 0x3, 0x4},
> and now KVM host and emulator running in BE mode.
> But that contradicts to what you said before.

Sorry, I misread the example here (and assumed we were
writing the same word in both cases, when actually the BE
code example is writing a different value). mmio.data[0] should
be 0x4, because:
 * BE ARM guest, so KVM must byte-swap the register value
    (giving 0x04030201)
 * BE host, so it writes the uint32_t in host order (giving
   0x4 in mmio.data[0])

>>> I believe, but I need to check, that PPC BE setup actually
>>> acts as the second case in above example  If we have PPC
>>> BE guest executing the following instructions:
>>>
>>> lis     r1,0x102
>>> ori     r1,r1,0x304
>>> stw    r1,0(r0)
>>>
>>> after first two instructions r1 would contain 0x01020304.
>>> IMHO It exactly corresponds to above my ARM second case -
>>> BE guest when it runs under ARM BE KVM host. I believe
>>> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
>>
>> Yes, assuming a BE PPC host kernel (which is the usual
>> arrangement).
>
> OK, that confirms my understanding how PPC mmio
> should work.
>
>>> But according to you data[0] must be 0x4 in BE host case
>>
>> Er, no. The data here is 0x01020304, so for a BE host
>> data[0] is the big end, ie 0x1. It would only be 0x4 if
>> mmio.data[] were LE always (or if you were running
>> your BE PPC guest on an LE PPC host, which I don't
>> think is supported currently).
>
> So do you agree that for all three code snippets cited in this
> email, we always will have mmio.data[] = {0x1, 0x2,
> 0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
> and for ppc code snippet in PPC BE qemu/host.

No. Also your ARM and PPC examples are not usefully
comparable, because:

> setend le
> mov r1, #0x04030201
> str r1, [r0]

This is an LE guest writing 0x04030201, and that is the
value that will go out on the bus.

> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]

This is a BE guest writing 0x01020304; as far as the
code running on the CPU is concerned; the value on the
bus will be byteswapped.

> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)

This is also a BE guest writing 0x01020304. I'm pretty
sure that the PPC approach is that for BE guests writing
a word that word goes out to the bus as is; for LE guests
(or if the page table is set up to say "this page is LE") the
CPU swaps it before putting it on the bus. In this regard
it is the opposite way round to ARM.

So the value you start with in the CPU register is not
the same in all three cases, and what the hardware
does is not the same either.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-23 10:23                           ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-23 10:23 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 23 January 2014 00:22, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> Peter, could I please ask you a favor. Could you please
> stop deleting pieces of your and my previous responses
> when you reply.

No, sorry. It produces excessively long and totally unreadable
emails for everybody else if people don't trim for context.
This is standard mailing list practice.

>>> Consider above big endian case (setend be) example,
>>> but now running in BE KVM host. 0x4 is LSB of CPU
>>> core register in this case.
>>
>> Yes. In this case if we are using the "mmio.data is host
>> kernel endianness" definition then mmio.data[0] should be
>> 0x01 (the MSB of the 32 bit data value).
>
> If mmio.data[0] is 0x1, mmio.data[] = {0x1, 0x2, 0x3, 0x4},
> and now KVM host and emulator running in BE mode.
> But that contradicts to what you said before.

Sorry, I misread the example here (and assumed we were
writing the same word in both cases, when actually the BE
code example is writing a different value). mmio.data[0] should
be 0x4, because:
 * BE ARM guest, so KVM must byte-swap the register value
    (giving 0x04030201)
 * BE host, so it writes the uint32_t in host order (giving
   0x4 in mmio.data[0])

>>> I believe, but I need to check, that PPC BE setup actually
>>> acts as the second case in above example  If we have PPC
>>> BE guest executing the following instructions:
>>>
>>> lis     r1,0x102
>>> ori     r1,r1,0x304
>>> stw    r1,0(r0)
>>>
>>> after first two instructions r1 would contain 0x01020304.
>>> IMHO It exactly corresponds to above my ARM second case -
>>> BE guest when it runs under ARM BE KVM host. I believe
>>> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
>>
>> Yes, assuming a BE PPC host kernel (which is the usual
>> arrangement).
>
> OK, that confirms my understanding how PPC mmio
> should work.
>
>>> But according to you data[0] must be 0x4 in BE host case
>>
>> Er, no. The data here is 0x01020304, so for a BE host
>> data[0] is the big end, ie 0x1. It would only be 0x4 if
>> mmio.data[] were LE always (or if you were running
>> your BE PPC guest on an LE PPC host, which I don't
>> think is supported currently).
>
> So do you agree that for all three code snippets cited in this
> email, we always will have mmio.data[] = {0x1, 0x2,
> 0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
> and for ppc code snippet in PPC BE qemu/host.

No. Also your ARM and PPC examples are not usefully
comparable, because:

> setend le
> mov r1, #0x04030201
> str r1, [r0]

This is an LE guest writing 0x04030201, and that is the
value that will go out on the bus.

> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]

This is a BE guest writing 0x01020304; as far as the
code running on the CPU is concerned; the value on the
bus will be byteswapped.

> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)

This is also a BE guest writing 0x01020304. I'm pretty
sure that the PPC approach is that for BE guests writing
a word that word goes out to the bus as is; for LE guests
(or if the page table is set up to say "this page is LE") the
CPU swaps it before putting it on the bus. In this regard
it is the opposite way round to ARM.

So the value you start with in the CPU register is not
the same in all three cases, and what the hardware
does is not the same either.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-23  4:25                   ` [Qemu-devel] " Victor Kamensky
@ 2014-01-23 10:32                     ` Alexander Graf
  -1 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-23 10:32 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, Anup Patel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall


On 23.01.2014, at 05:25, Victor Kamensky <victor.kamensky@linaro.org> wrote:

> Hi Alex,
> 
> Sorry, for delayed reply, I was focusing on discussion
> with Peter. Hope you and other folks may get something
> out of it :).
> 
> Please see responses inline
> 
> On 22 January 2014 02:52, Alexander Graf <agraf@suse.de> wrote:
>> 
>> On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> 
>>> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>>>> 
>>>> 
>>>> "Native endian" really is just a shortcut for "target endian"
>>>> which is LE for ARM and BE for PPC. There shouldn't be
>>>> a qemu-system-armeb or qemu-system-ppc64le.
>>> 
>>> I disagree. Fully functional ARM BE system is what we've
>>> been working on for last few months. 'We' is Linaro
>>> Networking Group, Endian subteam and some other guys
>>> in ARM and across community. Why we do that is a bit
>>> beyond of this discussion.
>>> 
>>> ARM BE patches for both V7 and V8 are already in mainline
>>> kernel. But ARM BE KVM host is broken now. It is known
>>> deficiency that I am trying to fix. Please look at [1]. Patches
>>> for V7 BE KVM were proposed and currently under active
>>> discussion. Currently I work on ARM V8 BE KVM changes.
>>> 
>>> So "native endian" in ARM is value of CPSR register E bit.
>>> If it is off native endian is LE, if it is on it is BE.
>>> 
>>> Once and if we agree on ARM BE KVM host changes, the
>>> next step would be patches in qemu one of which introduces
>>> qemu-system-armeb. Please see [2].
>> 
>> I think we're facing an ideology conflict here. Yes, there
>> should be a qemu-system-arm that is BE capable.
> 
> Maybe it is not ideology conflict but rather terminology clarity
> issue :). I am not sure what do you mean by "qemu-system-arm
> that is BE capable". In qemu build system there is just target
> name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
> target which is ARM V7 cpu in BE mode. That is true for a lot
> of open source packages. You could check [1] patch that
> introduces armeb target into qemu. Build for
> arm target produces qemu-system-arm executable that is
> marked 'ELF 32-bit LSB executable' and it could run on LE
> traditional ARM Linux. Build for armeb target produces
> qemu-system-armeb executable that is marked 'ELF 32-bit
> MSB executable' that can run on BE ARM Linux. armbe is
> nothing special here, just build option for qemu that should run
> on BE ARM Linux.

But why should it be called armbe then? What actual difference does the model have compared to the qemu-system-arm model?

> 
> Both qemu-system-arm and qemu-system-armeb should
> be BE/LE capable. I.e either of them along with KVM could
> either run LE or BE guest. MarcZ demonstrated that this
> is possible. I've tested both LE and BE guests with
> qemu-system-arm running on traditional LE ARM Linux,
> effectively repeating Marc's setup but with qemu.
> And I did test with my patches both BE and LE guests with
> qemu-system-armeb running on BE ARM Linux.
> 
>> There
>> should also be a qemu-system-ppc64 that is LE capable.
>> But there is no point in changing the "default endiannes"
>> for the virtual CPUs that we plug in there. Both CPUs are
>> perfectly capable of running in LE or BE mode, the
>> question is just what we declare the "default".
> 
> I am not sure, what you mean by "default"? Is it initial
> setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
> the way it is currently implemented by committed
> qemu-system-arm, and proposed qemu-system-armeb
> patches, they are both off. I.e even qemu-system-armeb
> starts running vcpu in LE mode, exactly by very similar
> reason as desribed in your next paragraph
> qemu-system-armeb has tiny bootloader that starts
> in LE mode, jumps to kernel kernel switches cpu to
> run in BE mode 'setend be' and EE bit is set just
> before mmu is enabled.

You're proving my point even more. If both targets are LE/BE capable and both targets start execution in LE mode, then why do we need a qemu-system-armbe at all? Just use qemu-system-arm.
> 

>> Think about the PPC bootstrap. We start off with a
>> BE firmware, then boot into the Linux kernel which
>> calls a hypercall to set the LE bit on every interrupt.
> 
> We have very similar situation with BE ARM Linux.
> When we run ARM BE Linux we start with bootloader
> which is LE and then CPU issues 'setend be' very
> soon as it starts executing kernel code, all secondary
> CPUs issue 'setend be' when they go out of reset pen
> or bootmonitor sleep.
> 
>> But there's no reason this little endian kernel
>> couldn't theoretically have big endian user space running
>> with access to emulated device registers.
> 
> I don't want to go there, it is very very messy ...
> 
> ------ Just a side note: ------
> Interestingly, half a year before I joined Linaro in Cisco I and
> my colleague implemented kernel patch that allowed to run
> BE user-space processes as sort of separate personality on
> top of LE ARM kernel ... treated kind of multi-abi system.
> Effectively we had to do byteswaps on all non-trivial
> system calls and ioctls in side of the kernel. We converted
> around 30 system calls and around 10 ioctls. Our target process
> was just using those and it works working, but patch was
> very intrusive and unnatural. I think in Linaro there was
> some public version of my presentation circulated that
> explained all this mess. I don't want seriously to consider it.
> 
> The only robust mixed mode, as MarcZ demonstrated,
> could be done only on VM boundaries. I.e LE host can
> run BE guest fine. And BE host can run LE guest fine.
> Everything else would be a huge mess. If we want to
> start pro and cons of different mixed modes we need to
> start separate thread.
> ------ End of side note ------------

Just because we don't do it on Linux doesn't mean some random guest can't do it. What if my RTOS of choice decides it wants to run half of its user space in little and the other half in big endian? What if my guest is actually an AMP system with an LE and a BE OS running side by side?

We shouldn't design virtualization just for the single use case we have in mind.

> 
>> As Peter already pointed out, the actual breakage behind
>> this is that we have a "default endianness" at all. But that's
>> a very difficult thing to resolve and I don't think should be
>> our primary goal. Just live with the fact that we declare
>> ARM little endian in QEMU and swap things
>> accordingly - then everyone's happy.
> 
> I disagree with Peter's point of view as you saw from our
> long thread :). I strongly believe that current mmio.data[]
> describes data on the bus perfectly fine with array of bytes.
> data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
> etc.

mmio.data[] is really just a transport between KVM and QEMU (or kvmtool if you can't work on ARM instruction set simulators). There is no point in overengineering anything here. We should do what's the most natural fit for everything.

> Please check "Differences between BE-32 and BE-8 buses"
> section in [2]. In modern ARM CPU memory bus is byte invariant (BE-8).
> As data lines bytes view concerns, it is the same between LE and
> BE-8 that is why IMHO array of bytes view is very good choice.
> PPC and MIPS CPUs memory buses are also byte invariant, they
> always been that way. I don't think we care about BE-32. So
> for all practical purposes, mmio structure is BE-8 bus emulation,
> where data signals could be defined by array of bytes. If one
> would try to define it as set of other bigger integers
> one need to have endianness attribute associated with it. If
> such attribute implied by default just through CPU type in order to
> work with existing cases it should be different for different CPU
> types, which means qemu running in the same endianity but
> on different CPU types should acts differently if it emulates
> the same device and that is bad IMHO. So I don't see any
> value from departing from bytes array view of data on the bus.

The bus comes after QEMU is involved. From a semantic perspective, the KVM ioctl interface sits between the core and the bus. KVM implements the core and some bits of the bus (to emulate in-kernel devices) and QEMU implements the actual bus topology.

What happens on the way from core -> device is bus specific, so QEMU should take care of this. The way the QEMU internal bus representation gets MMIO data from the CPU is by an ( address, value, len ) tuple. So that's the interface we should design for. And that means we need to transfer full "values", not arrays of data.

The fact that we have a data array is really just because it was easy to write and access. In reality and for all intents and purposes this is a union of u8, u16, u32, u64 that we can't change to be a union anymore because we need to stay backwards compatible. And as with any normal kernel interface you design that one with the endianness of the kernel ABI.


Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-23 10:32                     ` Alexander Graf
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Graf @ 2014-01-23 10:32 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, Anup Patel, QEMU Developers, qemu-ppc,
	kvmarm, Christoffer Dall


On 23.01.2014, at 05:25, Victor Kamensky <victor.kamensky@linaro.org> wrote:

> Hi Alex,
> 
> Sorry, for delayed reply, I was focusing on discussion
> with Peter. Hope you and other folks may get something
> out of it :).
> 
> Please see responses inline
> 
> On 22 January 2014 02:52, Alexander Graf <agraf@suse.de> wrote:
>> 
>> On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> 
>>> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
>>>> 
>>>> 
>>>> "Native endian" really is just a shortcut for "target endian"
>>>> which is LE for ARM and BE for PPC. There shouldn't be
>>>> a qemu-system-armeb or qemu-system-ppc64le.
>>> 
>>> I disagree. Fully functional ARM BE system is what we've
>>> been working on for last few months. 'We' is Linaro
>>> Networking Group, Endian subteam and some other guys
>>> in ARM and across community. Why we do that is a bit
>>> beyond of this discussion.
>>> 
>>> ARM BE patches for both V7 and V8 are already in mainline
>>> kernel. But ARM BE KVM host is broken now. It is known
>>> deficiency that I am trying to fix. Please look at [1]. Patches
>>> for V7 BE KVM were proposed and currently under active
>>> discussion. Currently I work on ARM V8 BE KVM changes.
>>> 
>>> So "native endian" in ARM is value of CPSR register E bit.
>>> If it is off native endian is LE, if it is on it is BE.
>>> 
>>> Once and if we agree on ARM BE KVM host changes, the
>>> next step would be patches in qemu one of which introduces
>>> qemu-system-armeb. Please see [2].
>> 
>> I think we're facing an ideology conflict here. Yes, there
>> should be a qemu-system-arm that is BE capable.
> 
> Maybe it is not ideology conflict but rather terminology clarity
> issue :). I am not sure what do you mean by "qemu-system-arm
> that is BE capable". In qemu build system there is just target
> name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
> target which is ARM V7 cpu in BE mode. That is true for a lot
> of open source packages. You could check [1] patch that
> introduces armeb target into qemu. Build for
> arm target produces qemu-system-arm executable that is
> marked 'ELF 32-bit LSB executable' and it could run on LE
> traditional ARM Linux. Build for armeb target produces
> qemu-system-armeb executable that is marked 'ELF 32-bit
> MSB executable' that can run on BE ARM Linux. armbe is
> nothing special here, just build option for qemu that should run
> on BE ARM Linux.

But why should it be called armbe then? What actual difference does the model have compared to the qemu-system-arm model?

> 
> Both qemu-system-arm and qemu-system-armeb should
> be BE/LE capable. I.e either of them along with KVM could
> either run LE or BE guest. MarcZ demonstrated that this
> is possible. I've tested both LE and BE guests with
> qemu-system-arm running on traditional LE ARM Linux,
> effectively repeating Marc's setup but with qemu.
> And I did test with my patches both BE and LE guests with
> qemu-system-armeb running on BE ARM Linux.
> 
>> There
>> should also be a qemu-system-ppc64 that is LE capable.
>> But there is no point in changing the "default endiannes"
>> for the virtual CPUs that we plug in there. Both CPUs are
>> perfectly capable of running in LE or BE mode, the
>> question is just what we declare the "default".
> 
> I am not sure, what you mean by "default"? Is it initial
> setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
> the way it is currently implemented by committed
> qemu-system-arm, and proposed qemu-system-armeb
> patches, they are both off. I.e even qemu-system-armeb
> starts running vcpu in LE mode, exactly by very similar
> reason as desribed in your next paragraph
> qemu-system-armeb has tiny bootloader that starts
> in LE mode, jumps to kernel kernel switches cpu to
> run in BE mode 'setend be' and EE bit is set just
> before mmu is enabled.

You're proving my point even more. If both targets are LE/BE capable and both targets start execution in LE mode, then why do we need a qemu-system-armbe at all? Just use qemu-system-arm.
> 

>> Think about the PPC bootstrap. We start off with a
>> BE firmware, then boot into the Linux kernel which
>> calls a hypercall to set the LE bit on every interrupt.
> 
> We have very similar situation with BE ARM Linux.
> When we run ARM BE Linux we start with bootloader
> which is LE and then CPU issues 'setend be' very
> soon as it starts executing kernel code, all secondary
> CPUs issue 'setend be' when they go out of reset pen
> or bootmonitor sleep.
> 
>> But there's no reason this little endian kernel
>> couldn't theoretically have big endian user space running
>> with access to emulated device registers.
> 
> I don't want to go there, it is very very messy ...
> 
> ------ Just a side note: ------
> Interestingly, half a year before I joined Linaro in Cisco I and
> my colleague implemented kernel patch that allowed to run
> BE user-space processes as sort of separate personality on
> top of LE ARM kernel ... treated kind of multi-abi system.
> Effectively we had to do byteswaps on all non-trivial
> system calls and ioctls in side of the kernel. We converted
> around 30 system calls and around 10 ioctls. Our target process
> was just using those and it works working, but patch was
> very intrusive and unnatural. I think in Linaro there was
> some public version of my presentation circulated that
> explained all this mess. I don't want seriously to consider it.
> 
> The only robust mixed mode, as MarcZ demonstrated,
> could be done only on VM boundaries. I.e LE host can
> run BE guest fine. And BE host can run LE guest fine.
> Everything else would be a huge mess. If we want to
> start pro and cons of different mixed modes we need to
> start separate thread.
> ------ End of side note ------------

Just because we don't do it on Linux doesn't mean some random guest can't do it. What if my RTOS of choice decides it wants to run half of its user space in little and the other half in big endian? What if my guest is actually an AMP system with an LE and a BE OS running side by side?

We shouldn't design virtualization just for the single use case we have in mind.

> 
>> As Peter already pointed out, the actual breakage behind
>> this is that we have a "default endianness" at all. But that's
>> a very difficult thing to resolve and I don't think should be
>> our primary goal. Just live with the fact that we declare
>> ARM little endian in QEMU and swap things
>> accordingly - then everyone's happy.
> 
> I disagree with Peter's point of view as you saw from our
> long thread :). I strongly believe that current mmio.data[]
> describes data on the bus perfectly fine with array of bytes.
> data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
> etc.

mmio.data[] is really just a transport between KVM and QEMU (or kvmtool if you can't work on ARM instruction set simulators). There is no point in overengineering anything here. We should do what's the most natural fit for everything.

> Please check "Differences between BE-32 and BE-8 buses"
> section in [2]. In modern ARM CPU memory bus is byte invariant (BE-8).
> As data lines bytes view concerns, it is the same between LE and
> BE-8 that is why IMHO array of bytes view is very good choice.
> PPC and MIPS CPUs memory buses are also byte invariant, they
> always been that way. I don't think we care about BE-32. So
> for all practical purposes, mmio structure is BE-8 bus emulation,
> where data signals could be defined by array of bytes. If one
> would try to define it as set of other bigger integers
> one need to have endianness attribute associated with it. If
> such attribute implied by default just through CPU type in order to
> work with existing cases it should be different for different CPU
> types, which means qemu running in the same endianity but
> on different CPU types should acts differently if it emulates
> the same device and that is bad IMHO. So I don't see any
> value from departing from bytes array view of data on the bus.

The bus comes after QEMU is involved. From a semantic perspective, the KVM ioctl interface sits between the core and the bus. KVM implements the core and some bits of the bus (to emulate in-kernel devices) and QEMU implements the actual bus topology.

What happens on the way from core -> device is bus specific, so QEMU should take care of this. The way the QEMU internal bus representation gets MMIO data from the CPU is by an ( address, value, len ) tuple. So that's the interface we should design for. And that means we need to transfer full "values", not arrays of data.

The fact that we have a data array is really just because it was easy to write and access. In reality and for all intents and purposes this is a union of u8, u16, u32, u64 that we can't change to be a union anymore because we need to stay backwards compatible. And as with any normal kernel interface you design that one with the endianness of the kernel ABI.


Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-23  4:25                   ` [Qemu-devel] " Victor Kamensky
@ 2014-01-23 10:56                     ` Greg Kurz
  -1 siblings, 0 replies; 102+ messages in thread
From: Greg Kurz @ 2014-01-23 10:56 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Alexander Graf, Thomas Falcon, kvm-devel, Anup Patel,
	QEMU Developers, qemu-ppc, kvmarm, Christoffer Dall

On Wed, 22 Jan 2014 20:25:05 -0800
Victor Kamensky <victor.kamensky@linaro.org> wrote:

> Hi Alex,
> 
> Sorry, for delayed reply, I was focusing on discussion
> with Peter. Hope you and other folks may get something
> out of it :).
> 
> Please see responses inline
> 
> On 22 January 2014 02:52, Alexander Graf <agraf@suse.de> wrote:
> >
> > On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org>
> > wrote:
> >
> >> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
> >>>
> >>>
> >>> "Native endian" really is just a shortcut for "target endian"
> >>> which is LE for ARM and BE for PPC. There shouldn't be
> >>> a qemu-system-armeb or qemu-system-ppc64le.
> >>
> >> I disagree. Fully functional ARM BE system is what we've
> >> been working on for last few months. 'We' is Linaro
> >> Networking Group, Endian subteam and some other guys
> >> in ARM and across community. Why we do that is a bit
> >> beyond of this discussion.
> >>
> >> ARM BE patches for both V7 and V8 are already in mainline
> >> kernel. But ARM BE KVM host is broken now. It is known
> >> deficiency that I am trying to fix. Please look at [1]. Patches
> >> for V7 BE KVM were proposed and currently under active
> >> discussion. Currently I work on ARM V8 BE KVM changes.
> >>
> >> So "native endian" in ARM is value of CPSR register E bit.
> >> If it is off native endian is LE, if it is on it is BE.
> >>
> >> Once and if we agree on ARM BE KVM host changes, the
> >> next step would be patches in qemu one of which introduces
> >> qemu-system-armeb. Please see [2].
> >
> > I think we're facing an ideology conflict here. Yes, there
> > should be a qemu-system-arm that is BE capable.
> 
> Maybe it is not ideology conflict but rather terminology clarity
> issue :). I am not sure what do you mean by "qemu-system-arm
> that is BE capable". In qemu build system there is just target
> name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
> target which is ARM V7 cpu in BE mode. That is true for a lot
> of open source packages. You could check [1] patch that
> introduces armeb target into qemu. Build for
> arm target produces qemu-system-arm executable that is
> marked 'ELF 32-bit LSB executable' and it could run on LE
> traditional ARM Linux. Build for armeb target produces
> qemu-system-armeb executable that is marked 'ELF 32-bit
> MSB executable' that can run on BE ARM Linux. armbe is
> nothing special here, just build option for qemu that should run
> on BE ARM Linux.
> 

Hmmm... it looks like there is a confusion about the qemu command naming.
The -target suffix in qemu-system-target has nothing to do with the ELF
information of the command itself.

[greg@bahia ~]$ file `which qemu-system-arm`
/bin/qemu-system-arm: ELF 64-bit LSB shared object, x86-64, version 1
(SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32,
BuildID[sha1]=0xbcb974847daa8159c17ed74906cd5351387d4097, stripped

It is valid to create a new target if it is substantially different from
existing ones (ppc64 versus ppc for example). This is not the case with ARM
since it is the very same CPU that can switch endianess with the 'setend'
instruction (which needs anyway to be emulated when running in TCG mode).

qemu-system-arm is THE command that should be able to emulate an ARM cpu,
whether the guest does 'setend le' or 'setend be'.

> Both qemu-system-arm and qemu-system-armeb should
> be BE/LE capable. I.e either of them along with KVM could
> either run LE or BE guest. MarcZ demonstrated that this
> is possible. I've tested both LE and BE guests with
> qemu-system-arm running on traditional LE ARM Linux,
> effectively repeating Marc's setup but with qemu.
> And I did test with my patches both BE and LE guests with
> qemu-system-armeb running on BE ARM Linux.
> 
> > There
> > should also be a qemu-system-ppc64 that is LE capable.
> > But there is no point in changing the "default endiannes"
> > for the virtual CPUs that we plug in there. Both CPUs are
> > perfectly capable of running in LE or BE mode, the
> > question is just what we declare the "default".
> 
> I am not sure, what you mean by "default"? Is it initial
> setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
> the way it is currently implemented by committed
> qemu-system-arm, and proposed qemu-system-armeb
> patches, they are both off. I.e even qemu-system-armeb
> starts running vcpu in LE mode, exactly by very similar
> reason as desribed in your next paragraph
> qemu-system-armeb has tiny bootloader that starts
> in LE mode, jumps to kernel kernel switches cpu to
> run in BE mode 'setend be' and EE bit is set just
> before mmu is enabled.
> 
> > Think about the PPC bootstrap. We start off with a
> > BE firmware, then boot into the Linux kernel which
> > calls a hypercall to set the LE bit on every interrupt.
> 
> We have very similar situation with BE ARM Linux.
> When we run ARM BE Linux we start with bootloader
> which is LE and then CPU issues 'setend be' very
> soon as it starts executing kernel code, all secondary
> CPUs issue 'setend be' when they go out of reset pen
> or bootmonitor sleep.
> 
> > But there's no reason this little endian kernel
> > couldn't theoretically have big endian user space running
> > with access to emulated device registers.
> 
> I don't want to go there, it is very very messy ...
> 
> ------ Just a side note: ------
> Interestingly, half a year before I joined Linaro in Cisco I and
> my colleague implemented kernel patch that allowed to run
> BE user-space processes as sort of separate personality on
> top of LE ARM kernel ... treated kind of multi-abi system.
> Effectively we had to do byteswaps on all non-trivial
> system calls and ioctls in side of the kernel. We converted
> around 30 system calls and around 10 ioctls. Our target process
> was just using those and it works working, but patch was
> very intrusive and unnatural. I think in Linaro there was
> some public version of my presentation circulated that
> explained all this mess. I don't want seriously to consider it.
> 
> The only robust mixed mode, as MarcZ demonstrated,
> could be done only on VM boundaries. I.e LE host can
> run BE guest fine. And BE host can run LE guest fine.
> Everything else would be a huge mess. If we want to
> start pro and cons of different mixed modes we need to
> start separate thread.
> ------ End of side note ------------
> 
> > As Peter already pointed out, the actual breakage behind
> > this is that we have a "default endianness" at all. But that's
> > a very difficult thing to resolve and I don't think should be
> > our primary goal. Just live with the fact that we declare
> > ARM little endian in QEMU and swap things
> > accordingly - then everyone's happy.
> 
> I disagree with Peter's point of view as you saw from our
> long thread :). I strongly believe that current mmio.data[]
> describes data on the bus perfectly fine with array of bytes.
> data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
> etc.
> 
> Please check "Differences between BE-32 and BE-8 buses"
> section in [2]. In modern ARM CPU memory bus is byte invariant (BE-8).
> As data lines bytes view concerns, it is the same between LE and
> BE-8 that is why IMHO array of bytes view is very good choice.
> PPC and MIPS CPUs memory buses are also byte invariant, they
> always been that way. I don't think we care about BE-32. So
> for all practical purposes, mmio structure is BE-8 bus emulation,
> where data signals could be defined by array of bytes. If one
> would try to define it as set of other bigger integers
> one need to have endianness attribute associated with it. If
> such attribute implied by default just through CPU type in order to
> work with existing cases it should be different for different CPU
> types, which means qemu running in the same endianity but
> on different CPU types should acts differently if it emulates
> the same device and that is bad IMHO. So I don't see any
> value from departing from bytes array view of data on the bus.
> 
> > This really only ever becomes a problem if you have devices
> > that have awareness of the CPUs endian mode. The only one
> > on PPC that I'm aware of that falls into this category is virtio
> > and there are patches pending to solve that. I don't know if there
> > are any QEMU emulated devices outside of virtio with this
> > issue on ARM, but you'll have to make the emulation code
> > for those look at the CPU state then.
> 
> Agreed on native endianity devices, I don't think we really should
> have them on ARM, I believe to your assertion for PPC. In any case
> those native endian devices will be very bad for mixed endianess case.
> Agreed virtio issues must be addressed, when I tested mixed
> modes I had to bring in virtio patches.
> 
> Thanks,
> Victor
> 
> [1]
> https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/9cc68f682d7c25c6749f0137269de0164d666356?hp=bdc07868d30d3362a4ba0215044a185ff7a80bf4
> 
> [2]
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> 
> >>
> >>> QEMU emulates everything that comes after the CPU, so
> >>> imagine the ioctl struct as a bus package. Your bus
> >>> doesn't care what endianness the CPU is in - it just
> >>> gets data from the CPU.
> >>
> >> I am not sure that I follow above. Suppose I have
> >>
> >> move r1, #1
> >> str r1, [r0]
> >>
> >> where r0 is device address. Now depending on CPSR
> >> E bit value device address will receive 1 as integer either
> >> in LE order or in BE order. That is how ARM v7 CPU
> >> works, regardless whether it is emulated or not.
> >>
> >> So if E bit is off (LE case) after str is executed
> >> byte at r0 address will get 1
> >> byte at r0 + 1 address will get 0
> >> byte at r0 + 2 address will get 0
> >> byte at r0 + 3 address will get 0
> >>
> >> If E bit is on (BE case) after str is executed
> >> byte at r0 address will get 0
> >> byte at r0 + 1 address will get 0
> >> byte at r0 + 2 address will get 0
> >> byte at r0 + 3 address will get 1
> >>
> >> my point that mmio.data[] just carries bytes for phys_addr
> >> mmio.data[0] would be value for byte at phys_addr,
> >> mmio.data[1] would be value for byte at phys_addr + 1, and
> >> so on.
> >
> > What we get is an instruction that traps because it wants to "write r1
> > (which has value=1) into address x". So at that point we get the
> > register value.
> >
> > Then we need to take a look at the E bit to see whether the write was
> > supposed to be in non-host endianness because we need to emulate
> > exactly the LE/BE difference you're indicating above. The way we
> > implement this on PPC is that we simply byte swap the register value
> > when guest_endian != host_endian.
> >
> > With this in place, QEMU can just memcpy() the value into a local
> > register and feed it into its emulation code which expects a "register
> > value as if the CPU was running in native endianness" as parameter -
> > with "native" meaning "little endian" for qemu-system-arm. Device
> > emulation code doesn't know what to do with a byte array.
> >
> > Take a look at QEMU's MMIO handler:
> >
> >         case KVM_EXIT_MMIO:
> >             DPRINTF("handle_mmio\n");
> >             cpu_physical_memory_rw(run->mmio.phys_addr,
> >                                    run->mmio.data,
> >                                    run->mmio.len,
> >                                    run->mmio.is_write);
> >             ret = 0;
> >             break;
> >
> > which translates to
> >
> >                 switch (l) {
> >                 case 8:
> >                     /* 64 bit write access */
> >                     val = ldq_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 8);
> >                     break;
> >                 case 4:
> >                     /* 32 bit write access */
> >                     val = ldl_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 4);
> >                     break;
> >                 case 2:
> >                     /* 16 bit write access */
> >                     val = lduw_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 2);
> >                     break;
> >                 case 1:
> >                     /* 8 bit write access */
> >                     val = ldub_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 1);
> >                     break;
> >                 default:
> >                     abort();
> >                 }
> >
> > which calls the ldx_p primitives
> >
> > #if defined(TARGET_WORDS_BIGENDIAN)
> > #define lduw_p(p) lduw_be_p(p)
> > #define ldsw_p(p) ldsw_be_p(p)
> > #define ldl_p(p) ldl_be_p(p)
> > #define ldq_p(p) ldq_be_p(p)
> > #define ldfl_p(p) ldfl_be_p(p)
> > #define ldfq_p(p) ldfq_be_p(p)
> > #define stw_p(p, v) stw_be_p(p, v)
> > #define stl_p(p, v) stl_be_p(p, v)
> > #define stq_p(p, v) stq_be_p(p, v)
> > #define stfl_p(p, v) stfl_be_p(p, v)
> > #define stfq_p(p, v) stfq_be_p(p, v)
> > #else
> > #define lduw_p(p) lduw_le_p(p)
> > #define ldsw_p(p) ldsw_le_p(p)
> > #define ldl_p(p) ldl_le_p(p)
> > #define ldq_p(p) ldq_le_p(p)
> > #define ldfl_p(p) ldfl_le_p(p)
> > #define ldfq_p(p) ldfq_le_p(p)
> > #define stw_p(p, v) stw_le_p(p, v)
> > #define stl_p(p, v) stl_le_p(p, v)
> > #define stq_p(p, v) stq_le_p(p, v)
> > #define stfl_p(p, v) stfl_le_p(p, v)
> > #define stfq_p(p, v) stfq_le_p(p, v)
> > #endif
> >
> > and then passes the result as "originating register access" to the
> > device emulation part of QEMU.
> >
> >
> > Maybe it becomes more clear if you understand the code flow that TCG is
> > going through. With TCG whenever a write traps into MMIO we go through
> > these functions
> >
> > void
> > glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
> > target_ulong addr, DATA_TYPE val, int mmu_idx)
> > {
> >     helper_te_st_name(env, addr, val, mmu_idx, GETRA());
> > }
> >
> > #ifdef TARGET_WORDS_BIGENDIAN
> > # define TGT_BE(X)  (X)
> > # define TGT_LE(X)  BSWAP(X)
> > #else
> > # define TGT_BE(X)  BSWAP(X)
> > # define TGT_LE(X)  (X)
> > #endif
> >
> > void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE
> > val, int mmu_idx, uintptr_t retaddr)
> > {
> > [...]
> >     /* Handle an IO access.  */
> >     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> >         hwaddr ioaddr;
> >         if ((addr & (DATA_SIZE - 1)) != 0) {
> >             goto do_unaligned_access;
> >         }
> >         ioaddr = env->iotlb[mmu_idx][index];
> >
> >         /* ??? Note that the io helpers always read data in the target
> >            byte ordering.  We should push the LE/BE request down into
> > io.  */ val = TGT_LE(val);
> >         glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
> >         return;
> >     }
> >     [...]
> > }
> >
> > static inline void glue(io_write, SUFFIX)(CPUArchState *env,
> >                                           hwaddr physaddr,
> >                                           DATA_TYPE val,
> >                                           target_ulong addr,
> >                                           uintptr_t retaddr)
> > {
> >     MemoryRegion *mr = iotlb_to_region(physaddr);
> >
> >     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
> >     if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env))
> > { cpu_io_recompile(env, retaddr);
> >     }
> >
> >     env->mem_io_vaddr = addr;
> >     env->mem_io_pc = retaddr;
> >     io_mem_write(mr, physaddr, val, 1 << SHIFT);
> > }
> >
> > which at the end of the chain means if you're running an same
> > endianness on guest and host, you get the original register value as
> > function parameter. If you run different endianness you get a swapped
> > value as function parameter.
> >
> > So at the end of all of this, if you're running qemu-system-arm (TCG)
> > on a BE host the request into the io callback function will come in as
> > register, then stay all the way it is until it reaches the IO callback
> > function. Unless you define a specific endianness for your device in
> > which case the callback may swizzle it again. But if your device
> > defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle
> > it.
> >
> > What happens when you switch your guest to BE mode (or LE for PPC)?
> > Very simple. The TCG frontend swizzles every memory read and write
> > before it hits TCG's memory operations.
> >
> > If you're running qemu-system-arm (KVM) on a BE host the request will
> > come into kvm-all.c, get read with swapped endianness (ldq_p) and then
> > passed into that way into the IO callback function. That's where the
> > bug lies. It should behave the same way as TCG, so it needs to know the
> > value the register originally had. So instead of doing an ldq_p() it
> > should go through a different path that does memcpy().
> >
> > But that doesn't fix the other-endian issue yet, right? Every value now
> > would come in as the register value.
> >
> > Well, unless you do the same thing TCG does inside the kernel. So the
> > kernel would swap the reads and writes before it accesses the ioctl
> > struct that connects kvm with QEMU. Then all abstraction layers work
> > just fine again and we don't need any qemu-system-armeb.
> >
> >
> > Alex
> >
> 



-- 
Gregory Kurz                                     kurzgreg@fr.ibm.com
                                                 gkurz@linux.vnet.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)562 165 496

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-23 10:56                     ` Greg Kurz
  0 siblings, 0 replies; 102+ messages in thread
From: Greg Kurz @ 2014-01-23 10:56 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, Anup Patel, Alexander Graf,
	QEMU Developers, qemu-ppc, kvmarm, Christoffer Dall

On Wed, 22 Jan 2014 20:25:05 -0800
Victor Kamensky <victor.kamensky@linaro.org> wrote:

> Hi Alex,
> 
> Sorry, for delayed reply, I was focusing on discussion
> with Peter. Hope you and other folks may get something
> out of it :).
> 
> Please see responses inline
> 
> On 22 January 2014 02:52, Alexander Graf <agraf@suse.de> wrote:
> >
> > On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@linaro.org>
> > wrote:
> >
> >> On 21 January 2014 22:41, Alexander Graf <agraf@suse.de> wrote:
> >>>
> >>>
> >>> "Native endian" really is just a shortcut for "target endian"
> >>> which is LE for ARM and BE for PPC. There shouldn't be
> >>> a qemu-system-armeb or qemu-system-ppc64le.
> >>
> >> I disagree. Fully functional ARM BE system is what we've
> >> been working on for last few months. 'We' is Linaro
> >> Networking Group, Endian subteam and some other guys
> >> in ARM and across community. Why we do that is a bit
> >> beyond of this discussion.
> >>
> >> ARM BE patches for both V7 and V8 are already in mainline
> >> kernel. But ARM BE KVM host is broken now. It is known
> >> deficiency that I am trying to fix. Please look at [1]. Patches
> >> for V7 BE KVM were proposed and currently under active
> >> discussion. Currently I work on ARM V8 BE KVM changes.
> >>
> >> So "native endian" in ARM is value of CPSR register E bit.
> >> If it is off native endian is LE, if it is on it is BE.
> >>
> >> Once and if we agree on ARM BE KVM host changes, the
> >> next step would be patches in qemu one of which introduces
> >> qemu-system-armeb. Please see [2].
> >
> > I think we're facing an ideology conflict here. Yes, there
> > should be a qemu-system-arm that is BE capable.
> 
> Maybe it is not ideology conflict but rather terminology clarity
> issue :). I am not sure what do you mean by "qemu-system-arm
> that is BE capable". In qemu build system there is just target
> name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
> target which is ARM V7 cpu in BE mode. That is true for a lot
> of open source packages. You could check [1] patch that
> introduces armeb target into qemu. Build for
> arm target produces qemu-system-arm executable that is
> marked 'ELF 32-bit LSB executable' and it could run on LE
> traditional ARM Linux. Build for armeb target produces
> qemu-system-armeb executable that is marked 'ELF 32-bit
> MSB executable' that can run on BE ARM Linux. armbe is
> nothing special here, just build option for qemu that should run
> on BE ARM Linux.
> 

Hmmm... it looks like there is a confusion about the qemu command naming.
The -target suffix in qemu-system-target has nothing to do with the ELF
information of the command itself.

[greg@bahia ~]$ file `which qemu-system-arm`
/bin/qemu-system-arm: ELF 64-bit LSB shared object, x86-64, version 1
(SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32,
BuildID[sha1]=0xbcb974847daa8159c17ed74906cd5351387d4097, stripped

It is valid to create a new target if it is substantially different from
existing ones (ppc64 versus ppc for example). This is not the case with ARM
since it is the very same CPU that can switch endianess with the 'setend'
instruction (which needs anyway to be emulated when running in TCG mode).

qemu-system-arm is THE command that should be able to emulate an ARM cpu,
whether the guest does 'setend le' or 'setend be'.

> Both qemu-system-arm and qemu-system-armeb should
> be BE/LE capable. I.e either of them along with KVM could
> either run LE or BE guest. MarcZ demonstrated that this
> is possible. I've tested both LE and BE guests with
> qemu-system-arm running on traditional LE ARM Linux,
> effectively repeating Marc's setup but with qemu.
> And I did test with my patches both BE and LE guests with
> qemu-system-armeb running on BE ARM Linux.
> 
> > There
> > should also be a qemu-system-ppc64 that is LE capable.
> > But there is no point in changing the "default endiannes"
> > for the virtual CPUs that we plug in there. Both CPUs are
> > perfectly capable of running in LE or BE mode, the
> > question is just what we declare the "default".
> 
> I am not sure, what you mean by "default"? Is it initial
> setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
> the way it is currently implemented by committed
> qemu-system-arm, and proposed qemu-system-armeb
> patches, they are both off. I.e even qemu-system-armeb
> starts running vcpu in LE mode, exactly by very similar
> reason as desribed in your next paragraph
> qemu-system-armeb has tiny bootloader that starts
> in LE mode, jumps to kernel kernel switches cpu to
> run in BE mode 'setend be' and EE bit is set just
> before mmu is enabled.
> 
> > Think about the PPC bootstrap. We start off with a
> > BE firmware, then boot into the Linux kernel which
> > calls a hypercall to set the LE bit on every interrupt.
> 
> We have very similar situation with BE ARM Linux.
> When we run ARM BE Linux we start with bootloader
> which is LE and then CPU issues 'setend be' very
> soon as it starts executing kernel code, all secondary
> CPUs issue 'setend be' when they go out of reset pen
> or bootmonitor sleep.
> 
> > But there's no reason this little endian kernel
> > couldn't theoretically have big endian user space running
> > with access to emulated device registers.
> 
> I don't want to go there, it is very very messy ...
> 
> ------ Just a side note: ------
> Interestingly, half a year before I joined Linaro in Cisco I and
> my colleague implemented kernel patch that allowed to run
> BE user-space processes as sort of separate personality on
> top of LE ARM kernel ... treated kind of multi-abi system.
> Effectively we had to do byteswaps on all non-trivial
> system calls and ioctls in side of the kernel. We converted
> around 30 system calls and around 10 ioctls. Our target process
> was just using those and it works working, but patch was
> very intrusive and unnatural. I think in Linaro there was
> some public version of my presentation circulated that
> explained all this mess. I don't want seriously to consider it.
> 
> The only robust mixed mode, as MarcZ demonstrated,
> could be done only on VM boundaries. I.e LE host can
> run BE guest fine. And BE host can run LE guest fine.
> Everything else would be a huge mess. If we want to
> start pro and cons of different mixed modes we need to
> start separate thread.
> ------ End of side note ------------
> 
> > As Peter already pointed out, the actual breakage behind
> > this is that we have a "default endianness" at all. But that's
> > a very difficult thing to resolve and I don't think should be
> > our primary goal. Just live with the fact that we declare
> > ARM little endian in QEMU and swap things
> > accordingly - then everyone's happy.
> 
> I disagree with Peter's point of view as you saw from our
> long thread :). I strongly believe that current mmio.data[]
> describes data on the bus perfectly fine with array of bytes.
> data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
> etc.
> 
> Please check "Differences between BE-32 and BE-8 buses"
> section in [2]. In modern ARM CPU memory bus is byte invariant (BE-8).
> As data lines bytes view concerns, it is the same between LE and
> BE-8 that is why IMHO array of bytes view is very good choice.
> PPC and MIPS CPUs memory buses are also byte invariant, they
> always been that way. I don't think we care about BE-32. So
> for all practical purposes, mmio structure is BE-8 bus emulation,
> where data signals could be defined by array of bytes. If one
> would try to define it as set of other bigger integers
> one need to have endianness attribute associated with it. If
> such attribute implied by default just through CPU type in order to
> work with existing cases it should be different for different CPU
> types, which means qemu running in the same endianity but
> on different CPU types should acts differently if it emulates
> the same device and that is bad IMHO. So I don't see any
> value from departing from bytes array view of data on the bus.
> 
> > This really only ever becomes a problem if you have devices
> > that have awareness of the CPUs endian mode. The only one
> > on PPC that I'm aware of that falls into this category is virtio
> > and there are patches pending to solve that. I don't know if there
> > are any QEMU emulated devices outside of virtio with this
> > issue on ARM, but you'll have to make the emulation code
> > for those look at the CPU state then.
> 
> Agreed on native endianity devices, I don't think we really should
> have them on ARM, I believe to your assertion for PPC. In any case
> those native endian devices will be very bad for mixed endianess case.
> Agreed virtio issues must be addressed, when I tested mixed
> modes I had to bring in virtio patches.
> 
> Thanks,
> Victor
> 
> [1]
> https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/9cc68f682d7c25c6749f0137269de0164d666356?hp=bdc07868d30d3362a4ba0215044a185ff7a80bf4
> 
> [2]
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> 
> >>
> >>> QEMU emulates everything that comes after the CPU, so
> >>> imagine the ioctl struct as a bus package. Your bus
> >>> doesn't care what endianness the CPU is in - it just
> >>> gets data from the CPU.
> >>
> >> I am not sure that I follow above. Suppose I have
> >>
> >> move r1, #1
> >> str r1, [r0]
> >>
> >> where r0 is device address. Now depending on CPSR
> >> E bit value device address will receive 1 as integer either
> >> in LE order or in BE order. That is how ARM v7 CPU
> >> works, regardless whether it is emulated or not.
> >>
> >> So if E bit is off (LE case) after str is executed
> >> byte at r0 address will get 1
> >> byte at r0 + 1 address will get 0
> >> byte at r0 + 2 address will get 0
> >> byte at r0 + 3 address will get 0
> >>
> >> If E bit is on (BE case) after str is executed
> >> byte at r0 address will get 0
> >> byte at r0 + 1 address will get 0
> >> byte at r0 + 2 address will get 0
> >> byte at r0 + 3 address will get 1
> >>
> >> my point that mmio.data[] just carries bytes for phys_addr
> >> mmio.data[0] would be value for byte at phys_addr,
> >> mmio.data[1] would be value for byte at phys_addr + 1, and
> >> so on.
> >
> > What we get is an instruction that traps because it wants to "write r1
> > (which has value=1) into address x". So at that point we get the
> > register value.
> >
> > Then we need to take a look at the E bit to see whether the write was
> > supposed to be in non-host endianness because we need to emulate
> > exactly the LE/BE difference you're indicating above. The way we
> > implement this on PPC is that we simply byte swap the register value
> > when guest_endian != host_endian.
> >
> > With this in place, QEMU can just memcpy() the value into a local
> > register and feed it into its emulation code which expects a "register
> > value as if the CPU was running in native endianness" as parameter -
> > with "native" meaning "little endian" for qemu-system-arm. Device
> > emulation code doesn't know what to do with a byte array.
> >
> > Take a look at QEMU's MMIO handler:
> >
> >         case KVM_EXIT_MMIO:
> >             DPRINTF("handle_mmio\n");
> >             cpu_physical_memory_rw(run->mmio.phys_addr,
> >                                    run->mmio.data,
> >                                    run->mmio.len,
> >                                    run->mmio.is_write);
> >             ret = 0;
> >             break;
> >
> > which translates to
> >
> >                 switch (l) {
> >                 case 8:
> >                     /* 64 bit write access */
> >                     val = ldq_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 8);
> >                     break;
> >                 case 4:
> >                     /* 32 bit write access */
> >                     val = ldl_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 4);
> >                     break;
> >                 case 2:
> >                     /* 16 bit write access */
> >                     val = lduw_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 2);
> >                     break;
> >                 case 1:
> >                     /* 8 bit write access */
> >                     val = ldub_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 1);
> >                     break;
> >                 default:
> >                     abort();
> >                 }
> >
> > which calls the ldx_p primitives
> >
> > #if defined(TARGET_WORDS_BIGENDIAN)
> > #define lduw_p(p) lduw_be_p(p)
> > #define ldsw_p(p) ldsw_be_p(p)
> > #define ldl_p(p) ldl_be_p(p)
> > #define ldq_p(p) ldq_be_p(p)
> > #define ldfl_p(p) ldfl_be_p(p)
> > #define ldfq_p(p) ldfq_be_p(p)
> > #define stw_p(p, v) stw_be_p(p, v)
> > #define stl_p(p, v) stl_be_p(p, v)
> > #define stq_p(p, v) stq_be_p(p, v)
> > #define stfl_p(p, v) stfl_be_p(p, v)
> > #define stfq_p(p, v) stfq_be_p(p, v)
> > #else
> > #define lduw_p(p) lduw_le_p(p)
> > #define ldsw_p(p) ldsw_le_p(p)
> > #define ldl_p(p) ldl_le_p(p)
> > #define ldq_p(p) ldq_le_p(p)
> > #define ldfl_p(p) ldfl_le_p(p)
> > #define ldfq_p(p) ldfq_le_p(p)
> > #define stw_p(p, v) stw_le_p(p, v)
> > #define stl_p(p, v) stl_le_p(p, v)
> > #define stq_p(p, v) stq_le_p(p, v)
> > #define stfl_p(p, v) stfl_le_p(p, v)
> > #define stfq_p(p, v) stfq_le_p(p, v)
> > #endif
> >
> > and then passes the result as "originating register access" to the
> > device emulation part of QEMU.
> >
> >
> > Maybe it becomes more clear if you understand the code flow that TCG is
> > going through. With TCG whenever a write traps into MMIO we go through
> > these functions
> >
> > void
> > glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
> > target_ulong addr, DATA_TYPE val, int mmu_idx)
> > {
> >     helper_te_st_name(env, addr, val, mmu_idx, GETRA());
> > }
> >
> > #ifdef TARGET_WORDS_BIGENDIAN
> > # define TGT_BE(X)  (X)
> > # define TGT_LE(X)  BSWAP(X)
> > #else
> > # define TGT_BE(X)  BSWAP(X)
> > # define TGT_LE(X)  (X)
> > #endif
> >
> > void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE
> > val, int mmu_idx, uintptr_t retaddr)
> > {
> > [...]
> >     /* Handle an IO access.  */
> >     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> >         hwaddr ioaddr;
> >         if ((addr & (DATA_SIZE - 1)) != 0) {
> >             goto do_unaligned_access;
> >         }
> >         ioaddr = env->iotlb[mmu_idx][index];
> >
> >         /* ??? Note that the io helpers always read data in the target
> >            byte ordering.  We should push the LE/BE request down into
> > io.  */ val = TGT_LE(val);
> >         glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
> >         return;
> >     }
> >     [...]
> > }
> >
> > static inline void glue(io_write, SUFFIX)(CPUArchState *env,
> >                                           hwaddr physaddr,
> >                                           DATA_TYPE val,
> >                                           target_ulong addr,
> >                                           uintptr_t retaddr)
> > {
> >     MemoryRegion *mr = iotlb_to_region(physaddr);
> >
> >     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
> >     if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env))
> > { cpu_io_recompile(env, retaddr);
> >     }
> >
> >     env->mem_io_vaddr = addr;
> >     env->mem_io_pc = retaddr;
> >     io_mem_write(mr, physaddr, val, 1 << SHIFT);
> > }
> >
> > which at the end of the chain means if you're running an same
> > endianness on guest and host, you get the original register value as
> > function parameter. If you run different endianness you get a swapped
> > value as function parameter.
> >
> > So at the end of all of this, if you're running qemu-system-arm (TCG)
> > on a BE host the request into the io callback function will come in as
> > register, then stay all the way it is until it reaches the IO callback
> > function. Unless you define a specific endianness for your device in
> > which case the callback may swizzle it again. But if your device
> > defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle
> > it.
> >
> > What happens when you switch your guest to BE mode (or LE for PPC)?
> > Very simple. The TCG frontend swizzles every memory read and write
> > before it hits TCG's memory operations.
> >
> > If you're running qemu-system-arm (KVM) on a BE host the request will
> > come into kvm-all.c, get read with swapped endianness (ldq_p) and then
> > passed into that way into the IO callback function. That's where the
> > bug lies. It should behave the same way as TCG, so it needs to know the
> > value the register originally had. So instead of doing an ldq_p() it
> > should go through a different path that does memcpy().
> >
> > But that doesn't fix the other-endian issue yet, right? Every value now
> > would come in as the register value.
> >
> > Well, unless you do the same thing TCG does inside the kernel. So the
> > kernel would swap the reads and writes before it accesses the ioctl
> > struct that connects kvm with QEMU. Then all abstraction layers work
> > just fine again and we don't need any qemu-system-armeb.
> >
> >
> > Alex
> >
> 



-- 
Gregory Kurz                                     kurzgreg@fr.ibm.com
                                                 gkurz@linux.vnet.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)562 165 496

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-23 10:23                           ` [Qemu-devel] " Peter Maydell
@ 2014-01-23 15:06                             ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23 15:06 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 02:23, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 23 January 2014 00:22, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> Peter, could I please ask you a favor. Could you please
>> stop deleting pieces of your and my previous responses
>> when you reply.
>
> No, sorry. It produces excessively long and totally unreadable
> emails for everybody else if people don't trim for context.
> This is standard mailing list practice.

Usually it is OK, but with your choices sometimes you remove
my questions without answering them, where I think that they
are essential for discussion. For example I asked about 'len'
with value that are not power of 2:

In [1] I wrote

"I don't see why you so attached to desire to describe
data part of memory transaction as just one of int
types. If we are talking about bunch of hypothetical
cases imagine such bus that allow transaction with
size of 6 bytes. How do you describe such data in
your ints speak? What endianity you can assign to
sequence of 6 bytes? While note that description of
such transaction as set of 6 byte values at address
$whatever makes perfect sense."

But notice that in your next reply [2] you just dropped it

Similar situation happens with this reply you removed
piece that I had to bring back. Please see below.

>>>> Consider above big endian case (setend be) example,
>>>> but now running in BE KVM host. 0x4 is LSB of CPU
>>>> core register in this case.
>>>
>>> Yes. In this case if we are using the "mmio.data is host
>>> kernel endianness" definition then mmio.data[0] should be
>>> 0x01 (the MSB of the 32 bit data value).
>>
>> If mmio.data[0] is 0x1, mmio.data[] = {0x1, 0x2, 0x3, 0x4},
>> and now KVM host and emulator running in BE mode.
>> But that contradicts to what you said before.
>
> Sorry, I misread the example here (and assumed we were
> writing the same word in both cases, when actually the BE
> code example is writing a different value). mmio.data[0] should
> be 0x4, because:
>  * BE ARM guest, so KVM must byte-swap the register value
>     (giving 0x04030201)
>  * BE host, so it writes the uint32_t in host order (giving
>    0x4 in mmio.data[0])
>
>>>> I believe, but I need to check, that PPC BE setup actually
>>>> acts as the second case in above example  If we have PPC
>>>> BE guest executing the following instructions:
>>>>
>>>> lis     r1,0x102
>>>> ori     r1,r1,0x304
>>>> stw    r1,0(r0)
>>>>
>>>> after first two instructions r1 would contain 0x01020304.
>>>> IMHO It exactly corresponds to above my ARM second case -
>>>> BE guest when it runs under ARM BE KVM host. I believe
>>>> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
>>>
>>> Yes, assuming a BE PPC host kernel (which is the usual
>>> arrangement).
>>
>> OK, that confirms my understanding how PPC mmio
>> should work.
>>
>>>> But according to you data[0] must be 0x4 in BE host case
>>>
>>> Er, no. The data here is 0x01020304, so for a BE host
>>> data[0] is the big end, ie 0x1. It would only be 0x4 if
>>> mmio.data[] were LE always (or if you were running
>>> your BE PPC guest on an LE PPC host, which I don't
>>> think is supported currently).
>>
>> So do you agree that for all three code snippets cited in this
>> email, we always will have mmio.data[] = {0x1, 0x2,
>> 0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
>> and for ppc code snippet in PPC BE qemu/host.
>
> No. Also your ARM and PPC examples are not usefully
> comparable, because:
>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>
> This is an LE guest writing 0x04030201, and that is the
> value that will go out on the bus.
>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>
> This is a BE guest writing 0x01020304; as far as the
> code running on the CPU is concerned; the value on the
> bus will be byteswapped.
>
>> lis     r1,0x102
>> ori     r1,r1,0x304
>> stw    r1,0(r0)
>
> This is also a BE guest writing 0x01020304. I'm pretty
> sure that the PPC approach is that for BE guests writing
> a word that word goes out to the bus as is; for LE guests
> (or if the page table is set up to say "this page is LE") the
> CPU swaps it before putting it on the bus. In this regard
> it is the opposite way round to ARM.
>
> So the value you start with in the CPU register is not
> the same in all three cases, and what the hardware
> does is not the same either.

So in what cases h/w does differently? I think we agreed
before that in ARM cases it is the same memory
transaction h/w cannot do anything different. And ARM
'setend be' cases matches PPC BE case in both cases
BE write happens to the same h/w address and value is
0x01020304, why h/w would see it differently? It is
the same write.

I think you missing that in all discussed cases BE-8, byte
invariant CPU memory buses are used. Here is what I wrote
in reply to Alex, it is worth copying it here:

---- start quote from my response to Alex -----
I disagree with Peter's point of view as you saw from our
long thread :). I strongly believe that current mmio.data[]
describes data on the bus perfectly fine with array of bytes.
data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
etc.

Please check "Differences between BE-32 and BE-8 buses"
section in [3]. In modern ARM CPU memory bus is byte invariant (BE-8).
As data lines bytes view concerns, it is the same between LE and
BE-8 that is why IMHO array of bytes view is very good choice.
PPC and MIPS CPUs memory buses are also byte invariant, they
always been that way. I don't think we care about BE-32. So
for all practical purposes, mmio structure is BE-8 bus emulation,
where data signals could be defined by array of bytes. If one
would try to define it as set of other bigger integers
one need to have endianness attribute associated with it. If
such attribute implied by default just through CPU type in order to
work with existing cases it should be different for different CPU
types, which means qemu running in the same endianity but
on different CPU types should acts differently if it emulates
the same device and that is bad IMHO. So I don't see any
value from departing from bytes array view of data on the bus."
---- end quote from my response to Alex -----

We really need to agree that in all three cases that
if the same device attached to memory bus at r0 address,
device sees the same data write. That is really the key.  If you
disagree please stop reading and let's discuss that. If we have
disagreement on this, I will find examples
in Linux kernel with code snippets from the same driver
writing to the same h/w register between considered 3
cases ARM LE, ARM BE, PPC BE. Or if it won't be
the same driver I will find logically very similar cases.

If now you agree that it is the same data seen by h/w
in all three cases let's go back to mmio.data[] content

So to summarize as far as mmio.data[[ concerned
for this three cases according to you:

For LE KVM host/qemu:
ARM 'setend le' case mmio.data[] = {0x1, 0x2, 0x3, 0x4}
ARM 'setend be' case mmio.data[] = {0x1, 0x2, 0x3, 0x4}

For BE KVM host/qemu:
ARM 'setend le' case mmio.data[] = {0x4, 0x3, 0x2, 0x1}
ARM 'setend be' case mmio.data[] = {0x4, 0x3, 0x2, 0x1}
PPC case mmio.data[] = {0x1, 0x2, 0x3, 0x4}

And here is what I asked wrt this in the end of [4]:

> So do you agree that for all three code snippets cited in this
> email, we always will have mmio.data[] = {0x1, 0x2,
> 0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
> and for ppc code snippet in PPC BE qemu/host.
> I believe it should be this way, because from emulator (i.e
> qemu) code point of view running on ARM BE qemu/host
> or PPC BE and emulating the same h/w device, code
> should not make difference whether it is ARM or PPC.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

just to highlight it again, as far as BE qemu concerned it sees
{0x4, 0x3, 0x2, 0x1} in ARM case and {0x1, 0x2, 0x3, 0x4}
in PPC case but it is the same value h/w write, why qemu
should process it differently depending on CPU type? I
believe it should not be the case and content of mmio.data[]
should be {0x1, 0x2, 0x3, 0x4} which is unambiguous BE-8
description of bus data signals for all possible endianness
cases of KVM host/qemu.

Thanks,
Victor

[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008902.html

[2] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008903.html

[3] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

[4] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008906.html

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-23 15:06                             ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23 15:06 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 23 January 2014 02:23, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 23 January 2014 00:22, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> Peter, could I please ask you a favor. Could you please
>> stop deleting pieces of your and my previous responses
>> when you reply.
>
> No, sorry. It produces excessively long and totally unreadable
> emails for everybody else if people don't trim for context.
> This is standard mailing list practice.

Usually it is OK, but with your choices sometimes you remove
my questions without answering them, where I think that they
are essential for discussion. For example I asked about 'len'
with value that are not power of 2:

In [1] I wrote

"I don't see why you so attached to desire to describe
data part of memory transaction as just one of int
types. If we are talking about bunch of hypothetical
cases imagine such bus that allow transaction with
size of 6 bytes. How do you describe such data in
your ints speak? What endianity you can assign to
sequence of 6 bytes? While note that description of
such transaction as set of 6 byte values at address
$whatever makes perfect sense."

But notice that in your next reply [2] you just dropped it

Similar situation happens with this reply you removed
piece that I had to bring back. Please see below.

>>>> Consider above big endian case (setend be) example,
>>>> but now running in BE KVM host. 0x4 is LSB of CPU
>>>> core register in this case.
>>>
>>> Yes. In this case if we are using the "mmio.data is host
>>> kernel endianness" definition then mmio.data[0] should be
>>> 0x01 (the MSB of the 32 bit data value).
>>
>> If mmio.data[0] is 0x1, mmio.data[] = {0x1, 0x2, 0x3, 0x4},
>> and now KVM host and emulator running in BE mode.
>> But that contradicts to what you said before.
>
> Sorry, I misread the example here (and assumed we were
> writing the same word in both cases, when actually the BE
> code example is writing a different value). mmio.data[0] should
> be 0x4, because:
>  * BE ARM guest, so KVM must byte-swap the register value
>     (giving 0x04030201)
>  * BE host, so it writes the uint32_t in host order (giving
>    0x4 in mmio.data[0])
>
>>>> I believe, but I need to check, that PPC BE setup actually
>>>> acts as the second case in above example  If we have PPC
>>>> BE guest executing the following instructions:
>>>>
>>>> lis     r1,0x102
>>>> ori     r1,r1,0x304
>>>> stw    r1,0(r0)
>>>>
>>>> after first two instructions r1 would contain 0x01020304.
>>>> IMHO It exactly corresponds to above my ARM second case -
>>>> BE guest when it runs under ARM BE KVM host. I believe
>>>> that mmio.data[] in PPC BE case would be {0x1, 0x2, 0x3, 0x4}.
>>>
>>> Yes, assuming a BE PPC host kernel (which is the usual
>>> arrangement).
>>
>> OK, that confirms my understanding how PPC mmio
>> should work.
>>
>>>> But according to you data[0] must be 0x4 in BE host case
>>>
>>> Er, no. The data here is 0x01020304, so for a BE host
>>> data[0] is the big end, ie 0x1. It would only be 0x4 if
>>> mmio.data[] were LE always (or if you were running
>>> your BE PPC guest on an LE PPC host, which I don't
>>> think is supported currently).
>>
>> So do you agree that for all three code snippets cited in this
>> email, we always will have mmio.data[] = {0x1, 0x2,
>> 0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
>> and for ppc code snippet in PPC BE qemu/host.
>
> No. Also your ARM and PPC examples are not usefully
> comparable, because:
>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>
> This is an LE guest writing 0x04030201, and that is the
> value that will go out on the bus.
>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>
> This is a BE guest writing 0x01020304; as far as the
> code running on the CPU is concerned; the value on the
> bus will be byteswapped.
>
>> lis     r1,0x102
>> ori     r1,r1,0x304
>> stw    r1,0(r0)
>
> This is also a BE guest writing 0x01020304. I'm pretty
> sure that the PPC approach is that for BE guests writing
> a word that word goes out to the bus as is; for LE guests
> (or if the page table is set up to say "this page is LE") the
> CPU swaps it before putting it on the bus. In this regard
> it is the opposite way round to ARM.
>
> So the value you start with in the CPU register is not
> the same in all three cases, and what the hardware
> does is not the same either.

So in what cases h/w does differently? I think we agreed
before that in ARM cases it is the same memory
transaction h/w cannot do anything different. And ARM
'setend be' cases matches PPC BE case in both cases
BE write happens to the same h/w address and value is
0x01020304, why h/w would see it differently? It is
the same write.

I think you missing that in all discussed cases BE-8, byte
invariant CPU memory buses are used. Here is what I wrote
in reply to Alex, it is worth copying it here:

---- start quote from my response to Alex -----
I disagree with Peter's point of view as you saw from our
long thread :). I strongly believe that current mmio.data[]
describes data on the bus perfectly fine with array of bytes.
data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
etc.

Please check "Differences between BE-32 and BE-8 buses"
section in [3]. In modern ARM CPU memory bus is byte invariant (BE-8).
As data lines bytes view concerns, it is the same between LE and
BE-8 that is why IMHO array of bytes view is very good choice.
PPC and MIPS CPUs memory buses are also byte invariant, they
always been that way. I don't think we care about BE-32. So
for all practical purposes, mmio structure is BE-8 bus emulation,
where data signals could be defined by array of bytes. If one
would try to define it as set of other bigger integers
one need to have endianness attribute associated with it. If
such attribute implied by default just through CPU type in order to
work with existing cases it should be different for different CPU
types, which means qemu running in the same endianity but
on different CPU types should acts differently if it emulates
the same device and that is bad IMHO. So I don't see any
value from departing from bytes array view of data on the bus."
---- end quote from my response to Alex -----

We really need to agree that in all three cases that
if the same device attached to memory bus at r0 address,
device sees the same data write. That is really the key.  If you
disagree please stop reading and let's discuss that. If we have
disagreement on this, I will find examples
in Linux kernel with code snippets from the same driver
writing to the same h/w register between considered 3
cases ARM LE, ARM BE, PPC BE. Or if it won't be
the same driver I will find logically very similar cases.

If now you agree that it is the same data seen by h/w
in all three cases let's go back to mmio.data[] content

So to summarize as far as mmio.data[[ concerned
for this three cases according to you:

For LE KVM host/qemu:
ARM 'setend le' case mmio.data[] = {0x1, 0x2, 0x3, 0x4}
ARM 'setend be' case mmio.data[] = {0x1, 0x2, 0x3, 0x4}

For BE KVM host/qemu:
ARM 'setend le' case mmio.data[] = {0x4, 0x3, 0x2, 0x1}
ARM 'setend be' case mmio.data[] = {0x4, 0x3, 0x2, 0x1}
PPC case mmio.data[] = {0x1, 0x2, 0x3, 0x4}

And here is what I asked wrt this in the end of [4]:

> So do you agree that for all three code snippets cited in this
> email, we always will have mmio.data[] = {0x1, 0x2,
> 0x3, 0x4}, for ARM LE qemu/host, for ARM BE qemu/host
> and for ppc code snippet in PPC BE qemu/host.
> I believe it should be this way, because from emulator (i.e
> qemu) code point of view running on ARM BE qemu/host
> or PPC BE and emulating the same h/w device, code
> should not make difference whether it is ARM or PPC.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

just to highlight it again, as far as BE qemu concerned it sees
{0x4, 0x3, 0x2, 0x1} in ARM case and {0x1, 0x2, 0x3, 0x4}
in PPC case but it is the same value h/w write, why qemu
should process it differently depending on CPU type? I
believe it should not be the case and content of mmio.data[]
should be {0x1, 0x2, 0x3, 0x4} which is unambiguous BE-8
description of bus data signals for all possible endianness
cases of KVM host/qemu.

Thanks,
Victor

[1] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008902.html

[2] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008903.html

[3] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

[4] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008906.html

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-23 15:06                             ` [Qemu-devel] " Victor Kamensky
@ 2014-01-23 15:33                               ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-23 15:33 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> In [1] I wrote
>
> "I don't see why you so attached to desire to describe
> data part of memory transaction as just one of int
> types. If we are talking about bunch of hypothetical
> cases imagine such bus that allow transaction with
> size of 6 bytes. How do you describe such data in
> your ints speak? What endianity you can assign to
> sequence of 6 bytes? While note that description of
> such transaction as set of 6 byte values at address
> $whatever makes perfect sense."
>
> But notice that in your next reply [2] you just dropped it

Yes. This is because it was one of the places where
I would have just had to repeat "no, I'm afraid you're wrong
about how hardware works". I think in general it's going
to be better if I don't try to reply point by point to this
email; I think you should go back and reread the emails I've
sent. Key points:
 (1) hardware is not doing anything involving arrays
     of bytes
 (2) the API between kernel and userspace needs to define
     the semantics of mmio.data, ie how to map between
     "x byte wide transaction with value v" and the array,
     and that is primarily what this conversation is about
 (3) the only choice which is both (a) sensible and (b)
     not breaking existing usage is to say "the array is
     in host-kernel-byte-order"
 (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
     the same, because in the ARM case it is doing an
     internal-to-CPU byteswap, and in the PPC case it is not

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-23 15:33                               ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-23 15:33 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> In [1] I wrote
>
> "I don't see why you so attached to desire to describe
> data part of memory transaction as just one of int
> types. If we are talking about bunch of hypothetical
> cases imagine such bus that allow transaction with
> size of 6 bytes. How do you describe such data in
> your ints speak? What endianity you can assign to
> sequence of 6 bytes? While note that description of
> such transaction as set of 6 byte values at address
> $whatever makes perfect sense."
>
> But notice that in your next reply [2] you just dropped it

Yes. This is because it was one of the places where
I would have just had to repeat "no, I'm afraid you're wrong
about how hardware works". I think in general it's going
to be better if I don't try to reply point by point to this
email; I think you should go back and reread the emails I've
sent. Key points:
 (1) hardware is not doing anything involving arrays
     of bytes
 (2) the API between kernel and userspace needs to define
     the semantics of mmio.data, ie how to map between
     "x byte wide transaction with value v" and the array,
     and that is primarily what this conversation is about
 (3) the only choice which is both (a) sensible and (b)
     not breaking existing usage is to say "the array is
     in host-kernel-byte-order"
 (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
     the same, because in the ARM case it is doing an
     internal-to-CPU byteswap, and in the PPC case it is not

thanks
-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-23 15:33                               ` [Qemu-devel] " Peter Maydell
@ 2014-01-23 16:25                                 ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23 16:25 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> In [1] I wrote
>>
>> "I don't see why you so attached to desire to describe
>> data part of memory transaction as just one of int
>> types. If we are talking about bunch of hypothetical
>> cases imagine such bus that allow transaction with
>> size of 6 bytes. How do you describe such data in
>> your ints speak? What endianity you can assign to
>> sequence of 6 bytes? While note that description of
>> such transaction as set of 6 byte values at address
>> $whatever makes perfect sense."
>>
>> But notice that in your next reply [2] you just dropped it
>
> Yes. This is because it was one of the places where
> I would have just had to repeat "no, I'm afraid you're wrong
> about how hardware works". I think in general it's going
> to be better if I don't try to reply point by point to this
> email; I think you should go back and reread the emails I've
> sent. Key points:
>  (1) hardware is not doing anything involving arrays
>      of bytes

Array of bytes or integers is just a way to describe data lines
on the bus. Did you look at this document?

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

A0, A1, ,,, A7 byte values are the same for both LE and BE-8
case (first two columns in the table) and they unambiguously
describe data bus signals

>  (2) the API between kernel and userspace needs to define
>      the semantics of mmio.data, ie how to map between
>      "x byte wide transaction with value v" and the array,
>      and that is primarily what this conversation is about
>  (3) the only choice which is both (a) sensible and (b)
>      not breaking existing usage is to say "the array is
>      in host-kernel-byte-order"
>  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>      the same, because in the ARM case it is doing an
>      internal-to-CPU byteswap, and in the PPC case it is not

That is one of the key disconnects. I'll go find real examples
in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
everybody sake's here is summary of the disconnect:

If we have the same h/w connected to memory bus in ARM
and PPC systems and we have the following three pieces
of code that work with r0 having same device same
register address:

1. ARM LE word write of  0x04030201:
setend le
mov r1, #0x04030201
str r1, [r0]

2. ARM BE word write of 0x01020304:
setend be
mov r1, #0x01020304
str r1, [r0]

3. PPC BE word write of 0x01020304:
lis     r1,0x102
ori     r1,r1,0x304
stw    r1,0(r0)

I claim that h/w will see the same data on bus lines in all
three cases, and h/w would acts the same in all three
cases. Peter says that ARM BE and PPC BE case h/w
would act differently.

If anyone else can offer opinion on that while I am looking
for real examples that would be great.

Thanks,
Victor

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-23 16:25                                 ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-23 16:25 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> In [1] I wrote
>>
>> "I don't see why you so attached to desire to describe
>> data part of memory transaction as just one of int
>> types. If we are talking about bunch of hypothetical
>> cases imagine such bus that allow transaction with
>> size of 6 bytes. How do you describe such data in
>> your ints speak? What endianity you can assign to
>> sequence of 6 bytes? While note that description of
>> such transaction as set of 6 byte values at address
>> $whatever makes perfect sense."
>>
>> But notice that in your next reply [2] you just dropped it
>
> Yes. This is because it was one of the places where
> I would have just had to repeat "no, I'm afraid you're wrong
> about how hardware works". I think in general it's going
> to be better if I don't try to reply point by point to this
> email; I think you should go back and reread the emails I've
> sent. Key points:
>  (1) hardware is not doing anything involving arrays
>      of bytes

Array of bytes or integers is just a way to describe data lines
on the bus. Did you look at this document?

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

A0, A1, ,,, A7 byte values are the same for both LE and BE-8
case (first two columns in the table) and they unambiguously
describe data bus signals

>  (2) the API between kernel and userspace needs to define
>      the semantics of mmio.data, ie how to map between
>      "x byte wide transaction with value v" and the array,
>      and that is primarily what this conversation is about
>  (3) the only choice which is both (a) sensible and (b)
>      not breaking existing usage is to say "the array is
>      in host-kernel-byte-order"
>  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>      the same, because in the ARM case it is doing an
>      internal-to-CPU byteswap, and in the PPC case it is not

That is one of the key disconnects. I'll go find real examples
in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
everybody sake's here is summary of the disconnect:

If we have the same h/w connected to memory bus in ARM
and PPC systems and we have the following three pieces
of code that work with r0 having same device same
register address:

1. ARM LE word write of  0x04030201:
setend le
mov r1, #0x04030201
str r1, [r0]

2. ARM BE word write of 0x01020304:
setend be
mov r1, #0x01020304
str r1, [r0]

3. PPC BE word write of 0x01020304:
lis     r1,0x102
ori     r1,r1,0x304
stw    r1,0(r0)

I claim that h/w will see the same data on bus lines in all
three cases, and h/w would acts the same in all three
cases. Peter says that ARM BE and PPC BE case h/w
would act differently.

If anyone else can offer opinion on that while I am looking
for real examples that would be great.

Thanks,
Victor

> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-23 16:25                                 ` [Qemu-devel] " Victor Kamensky
@ 2014-01-23 20:45                                   ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-23 20:45 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> >> In [1] I wrote
> >>
> >> "I don't see why you so attached to desire to describe
> >> data part of memory transaction as just one of int
> >> types. If we are talking about bunch of hypothetical
> >> cases imagine such bus that allow transaction with
> >> size of 6 bytes. How do you describe such data in
> >> your ints speak? What endianity you can assign to
> >> sequence of 6 bytes? While note that description of
> >> such transaction as set of 6 byte values at address
> >> $whatever makes perfect sense."
> >>
> >> But notice that in your next reply [2] you just dropped it
> >
> > Yes. This is because it was one of the places where
> > I would have just had to repeat "no, I'm afraid you're wrong
> > about how hardware works". I think in general it's going
> > to be better if I don't try to reply point by point to this
> > email; I think you should go back and reread the emails I've
> > sent. Key points:
> >  (1) hardware is not doing anything involving arrays
> >      of bytes
> 
> Array of bytes or integers is just a way to describe data lines
> on the bus. Did you look at this document?
> 
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> 
> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
> case (first two columns in the table) and they unambiguously
> describe data bus signals
> 

The point is simple, and Peter has made it over and over:
Any consumer of a memory operation sees "value, len, address".

This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
definition and having a pointer to the structure you need to be able to
tell me "value, len, address".

> >  (2) the API between kernel and userspace needs to define
> >      the semantics of mmio.data, ie how to map between
> >      "x byte wide transaction with value v" and the array,
> >      and that is primarily what this conversation is about
> >  (3) the only choice which is both (a) sensible and (b)
> >      not breaking existing usage is to say "the array is
> >      in host-kernel-byte-order"
> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> >      the same, because in the ARM case it is doing an
> >      internal-to-CPU byteswap, and in the PPC case it is not
> 
> That is one of the key disconnects. I'll go find real examples
> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> everybody sake's here is summary of the disconnect:
> 
> If we have the same h/w connected to memory bus in ARM
> and PPC systems and we have the following three pieces
> of code that work with r0 having same device same
> register address:
> 
> 1. ARM LE word write of  0x04030201:
> setend le
> mov r1, #0x04030201
> str r1, [r0]
> 
> 2. ARM BE word write of 0x01020304:
> setend be
> mov r1, #0x01020304
> str r1, [r0]
> 
> 3. PPC BE word write of 0x01020304:
> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)
> 
> I claim that h/w will see the same data on bus lines in all
> three cases, and h/w would acts the same in all three
> cases. Peter says that ARM BE and PPC BE case h/w
> would act differently.
> 
> If anyone else can offer opinion on that while I am looking
> for real examples that would be great.
> 

I really don't think listing all these examples help.  You need to focus
on the key points that Peter listed in his previous mail.

I tried in our chat to ask you this questions:

vcpu_data_host_to_guest() is handling a read from an emulated device.
All the info you have is:
(1) len of memory access
(2) mmio.data pointer
(3) destination register
(4) host CPU endianness
(5) guest CPU endianness

Based on this information alone, you need to decide whether you do a
byteswap or not before loading the hardware register upon returning to
the guest.

You will find it impossible to answer, because you don't know the layout
of mmio.data, and that is the thing we are trying to solve.

If you cannot reply to this point in less than 50 lines or mention
anything about devices being LE or BE or come with examples, I am
probably not going to read your reply, sorry.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-23 20:45                                   ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-23 20:45 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> >> In [1] I wrote
> >>
> >> "I don't see why you so attached to desire to describe
> >> data part of memory transaction as just one of int
> >> types. If we are talking about bunch of hypothetical
> >> cases imagine such bus that allow transaction with
> >> size of 6 bytes. How do you describe such data in
> >> your ints speak? What endianity you can assign to
> >> sequence of 6 bytes? While note that description of
> >> such transaction as set of 6 byte values at address
> >> $whatever makes perfect sense."
> >>
> >> But notice that in your next reply [2] you just dropped it
> >
> > Yes. This is because it was one of the places where
> > I would have just had to repeat "no, I'm afraid you're wrong
> > about how hardware works". I think in general it's going
> > to be better if I don't try to reply point by point to this
> > email; I think you should go back and reread the emails I've
> > sent. Key points:
> >  (1) hardware is not doing anything involving arrays
> >      of bytes
> 
> Array of bytes or integers is just a way to describe data lines
> on the bus. Did you look at this document?
> 
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> 
> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
> case (first two columns in the table) and they unambiguously
> describe data bus signals
> 

The point is simple, and Peter has made it over and over:
Any consumer of a memory operation sees "value, len, address".

This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
definition and having a pointer to the structure you need to be able to
tell me "value, len, address".

> >  (2) the API between kernel and userspace needs to define
> >      the semantics of mmio.data, ie how to map between
> >      "x byte wide transaction with value v" and the array,
> >      and that is primarily what this conversation is about
> >  (3) the only choice which is both (a) sensible and (b)
> >      not breaking existing usage is to say "the array is
> >      in host-kernel-byte-order"
> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> >      the same, because in the ARM case it is doing an
> >      internal-to-CPU byteswap, and in the PPC case it is not
> 
> That is one of the key disconnects. I'll go find real examples
> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> everybody sake's here is summary of the disconnect:
> 
> If we have the same h/w connected to memory bus in ARM
> and PPC systems and we have the following three pieces
> of code that work with r0 having same device same
> register address:
> 
> 1. ARM LE word write of  0x04030201:
> setend le
> mov r1, #0x04030201
> str r1, [r0]
> 
> 2. ARM BE word write of 0x01020304:
> setend be
> mov r1, #0x01020304
> str r1, [r0]
> 
> 3. PPC BE word write of 0x01020304:
> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)
> 
> I claim that h/w will see the same data on bus lines in all
> three cases, and h/w would acts the same in all three
> cases. Peter says that ARM BE and PPC BE case h/w
> would act differently.
> 
> If anyone else can offer opinion on that while I am looking
> for real examples that would be great.
> 

I really don't think listing all these examples help.  You need to focus
on the key points that Peter listed in his previous mail.

I tried in our chat to ask you this questions:

vcpu_data_host_to_guest() is handling a read from an emulated device.
All the info you have is:
(1) len of memory access
(2) mmio.data pointer
(3) destination register
(4) host CPU endianness
(5) guest CPU endianness

Based on this information alone, you need to decide whether you do a
byteswap or not before loading the hardware register upon returning to
the guest.

You will find it impossible to answer, because you don't know the layout
of mmio.data, and that is the thing we are trying to solve.

If you cannot reply to this point in less than 50 lines or mention
anything about devices being LE or BE or come with examples, I am
probably not going to read your reply, sorry.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22  8:57               ` [Qemu-devel] " Anup Patel
@ 2014-01-23 23:28                 ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-23 23:28 UTC (permalink / raw)
  To: Anup Patel
  Cc: Alexander Graf, Victor Kamensky, Thomas Falcon, kvm-devel,
	QEMU Developers, qemu-ppc, kvmarm

On Wed, Jan 22, 2014 at 02:27:29PM +0530, Anup Patel wrote:

[...]

> 
> Thanks for the info on QEMU side handling of MMIO data.
> 
> I was not aware that we would be only have "target endian = LE"
> for ARM/ARM64 in QEMU. I think Marc Z had mentioned similar
> thing about MMIO this in our previous discussions on his patches.
> (Please refer, http://www.spinics.net/lists/arm-kernel/msg283313.html)
> 
> This clearly means MMIO data passed to user space (QEMU) has
> to of host endianness so that QEMU can take care of bust->device
> endian map.

Hmmm, I'm not sure what you mean exactly, but the fact remains that we
simply need to decide on a layout of mmio.data that (1) doesn't break
existing userspace (2) is clearly defined for mixed-mmio use cases.

> 
> Current vcpu_data_guest_to_host() and vcpu_data_host_to_guest()
> does not perform endianness conversion of MMIO data to LE when
> we are running LE guest on BE host so we do need Victor's patch
> for fixing vcpu_data_guest_to_host() and vcpu_data_host_to_guest().
> (Already reported long time back by me,
> http://www.spinics.net/lists/arm-kernel/msg283308.html)
> 

The problem is that we cannot decide on how the patch should look like
before the endianness of mmio.data is decided.

Alex, Peter, and I agree that it should be that of the host endianness
and represent what the architecture in question would put on the memory
bus.  In the case of ARM, it's the register value when the VCPU E-bit is
clear, and it's the byteswapped register value when the VCPU E-bit is
set.

Therefore, the patch needs to do an unconditional byteswap when the VCPU
E-bit is set, instead of the beXX_to_cpu and cpu_to_beXX.

I'm sending out a patch to clarify the KVM API so we can move on.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-23 23:28                 ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-23 23:28 UTC (permalink / raw)
  To: Anup Patel
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	Alexander Graf, qemu-ppc, kvmarm

On Wed, Jan 22, 2014 at 02:27:29PM +0530, Anup Patel wrote:

[...]

> 
> Thanks for the info on QEMU side handling of MMIO data.
> 
> I was not aware that we would be only have "target endian = LE"
> for ARM/ARM64 in QEMU. I think Marc Z had mentioned similar
> thing about MMIO this in our previous discussions on his patches.
> (Please refer, http://www.spinics.net/lists/arm-kernel/msg283313.html)
> 
> This clearly means MMIO data passed to user space (QEMU) has
> to of host endianness so that QEMU can take care of bust->device
> endian map.

Hmmm, I'm not sure what you mean exactly, but the fact remains that we
simply need to decide on a layout of mmio.data that (1) doesn't break
existing userspace (2) is clearly defined for mixed-mmio use cases.

> 
> Current vcpu_data_guest_to_host() and vcpu_data_host_to_guest()
> does not perform endianness conversion of MMIO data to LE when
> we are running LE guest on BE host so we do need Victor's patch
> for fixing vcpu_data_guest_to_host() and vcpu_data_host_to_guest().
> (Already reported long time back by me,
> http://www.spinics.net/lists/arm-kernel/msg283308.html)
> 

The problem is that we cannot decide on how the patch should look like
before the endianness of mmio.data is decided.

Alex, Peter, and I agree that it should be that of the host endianness
and represent what the architecture in question would put on the memory
bus.  In the case of ARM, it's the register value when the VCPU E-bit is
clear, and it's the byteswapped register value when the VCPU E-bit is
set.

Therefore, the patch needs to do an unconditional byteswap when the VCPU
E-bit is set, instead of the beXX_to_cpu and cpu_to_beXX.

I'm sending out a patch to clarify the KVM API so we can move on.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-23 16:25                                 ` [Qemu-devel] " Victor Kamensky
@ 2014-01-24  0:09                                   ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-24  0:09 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Christoffer Dall, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 08:25, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>>> In [1] I wrote
>>>
>>> "I don't see why you so attached to desire to describe
>>> data part of memory transaction as just one of int
>>> types. If we are talking about bunch of hypothetical
>>> cases imagine such bus that allow transaction with
>>> size of 6 bytes. How do you describe such data in
>>> your ints speak? What endianity you can assign to
>>> sequence of 6 bytes? While note that description of
>>> such transaction as set of 6 byte values at address
>>> $whatever makes perfect sense."
>>>
>>> But notice that in your next reply [2] you just dropped it
>>
>> Yes. This is because it was one of the places where
>> I would have just had to repeat "no, I'm afraid you're wrong
>> about how hardware works". I think in general it's going
>> to be better if I don't try to reply point by point to this
>> email; I think you should go back and reread the emails I've
>> sent. Key points:
>>  (1) hardware is not doing anything involving arrays
>>      of bytes
>
> Array of bytes or integers is just a way to describe data lines
> on the bus. Did you look at this document?
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>
> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
> case (first two columns in the table) and they unambiguously
> describe data bus signals
>
>>  (2) the API between kernel and userspace needs to define
>>      the semantics of mmio.data, ie how to map between
>>      "x byte wide transaction with value v" and the array,
>>      and that is primarily what this conversation is about
>>  (3) the only choice which is both (a) sensible and (b)
>>      not breaking existing usage is to say "the array is
>>      in host-kernel-byte-order"
>>  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>>      the same, because in the ARM case it is doing an
>>      internal-to-CPU byteswap, and in the PPC case it is not
>
> That is one of the key disconnects. I'll go find real examples
> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> everybody sake's here is summary of the disconnect:
>
> If we have the same h/w connected to memory bus in ARM
> and PPC systems and we have the following three pieces
> of code that work with r0 having same device same
> register address:
>
> 1. ARM LE word write of  0x04030201:
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> 2. ARM BE word write of 0x01020304:
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> 3. PPC BE word write of 0x01020304:
> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)
>
> I claim that h/w will see the same data on bus lines in all
> three cases, and h/w would acts the same in all three
> cases. Peter says that ARM BE and PPC BE case h/w
> would act differently.
>
> If anyone else can offer opinion on that while I am looking
> for real examples that would be great.

Here is my example:

Let's look at isp1760 USB host controller (effectively
one used by TC2). Source code is at
drivers/usb/host/isp1760-hcd.c file. The driver could
be enabled in kernel by adding CONFIG_USB_ISP1760_HCD=y
config I enabled it in ppc image build, arm TC2 already
have it

isp1760 USB host controller driver registers are
in LE format. isp1760 USB host controller driver uses
reg_write32 function to write memory mapped controler
registers. That in turns calls writel, which is LE
device word write:

void reg_write32(void __iomem *base, u32 reg, u32 val)
{
    writel(val, base + reg);
}

In C terms writel is LE word write function. It is
effective memory barrier, cpu_to_le32 and write.
cpu_to_le32 will do byteswap only in BE case.

LE ARM
------

00002e04 <reg_write32>:
    2e04:       e92d4070        push    {r4, r5, r6, lr}
    2e08:       e1a04000        mov     r4, r0
    2e0c:       e1a05001        mov     r5, r1
    2e10:       e1a06002        mov     r6, r2
    2e14:       f57ff04e        dsb     st
    2e18:       e59f3018        ldr     r3, [pc, #24]   ; 2e38
<reg_write32+0x34>
    2e1c:       e5933018        ldr     r3, [r3, #24]
    2e20:       e3530000        cmp     r3, #0
    2e24:       0a000000        beq     2e2c <reg_write32+0x28>
    2e28:       e12fff33        blx     r3
    2e2c:       e0841005        add     r1, r4, r5
    2e30:       e5816000        str     r6, [r1] @ <-------
    2e34:       e8bd8070        pop     {r4, r5, r6, pc}
    2e38:       00000000        .word   0x00000000

operates in LE, just write value to device register mem location

BE ARM
------

00000590 <reg_write32>:
     590:       e92d4070        push    {r4, r5, r6, lr}
     594:       e1a04000        mov     r4, r0
     598:       e1a05001        mov     r5, r1
     59c:       e1a06002        mov     r6, r2
     5a0:       f57ff04e        dsb     st
     5a4:       e59f301c        ldr     r3, [pc, #28]   ; 5c8 <reg_write32+0x38>
     5a8:       e5933018        ldr     r3, [r3, #24]
     5ac:       e3530000        cmp     r3, #0
     5b0:       0a000000        beq     5b8 <reg_write32+0x28>
     5b4:       e12fff33        blx     r3
     5b8:       e6bf2f36        rev     r2, r6   @ <-------
     5bc:       e0841005        add     r1, r4, r5
     5c0:       e5812000        str     r2, [r1] @ <-------
     5c4:       e8bd8070        pop     {r4, r5, r6, pc}
     5c8:       00000000        .word   0x00000000

operates in BE, byteswap first, then write value to device mem location

BE PPC
------

00003070 <reg_write32>:
    3070:       7c 08 02 a6     mflr    r0
    3074:       90 01 00 04     stw     r0,4(r1)
    3078:       48 00 00 01     bl      3078 <reg_write32+0x8>
    307c:       94 21 ff f0     stwu    r1,-16(r1)
    3080:       7c 08 02 a6     mflr    r0
    3084:       90 01 00 14     stw     r0,20(r1)
    3088:       7c 83 22 14     add     r4,r3,r4
    308c:       7c 00 04 ac     sync
    3090:       7c a0 25 2c     stwbrx  r5,0,r4 @ <-------
    3094:       80 01 00 14     lwz     r0,20(r1)
    3098:       38 21 00 10     addi    r1,r1,16
    309c:       7c 08 03 a6     mtlr    r0
    30a0:       4e 80 00 20     blr

stwbrx - is 'Store Word Byte-Reverse Indexed' instruction
bascially byteswap word and store it (r3 is base, r4
is reg, r5 is val).

operates in BE, byteswap and store value but now it is
done in one instruction. But it is exactly what ARM BE
code does.

So if one will call reg_write32(0x1000, 0, 0x04030201) it
would work the same way as earlier cited sequences - ARM
LE case would use 0x04030201 and write it out. BE ARM and
BE PPC would get byteswap value of (0x04030201)=0x01020304
and write it out.

Device see the same thing for all three cases, that is how
BE-8 memory bus works.

Thanks,
Victor

> Thanks,
> Victor
>
>> thanks
>> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-24  0:09                                   ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-24  0:09 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 23 January 2014 08:25, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>>> In [1] I wrote
>>>
>>> "I don't see why you so attached to desire to describe
>>> data part of memory transaction as just one of int
>>> types. If we are talking about bunch of hypothetical
>>> cases imagine such bus that allow transaction with
>>> size of 6 bytes. How do you describe such data in
>>> your ints speak? What endianity you can assign to
>>> sequence of 6 bytes? While note that description of
>>> such transaction as set of 6 byte values at address
>>> $whatever makes perfect sense."
>>>
>>> But notice that in your next reply [2] you just dropped it
>>
>> Yes. This is because it was one of the places where
>> I would have just had to repeat "no, I'm afraid you're wrong
>> about how hardware works". I think in general it's going
>> to be better if I don't try to reply point by point to this
>> email; I think you should go back and reread the emails I've
>> sent. Key points:
>>  (1) hardware is not doing anything involving arrays
>>      of bytes
>
> Array of bytes or integers is just a way to describe data lines
> on the bus. Did you look at this document?
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>
> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
> case (first two columns in the table) and they unambiguously
> describe data bus signals
>
>>  (2) the API between kernel and userspace needs to define
>>      the semantics of mmio.data, ie how to map between
>>      "x byte wide transaction with value v" and the array,
>>      and that is primarily what this conversation is about
>>  (3) the only choice which is both (a) sensible and (b)
>>      not breaking existing usage is to say "the array is
>>      in host-kernel-byte-order"
>>  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>>      the same, because in the ARM case it is doing an
>>      internal-to-CPU byteswap, and in the PPC case it is not
>
> That is one of the key disconnects. I'll go find real examples
> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> everybody sake's here is summary of the disconnect:
>
> If we have the same h/w connected to memory bus in ARM
> and PPC systems and we have the following three pieces
> of code that work with r0 having same device same
> register address:
>
> 1. ARM LE word write of  0x04030201:
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> 2. ARM BE word write of 0x01020304:
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> 3. PPC BE word write of 0x01020304:
> lis     r1,0x102
> ori     r1,r1,0x304
> stw    r1,0(r0)
>
> I claim that h/w will see the same data on bus lines in all
> three cases, and h/w would acts the same in all three
> cases. Peter says that ARM BE and PPC BE case h/w
> would act differently.
>
> If anyone else can offer opinion on that while I am looking
> for real examples that would be great.

Here is my example:

Let's look at isp1760 USB host controller (effectively
one used by TC2). Source code is at
drivers/usb/host/isp1760-hcd.c file. The driver could
be enabled in kernel by adding CONFIG_USB_ISP1760_HCD=y
config I enabled it in ppc image build, arm TC2 already
have it

isp1760 USB host controller driver registers are
in LE format. isp1760 USB host controller driver uses
reg_write32 function to write memory mapped controler
registers. That in turns calls writel, which is LE
device word write:

void reg_write32(void __iomem *base, u32 reg, u32 val)
{
    writel(val, base + reg);
}

In C terms writel is LE word write function. It is
effective memory barrier, cpu_to_le32 and write.
cpu_to_le32 will do byteswap only in BE case.

LE ARM
------

00002e04 <reg_write32>:
    2e04:       e92d4070        push    {r4, r5, r6, lr}
    2e08:       e1a04000        mov     r4, r0
    2e0c:       e1a05001        mov     r5, r1
    2e10:       e1a06002        mov     r6, r2
    2e14:       f57ff04e        dsb     st
    2e18:       e59f3018        ldr     r3, [pc, #24]   ; 2e38
<reg_write32+0x34>
    2e1c:       e5933018        ldr     r3, [r3, #24]
    2e20:       e3530000        cmp     r3, #0
    2e24:       0a000000        beq     2e2c <reg_write32+0x28>
    2e28:       e12fff33        blx     r3
    2e2c:       e0841005        add     r1, r4, r5
    2e30:       e5816000        str     r6, [r1] @ <-------
    2e34:       e8bd8070        pop     {r4, r5, r6, pc}
    2e38:       00000000        .word   0x00000000

operates in LE, just write value to device register mem location

BE ARM
------

00000590 <reg_write32>:
     590:       e92d4070        push    {r4, r5, r6, lr}
     594:       e1a04000        mov     r4, r0
     598:       e1a05001        mov     r5, r1
     59c:       e1a06002        mov     r6, r2
     5a0:       f57ff04e        dsb     st
     5a4:       e59f301c        ldr     r3, [pc, #28]   ; 5c8 <reg_write32+0x38>
     5a8:       e5933018        ldr     r3, [r3, #24]
     5ac:       e3530000        cmp     r3, #0
     5b0:       0a000000        beq     5b8 <reg_write32+0x28>
     5b4:       e12fff33        blx     r3
     5b8:       e6bf2f36        rev     r2, r6   @ <-------
     5bc:       e0841005        add     r1, r4, r5
     5c0:       e5812000        str     r2, [r1] @ <-------
     5c4:       e8bd8070        pop     {r4, r5, r6, pc}
     5c8:       00000000        .word   0x00000000

operates in BE, byteswap first, then write value to device mem location

BE PPC
------

00003070 <reg_write32>:
    3070:       7c 08 02 a6     mflr    r0
    3074:       90 01 00 04     stw     r0,4(r1)
    3078:       48 00 00 01     bl      3078 <reg_write32+0x8>
    307c:       94 21 ff f0     stwu    r1,-16(r1)
    3080:       7c 08 02 a6     mflr    r0
    3084:       90 01 00 14     stw     r0,20(r1)
    3088:       7c 83 22 14     add     r4,r3,r4
    308c:       7c 00 04 ac     sync
    3090:       7c a0 25 2c     stwbrx  r5,0,r4 @ <-------
    3094:       80 01 00 14     lwz     r0,20(r1)
    3098:       38 21 00 10     addi    r1,r1,16
    309c:       7c 08 03 a6     mtlr    r0
    30a0:       4e 80 00 20     blr

stwbrx - is 'Store Word Byte-Reverse Indexed' instruction
bascially byteswap word and store it (r3 is base, r4
is reg, r5 is val).

operates in BE, byteswap and store value but now it is
done in one instruction. But it is exactly what ARM BE
code does.

So if one will call reg_write32(0x1000, 0, 0x04030201) it
would work the same way as earlier cited sequences - ARM
LE case would use 0x04030201 and write it out. BE ARM and
BE PPC would get byteswap value of (0x04030201)=0x01020304
and write it out.

Device see the same thing for all three cases, that is how
BE-8 memory bus works.

Thanks,
Victor

> Thanks,
> Victor
>
>> thanks
>> -- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-23 20:45                                   ` [Qemu-devel] " Christoffer Dall
@ 2014-01-24  0:50                                     ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-24  0:50 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 12:45, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
>> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> >> In [1] I wrote
>> >>
>> >> "I don't see why you so attached to desire to describe
>> >> data part of memory transaction as just one of int
>> >> types. If we are talking about bunch of hypothetical
>> >> cases imagine such bus that allow transaction with
>> >> size of 6 bytes. How do you describe such data in
>> >> your ints speak? What endianity you can assign to
>> >> sequence of 6 bytes? While note that description of
>> >> such transaction as set of 6 byte values at address
>> >> $whatever makes perfect sense."
>> >>
>> >> But notice that in your next reply [2] you just dropped it
>> >
>> > Yes. This is because it was one of the places where
>> > I would have just had to repeat "no, I'm afraid you're wrong
>> > about how hardware works". I think in general it's going
>> > to be better if I don't try to reply point by point to this
>> > email; I think you should go back and reread the emails I've
>> > sent. Key points:
>> >  (1) hardware is not doing anything involving arrays
>> >      of bytes
>>
>> Array of bytes or integers is just a way to describe data lines
>> on the bus. Did you look at this document?
>>
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>>
>> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
>> case (first two columns in the table) and they unambiguously
>> describe data bus signals
>>
>
> The point is simple, and Peter has made it over and over:
> Any consumer of a memory operation sees "value, len, address".

and "endianess" of operation.

here is memory operation

*(int *) (0x1000) = 0x01020304;

can you tell how memory will look like at 0x1000 address - you can't
in LE it will look one way in BE byteswapped.

> This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
> definition and having a pointer to the structure you need to be able to
> tell me "value, len, address".
>
>> >  (2) the API between kernel and userspace needs to define
>> >      the semantics of mmio.data, ie how to map between
>> >      "x byte wide transaction with value v" and the array,
>> >      and that is primarily what this conversation is about
>> >  (3) the only choice which is both (a) sensible and (b)
>> >      not breaking existing usage is to say "the array is
>> >      in host-kernel-byte-order"
>> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>> >      the same, because in the ARM case it is doing an
>> >      internal-to-CPU byteswap, and in the PPC case it is not
>>
>> That is one of the key disconnects. I'll go find real examples
>> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
>> everybody sake's here is summary of the disconnect:
>>
>> If we have the same h/w connected to memory bus in ARM
>> and PPC systems and we have the following three pieces
>> of code that work with r0 having same device same
>> register address:
>>
>> 1. ARM LE word write of  0x04030201:
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> 2. ARM BE word write of 0x01020304:
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> 3. PPC BE word write of 0x01020304:
>> lis     r1,0x102
>> ori     r1,r1,0x304
>> stw    r1,0(r0)
>>
>> I claim that h/w will see the same data on bus lines in all
>> three cases, and h/w would acts the same in all three
>> cases. Peter says that ARM BE and PPC BE case h/w
>> would act differently.
>>
>> If anyone else can offer opinion on that while I am looking
>> for real examples that would be great.
>>
>
> I really don't think listing all these examples help.

I think Peter is wrong in his understanding how real
BE PPC kernel drivers work with h/w mapped devices. Going
with such misunderstanding to suggest how it should hold
info in emulated mmio case is quite strange.

> You need to focus
> on the key points that Peter listed in his previous mail.
>
> I tried in our chat to ask you this questions:
>
> vcpu_data_host_to_guest() is handling a read from an emulated device.
> All the info you have is:
> (1) len of memory access
> (2) mmio.data pointer
> (3) destination register
> (4) host CPU endianness
> (5) guest CPU endianness
>
> Based on this information alone, you need to decide whether you do a
> byteswap or not before loading the hardware register upon returning to
> the guest.
>
> You will find it impossible to answer, because you don't know the layout
> of mmio.data, and that is the thing we are trying to solve.

Actually I am not arguing with above. I agree that
meaning of mmio.data should be better clarified.

I propose my clarification as array of bytes at
phys_addr address on BE-8,
byte invariant, memory bus. That unambiguously
describes data bus signals in case of BE-8 memory
bus. Please look at

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

first two columns LE and BE-8, If one will specify all
values for A0, A1, ... A7 it will define all bus signals.
Note that is the only endian agnostic way to describe data
bus signals. If one would try to describe them in
half-word[s], word[s], double word one need to tell what
endianity of those integers (other columns in document
table).

Peter claims that "I don't understand how h/w bus works".
I disagree with that. I gave pointer on document that describes
how BE-8, byte invariant, memory bus works. I would
appreciate pointer to document, section and page that
describes Peter's memory bus operation understanding.

I pointed that Peter's proposal would have the following issue:
BE qemu will have to act differently depending on CPU
type while emulating the same device. If Peter's proposal is
accepted n qemu code should do something like:

#ifdef WORD_BIGENDIAN
#ifdef __PPC_
   do one thing
#else
  do another
#endif
#endif

there reason for that because the same device write in mmio
will look like this:

ARM LE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
ARM BE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
PPC LE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
PPC BE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }

for ARM BE and PPC BE arrays are different even
it is just BE case, so code would need to '#if ARM'
thing

If you follow my proposal to clarify mmio.data[] meaning
mmio.data[] array will look the same in all 4 cases,
compatible with current usage.

If Peter's proposal is adopted ARM BE and PPC LE cases
would be penalized with excessive back and forth
byteswaps. That is possible to avoid with my proposal.

Thanks,
Victor

> If you cannot reply to this point in less than 50 lines or mention
> anything about devices being LE or BE or come with examples, I am
> probably not going to read your reply, sorry.
>
> -Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-24  0:50                                     ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-24  0:50 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 12:45, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
>> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> >> In [1] I wrote
>> >>
>> >> "I don't see why you so attached to desire to describe
>> >> data part of memory transaction as just one of int
>> >> types. If we are talking about bunch of hypothetical
>> >> cases imagine such bus that allow transaction with
>> >> size of 6 bytes. How do you describe such data in
>> >> your ints speak? What endianity you can assign to
>> >> sequence of 6 bytes? While note that description of
>> >> such transaction as set of 6 byte values at address
>> >> $whatever makes perfect sense."
>> >>
>> >> But notice that in your next reply [2] you just dropped it
>> >
>> > Yes. This is because it was one of the places where
>> > I would have just had to repeat "no, I'm afraid you're wrong
>> > about how hardware works". I think in general it's going
>> > to be better if I don't try to reply point by point to this
>> > email; I think you should go back and reread the emails I've
>> > sent. Key points:
>> >  (1) hardware is not doing anything involving arrays
>> >      of bytes
>>
>> Array of bytes or integers is just a way to describe data lines
>> on the bus. Did you look at this document?
>>
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>>
>> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
>> case (first two columns in the table) and they unambiguously
>> describe data bus signals
>>
>
> The point is simple, and Peter has made it over and over:
> Any consumer of a memory operation sees "value, len, address".

and "endianess" of operation.

here is memory operation

*(int *) (0x1000) = 0x01020304;

can you tell how memory will look like at 0x1000 address - you can't
in LE it will look one way in BE byteswapped.

> This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
> definition and having a pointer to the structure you need to be able to
> tell me "value, len, address".
>
>> >  (2) the API between kernel and userspace needs to define
>> >      the semantics of mmio.data, ie how to map between
>> >      "x byte wide transaction with value v" and the array,
>> >      and that is primarily what this conversation is about
>> >  (3) the only choice which is both (a) sensible and (b)
>> >      not breaking existing usage is to say "the array is
>> >      in host-kernel-byte-order"
>> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>> >      the same, because in the ARM case it is doing an
>> >      internal-to-CPU byteswap, and in the PPC case it is not
>>
>> That is one of the key disconnects. I'll go find real examples
>> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
>> everybody sake's here is summary of the disconnect:
>>
>> If we have the same h/w connected to memory bus in ARM
>> and PPC systems and we have the following three pieces
>> of code that work with r0 having same device same
>> register address:
>>
>> 1. ARM LE word write of  0x04030201:
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> 2. ARM BE word write of 0x01020304:
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> 3. PPC BE word write of 0x01020304:
>> lis     r1,0x102
>> ori     r1,r1,0x304
>> stw    r1,0(r0)
>>
>> I claim that h/w will see the same data on bus lines in all
>> three cases, and h/w would acts the same in all three
>> cases. Peter says that ARM BE and PPC BE case h/w
>> would act differently.
>>
>> If anyone else can offer opinion on that while I am looking
>> for real examples that would be great.
>>
>
> I really don't think listing all these examples help.

I think Peter is wrong in his understanding how real
BE PPC kernel drivers work with h/w mapped devices. Going
with such misunderstanding to suggest how it should hold
info in emulated mmio case is quite strange.

> You need to focus
> on the key points that Peter listed in his previous mail.
>
> I tried in our chat to ask you this questions:
>
> vcpu_data_host_to_guest() is handling a read from an emulated device.
> All the info you have is:
> (1) len of memory access
> (2) mmio.data pointer
> (3) destination register
> (4) host CPU endianness
> (5) guest CPU endianness
>
> Based on this information alone, you need to decide whether you do a
> byteswap or not before loading the hardware register upon returning to
> the guest.
>
> You will find it impossible to answer, because you don't know the layout
> of mmio.data, and that is the thing we are trying to solve.

Actually I am not arguing with above. I agree that
meaning of mmio.data should be better clarified.

I propose my clarification as array of bytes at
phys_addr address on BE-8,
byte invariant, memory bus. That unambiguously
describes data bus signals in case of BE-8 memory
bus. Please look at

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html

first two columns LE and BE-8, If one will specify all
values for A0, A1, ... A7 it will define all bus signals.
Note that is the only endian agnostic way to describe data
bus signals. If one would try to describe them in
half-word[s], word[s], double word one need to tell what
endianity of those integers (other columns in document
table).

Peter claims that "I don't understand how h/w bus works".
I disagree with that. I gave pointer on document that describes
how BE-8, byte invariant, memory bus works. I would
appreciate pointer to document, section and page that
describes Peter's memory bus operation understanding.

I pointed that Peter's proposal would have the following issue:
BE qemu will have to act differently depending on CPU
type while emulating the same device. If Peter's proposal is
accepted n qemu code should do something like:

#ifdef WORD_BIGENDIAN
#ifdef __PPC_
   do one thing
#else
  do another
#endif
#endif

there reason for that because the same device write in mmio
will look like this:

ARM LE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
ARM BE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
PPC LE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
PPC BE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }

for ARM BE and PPC BE arrays are different even
it is just BE case, so code would need to '#if ARM'
thing

If you follow my proposal to clarify mmio.data[] meaning
mmio.data[] array will look the same in all 4 cases,
compatible with current usage.

If Peter's proposal is adopted ARM BE and PPC LE cases
would be penalized with excessive back and forth
byteswaps. That is possible to avoid with my proposal.

Thanks,
Victor

> If you cannot reply to this point in less than 50 lines or mention
> anything about devices being LE or BE or come with examples, I am
> probably not going to read your reply, sorry.
>
> -Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-24  0:50                                     ` [Qemu-devel] " Victor Kamensky
@ 2014-01-24  2:14                                       ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-24  2:14 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On Thu, Jan 23, 2014 at 04:50:18PM -0800, Victor Kamensky wrote:
> On 23 January 2014 12:45, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
> >> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> >> >> In [1] I wrote
> >> >>
> >> >> "I don't see why you so attached to desire to describe
> >> >> data part of memory transaction as just one of int
> >> >> types. If we are talking about bunch of hypothetical
> >> >> cases imagine such bus that allow transaction with
> >> >> size of 6 bytes. How do you describe such data in
> >> >> your ints speak? What endianity you can assign to
> >> >> sequence of 6 bytes? While note that description of
> >> >> such transaction as set of 6 byte values at address
> >> >> $whatever makes perfect sense."
> >> >>
> >> >> But notice that in your next reply [2] you just dropped it
> >> >
> >> > Yes. This is because it was one of the places where
> >> > I would have just had to repeat "no, I'm afraid you're wrong
> >> > about how hardware works". I think in general it's going
> >> > to be better if I don't try to reply point by point to this
> >> > email; I think you should go back and reread the emails I've
> >> > sent. Key points:
> >> >  (1) hardware is not doing anything involving arrays
> >> >      of bytes
> >>
> >> Array of bytes or integers is just a way to describe data lines
> >> on the bus. Did you look at this document?
> >>
> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> >>
> >> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
> >> case (first two columns in the table) and they unambiguously
> >> describe data bus signals
> >>
> >
> > The point is simple, and Peter has made it over and over:
> > Any consumer of a memory operation sees "value, len, address".
> 
> and "endianess" of operation.

no, value is value, is value.  By a consumer I mean whatever sits and
the end of the memory bus.  There is no endianness.

> 
> here is memory operation
> 
> *(int *) (0x1000) = 0x01020304;

this is from the CPU's perspective and involves specifics of a
programming language and a compiler.  You cannot compare to the above.

> 
> can you tell how memory will look like at 0x1000 address - you can't
> in LE it will look one way in BE byteswapped.
> 
> > This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
> > definition and having a pointer to the structure you need to be able to
> > tell me "value, len, address".
> >
> >> >  (2) the API between kernel and userspace needs to define
> >> >      the semantics of mmio.data, ie how to map between
> >> >      "x byte wide transaction with value v" and the array,
> >> >      and that is primarily what this conversation is about
> >> >  (3) the only choice which is both (a) sensible and (b)
> >> >      not breaking existing usage is to say "the array is
> >> >      in host-kernel-byte-order"
> >> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> >> >      the same, because in the ARM case it is doing an
> >> >      internal-to-CPU byteswap, and in the PPC case it is not
> >>
> >> That is one of the key disconnects. I'll go find real examples
> >> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> >> everybody sake's here is summary of the disconnect:
> >>
> >> If we have the same h/w connected to memory bus in ARM
> >> and PPC systems and we have the following three pieces
> >> of code that work with r0 having same device same
> >> register address:
> >>
> >> 1. ARM LE word write of  0x04030201:
> >> setend le
> >> mov r1, #0x04030201
> >> str r1, [r0]
> >>
> >> 2. ARM BE word write of 0x01020304:
> >> setend be
> >> mov r1, #0x01020304
> >> str r1, [r0]
> >>
> >> 3. PPC BE word write of 0x01020304:
> >> lis     r1,0x102
> >> ori     r1,r1,0x304
> >> stw    r1,0(r0)
> >>
> >> I claim that h/w will see the same data on bus lines in all
> >> three cases, and h/w would acts the same in all three
> >> cases. Peter says that ARM BE and PPC BE case h/w
> >> would act differently.
> >>
> >> If anyone else can offer opinion on that while I am looking
> >> for real examples that would be great.
> >>
> >
> > I really don't think listing all these examples help.
> 
> I think Peter is wrong in his understanding how real
> BE PPC kernel drivers work with h/w mapped devices. Going
> with such misunderstanding to suggest how it should hold
> info in emulated mmio case is quite strange.
> 
> > You need to focus
> > on the key points that Peter listed in his previous mail.
> >
> > I tried in our chat to ask you this questions:
> >
> > vcpu_data_host_to_guest() is handling a read from an emulated device.
> > All the info you have is:
> > (1) len of memory access
> > (2) mmio.data pointer
> > (3) destination register
> > (4) host CPU endianness
> > (5) guest CPU endianness
> >
> > Based on this information alone, you need to decide whether you do a
> > byteswap or not before loading the hardware register upon returning to
> > the guest.
> >
> > You will find it impossible to answer, because you don't know the layout
> > of mmio.data, and that is the thing we are trying to solve.
> 
> Actually I am not arguing with above. I agree that
> meaning of mmio.data should be better clarified.

Progress,  \o/

> 
> I propose my clarification as array of bytes at
> phys_addr address on BE-8,
> byte invariant, memory bus. 
> That unambiguously
> describes data bus signals in case of BE-8 memory
> bus. Please look at
> 
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> 
> first two columns LE and BE-8, If one will specify all
> values for A0, A1, ... A7 it will define all bus signals.
> Note that is the only endian agnostic way to describe data
> bus signals. If one would try to describe them in
> half-word[s], word[s], double word one need to tell what
> endianity of those integers (other columns in document
> table).

This is a reference to an arm1136 processor specification, which does
not support virtualization.  We are discussing a generic kernel
interface documentation, I'm afraid it's not useful.

> 
> Peter claims that "I don't understand how h/w bus works".
> I disagree with that. I gave pointer on document that describes
> how BE-8, byte invariant, memory bus works. I would
> appreciate pointer to document, section and page that
> describes Peter's memory bus operation understanding.

I choose to trust that Peter understands very well how a h/w bus works.
I am not sure the documentation you request exists in public.

I, however, understand how KVM works, and I understand how the
kernel<->user ABI works, and you still haven't been able to express your
proposal in a concise, understandable, generic maner that works for a
KVM ABI specification, unfortunately.

> 
> I pointed that Peter's proposal would have the following issue:
> BE qemu will have to act differently depending on CPU
> type while emulating the same device. If Peter's proposal is
> accepted n qemu code should do something like:
> 
> #ifdef WORD_BIGENDIAN
> #ifdef __PPC_
>    do one thing
> #else
>   do another
> #endif
> #endif

No, your device has an operation: write32(u32 val)

That's it, Peter suggests having proper device containers in QEMU that
translate from bus native endianness to the device native endianness.

> 
> there reason for that because the same device write in mmio
> will look like this:
> 
> ARM LE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
> ARM BE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
> PPC LE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
> PPC BE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
> 
> for ARM BE and PPC BE arrays are different even
> it is just BE case, so code would need to '#if ARM'
> thing

I don't understand this, sorry.

> 
> If you follow my proposal to clarify mmio.data[] meaning
> mmio.data[] array will look the same in all 4 cases,
> compatible with current usage.

I still don't know what your proposal is, "array will look the same in
all 4 cases" is not a definition that I can use to interpret the data
written.

The semantics you need to be able to describe is that of the memory
operation: for example, store a word, not store an array of bytes.

> 
> If Peter's proposal is adopted ARM BE and PPC LE cases
> would be penalized with excessive back and forth
> byteswaps. That is possible to avoid with my proposal.
> 

I would take 50 byteswaps with a clear ABI any day over an obscure
standard that can avoid a single hardware-on-register instruction.  This
is about designing a clean software interface, not about building an
optimized integrated stack.

Unfortunately, this is going nowhere, so I think we need to stop this
thread.  As you can see I have sent a patch as a clarification to the
ABI, if it's merged we can move on with more important tasks.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-24  2:14                                       ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-24  2:14 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On Thu, Jan 23, 2014 at 04:50:18PM -0800, Victor Kamensky wrote:
> On 23 January 2014 12:45, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
> >> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> >> >> In [1] I wrote
> >> >>
> >> >> "I don't see why you so attached to desire to describe
> >> >> data part of memory transaction as just one of int
> >> >> types. If we are talking about bunch of hypothetical
> >> >> cases imagine such bus that allow transaction with
> >> >> size of 6 bytes. How do you describe such data in
> >> >> your ints speak? What endianity you can assign to
> >> >> sequence of 6 bytes? While note that description of
> >> >> such transaction as set of 6 byte values at address
> >> >> $whatever makes perfect sense."
> >> >>
> >> >> But notice that in your next reply [2] you just dropped it
> >> >
> >> > Yes. This is because it was one of the places where
> >> > I would have just had to repeat "no, I'm afraid you're wrong
> >> > about how hardware works". I think in general it's going
> >> > to be better if I don't try to reply point by point to this
> >> > email; I think you should go back and reread the emails I've
> >> > sent. Key points:
> >> >  (1) hardware is not doing anything involving arrays
> >> >      of bytes
> >>
> >> Array of bytes or integers is just a way to describe data lines
> >> on the bus. Did you look at this document?
> >>
> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> >>
> >> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
> >> case (first two columns in the table) and they unambiguously
> >> describe data bus signals
> >>
> >
> > The point is simple, and Peter has made it over and over:
> > Any consumer of a memory operation sees "value, len, address".
> 
> and "endianess" of operation.

no, value is value, is value.  By a consumer I mean whatever sits and
the end of the memory bus.  There is no endianness.

> 
> here is memory operation
> 
> *(int *) (0x1000) = 0x01020304;

this is from the CPU's perspective and involves specifics of a
programming language and a compiler.  You cannot compare to the above.

> 
> can you tell how memory will look like at 0x1000 address - you can't
> in LE it will look one way in BE byteswapped.
> 
> > This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
> > definition and having a pointer to the structure you need to be able to
> > tell me "value, len, address".
> >
> >> >  (2) the API between kernel and userspace needs to define
> >> >      the semantics of mmio.data, ie how to map between
> >> >      "x byte wide transaction with value v" and the array,
> >> >      and that is primarily what this conversation is about
> >> >  (3) the only choice which is both (a) sensible and (b)
> >> >      not breaking existing usage is to say "the array is
> >> >      in host-kernel-byte-order"
> >> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> >> >      the same, because in the ARM case it is doing an
> >> >      internal-to-CPU byteswap, and in the PPC case it is not
> >>
> >> That is one of the key disconnects. I'll go find real examples
> >> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> >> everybody sake's here is summary of the disconnect:
> >>
> >> If we have the same h/w connected to memory bus in ARM
> >> and PPC systems and we have the following three pieces
> >> of code that work with r0 having same device same
> >> register address:
> >>
> >> 1. ARM LE word write of  0x04030201:
> >> setend le
> >> mov r1, #0x04030201
> >> str r1, [r0]
> >>
> >> 2. ARM BE word write of 0x01020304:
> >> setend be
> >> mov r1, #0x01020304
> >> str r1, [r0]
> >>
> >> 3. PPC BE word write of 0x01020304:
> >> lis     r1,0x102
> >> ori     r1,r1,0x304
> >> stw    r1,0(r0)
> >>
> >> I claim that h/w will see the same data on bus lines in all
> >> three cases, and h/w would acts the same in all three
> >> cases. Peter says that ARM BE and PPC BE case h/w
> >> would act differently.
> >>
> >> If anyone else can offer opinion on that while I am looking
> >> for real examples that would be great.
> >>
> >
> > I really don't think listing all these examples help.
> 
> I think Peter is wrong in his understanding how real
> BE PPC kernel drivers work with h/w mapped devices. Going
> with such misunderstanding to suggest how it should hold
> info in emulated mmio case is quite strange.
> 
> > You need to focus
> > on the key points that Peter listed in his previous mail.
> >
> > I tried in our chat to ask you this questions:
> >
> > vcpu_data_host_to_guest() is handling a read from an emulated device.
> > All the info you have is:
> > (1) len of memory access
> > (2) mmio.data pointer
> > (3) destination register
> > (4) host CPU endianness
> > (5) guest CPU endianness
> >
> > Based on this information alone, you need to decide whether you do a
> > byteswap or not before loading the hardware register upon returning to
> > the guest.
> >
> > You will find it impossible to answer, because you don't know the layout
> > of mmio.data, and that is the thing we are trying to solve.
> 
> Actually I am not arguing with above. I agree that
> meaning of mmio.data should be better clarified.

Progress,  \o/

> 
> I propose my clarification as array of bytes at
> phys_addr address on BE-8,
> byte invariant, memory bus. 
> That unambiguously
> describes data bus signals in case of BE-8 memory
> bus. Please look at
> 
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> 
> first two columns LE and BE-8, If one will specify all
> values for A0, A1, ... A7 it will define all bus signals.
> Note that is the only endian agnostic way to describe data
> bus signals. If one would try to describe them in
> half-word[s], word[s], double word one need to tell what
> endianity of those integers (other columns in document
> table).

This is a reference to an arm1136 processor specification, which does
not support virtualization.  We are discussing a generic kernel
interface documentation, I'm afraid it's not useful.

> 
> Peter claims that "I don't understand how h/w bus works".
> I disagree with that. I gave pointer on document that describes
> how BE-8, byte invariant, memory bus works. I would
> appreciate pointer to document, section and page that
> describes Peter's memory bus operation understanding.

I choose to trust that Peter understands very well how a h/w bus works.
I am not sure the documentation you request exists in public.

I, however, understand how KVM works, and I understand how the
kernel<->user ABI works, and you still haven't been able to express your
proposal in a concise, understandable, generic maner that works for a
KVM ABI specification, unfortunately.

> 
> I pointed that Peter's proposal would have the following issue:
> BE qemu will have to act differently depending on CPU
> type while emulating the same device. If Peter's proposal is
> accepted n qemu code should do something like:
> 
> #ifdef WORD_BIGENDIAN
> #ifdef __PPC_
>    do one thing
> #else
>   do another
> #endif
> #endif

No, your device has an operation: write32(u32 val)

That's it, Peter suggests having proper device containers in QEMU that
translate from bus native endianness to the device native endianness.

> 
> there reason for that because the same device write in mmio
> will look like this:
> 
> ARM LE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
> ARM BE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
> PPC LE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
> PPC BE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
> 
> for ARM BE and PPC BE arrays are different even
> it is just BE case, so code would need to '#if ARM'
> thing

I don't understand this, sorry.

> 
> If you follow my proposal to clarify mmio.data[] meaning
> mmio.data[] array will look the same in all 4 cases,
> compatible with current usage.

I still don't know what your proposal is, "array will look the same in
all 4 cases" is not a definition that I can use to interpret the data
written.

The semantics you need to be able to describe is that of the memory
operation: for example, store a word, not store an array of bytes.

> 
> If Peter's proposal is adopted ARM BE and PPC LE cases
> would be penalized with excessive back and forth
> byteswaps. That is possible to avoid with my proposal.
> 

I would take 50 byteswaps with a clear ABI any day over an obscure
standard that can avoid a single hardware-on-register instruction.  This
is about designing a clean software interface, not about building an
optimized integrated stack.

Unfortunately, this is going nowhere, so I think we need to stop this
thread.  As you can see I have sent a patch as a clarification to the
ABI, if it's merged we can move on with more important tasks.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-24  2:14                                       ` [Qemu-devel] " Christoffer Dall
@ 2014-01-24  4:11                                         ` Victor Kamensky
  -1 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-24  4:11 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 18:14, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Thu, Jan 23, 2014 at 04:50:18PM -0800, Victor Kamensky wrote:
>> On 23 January 2014 12:45, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> > On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
>> >> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> >> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> >> >> In [1] I wrote
>> >> >>
>> >> >> "I don't see why you so attached to desire to describe
>> >> >> data part of memory transaction as just one of int
>> >> >> types. If we are talking about bunch of hypothetical
>> >> >> cases imagine such bus that allow transaction with
>> >> >> size of 6 bytes. How do you describe such data in
>> >> >> your ints speak? What endianity you can assign to
>> >> >> sequence of 6 bytes? While note that description of
>> >> >> such transaction as set of 6 byte values at address
>> >> >> $whatever makes perfect sense."
>> >> >>
>> >> >> But notice that in your next reply [2] you just dropped it
>> >> >
>> >> > Yes. This is because it was one of the places where
>> >> > I would have just had to repeat "no, I'm afraid you're wrong
>> >> > about how hardware works". I think in general it's going
>> >> > to be better if I don't try to reply point by point to this
>> >> > email; I think you should go back and reread the emails I've
>> >> > sent. Key points:
>> >> >  (1) hardware is not doing anything involving arrays
>> >> >      of bytes
>> >>
>> >> Array of bytes or integers is just a way to describe data lines
>> >> on the bus. Did you look at this document?
>> >>
>> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>> >>
>> >> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
>> >> case (first two columns in the table) and they unambiguously
>> >> describe data bus signals
>> >>
>> >
>> > The point is simple, and Peter has made it over and over:
>> > Any consumer of a memory operation sees "value, len, address".
>>
>> and "endianess" of operation.
>
> no, value is value, is value.  By a consumer I mean whatever sits and
> the end of the memory bus.  There is no endianness.
>
>>
>> here is memory operation
>>
>> *(int *) (0x1000) = 0x01020304;
>
> this is from the CPU's perspective and involves specifics of a
> programming language and a compiler.  You cannot compare to the above.

compare it with this description of memory like this
unsigned char mem[] = {0x4, 0x3, 0x2, 0x1};
that is the same memory content from *anyone* perspective
at address mem we have 0x4, at addres mem+1 we have 0x3, etc

>>
>> can you tell how memory will look like at 0x1000 address - you can't
>> in LE it will look one way in BE byteswapped.
>>
>> > This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
>> > definition and having a pointer to the structure you need to be able to
>> > tell me "value, len, address".
>> >
>> >> >  (2) the API between kernel and userspace needs to define
>> >> >      the semantics of mmio.data, ie how to map between
>> >> >      "x byte wide transaction with value v" and the array,
>> >> >      and that is primarily what this conversation is about
>> >> >  (3) the only choice which is both (a) sensible and (b)
>> >> >      not breaking existing usage is to say "the array is
>> >> >      in host-kernel-byte-order"
>> >> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>> >> >      the same, because in the ARM case it is doing an
>> >> >      internal-to-CPU byteswap, and in the PPC case it is not
>> >>
>> >> That is one of the key disconnects. I'll go find real examples
>> >> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
>> >> everybody sake's here is summary of the disconnect:
>> >>
>> >> If we have the same h/w connected to memory bus in ARM
>> >> and PPC systems and we have the following three pieces
>> >> of code that work with r0 having same device same
>> >> register address:
>> >>
>> >> 1. ARM LE word write of  0x04030201:
>> >> setend le
>> >> mov r1, #0x04030201
>> >> str r1, [r0]
>> >>
>> >> 2. ARM BE word write of 0x01020304:
>> >> setend be
>> >> mov r1, #0x01020304
>> >> str r1, [r0]
>> >>
>> >> 3. PPC BE word write of 0x01020304:
>> >> lis     r1,0x102
>> >> ori     r1,r1,0x304
>> >> stw    r1,0(r0)
>> >>
>> >> I claim that h/w will see the same data on bus lines in all
>> >> three cases, and h/w would acts the same in all three
>> >> cases. Peter says that ARM BE and PPC BE case h/w
>> >> would act differently.
>> >>
>> >> If anyone else can offer opinion on that while I am looking
>> >> for real examples that would be great.
>> >>
>> >
>> > I really don't think listing all these examples help.
>>
>> I think Peter is wrong in his understanding how real
>> BE PPC kernel drivers work with h/w mapped devices. Going
>> with such misunderstanding to suggest how it should hold
>> info in emulated mmio case is quite strange.
>>
>> > You need to focus
>> > on the key points that Peter listed in his previous mail.
>> >
>> > I tried in our chat to ask you this questions:
>> >
>> > vcpu_data_host_to_guest() is handling a read from an emulated device.
>> > All the info you have is:
>> > (1) len of memory access
>> > (2) mmio.data pointer
>> > (3) destination register
>> > (4) host CPU endianness
>> > (5) guest CPU endianness
>> >
>> > Based on this information alone, you need to decide whether you do a
>> > byteswap or not before loading the hardware register upon returning to
>> > the guest.
>> >
>> > You will find it impossible to answer, because you don't know the layout
>> > of mmio.data, and that is the thing we are trying to solve.
>>
>> Actually I am not arguing with above. I agree that
>> meaning of mmio.data should be better clarified.
>
> Progress,  \o/
>
>>
>> I propose my clarification as array of bytes at
>> phys_addr address on BE-8,
>> byte invariant, memory bus.
>> That unambiguously
>> describes data bus signals in case of BE-8 memory
>> bus. Please look at
>>
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>>
>> first two columns LE and BE-8, If one will specify all
>> values for A0, A1, ... A7 it will define all bus signals.
>> Note that is the only endian agnostic way to describe data
>> bus signals. If one would try to describe them in
>> half-word[s], word[s], double word one need to tell what
>> endianity of those integers (other columns in document
>> table).
>
> This is a reference to an arm1136 processor specification, which does
> not support virtualization.  We are discussing a generic kernel
> interface documentation, I'm afraid it's not useful.

The reason why BE-8 vs BE-32 explanation is in this CPU spec,
because it was transitional from BE-32 to BE-8 memory bus,
and this CPU has bit that can specify whether CPU works with
memory in BE-8 or BE-32. Since then all ARM cpus works in
BE-8, when they drop switch bit they dropped this section.
Check  CONFIG_CPU_ENDIAN_BE8 kernel config.
In fact all modern CPUs that support both endianities work
with memory bus in BE-8 mode.

>>
>> Peter claims that "I don't understand how h/w bus works".
>> I disagree with that. I gave pointer on document that describes
>> how BE-8, byte invariant, memory bus works. I would
>> appreciate pointer to document, section and page that
>> describes Peter's memory bus operation understanding.
>
> I choose to trust that Peter understands very well how a h/w bus works.
> I am not sure the documentation you request exists in public.
>
> I, however, understand how KVM works, and I understand how the
> kernel<->user ABI works, and you still haven't been able to express your
> proposal in a concise, understandable, generic maner that works for a
> KVM ABI specification, unfortunately.
>
>>
>> I pointed that Peter's proposal would have the following issue:
>> BE qemu will have to act differently depending on CPU
>> type while emulating the same device. If Peter's proposal is
>> accepted n qemu code should do something like:
>>
>> #ifdef WORD_BIGENDIAN
>> #ifdef __PPC_
>>    do one thing
>> #else
>>   do another
>> #endif
>> #endif
>
> No, your device has an operation: write32(u32 val)
>
> That's it, Peter suggests having proper device containers in QEMU that
> translate from bus native endianness to the device native endianness.
>
>>
>> there reason for that because the same device write in mmio
>> will look like this:
>>
>> ARM LE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
>> ARM BE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
>> PPC LE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
>> PPC BE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
>>
>> for ARM BE and PPC BE arrays are different even
>> it is just BE case, so code would need to '#if ARM'
>> thing
>
> I don't understand this, sorry.

In above example suppose in case of ARM guest we had guest
little endian write of 0x04030201 into device address emulated
on ARM LE host/qemu, and in this case content of
mmio.data[] array looks like this:
  mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
it is current case nothing we can do about
when the same guest run in ARM BE KVM host/qemu
and it executes exactly same access as in above
and we follow clarification you just merged, mmio.data[]
array will look like this:
  mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }

>>
>> If you follow my proposal to clarify mmio.data[] meaning
>> mmio.data[] array will look the same in all 4 cases,
>> compatible with current usage.
>
> I still don't know what your proposal is, "array will look the same in
> all 4 cases" is not a definition that I can use to interpret the data
> written.
>
> The semantics you need to be able to describe is that of the memory
> operation: for example, store a word, not store an array of bytes.
>
>>
>> If Peter's proposal is adopted ARM BE and PPC LE cases
>> would be penalized with excessive back and forth
>> byteswaps. That is possible to avoid with my proposal.
>>
>
> I would take 50 byteswaps with a clear ABI any day over an obscure
> standard that can avoid a single hardware-on-register instruction.  This
> is about designing a clean software interface, not about building an
> optimized integrated stack.
>
> Unfortunately, this is going nowhere, so I think we need to stop this
> thread.  As you can see I have sent a patch as a clarification to the
> ABI, if it's merged we can move on with more important tasks.

OK, that is fine. I still believe is not the best choice,
but I agree that we need to move on. I will respin my
V7 KVM BE patches according to this new semantics, I will
integrate comments that you (thanks!) and others gave me
over mailing list and post my series again when it is ready.

Thanks,
Victor

> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-24  4:11                                         ` Victor Kamensky
  0 siblings, 0 replies; 102+ messages in thread
From: Victor Kamensky @ 2014-01-24  4:11 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm

On 23 January 2014 18:14, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Thu, Jan 23, 2014 at 04:50:18PM -0800, Victor Kamensky wrote:
>> On 23 January 2014 12:45, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> > On Thu, Jan 23, 2014 at 08:25:35AM -0800, Victor Kamensky wrote:
>> >> On 23 January 2014 07:33, Peter Maydell <peter.maydell@linaro.org> wrote:
>> >> > On 23 January 2014 15:06, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> >> >> In [1] I wrote
>> >> >>
>> >> >> "I don't see why you so attached to desire to describe
>> >> >> data part of memory transaction as just one of int
>> >> >> types. If we are talking about bunch of hypothetical
>> >> >> cases imagine such bus that allow transaction with
>> >> >> size of 6 bytes. How do you describe such data in
>> >> >> your ints speak? What endianity you can assign to
>> >> >> sequence of 6 bytes? While note that description of
>> >> >> such transaction as set of 6 byte values at address
>> >> >> $whatever makes perfect sense."
>> >> >>
>> >> >> But notice that in your next reply [2] you just dropped it
>> >> >
>> >> > Yes. This is because it was one of the places where
>> >> > I would have just had to repeat "no, I'm afraid you're wrong
>> >> > about how hardware works". I think in general it's going
>> >> > to be better if I don't try to reply point by point to this
>> >> > email; I think you should go back and reread the emails I've
>> >> > sent. Key points:
>> >> >  (1) hardware is not doing anything involving arrays
>> >> >      of bytes
>> >>
>> >> Array of bytes or integers is just a way to describe data lines
>> >> on the bus. Did you look at this document?
>> >>
>> >> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>> >>
>> >> A0, A1, ,,, A7 byte values are the same for both LE and BE-8
>> >> case (first two columns in the table) and they unambiguously
>> >> describe data bus signals
>> >>
>> >
>> > The point is simple, and Peter has made it over and over:
>> > Any consumer of a memory operation sees "value, len, address".
>>
>> and "endianess" of operation.
>
> no, value is value, is value.  By a consumer I mean whatever sits and
> the end of the memory bus.  There is no endianness.
>
>>
>> here is memory operation
>>
>> *(int *) (0x1000) = 0x01020304;
>
> this is from the CPU's perspective and involves specifics of a
> programming language and a compiler.  You cannot compare to the above.

compare it with this description of memory like this
unsigned char mem[] = {0x4, 0x3, 0x2, 0x1};
that is the same memory content from *anyone* perspective
at address mem we have 0x4, at addres mem+1 we have 0x3, etc

>>
>> can you tell how memory will look like at 0x1000 address - you can't
>> in LE it will look one way in BE byteswapped.
>>
>> > This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
>> > definition and having a pointer to the structure you need to be able to
>> > tell me "value, len, address".
>> >
>> >> >  (2) the API between kernel and userspace needs to define
>> >> >      the semantics of mmio.data, ie how to map between
>> >> >      "x byte wide transaction with value v" and the array,
>> >> >      and that is primarily what this conversation is about
>> >> >  (3) the only choice which is both (a) sensible and (b)
>> >> >      not breaking existing usage is to say "the array is
>> >> >      in host-kernel-byte-order"
>> >> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>> >> >      the same, because in the ARM case it is doing an
>> >> >      internal-to-CPU byteswap, and in the PPC case it is not
>> >>
>> >> That is one of the key disconnects. I'll go find real examples
>> >> in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
>> >> everybody sake's here is summary of the disconnect:
>> >>
>> >> If we have the same h/w connected to memory bus in ARM
>> >> and PPC systems and we have the following three pieces
>> >> of code that work with r0 having same device same
>> >> register address:
>> >>
>> >> 1. ARM LE word write of  0x04030201:
>> >> setend le
>> >> mov r1, #0x04030201
>> >> str r1, [r0]
>> >>
>> >> 2. ARM BE word write of 0x01020304:
>> >> setend be
>> >> mov r1, #0x01020304
>> >> str r1, [r0]
>> >>
>> >> 3. PPC BE word write of 0x01020304:
>> >> lis     r1,0x102
>> >> ori     r1,r1,0x304
>> >> stw    r1,0(r0)
>> >>
>> >> I claim that h/w will see the same data on bus lines in all
>> >> three cases, and h/w would acts the same in all three
>> >> cases. Peter says that ARM BE and PPC BE case h/w
>> >> would act differently.
>> >>
>> >> If anyone else can offer opinion on that while I am looking
>> >> for real examples that would be great.
>> >>
>> >
>> > I really don't think listing all these examples help.
>>
>> I think Peter is wrong in his understanding how real
>> BE PPC kernel drivers work with h/w mapped devices. Going
>> with such misunderstanding to suggest how it should hold
>> info in emulated mmio case is quite strange.
>>
>> > You need to focus
>> > on the key points that Peter listed in his previous mail.
>> >
>> > I tried in our chat to ask you this questions:
>> >
>> > vcpu_data_host_to_guest() is handling a read from an emulated device.
>> > All the info you have is:
>> > (1) len of memory access
>> > (2) mmio.data pointer
>> > (3) destination register
>> > (4) host CPU endianness
>> > (5) guest CPU endianness
>> >
>> > Based on this information alone, you need to decide whether you do a
>> > byteswap or not before loading the hardware register upon returning to
>> > the guest.
>> >
>> > You will find it impossible to answer, because you don't know the layout
>> > of mmio.data, and that is the thing we are trying to solve.
>>
>> Actually I am not arguing with above. I agree that
>> meaning of mmio.data should be better clarified.
>
> Progress,  \o/
>
>>
>> I propose my clarification as array of bytes at
>> phys_addr address on BE-8,
>> byte invariant, memory bus.
>> That unambiguously
>> describes data bus signals in case of BE-8 memory
>> bus. Please look at
>>
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
>>
>> first two columns LE and BE-8, If one will specify all
>> values for A0, A1, ... A7 it will define all bus signals.
>> Note that is the only endian agnostic way to describe data
>> bus signals. If one would try to describe them in
>> half-word[s], word[s], double word one need to tell what
>> endianity of those integers (other columns in document
>> table).
>
> This is a reference to an arm1136 processor specification, which does
> not support virtualization.  We are discussing a generic kernel
> interface documentation, I'm afraid it's not useful.

The reason why BE-8 vs BE-32 explanation is in this CPU spec,
because it was transitional from BE-32 to BE-8 memory bus,
and this CPU has bit that can specify whether CPU works with
memory in BE-8 or BE-32. Since then all ARM cpus works in
BE-8, when they drop switch bit they dropped this section.
Check  CONFIG_CPU_ENDIAN_BE8 kernel config.
In fact all modern CPUs that support both endianities work
with memory bus in BE-8 mode.

>>
>> Peter claims that "I don't understand how h/w bus works".
>> I disagree with that. I gave pointer on document that describes
>> how BE-8, byte invariant, memory bus works. I would
>> appreciate pointer to document, section and page that
>> describes Peter's memory bus operation understanding.
>
> I choose to trust that Peter understands very well how a h/w bus works.
> I am not sure the documentation you request exists in public.
>
> I, however, understand how KVM works, and I understand how the
> kernel<->user ABI works, and you still haven't been able to express your
> proposal in a concise, understandable, generic maner that works for a
> KVM ABI specification, unfortunately.
>
>>
>> I pointed that Peter's proposal would have the following issue:
>> BE qemu will have to act differently depending on CPU
>> type while emulating the same device. If Peter's proposal is
>> accepted n qemu code should do something like:
>>
>> #ifdef WORD_BIGENDIAN
>> #ifdef __PPC_
>>    do one thing
>> #else
>>   do another
>> #endif
>> #endif
>
> No, your device has an operation: write32(u32 val)
>
> That's it, Peter suggests having proper device containers in QEMU that
> translate from bus native endianness to the device native endianness.
>
>>
>> there reason for that because the same device write in mmio
>> will look like this:
>>
>> ARM LE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
>> ARM BE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
>> PPC LE mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }
>> PPC BE mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
>>
>> for ARM BE and PPC BE arrays are different even
>> it is just BE case, so code would need to '#if ARM'
>> thing
>
> I don't understand this, sorry.

In above example suppose in case of ARM guest we had guest
little endian write of 0x04030201 into device address emulated
on ARM LE host/qemu, and in this case content of
mmio.data[] array looks like this:
  mmio.data[] = { 0x01, 0x02, 0x03, 0x04 }
it is current case nothing we can do about
when the same guest run in ARM BE KVM host/qemu
and it executes exactly same access as in above
and we follow clarification you just merged, mmio.data[]
array will look like this:
  mmio.data[] = { 0x04, 0x03, 0x02, 0x01 }

>>
>> If you follow my proposal to clarify mmio.data[] meaning
>> mmio.data[] array will look the same in all 4 cases,
>> compatible with current usage.
>
> I still don't know what your proposal is, "array will look the same in
> all 4 cases" is not a definition that I can use to interpret the data
> written.
>
> The semantics you need to be able to describe is that of the memory
> operation: for example, store a word, not store an array of bytes.
>
>>
>> If Peter's proposal is adopted ARM BE and PPC LE cases
>> would be penalized with excessive back and forth
>> byteswaps. That is possible to avoid with my proposal.
>>
>
> I would take 50 byteswaps with a clear ABI any day over an obscure
> standard that can avoid a single hardware-on-register instruction.  This
> is about designing a clean software interface, not about building an
> optimized integrated stack.
>
> Unfortunately, this is going nowhere, so I think we need to stop this
> thread.  As you can see I have sent a patch as a clarification to the
> ABI, if it's merged we can move on with more important tasks.

OK, that is fine. I still believe is not the best choice,
but I agree that we need to move on. I will respin my
V7 KVM BE patches according to this new semantics, I will
integrate comments that you (thanks!) and others gave me
over mailing list and post my series again when it is ready.

Thanks,
Victor

> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22 17:29               ` [Qemu-devel] " Peter Maydell
@ 2014-01-27 23:27                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-27 23:27 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Wed, 2014-01-22 at 17:29 +0000, Peter Maydell wrote:
> 
> > Basically if it would be on real bus, get byte value
> > that corresponds to phys_addr + 0 address place
> > it into data[0], get byte value that corresponds to
> > phys_addr + 1 address place it into data[1], etc.
> 
> This just isn't how real buses work.

Actually it can be :-)

>  There is no
> "address + 1, address + 2". There is a single address
> for the memory transaction and a set of data on
> data lines and some separate size information.
> How the device at the far end of the bus chooses
> to respond to 32 bit accesses to address X versus
> 8 bit accesses to addresses X through X+3 is entirely
> its own business and unrelated to the CPU.

However the bus has a definition of what byte lane is the lowest in
address order. Byte order invariance is an important function of
all busses.

I think that trying to treat it any differently than an address
ordered series of bytes is going to turn into a complete and
inextricable mess.

>  (It would
> be perfectly possible to have a device which when
> you read from address X as 32 bits returned 0x12345678,
> when you read from address X as 16 bits returned
> 0x9abc, returned 0x42 for an 8 bit read from X+1,
> and so on. Having byte reads from X..X+3 return
> values corresponding to parts of the 32 bit access
> is purely a convention.)

Right, it's possible. It's also stupid and not how most modern devices
and busses work. Besides there is no reason why that can't be
implemented with Victor proposal anyway.

Ben.




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-27 23:27                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-27 23:27 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Wed, 2014-01-22 at 17:29 +0000, Peter Maydell wrote:
> 
> > Basically if it would be on real bus, get byte value
> > that corresponds to phys_addr + 0 address place
> > it into data[0], get byte value that corresponds to
> > phys_addr + 1 address place it into data[1], etc.
> 
> This just isn't how real buses work.

Actually it can be :-)

>  There is no
> "address + 1, address + 2". There is a single address
> for the memory transaction and a set of data on
> data lines and some separate size information.
> How the device at the far end of the bus chooses
> to respond to 32 bit accesses to address X versus
> 8 bit accesses to addresses X through X+3 is entirely
> its own business and unrelated to the CPU.

However the bus has a definition of what byte lane is the lowest in
address order. Byte order invariance is an important function of
all busses.

I think that trying to treat it any differently than an address
ordered series of bytes is going to turn into a complete and
inextricable mess.

>  (It would
> be perfectly possible to have a device which when
> you read from address X as 32 bits returned 0x12345678,
> when you read from address X as 16 bits returned
> 0x9abc, returned 0x42 for an 8 bit read from X+1,
> and so on. Having byte reads from X..X+3 return
> values corresponding to parts of the 32 bit access
> is purely a convention.)

Right, it's possible. It's also stupid and not how most modern devices
and busses work. Besides there is no reason why that can't be
implemented with Victor proposal anyway.

Ben.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22 19:29                 ` [Qemu-devel] " Victor Kamensky
@ 2014-01-27 23:31                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-27 23:31 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Wed, 2014-01-22 at 11:29 -0800, Victor Kamensky wrote:
> I don't see why you so attached to desire to describe
> data part of memory transaction as just one of int
> types. If we are talking about bunch of hypothetical
> cases imagine such bus that allow transaction with
> size of 6 bytes. How do you describe such data in
> your ints speak? What endianity you can assign to
> sequence of 6 bytes? While note that description of
> such transaction as set of 6 byte values at address
> $whatever makes perfect sense.

Absolutely. For example, the "real" bus out of a POWER8 core is
something like 128 bytes wide though I wouldn't be surprised if it was
serialized, I don't actually know the details, it's all inside the chip.
The interconnect between chip is a multi-lane elastic interface whose
width has nothing to do with the payload size. Same goes with PCIe.

The only thing that can more/less sanely represent all of these things
is a series of bytes ordered by address, with attributes such as the
access size (or byte enables if that makes more sense or we want to
emulate really funky stuff) and possibly other decoration that some
architectures might want to have (such as caching/combining attributes
etc... which *might* be useful under some circumstances).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-27 23:31                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-27 23:31 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Wed, 2014-01-22 at 11:29 -0800, Victor Kamensky wrote:
> I don't see why you so attached to desire to describe
> data part of memory transaction as just one of int
> types. If we are talking about bunch of hypothetical
> cases imagine such bus that allow transaction with
> size of 6 bytes. How do you describe such data in
> your ints speak? What endianity you can assign to
> sequence of 6 bytes? While note that description of
> such transaction as set of 6 byte values at address
> $whatever makes perfect sense.

Absolutely. For example, the "real" bus out of a POWER8 core is
something like 128 bytes wide though I wouldn't be surprised if it was
serialized, I don't actually know the details, it's all inside the chip.
The interconnect between chip is a multi-lane elastic interface whose
width has nothing to do with the payload size. Same goes with PCIe.

The only thing that can more/less sanely represent all of these things
is a series of bytes ordered by address, with attributes such as the
access size (or byte enables if that makes more sense or we want to
emulate really funky stuff) and possibly other decoration that some
architectures might want to have (such as caching/combining attributes
etc... which *might* be useful under some circumstances).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-22 20:02                   ` [Qemu-devel] " Peter Maydell
@ 2014-01-27 23:34                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-27 23:34 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Wed, 2014-01-22 at 20:02 +0000, Peter Maydell wrote:
> 
> Defining it as being always guest-order would mean that
> userspace had to continually look at the guest CPU
> endianness bit, which is annoying and awkward.
> 
> Defining it as always host-endian order is the most
> reasonable option available. It also happens to work
> for the current QEMU code, which is nice.

No.

Having a byte array coming in that represents what the CPU does in its
current byte order means you do *NOT* need to query the endianness of
the guest CPU from userspace.

Ben.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-27 23:34                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-27 23:34 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Wed, 2014-01-22 at 20:02 +0000, Peter Maydell wrote:
> 
> Defining it as being always guest-order would mean that
> userspace had to continually look at the guest CPU
> endianness bit, which is annoying and awkward.
> 
> Defining it as always host-endian order is the most
> reasonable option available. It also happens to work
> for the current QEMU code, which is nice.

No.

Having a byte array coming in that represents what the CPU does in its
current byte order means you do *NOT* need to query the endianness of
the guest CPU from userspace.

Ben.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-27 23:34                     ` [Qemu-devel] " Benjamin Herrenschmidt
@ 2014-01-27 23:49                       ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-27 23:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On 27 January 2014 23:34, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2014-01-22 at 20:02 +0000, Peter Maydell wrote:
>>
>> Defining it as being always guest-order would mean that
>> userspace had to continually look at the guest CPU
>> endianness bit, which is annoying and awkward.
>>
>> Defining it as always host-endian order is the most
>> reasonable option available. It also happens to work
>> for the current QEMU code, which is nice.
>
> No.
>
> Having a byte array coming in that represents what the CPU does in its
> current byte order means you do *NOT* need to query the endianness of
> the guest CPU from userspace.

Er, what? If we make the array be guest's current order
then by definition userspace has to look at the guest's
current endianness. I agree that would be bad. Either
of the two current proposals (host kernel order; guest
CPU's native/natural/default-byte-order) avoid it, though.

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-27 23:49                       ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2014-01-27 23:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On 27 January 2014 23:34, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2014-01-22 at 20:02 +0000, Peter Maydell wrote:
>>
>> Defining it as being always guest-order would mean that
>> userspace had to continually look at the guest CPU
>> endianness bit, which is annoying and awkward.
>>
>> Defining it as always host-endian order is the most
>> reasonable option available. It also happens to work
>> for the current QEMU code, which is nice.
>
> No.
>
> Having a byte array coming in that represents what the CPU does in its
> current byte order means you do *NOT* need to query the endianness of
> the guest CPU from userspace.

Er, what? If we make the array be guest's current order
then by definition userspace has to look at the guest's
current endianness. I agree that would be bad. Either
of the two current proposals (host kernel order; guest
CPU's native/natural/default-byte-order) avoid it, though.

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-23 15:33                               ` [Qemu-devel] " Peter Maydell
@ 2014-01-28  0:07                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:07 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Thu, 2014-01-23 at 15:33 +0000, Peter Maydell wrote:
>  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>      the same, because in the ARM case it is doing an
>      internal-to-CPU byteswap, and in the PPC case it is not

Aren't they both byte-order invariant ?

In that case they are the same.

Ben.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  0:07                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:07 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Thu, 2014-01-23 at 15:33 +0000, Peter Maydell wrote:
>  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>      the same, because in the ARM case it is doing an
>      internal-to-CPU byteswap, and in the PPC case it is not

Aren't they both byte-order invariant ?

In that case they are the same.

Ben.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-28  0:07                                 ` [Qemu-devel] " Benjamin Herrenschmidt
@ 2014-01-28  0:07                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:07 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Tue, 2014-01-28 at 11:07 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2014-01-23 at 15:33 +0000, Peter Maydell wrote:
> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> >      the same, because in the ARM case it is doing an
> >      internal-to-CPU byteswap, and in the PPC case it is not
> 
> Aren't they both byte-order invariant ?

I meant byte-address...

> In that case they are the same.
> 
> Ben.
> 



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  0:07                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:07 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Tue, 2014-01-28 at 11:07 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2014-01-23 at 15:33 +0000, Peter Maydell wrote:
> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> >      the same, because in the ARM case it is doing an
> >      internal-to-CPU byteswap, and in the PPC case it is not
> 
> Aren't they both byte-order invariant ?

I meant byte-address...

> In that case they are the same.
> 
> Ben.
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-23 20:45                                   ` [Qemu-devel] " Christoffer Dall
@ 2014-01-28  0:15                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:15 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Victor Kamensky, Peter Maydell, Thomas Falcon, kvm-devel,
	QEMU Developers, qemu-ppc, kvmarm


> The point is simple, and Peter has made it over and over:
> Any consumer of a memory operation sees "value, len, address".
> 
> This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
> definition and having a pointer to the structure you need to be able to
> tell me "value, len, address".

But that's useless because it doesn't tell you the byte order
of the value which is critical for emulation unless you *defined* in
your ABI the byte order of the value, and thus include an artificial
swap when the guest is in a different endian mode than the host.


My understanding is that ARM is byte-address invariant, as is powerpc,
so it makes a LOT more sense to carry a sequence of address ordered
bytes instead which will correspond to what the guest code thinks
it's writing, and have the device respond appropriately based on
the endianness of the bus it sits on or the device itself.


> > >  (2) the API between kernel and userspace needs to define
> > >      the semantics of mmio.data, ie how to map between
> > >      "x byte wide transaction with value v" and the array,
> > >      and that is primarily what this conversation is about
> > >  (3) the only choice which is both (a) sensible and (b)
> > >      not breaking existing usage is to say "the array is
> > >      in host-kernel-byte-order"
> > >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> > >      the same, because in the ARM case it is doing an
> > >      internal-to-CPU byteswap, and in the PPC case it is not

I very much doubt that there is a difference here, I'm not sure about
that business with "internal byteswap".

The order in which the bytes of a word are presented on the bus changes
depending on the core endianness. If that's what you call a "byteswap"
then both ARM and PPC do it when they are in their "other" endian,
but that confusion comes from assuming that a data bus has an endianness
at all to begin with.

I was hoping that by 2014, such ideas were things of the past.

> > That is one of the key disconnects. I'll go find real examples
> > in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> > everybody sake's here is summary of the disconnect:
> > 
> > If we have the same h/w connected to memory bus in ARM
> > and PPC systems and we have the following three pieces
> > of code that work with r0 having same device same
> > register address:
> > 
> > 1. ARM LE word write of  0x04030201:
> > setend le
> > mov r1, #0x04030201
> > str r1, [r0]
> > 
> > 2. ARM BE word write of 0x01020304:
> > setend be
> > mov r1, #0x01020304
> > str r1, [r0]
> > 
> > 3. PPC BE word write of 0x01020304:
> > lis     r1,0x102
> > ori     r1,r1,0x304
> > stw    r1,0(r0)
> > 
> > I claim that h/w will see the same data on bus lines in all
> > three cases, and h/w would acts the same in all three
> > cases. Peter says that ARM BE and PPC BE case h/w
> > would act differently.
> > 
> > If anyone else can offer opinion on that while I am looking
> > for real examples that would be great.
> > 
> 
> I really don't think listing all these examples help.  You need to focus
> on the key points that Peter listed in his previous mail.
> 
> I tried in our chat to ask you this questions:
> 
> vcpu_data_host_to_guest() is handling a read from an emulated device.
> All the info you have is:
> (1) len of memory access
> (2) mmio.data pointer
> (3) destination register
> (4) host CPU endianness
> (5) guest CPU endianness
> 
> Based on this information alone, you need to decide whether you do a
> byteswap or not before loading the hardware register upon returning to
> the guest.
> 
> You will find it impossible to answer, because you don't know the layout
> of mmio.data, and that is the thing we are trying to solve.
> 
> If you cannot reply to this point in less than 50 lines or mention
> anything about devices being LE or BE or come with examples, I am
> probably not going to read your reply, sorry.
> 
> -Christoffer



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  0:15                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:15 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, Victor Kamensky,
	QEMU Developers, qemu-ppc, kvmarm


> The point is simple, and Peter has made it over and over:
> Any consumer of a memory operation sees "value, len, address".
> 
> This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
> definition and having a pointer to the structure you need to be able to
> tell me "value, len, address".

But that's useless because it doesn't tell you the byte order
of the value which is critical for emulation unless you *defined* in
your ABI the byte order of the value, and thus include an artificial
swap when the guest is in a different endian mode than the host.


My understanding is that ARM is byte-address invariant, as is powerpc,
so it makes a LOT more sense to carry a sequence of address ordered
bytes instead which will correspond to what the guest code thinks
it's writing, and have the device respond appropriately based on
the endianness of the bus it sits on or the device itself.


> > >  (2) the API between kernel and userspace needs to define
> > >      the semantics of mmio.data, ie how to map between
> > >      "x byte wide transaction with value v" and the array,
> > >      and that is primarily what this conversation is about
> > >  (3) the only choice which is both (a) sensible and (b)
> > >      not breaking existing usage is to say "the array is
> > >      in host-kernel-byte-order"
> > >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> > >      the same, because in the ARM case it is doing an
> > >      internal-to-CPU byteswap, and in the PPC case it is not

I very much doubt that there is a difference here, I'm not sure about
that business with "internal byteswap".

The order in which the bytes of a word are presented on the bus changes
depending on the core endianness. If that's what you call a "byteswap"
then both ARM and PPC do it when they are in their "other" endian,
but that confusion comes from assuming that a data bus has an endianness
at all to begin with.

I was hoping that by 2014, such ideas were things of the past.

> > That is one of the key disconnects. I'll go find real examples
> > in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> > everybody sake's here is summary of the disconnect:
> > 
> > If we have the same h/w connected to memory bus in ARM
> > and PPC systems and we have the following three pieces
> > of code that work with r0 having same device same
> > register address:
> > 
> > 1. ARM LE word write of  0x04030201:
> > setend le
> > mov r1, #0x04030201
> > str r1, [r0]
> > 
> > 2. ARM BE word write of 0x01020304:
> > setend be
> > mov r1, #0x01020304
> > str r1, [r0]
> > 
> > 3. PPC BE word write of 0x01020304:
> > lis     r1,0x102
> > ori     r1,r1,0x304
> > stw    r1,0(r0)
> > 
> > I claim that h/w will see the same data on bus lines in all
> > three cases, and h/w would acts the same in all three
> > cases. Peter says that ARM BE and PPC BE case h/w
> > would act differently.
> > 
> > If anyone else can offer opinion on that while I am looking
> > for real examples that would be great.
> > 
> 
> I really don't think listing all these examples help.  You need to focus
> on the key points that Peter listed in his previous mail.
> 
> I tried in our chat to ask you this questions:
> 
> vcpu_data_host_to_guest() is handling a read from an emulated device.
> All the info you have is:
> (1) len of memory access
> (2) mmio.data pointer
> (3) destination register
> (4) host CPU endianness
> (5) guest CPU endianness
> 
> Based on this information alone, you need to decide whether you do a
> byteswap or not before loading the hardware register upon returning to
> the guest.
> 
> You will find it impossible to answer, because you don't know the layout
> of mmio.data, and that is the thing we are trying to solve.
> 
> If you cannot reply to this point in less than 50 lines or mention
> anything about devices being LE or BE or come with examples, I am
> probably not going to read your reply, sorry.
> 
> -Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-24  4:11                                         ` [Qemu-devel] " Victor Kamensky
@ 2014-01-28  0:32                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:32 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Christoffer Dall, Peter Maydell, Thomas Falcon, kvm-devel,
	QEMU Developers, qemu-ppc, kvmarm

On Thu, 2014-01-23 at 20:11 -0800, Victor Kamensky wrote:
> > I would take 50 byteswaps with a clear ABI any day over an obscure
> > standard that can avoid a single hardware-on-register instruction.
> This
> > is about designing a clean software interface, not about building an
> > optimized integrated stack.
> >
> > Unfortunately, this is going nowhere, so I think we need to stop
> this
> > thread.  As you can see I have sent a patch as a clarification to
> the
> > ABI, if it's merged we can move on with more important tasks.
> 
> OK, that is fine. I still believe is not the best choice,
> but I agree that we need to move on. I will respin my
> V7 KVM BE patches according to this new semantics, I will
> integrate comments that you (thanks!) and others gave me
> over mailing list and post my series again when it is ready.

Right, the whole "host endian" is a horrible choice from every way you
look at it, but I'm afraid it's unfixable since it's already ABI :-(

Ben.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  0:32                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:32 UTC (permalink / raw)
  To: Victor Kamensky
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Thu, 2014-01-23 at 20:11 -0800, Victor Kamensky wrote:
> > I would take 50 byteswaps with a clear ABI any day over an obscure
> > standard that can avoid a single hardware-on-register instruction.
> This
> > is about designing a clean software interface, not about building an
> > optimized integrated stack.
> >
> > Unfortunately, this is going nowhere, so I think we need to stop
> this
> > thread.  As you can see I have sent a patch as a clarification to
> the
> > ABI, if it's merged we can move on with more important tasks.
> 
> OK, that is fine. I still believe is not the best choice,
> but I agree that we need to move on. I will respin my
> V7 KVM BE patches according to this new semantics, I will
> integrate comments that you (thanks!) and others gave me
> over mailing list and post my series again when it is ready.

Right, the whole "host endian" is a horrible choice from every way you
look at it, but I'm afraid it's unfixable since it's already ABI :-(

Ben.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-27 23:49                       ` [Qemu-devel] " Peter Maydell
@ 2014-01-28  0:36                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:36 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Victor Kamensky, Thomas Falcon, kvm-devel, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Mon, 2014-01-27 at 23:49 +0000, Peter Maydell wrote:
> 
> Er, what? If we make the array be guest's current order
> then by definition userspace has to look at the guest's
> current endianness. I agree that would be bad. Either
> of the two current proposals (host kernel order; guest
> CPU's native/natural/default-byte-order) avoid it, though.

No, this has nothing to do with the guest endianness, and
all to do about the (hopefully) byte-address invariant bus we have
on the processor.

Anyway, the existing crap is ABI so I suspect we have to stick with it,
just maybe document it better.

Ben.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  0:36                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  0:36 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On Mon, 2014-01-27 at 23:49 +0000, Peter Maydell wrote:
> 
> Er, what? If we make the array be guest's current order
> then by definition userspace has to look at the guest's
> current endianness. I agree that would be bad. Either
> of the two current proposals (host kernel order; guest
> CPU's native/natural/default-byte-order) avoid it, though.

No, this has nothing to do with the guest endianness, and
all to do about the (hopefully) byte-address invariant bus we have
on the processor.

Anyway, the existing crap is ABI so I suspect we have to stick with it,
just maybe document it better.

Ben.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-28  0:32                                           ` [Qemu-devel] " Benjamin Herrenschmidt
@ 2014-01-28  0:40                                             ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-28  0:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Victor Kamensky, Peter Maydell, Thomas Falcon, kvm-devel,
	QEMU Developers, qemu-ppc, kvmarm

On Tue, Jan 28, 2014 at 11:32:41AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2014-01-23 at 20:11 -0800, Victor Kamensky wrote:
> > > I would take 50 byteswaps with a clear ABI any day over an obscure
> > > standard that can avoid a single hardware-on-register instruction.
> > This
> > > is about designing a clean software interface, not about building an
> > > optimized integrated stack.
> > >
> > > Unfortunately, this is going nowhere, so I think we need to stop
> > this
> > > thread.  As you can see I have sent a patch as a clarification to
> > the
> > > ABI, if it's merged we can move on with more important tasks.
> > 
> > OK, that is fine. I still believe is not the best choice,
> > but I agree that we need to move on. I will respin my
> > V7 KVM BE patches according to this new semantics, I will
> > integrate comments that you (thanks!) and others gave me
> > over mailing list and post my series again when it is ready.
> 
> Right, the whole "host endian" is a horrible choice from every way you
> look at it, but I'm afraid it's unfixable since it's already ABI :-(
> 
Why is it a horrible choice?

I don't think it's actually ABI at this point, it's undefined.

The only thing fixed is PPC BE host and ARM LE host, and in both cases
we currently perform a byteswap in KVM if the guest is a different
endianness.

Honestly I don't care which way it's defined, as long as it's defined
somehow, and I have not yet seen anyone formulate how the ABI
specification should be worded, so people clearly understand what's
going on.

If you take a look at the v2 patch "KVM: Specify byte order for
KVM_EXIT_MMIO", that's where it ended up.

If you can formulate something with your experience in endianness that
makes this clear, it would be extremely helpful.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  0:40                                             ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-28  0:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, Victor Kamensky,
	QEMU Developers, qemu-ppc, kvmarm

On Tue, Jan 28, 2014 at 11:32:41AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2014-01-23 at 20:11 -0800, Victor Kamensky wrote:
> > > I would take 50 byteswaps with a clear ABI any day over an obscure
> > > standard that can avoid a single hardware-on-register instruction.
> > This
> > > is about designing a clean software interface, not about building an
> > > optimized integrated stack.
> > >
> > > Unfortunately, this is going nowhere, so I think we need to stop
> > this
> > > thread.  As you can see I have sent a patch as a clarification to
> > the
> > > ABI, if it's merged we can move on with more important tasks.
> > 
> > OK, that is fine. I still believe is not the best choice,
> > but I agree that we need to move on. I will respin my
> > V7 KVM BE patches according to this new semantics, I will
> > integrate comments that you (thanks!) and others gave me
> > over mailing list and post my series again when it is ready.
> 
> Right, the whole "host endian" is a horrible choice from every way you
> look at it, but I'm afraid it's unfixable since it's already ABI :-(
> 
Why is it a horrible choice?

I don't think it's actually ABI at this point, it's undefined.

The only thing fixed is PPC BE host and ARM LE host, and in both cases
we currently perform a byteswap in KVM if the guest is a different
endianness.

Honestly I don't care which way it's defined, as long as it's defined
somehow, and I have not yet seen anyone formulate how the ABI
specification should be worded, so people clearly understand what's
going on.

If you take a look at the v2 patch "KVM: Specify byte order for
KVM_EXIT_MMIO", that's where it ended up.

If you can formulate something with your experience in endianness that
makes this clear, it would be extremely helpful.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-28  0:36                         ` [Qemu-devel] " Benjamin Herrenschmidt
@ 2014-01-28  0:44                           ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-28  0:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Maydell, Victor Kamensky, Thomas Falcon, kvm-devel,
	QEMU Developers, qemu-ppc, kvmarm

On Tue, Jan 28, 2014 at 11:36:13AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2014-01-27 at 23:49 +0000, Peter Maydell wrote:
> > 
> > Er, what? If we make the array be guest's current order
> > then by definition userspace has to look at the guest's
> > current endianness. I agree that would be bad. Either
> > of the two current proposals (host kernel order; guest
> > CPU's native/natural/default-byte-order) avoid it, though.
> 
> No, this has nothing to do with the guest endianness, and
> all to do about the (hopefully) byte-address invariant bus we have
> on the processor.
> 
> Anyway, the existing crap is ABI so I suspect we have to stick with it,
> just maybe document it better.
> 

I'm loosing track of this discussion, Ben, can you explain a bit?  You
wrote:

  Having a byte array coming in that represents what the CPU does in its
  current byte order means you do *NOT* need to query the endianness of
  the guest CPU from userspace.

What does "a byte array that represents what the CPU does in its current
byte order" mean in this context.  Do you mean the VCPU or the physical
CPU when you say CPU.

I read your text as saying "just do a store of the register into the
data pointer and don't worry about endianness", but somebody, somewhere,
has to check the VCPU endianness setting.

I'm probably wrong, and you are probably the right person to clear this
up, but can you formulate exactly what you think the KVM ABI is and how
you would put it in Documentation/virtual/kvm/api.txt?

My point of view is that it is KVM that needs to do this, and it should
"emulate the CPU" by performing a byteswap in the case where the CPU
E-bit is set on ARM, but this is an ARM-centric way of looking at
things.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  0:44                           ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-28  0:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, Victor Kamensky,
	QEMU Developers, qemu-ppc, kvmarm

On Tue, Jan 28, 2014 at 11:36:13AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2014-01-27 at 23:49 +0000, Peter Maydell wrote:
> > 
> > Er, what? If we make the array be guest's current order
> > then by definition userspace has to look at the guest's
> > current endianness. I agree that would be bad. Either
> > of the two current proposals (host kernel order; guest
> > CPU's native/natural/default-byte-order) avoid it, though.
> 
> No, this has nothing to do with the guest endianness, and
> all to do about the (hopefully) byte-address invariant bus we have
> on the processor.
> 
> Anyway, the existing crap is ABI so I suspect we have to stick with it,
> just maybe document it better.
> 

I'm loosing track of this discussion, Ben, can you explain a bit?  You
wrote:

  Having a byte array coming in that represents what the CPU does in its
  current byte order means you do *NOT* need to query the endianness of
  the guest CPU from userspace.

What does "a byte array that represents what the CPU does in its current
byte order" mean in this context.  Do you mean the VCPU or the physical
CPU when you say CPU.

I read your text as saying "just do a store of the register into the
data pointer and don't worry about endianness", but somebody, somewhere,
has to check the VCPU endianness setting.

I'm probably wrong, and you are probably the right person to clear this
up, but can you formulate exactly what you think the KVM ABI is and how
you would put it in Documentation/virtual/kvm/api.txt?

My point of view is that it is KVM that needs to do this, and it should
"emulate the CPU" by performing a byteswap in the case where the CPU
E-bit is set on ARM, but this is an ARM-centric way of looking at
things.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-28  0:44                           ` [Qemu-devel] " Christoffer Dall
@ 2014-01-28  4:47                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  4:47 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Victor Kamensky, Thomas Falcon, kvm-devel,
	QEMU Developers, qemu-ppc, kvmarm

On Mon, 2014-01-27 at 16:44 -0800, Christoffer Dall wrote:

> I'm loosing track of this discussion, Ben, can you explain a bit?  You
> wrote:
> 
>   Having a byte array coming in that represents what the CPU does in its
>   current byte order means you do *NOT* need to query the endianness of
>   the guest CPU from userspace.
> 
> What does "a byte array that represents what the CPU does in its current
> byte order" mean in this context.  Do you mean the VCPU or the physical
> CPU when you say CPU.

It doesn't matter once it's a byte array in address order. Again this is
the *right* abstraction for the kernel ABI, because you do not care
about the endianness of either side, guest or host.

It makes no sense to treat a modern CPU data bus as having an MSB and an
LSB (even if they have it sometimes on the block diagram). Only when
*interpreting a value* on that bus, such as an *address* does the
endianness become of use.

Treat the bus instead as an ordered sequence of bytes in ascending
address order and most of the complexity goes away.

>From there, for a given device, it all depends which bytes *that device*
choses to consider as being the MSB vs. LSB. It's not even a bus thing,
though of course some busses suggest an endianness, and some like PCI
mandates it for configuration space. 

But it remains a device-side choice.

> I read your text as saying "just do a store of the register into the
> data pointer and don't worry about endianness", but somebody, somewhere,
> has to check the VCPU endianness setting.
> 
> I'm probably wrong, and you are probably the right person to clear this
> up, but can you formulate exactly what you think the KVM ABI is and how
> you would put it in Documentation/virtual/kvm/api.txt?
> 
> My point of view is that it is KVM that needs to do this, and it should
> "emulate the CPU" by performing a byteswap in the case where the CPU
> E-bit is set on ARM, but this is an ARM-centric way of looking at
> things.

The ABI going to qemu should be (and inside qemu from TCG to the
emulation) that the CPU did an access of N bytes wide at address A
whose value is the byte array data[] in ascending address order.

Ben.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  4:47                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-28  4:47 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, Victor Kamensky,
	QEMU Developers, qemu-ppc, kvmarm

On Mon, 2014-01-27 at 16:44 -0800, Christoffer Dall wrote:

> I'm loosing track of this discussion, Ben, can you explain a bit?  You
> wrote:
> 
>   Having a byte array coming in that represents what the CPU does in its
>   current byte order means you do *NOT* need to query the endianness of
>   the guest CPU from userspace.
> 
> What does "a byte array that represents what the CPU does in its current
> byte order" mean in this context.  Do you mean the VCPU or the physical
> CPU when you say CPU.

It doesn't matter once it's a byte array in address order. Again this is
the *right* abstraction for the kernel ABI, because you do not care
about the endianness of either side, guest or host.

It makes no sense to treat a modern CPU data bus as having an MSB and an
LSB (even if they have it sometimes on the block diagram). Only when
*interpreting a value* on that bus, such as an *address* does the
endianness become of use.

Treat the bus instead as an ordered sequence of bytes in ascending
address order and most of the complexity goes away.

>From there, for a given device, it all depends which bytes *that device*
choses to consider as being the MSB vs. LSB. It's not even a bus thing,
though of course some busses suggest an endianness, and some like PCI
mandates it for configuration space. 

But it remains a device-side choice.

> I read your text as saying "just do a store of the register into the
> data pointer and don't worry about endianness", but somebody, somewhere,
> has to check the VCPU endianness setting.
> 
> I'm probably wrong, and you are probably the right person to clear this
> up, but can you formulate exactly what you think the KVM ABI is and how
> you would put it in Documentation/virtual/kvm/api.txt?
> 
> My point of view is that it is KVM that needs to do this, and it should
> "emulate the CPU" by performing a byteswap in the case where the CPU
> E-bit is set on ARM, but this is an ARM-centric way of looking at
> things.

The ABI going to qemu should be (and inside qemu from TCG to the
emulation) that the CPU did an access of N bytes wide at address A
whose value is the byte array data[] in ascending address order.

Ben.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: KVM and variable-endianness guest CPUs
  2014-01-22 10:22           ` [Qemu-devel] " Peter Maydell
@ 2014-01-28  9:04             ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2014-01-28  9:04 UTC (permalink / raw)
  To: Peter Maydell, Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 01/22/2014 12:22 PM, Peter Maydell wrote:
> On 22 January 2014 05:39, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> Hi Guys,
>>
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>>
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
> This email is about five times too long to be actually
> useful, but the major issue here is that the data being
> transferred is not just a bag of bytes. The data[]
> array plus the size field are being (mis)used to indicate
> that the memory transaction is one of:
>   * an 8 bit access
>   * a 16 bit access of some uint16_t value
>   * a 32 bit access of some uint32_t value
>   * a 64 bit access of some uint64_t value
>
> exactly as a CPU hardware bus would do. It's
> because the API is defined in this awkward way with
> a uint8_t[] array that we need to specify how both
> sides should go from the actual properties of the
> memory transaction (value and size) to filling in the
> array.

That is not how x86 hardware works.  Back when there was a bus, there 
were no address lines A0-A2; instead we had 8 byte enables BE0-BE7.  A 
memory transaction placed the qword address on the address lines and 
asserted the byte enables for the appropriate byte, word, dword, or 
qword, shifted for the low order bits of the address.

If you generated an unaligned access, the transaction was split into 
two, so an 8-byte write might appear as a 5-byte write followed by a 
3-byte write.  In fact, the two halves of the transaction might go to 
different devices, or one might go to a device and another to memory.

PCI works the same way.



>
> Furthermore, device endianness is entirely irrelevant
> for deciding the properties of mmio.data[], because the
> thing we're modelling here is essentially the CPU->bus
> interface. In real hardware, the properties of individual
> devices on the bus are irrelevant to how the CPU's
> interface to the bus behaves, and similarly here the
> properties of emulated devices don't affect how KVM's
> interface to QEMU userspace needs to work.
>
> MemoryRegion's 'endianness' field, incidentally, is
> a dreadful mess that we should get rid of. It is attempting
> to model the property that some buses/bridges have of
> doing byte-lane-swaps on data that passes through as
> a property of the device itself. It would be better if we
> modelled it properly, with container regions having possible
> byte-swapping and devices just being devices.
>

No, that is not what it is modelling.

Suppose a little endian cpu writes a dword 0x12345678 to address 0 of a 
device, and read back a byte from address 0.  What value do you read back?

Some (most) devices will return 0x78, others will return 0x12. Other 
devices don't support mixed sizes at all, but many do.  PCI 
configuration space is an example; it is common to read both Device ID 
and Vendor ID with a single 32-bit transaction, but you can also read 
them separately with two 16-bit transaction.  Because PCI is 
little-endian, the Vendor ID at address 0 will be returned as the low 
word of the 32-bit read of a little-endian processor.

If you remove device endianness from memory regions, you have to pass 
the data as arrays of bytes (like the KVM interface) and let the device 
assemble words from those bytes itself, taking into consideration its 
own endianness.  What MemoryRegion's endianness does is let the device 
declare its endianness to the API and let it do all the work.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] KVM and variable-endianness guest CPUs
@ 2014-01-28  9:04             ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2014-01-28  9:04 UTC (permalink / raw)
  To: Peter Maydell, Victor Kamensky
  Cc: Thomas Falcon, kvm-devel, QEMU Developers, qemu-ppc, kvmarm,
	Christoffer Dall

On 01/22/2014 12:22 PM, Peter Maydell wrote:
> On 22 January 2014 05:39, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> Hi Guys,
>>
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>>
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
> This email is about five times too long to be actually
> useful, but the major issue here is that the data being
> transferred is not just a bag of bytes. The data[]
> array plus the size field are being (mis)used to indicate
> that the memory transaction is one of:
>   * an 8 bit access
>   * a 16 bit access of some uint16_t value
>   * a 32 bit access of some uint32_t value
>   * a 64 bit access of some uint64_t value
>
> exactly as a CPU hardware bus would do. It's
> because the API is defined in this awkward way with
> a uint8_t[] array that we need to specify how both
> sides should go from the actual properties of the
> memory transaction (value and size) to filling in the
> array.

That is not how x86 hardware works.  Back when there was a bus, there 
were no address lines A0-A2; instead we had 8 byte enables BE0-BE7.  A 
memory transaction placed the qword address on the address lines and 
asserted the byte enables for the appropriate byte, word, dword, or 
qword, shifted for the low order bits of the address.

If you generated an unaligned access, the transaction was split into 
two, so an 8-byte write might appear as a 5-byte write followed by a 
3-byte write.  In fact, the two halves of the transaction might go to 
different devices, or one might go to a device and another to memory.

PCI works the same way.



>
> Furthermore, device endianness is entirely irrelevant
> for deciding the properties of mmio.data[], because the
> thing we're modelling here is essentially the CPU->bus
> interface. In real hardware, the properties of individual
> devices on the bus are irrelevant to how the CPU's
> interface to the bus behaves, and similarly here the
> properties of emulated devices don't affect how KVM's
> interface to QEMU userspace needs to work.
>
> MemoryRegion's 'endianness' field, incidentally, is
> a dreadful mess that we should get rid of. It is attempting
> to model the property that some buses/bridges have of
> doing byte-lane-swaps on data that passes through as
> a property of the device itself. It would be better if we
> modelled it properly, with container regions having possible
> byte-swapping and devices just being devices.
>

No, that is not what it is modelling.

Suppose a little endian cpu writes a dword 0x12345678 to address 0 of a 
device, and read back a byte from address 0.  What value do you read back?

Some (most) devices will return 0x78, others will return 0x12. Other 
devices don't support mixed sizes at all, but many do.  PCI 
configuration space is an example; it is common to read both Device ID 
and Vendor ID with a single 32-bit transaction, but you can also read 
them separately with two 16-bit transaction.  Because PCI is 
little-endian, the Vendor ID at address 0 will be returned as the low 
word of the 32-bit read of a little-endian processor.

If you remove device endianness from memory regions, you have to pass 
the data as arrays of bytes (like the KVM interface) and let the device 
assemble words from those bytes itself, taking into consideration its 
own endianness.  What MemoryRegion's endianness does is let the device 
declare its endianness to the API and let it do all the work.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-27 23:27                 ` [Qemu-devel] " Benjamin Herrenschmidt
@ 2014-01-28  9:16                   ` Avi Kivity
  -1 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2014-01-28  9:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On 01/28/2014 01:27 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2014-01-22 at 17:29 +0000, Peter Maydell wrote:
>>> Basically if it would be on real bus, get byte value
>>> that corresponds to phys_addr + 0 address place
>>> it into data[0], get byte value that corresponds to
>>> phys_addr + 1 address place it into data[1], etc.
>> This just isn't how real buses work.
> Actually it can be :-)
>
>>   There is no
>> "address + 1, address + 2". There is a single address
>> for the memory transaction and a set of data on
>> data lines and some separate size information.
>> How the device at the far end of the bus chooses
>> to respond to 32 bit accesses to address X versus
>> 8 bit accesses to addresses X through X+3 is entirely
>> its own business and unrelated to the CPU.
> However the bus has a definition of what byte lane is the lowest in
> address order. Byte order invariance is an important function of
> all busses.
>
> I think that trying to treat it any differently than an address
> ordered series of bytes is going to turn into a complete and
> inextricable mess.

I agree.

The two options are:

  (address, byte array, length)

and

  (address, value, word size, endianness)

the first is the KVM ABI, the second is how MemoryRegions work. Both are 
valid, but the first is more general (supports the 3-byte accesses 
sometimes generated on x86).


>
>>   (It would
>> be perfectly possible to have a device which when
>> you read from address X as 32 bits returned 0x12345678,
>> when you read from address X as 16 bits returned
>> 0x9abc, returned 0x42 for an 8 bit read from X+1,
>> and so on. Having byte reads from X..X+3 return
>> values corresponding to parts of the 32 bit access
>> is purely a convention.)
> Right, it's possible. It's also stupid and not how most modern devices
> and busses work. Besides there is no reason why that can't be
> implemented with Victor proposal anyway.

Right.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28  9:16                   ` Avi Kivity
  0 siblings, 0 replies; 102+ messages in thread
From: Avi Kivity @ 2014-01-28  9:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Peter Maydell
  Cc: Thomas Falcon, kvm-devel, Victor Kamensky, QEMU Developers,
	qemu-ppc, kvmarm, Christoffer Dall

On 01/28/2014 01:27 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2014-01-22 at 17:29 +0000, Peter Maydell wrote:
>>> Basically if it would be on real bus, get byte value
>>> that corresponds to phys_addr + 0 address place
>>> it into data[0], get byte value that corresponds to
>>> phys_addr + 1 address place it into data[1], etc.
>> This just isn't how real buses work.
> Actually it can be :-)
>
>>   There is no
>> "address + 1, address + 2". There is a single address
>> for the memory transaction and a set of data on
>> data lines and some separate size information.
>> How the device at the far end of the bus chooses
>> to respond to 32 bit accesses to address X versus
>> 8 bit accesses to addresses X through X+3 is entirely
>> its own business and unrelated to the CPU.
> However the bus has a definition of what byte lane is the lowest in
> address order. Byte order invariance is an important function of
> all busses.
>
> I think that trying to treat it any differently than an address
> ordered series of bytes is going to turn into a complete and
> inextricable mess.

I agree.

The two options are:

  (address, byte array, length)

and

  (address, value, word size, endianness)

the first is the KVM ABI, the second is how MemoryRegions work. Both are 
valid, but the first is more general (supports the 3-byte accesses 
sometimes generated on x86).


>
>>   (It would
>> be perfectly possible to have a device which when
>> you read from address X as 32 bits returned 0x12345678,
>> when you read from address X as 16 bits returned
>> 0x9abc, returned 0x42 for an 8 bit read from X+1,
>> and so on. Having byte reads from X..X+3 return
>> values corresponding to parts of the 32 bit access
>> is purely a convention.)
> Right, it's possible. It's also stupid and not how most modern devices
> and busses work. Besides there is no reason why that can't be
> implemented with Victor proposal anyway.

Right.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-ppc] KVM and variable-endianness guest CPUs
  2014-01-28  4:47                             ` [Qemu-devel] " Benjamin Herrenschmidt
@ 2014-01-28 16:31                               ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-28 16:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Maydell, Victor Kamensky, Thomas Falcon, kvm-devel,
	QEMU Developers, qemu-ppc, kvmarm

On Tue, Jan 28, 2014 at 03:47:32PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2014-01-27 at 16:44 -0800, Christoffer Dall wrote:
> 
> > I'm loosing track of this discussion, Ben, can you explain a bit?  You
> > wrote:
> > 
> >   Having a byte array coming in that represents what the CPU does in its
> >   current byte order means you do *NOT* need to query the endianness of
> >   the guest CPU from userspace.
> > 
> > What does "a byte array that represents what the CPU does in its current
> > byte order" mean in this context.  Do you mean the VCPU or the physical
> > CPU when you say CPU.
> 
> It doesn't matter once it's a byte array in address order. Again this is
> the *right* abstraction for the kernel ABI, because you do not care
> about the endianness of either side, guest or host.
> 
> It makes no sense to treat a modern CPU data bus as having an MSB and an
> LSB (even if they have it sometimes on the block diagram). Only when
> *interpreting a value* on that bus, such as an *address* does the
> endianness become of use.
> 
> Treat the bus instead as an ordered sequence of bytes in ascending
> address order and most of the complexity goes away.
> 
> From there, for a given device, it all depends which bytes *that device*
> choses to consider as being the MSB vs. LSB. It's not even a bus thing,
> though of course some busses suggest an endianness, and some like PCI
> mandates it for configuration space. 
> 
> But it remains a device-side choice.
> 
> > I read your text as saying "just do a store of the register into the
> > data pointer and don't worry about endianness", but somebody, somewhere,
> > has to check the VCPU endianness setting.
> > 
> > I'm probably wrong, and you are probably the right person to clear this
> > up, but can you formulate exactly what you think the KVM ABI is and how
> > you would put it in Documentation/virtual/kvm/api.txt?
> > 
> > My point of view is that it is KVM that needs to do this, and it should
> > "emulate the CPU" by performing a byteswap in the case where the CPU
> > E-bit is set on ARM, but this is an ARM-centric way of looking at
> > things.
> 
> The ABI going to qemu should be (and inside qemu from TCG to the
> emulation) that the CPU did an access of N bytes wide at address A
> whose value is the byte array data[] in ascending address order.
> 
OK, I've sent a v3 of the ABI clarification patch following the wording
from you and Scott.  I think we all agree what the format should look
like at this point and hopefully we can quickly agree about a text to
describe that.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
@ 2014-01-28 16:31                               ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2014-01-28 16:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Maydell, Thomas Falcon, kvm-devel, Victor Kamensky,
	QEMU Developers, qemu-ppc, kvmarm

On Tue, Jan 28, 2014 at 03:47:32PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2014-01-27 at 16:44 -0800, Christoffer Dall wrote:
> 
> > I'm loosing track of this discussion, Ben, can you explain a bit?  You
> > wrote:
> > 
> >   Having a byte array coming in that represents what the CPU does in its
> >   current byte order means you do *NOT* need to query the endianness of
> >   the guest CPU from userspace.
> > 
> > What does "a byte array that represents what the CPU does in its current
> > byte order" mean in this context.  Do you mean the VCPU or the physical
> > CPU when you say CPU.
> 
> It doesn't matter once it's a byte array in address order. Again this is
> the *right* abstraction for the kernel ABI, because you do not care
> about the endianness of either side, guest or host.
> 
> It makes no sense to treat a modern CPU data bus as having an MSB and an
> LSB (even if they have it sometimes on the block diagram). Only when
> *interpreting a value* on that bus, such as an *address* does the
> endianness become of use.
> 
> Treat the bus instead as an ordered sequence of bytes in ascending
> address order and most of the complexity goes away.
> 
> From there, for a given device, it all depends which bytes *that device*
> choses to consider as being the MSB vs. LSB. It's not even a bus thing,
> though of course some busses suggest an endianness, and some like PCI
> mandates it for configuration space. 
> 
> But it remains a device-side choice.
> 
> > I read your text as saying "just do a store of the register into the
> > data pointer and don't worry about endianness", but somebody, somewhere,
> > has to check the VCPU endianness setting.
> > 
> > I'm probably wrong, and you are probably the right person to clear this
> > up, but can you formulate exactly what you think the KVM ABI is and how
> > you would put it in Documentation/virtual/kvm/api.txt?
> > 
> > My point of view is that it is KVM that needs to do this, and it should
> > "emulate the CPU" by performing a byteswap in the case where the CPU
> > E-bit is set on ARM, but this is an ARM-centric way of looking at
> > things.
> 
> The ABI going to qemu should be (and inside qemu from TCG to the
> emulation) that the CPU did an access of N bytes wide at address A
> whose value is the byte array data[] in ascending address order.
> 
OK, I've sent a v3 of the ABI clarification patch following the wording
from you and Scott.  I think we all agree what the format should look
like at this point and hopefully we can quickly agree about a text to
describe that.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2014-01-28 16:31 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-17 17:53 KVM and variable-endianness guest CPUs Peter Maydell
2014-01-17 17:53 ` [Qemu-devel] " Peter Maydell
2014-01-17 18:52 ` Peter Maydell
2014-01-17 18:52   ` [Qemu-devel] " Peter Maydell
2014-01-18  4:24   ` Christoffer Dall
2014-01-18  4:24     ` [Qemu-devel] " Christoffer Dall
2014-01-18  7:32     ` Alexander Graf
2014-01-18  7:32       ` [Qemu-devel] " Alexander Graf
2014-01-18 10:15       ` Peter Maydell
2014-01-18 10:15         ` [Qemu-devel] " Peter Maydell
2014-01-20 14:20         ` Alexander Graf
2014-01-20 14:20           ` [Qemu-devel] " Alexander Graf
2014-01-20 14:31           ` Peter Maydell
2014-01-20 14:31             ` [Qemu-devel] " Peter Maydell
2014-01-20 14:22   ` Alexander Graf
2014-01-20 14:22     ` [Qemu-devel] " Alexander Graf
2014-01-20 19:19     ` Christoffer Dall
2014-01-20 19:19       ` [Qemu-devel] " Christoffer Dall
2014-01-22  5:39       ` Victor Kamensky
2014-01-22  5:39         ` [Qemu-devel] " Victor Kamensky
2014-01-22  6:31         ` Anup Patel
2014-01-22  6:31           ` [Qemu-devel] " Anup Patel
2014-01-22  6:41           ` [Qemu-ppc] " Alexander Graf
2014-01-22  6:41             ` [Qemu-devel] " Alexander Graf
2014-01-22  7:26             ` Victor Kamensky
2014-01-22  7:26               ` [Qemu-devel] " Victor Kamensky
2014-01-22 10:52               ` Alexander Graf
2014-01-22 10:52                 ` [Qemu-devel] " Alexander Graf
2014-01-23  4:25                 ` Victor Kamensky
2014-01-23  4:25                   ` [Qemu-devel] " Victor Kamensky
2014-01-23 10:32                   ` Alexander Graf
2014-01-23 10:32                     ` [Qemu-devel] " Alexander Graf
2014-01-23 10:56                   ` Greg Kurz
2014-01-23 10:56                     ` [Qemu-devel] " Greg Kurz
2014-01-22  8:57             ` Anup Patel
2014-01-22  8:57               ` [Qemu-devel] " Anup Patel
2014-01-23 23:28               ` Christoffer Dall
2014-01-23 23:28                 ` [Qemu-devel] " Christoffer Dall
2014-01-22 10:22         ` Peter Maydell
2014-01-22 10:22           ` [Qemu-devel] " Peter Maydell
2014-01-22 17:19           ` Victor Kamensky
2014-01-22 17:19             ` [Qemu-devel] " Victor Kamensky
2014-01-22 17:29             ` Peter Maydell
2014-01-22 17:29               ` [Qemu-devel] " Peter Maydell
2014-01-22 19:29               ` Victor Kamensky
2014-01-22 19:29                 ` [Qemu-devel] " Victor Kamensky
2014-01-22 20:02                 ` Peter Maydell
2014-01-22 20:02                   ` [Qemu-devel] " Peter Maydell
2014-01-22 22:47                   ` Victor Kamensky
2014-01-22 22:47                     ` [Qemu-devel] " Victor Kamensky
2014-01-22 23:18                     ` Peter Maydell
2014-01-22 23:18                       ` [Qemu-devel] " Peter Maydell
2014-01-23  0:22                       ` Victor Kamensky
2014-01-23  0:22                         ` [Qemu-devel] " Victor Kamensky
2014-01-23 10:23                         ` Peter Maydell
2014-01-23 10:23                           ` [Qemu-devel] " Peter Maydell
2014-01-23 15:06                           ` Victor Kamensky
2014-01-23 15:06                             ` [Qemu-devel] " Victor Kamensky
2014-01-23 15:33                             ` Peter Maydell
2014-01-23 15:33                               ` [Qemu-devel] " Peter Maydell
2014-01-23 16:25                               ` Victor Kamensky
2014-01-23 16:25                                 ` [Qemu-devel] " Victor Kamensky
2014-01-23 20:45                                 ` Christoffer Dall
2014-01-23 20:45                                   ` [Qemu-devel] " Christoffer Dall
2014-01-24  0:50                                   ` Victor Kamensky
2014-01-24  0:50                                     ` [Qemu-devel] " Victor Kamensky
2014-01-24  2:14                                     ` Christoffer Dall
2014-01-24  2:14                                       ` [Qemu-devel] " Christoffer Dall
2014-01-24  4:11                                       ` Victor Kamensky
2014-01-24  4:11                                         ` [Qemu-devel] " Victor Kamensky
2014-01-28  0:32                                         ` [Qemu-ppc] " Benjamin Herrenschmidt
2014-01-28  0:32                                           ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-28  0:40                                           ` Christoffer Dall
2014-01-28  0:40                                             ` [Qemu-devel] " Christoffer Dall
2014-01-28  0:15                                   ` Benjamin Herrenschmidt
2014-01-28  0:15                                     ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-24  0:09                                 ` Victor Kamensky
2014-01-24  0:09                                   ` [Qemu-devel] " Victor Kamensky
2014-01-28  0:07                               ` [Qemu-ppc] " Benjamin Herrenschmidt
2014-01-28  0:07                                 ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-28  0:07                                 ` Benjamin Herrenschmidt
2014-01-28  0:07                                   ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-27 23:34                   ` Benjamin Herrenschmidt
2014-01-27 23:34                     ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-27 23:49                     ` Peter Maydell
2014-01-27 23:49                       ` [Qemu-devel] " Peter Maydell
2014-01-28  0:36                       ` Benjamin Herrenschmidt
2014-01-28  0:36                         ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-28  0:44                         ` Christoffer Dall
2014-01-28  0:44                           ` [Qemu-devel] " Christoffer Dall
2014-01-28  4:47                           ` Benjamin Herrenschmidt
2014-01-28  4:47                             ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-28 16:31                             ` Christoffer Dall
2014-01-28 16:31                               ` [Qemu-devel] " Christoffer Dall
2014-01-27 23:31                 ` Benjamin Herrenschmidt
2014-01-27 23:31                   ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-27 23:27               ` Benjamin Herrenschmidt
2014-01-27 23:27                 ` [Qemu-devel] " Benjamin Herrenschmidt
2014-01-28  9:16                 ` Avi Kivity
2014-01-28  9:16                   ` [Qemu-devel] " Avi Kivity
2014-01-28  9:04           ` Avi Kivity
2014-01-28  9:04             ` [Qemu-devel] " Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.