All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] branches are expensive
@ 2009-03-17 11:05 Steffen Liebergeld
  2009-03-17 11:18 ` Avi Kivity
  2009-03-17 13:30 ` [Qemu-devel] " Laurent Desnogues
  0 siblings, 2 replies; 14+ messages in thread
From: Steffen Liebergeld @ 2009-03-17 11:05 UTC (permalink / raw)
  To: qemu-devel

Hi,

while measuring the execution of an ARM guest, I encountered that branches are
extremely expensive in terms of executed host instructions. A single ARM
branch takes 148 to 152 host instructions. In my setup host and guest use the
ARM instruction set architecture.

My question is what makes branches so expensive? What code is run when
executing a branch? Other instructions are translated to 14 to 40
instructions.

Any help is appreciated.

Greetings, Steffen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] branches are expensive
  2009-03-17 11:05 [Qemu-devel] branches are expensive Steffen Liebergeld
@ 2009-03-17 11:18 ` Avi Kivity
  2009-03-17 11:32   ` [Qemu-devel] " Jan Kiszka
  2009-03-17 11:36   ` Steffen Liebergeld
  2009-03-17 13:30 ` [Qemu-devel] " Laurent Desnogues
  1 sibling, 2 replies; 14+ messages in thread
From: Avi Kivity @ 2009-03-17 11:18 UTC (permalink / raw)
  To: qemu-devel

Steffen Liebergeld wrote:
> Hi,
>
> while measuring the execution of an ARM guest, I encountered that branches are
> extremely expensive in terms of executed host instructions. A single ARM
> branch takes 148 to 152 host instructions. In my setup host and guest use the
> ARM instruction set architecture.
>
> My question is what makes branches so expensive? What code is run when
> executing a branch? Other instructions are translated to 14 to 40
> instructions.
>   

I'm no tcg guru, but if branches are not chained, you'd need an 
expensive lookup to find the next translation block.  If branches are 
chained they'll probably be much faster.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: branches are expensive
  2009-03-17 11:18 ` Avi Kivity
@ 2009-03-17 11:32   ` Jan Kiszka
  2009-03-17 12:31     ` Steffen Liebergeld
  2009-03-17 11:36   ` Steffen Liebergeld
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kiszka @ 2009-03-17 11:32 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Steffen Liebergeld wrote:
>> Hi,
>>
>> while measuring the execution of an ARM guest, I encountered that
>> branches are
>> extremely expensive in terms of executed host instructions. A single ARM
>> branch takes 148 to 152 host instructions. In my setup host and guest
>> use the
>> ARM instruction set architecture.
>>
>> My question is what makes branches so expensive? What code is run when
>> executing a branch? Other instructions are translated to 14 to 40
>> instructions.
>>   
> 
> I'm no tcg guru, but if branches are not chained, you'd need an
> expensive lookup to find the next translation block.  If branches are
> chained they'll probably be much faster.

That is probably the reason.

You can check to generated host code and compare it to the guest code
via -d in_asm,out_asm (or via the monitor: log in_asm,out_asm), logs
will be written /tmp/qemu.log by default. The ratio of direct (chained)
jumps is given via "info jit".

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: branches are expensive
  2009-03-17 11:18 ` Avi Kivity
  2009-03-17 11:32   ` [Qemu-devel] " Jan Kiszka
@ 2009-03-17 11:36   ` Steffen Liebergeld
  1 sibling, 0 replies; 14+ messages in thread
From: Steffen Liebergeld @ 2009-03-17 11:36 UTC (permalink / raw)
  To: qemu-devel

Hi Avi,
Avi Kivity <avi@redhat.com> schrieb:
> Steffen Liebergeld wrote:
>> Hi,
>>
>> while measuring the execution of an ARM guest, I encountered that branches are
>> extremely expensive in terms of executed host instructions. A single ARM
>> branch takes 148 to 152 host instructions. In my setup host and guest use the
>> ARM instruction set architecture.
>>
>> My question is what makes branches so expensive? What code is run when
>> executing a branch? Other instructions are translated to 14 to 40
>> instructions.
>>   
>
> I'm no tcg guru, but if branches are not chained, you'd need an 
> expensive lookup to find the next translation block.  If branches are 
> chained they'll probably be much faster.

Sure. But how can I get to know how many TBs are chained.

Greetings, Steffen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: branches are expensive
  2009-03-17 11:32   ` [Qemu-devel] " Jan Kiszka
@ 2009-03-17 12:31     ` Steffen Liebergeld
  2009-03-17 12:51       ` Paul Brook
  0 siblings, 1 reply; 14+ messages in thread
From: Steffen Liebergeld @ 2009-03-17 12:31 UTC (permalink / raw)
  To: qemu-devel

Hi,

Jan Kiszka <jan.kiszka@siemens.com> schrieb:
> Avi Kivity wrote:
>> Steffen Liebergeld wrote:
>>> Hi,
>>>
>>> while measuring the execution of an ARM guest, I encountered that
>>> branches are
>>> extremely expensive in terms of executed host instructions. A single ARM
>>> branch takes 148 to 152 host instructions. In my setup host and guest
>>> use the
>>> ARM instruction set architecture.
>>>
>>> My question is what makes branches so expensive? What code is run when
>>> executing a branch? Other instructions are translated to 14 to 40
>>> instructions.
>>>   
>> 
>> I'm no tcg guru, but if branches are not chained, you'd need an
>> expensive lookup to find the next translation block.  If branches are
>> chained they'll probably be much faster.
>
> That is probably the reason.
>
> You can check to generated host code and compare it to the guest code
> via -d in_asm,out_asm (or via the monitor: log in_asm,out_asm), logs
> will be written /tmp/qemu.log by default. The ratio of direct (chained)
> jumps is given via "info jit".

The radio is quite bad. Do you have any documentation on when Qemu does the
chaining and more important, when it does not. For example are unconditional
jumps always chained, or only in one direction (forward or backward).

Many thanks.
Steffen 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] Re: branches are expensive
  2009-03-17 12:31     ` Steffen Liebergeld
@ 2009-03-17 12:51       ` Paul Brook
  2009-03-17 13:24         ` Avi Kivity
  2009-03-19 10:07         ` Steffen Liebergeld
  0 siblings, 2 replies; 14+ messages in thread
From: Paul Brook @ 2009-03-17 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Steffen Liebergeld

> The ratio is quite bad. Do you have any documentation on when Qemu does the
> chaining and more important, when it does not. For example are
> unconditional jumps always chained, or only in one direction (forward or
> backward).

Direct jumps[1] within the same page are chained (including ). Indirect 
jumps[2] and direct jumps to a different page are not chained. Chaining jumps 
between pages would require breaking TB chains every time a TLB flush occurs.

Unchained jumps just a two stage lookup to cache frequently used entries.

Paul

[1] b, b<cc> and bl
[2] bx, mov pc, ldr pc, pop {pc}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] Re: branches are expensive
  2009-03-17 12:51       ` Paul Brook
@ 2009-03-17 13:24         ` Avi Kivity
  2009-03-19 10:07         ` Steffen Liebergeld
  1 sibling, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2009-03-17 13:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Steffen Liebergeld

Paul Brook wrote:
>> The ratio is quite bad. Do you have any documentation on when Qemu does the
>> chaining and more important, when it does not. For example are
>> unconditional jumps always chained, or only in one direction (forward or
>> backward).
>>     
>
> Direct jumps[1] within the same page are chained (including ). Indirect 
> jumps[2] and direct jumps to a different page are not chained. Chaining jumps 
> between pages would require breaking TB chains every time a TLB flush occurs.
>   

You could optimize interpage direct jumps as follows:

    if (tb->tlb_generation != global_tlb_generation)
        revalidate_interpage_branch();
    asm ("B target_address")

A tlb flush (or switching execution to a different cpu) increments 
global_tlb_generation; revalidate_interpage_branch() sets target_address 
to the slow path which does the tb lookup, and sets tlb_generation = 
global_tlb_generation.  Should compile an unconditional branch to 5 
instructions.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] branches are expensive
  2009-03-17 11:05 [Qemu-devel] branches are expensive Steffen Liebergeld
  2009-03-17 11:18 ` Avi Kivity
@ 2009-03-17 13:30 ` Laurent Desnogues
  1 sibling, 0 replies; 14+ messages in thread
From: Laurent Desnogues @ 2009-03-17 13:30 UTC (permalink / raw)
  To: qemu-devel

On Tue, Mar 17, 2009 at 12:05 PM, Steffen Liebergeld <usenet@gmx.eu> wrote:
>
> while measuring the execution of an ARM guest, I encountered that branches are
> extremely expensive in terms of executed host instructions. A single ARM
> branch takes 148 to 152 host instructions. In my setup host and guest use the
> ARM instruction set architecture.
>
> My question is what makes branches so expensive? What code is run when
> executing a branch? Other instructions are translated to 14 to 40
> instructions.

This raises a few questions:

1. are you talking of qemu system or qemu user (I guess the former)?
2. how did you measure executed host instructions?
3. what is your host processor?

My experience on TB for qemu user is about x7 code *size*
expansion on a x86_64 (though svn qemu is probably higher
as I have a few specific tunings).


Laurent

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: branches are expensive
  2009-03-17 12:51       ` Paul Brook
  2009-03-17 13:24         ` Avi Kivity
@ 2009-03-19 10:07         ` Steffen Liebergeld
  2009-03-19 10:30           ` Laurent Desnogues
                             ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Steffen Liebergeld @ 2009-03-19 10:07 UTC (permalink / raw)
  To: qemu-devel

Hi Paul,

Paul Brook <paul@codesourcery.com> schrieb:
>> The ratio is quite bad. Do you have any documentation on when Qemu does the
>> chaining and more important, when it does not. For example are
>> unconditional jumps always chained, or only in one direction (forward or
>> backward).
>
> Direct jumps[1] within the same page are chained (including ). Indirect 
> jumps[2] and direct jumps to a different page are not chained. Chaining jumps 
> between pages would require breaking TB chains every time a TLB flush occurs.

I've tested Qemu 0.10.0 and with i386-softmmu on a i386 host I get the
following numbers:
direct jump count 70%, 2 jumps 54%

For qemu-system-arm on an ARM host, the numbers look like this:
direct jump count 47%, 2 jumps 40%

For completeness I tested qemu-system-arm on a i386 host as well:
direct jump count 44%, 2 jumps 37%

So it looks like the chaining on ARM targets is not as effective as on i386
targets (regardless of the guest, I used the same guest setup, compiled for
different architectures, on all tests). Do you have any ideas why this is the
case?

> Unchained jumps just a two stage lookup to cache frequently used entries.

Many thanks for your explanations.

> Paul
>
> [1] b, b<cc> and bl
> [2] bx, mov pc, ldr pc, pop {pc}

Greetings, Steffen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] Re: branches are expensive
  2009-03-19 10:07         ` Steffen Liebergeld
@ 2009-03-19 10:30           ` Laurent Desnogues
  2009-03-19 10:39             ` Steffen Liebergeld
  2009-03-19 10:52           ` Avi Kivity
  2009-03-19 11:34           ` Paul Brook
  2 siblings, 1 reply; 14+ messages in thread
From: Laurent Desnogues @ 2009-03-19 10:30 UTC (permalink / raw)
  To: qemu-devel

On Thu, Mar 19, 2009 at 11:07 AM, Steffen Liebergeld <usenet@gmx.eu> wrote:
>
> I've tested Qemu 0.10.0 and with i386-softmmu on a i386 host I get the
> following numbers:
> direct jump count 70%, 2 jumps 54%
>
> For qemu-system-arm on an ARM host, the numbers look like this:
> direct jump count 47%, 2 jumps 40%
>
> For completeness I tested qemu-system-arm on a i386 host as well:
> direct jump count 44%, 2 jumps 37%
>
> So it looks like the chaining on ARM targets is not as effective as on i386
> targets (regardless of the guest, I used the same guest setup, compiled for
> different architectures, on all tests). Do you have any ideas why this is the
> case?

Different instruction sets, different compilers.  You'd better compare
guest code before drawing any conclusion.


Laurent

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: branches are expensive
  2009-03-19 10:30           ` Laurent Desnogues
@ 2009-03-19 10:39             ` Steffen Liebergeld
  2009-03-19 11:06               ` Laurent Desnogues
  0 siblings, 1 reply; 14+ messages in thread
From: Steffen Liebergeld @ 2009-03-19 10:39 UTC (permalink / raw)
  To: qemu-devel

Laurent Desnogues <laurent.desnogues@gmail.com> schrieb:
> On Thu, Mar 19, 2009 at 11:07 AM, Steffen Liebergeld <usenet@gmx.eu> wrote:
>>
>> I've tested Qemu 0.10.0 and with i386-softmmu on a i386 host I get the
>> following numbers:
>> direct jump count 70%, 2 jumps 54%
>>
>> For qemu-system-arm on an ARM host, the numbers look like this:
>> direct jump count 47%, 2 jumps 40%
>>
>> For completeness I tested qemu-system-arm on a i386 host as well:
>> direct jump count 44%, 2 jumps 37%
>>
>> So it looks like the chaining on ARM targets is not as effective as on i386
>> targets (regardless of the guest, I used the same guest setup, compiled for
>> different architectures, on all tests). Do you have any ideas why this is the
>> case?
>
> Different instruction sets, different compilers.  You'd better compare
> guest code before drawing any conclusion.

The qemu-system-arm on ARM and i386 were compiled by the same compiler.

Greetings, Steffen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] Re: branches are expensive
  2009-03-19 10:07         ` Steffen Liebergeld
  2009-03-19 10:30           ` Laurent Desnogues
@ 2009-03-19 10:52           ` Avi Kivity
  2009-03-19 11:34           ` Paul Brook
  2 siblings, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2009-03-19 10:52 UTC (permalink / raw)
  To: qemu-devel

Steffen Liebergeld wrote:
> I've tested Qemu 0.10.0 and with i386-softmmu on a i386 host I get the
> following numbers:
> direct jump count 70%, 2 jumps 54%
>
> For qemu-system-arm on an ARM host, the numbers look like this:
> direct jump count 47%, 2 jumps 40%
>
> For completeness I tested qemu-system-arm on a i386 host as well:
> direct jump count 44%, 2 jumps 37%
>
> So it looks like the chaining on ARM targets is not as effective as on i386
> targets (regardless of the guest, I used the same guest setup, compiled for
> different architectures, on all tests). Do you have any ideas why this is the
> case?
>   

I'd guess that predicated instructions are heavily used on ARM for 
if/else sequences, so intra-page branches would be less frequent.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] Re: branches are expensive
  2009-03-19 10:39             ` Steffen Liebergeld
@ 2009-03-19 11:06               ` Laurent Desnogues
  0 siblings, 0 replies; 14+ messages in thread
From: Laurent Desnogues @ 2009-03-19 11:06 UTC (permalink / raw)
  To: qemu-devel

On Thu, Mar 19, 2009 at 11:39 AM, Steffen Liebergeld <usenet@gmx.eu> wrote:
> Laurent Desnogues <laurent.desnogues@gmail.com> schrieb:
>> On Thu, Mar 19, 2009 at 11:07 AM, Steffen Liebergeld <usenet@gmx.eu> wrote:
>>>
>>> I've tested Qemu 0.10.0 and with i386-softmmu on a i386 host I get the
>>> following numbers:
>>> direct jump count 70%, 2 jumps 54%
>>>
>>> For qemu-system-arm on an ARM host, the numbers look like this:
>>> direct jump count 47%, 2 jumps 40%
>>>
>>> For completeness I tested qemu-system-arm on a i386 host as well:
>>> direct jump count 44%, 2 jumps 37%
>>>
>>> So it looks like the chaining on ARM targets is not as effective as on i386
>>> targets (regardless of the guest, I used the same guest setup, compiled for
>>> different architectures, on all tests). Do you have any ideas why this is the
>>> case?
>>
>> Different instruction sets, different compilers.  You'd better compare
>> guest code before drawing any conclusion.
>
> The qemu-system-arm on ARM and i386 were compiled by the same compiler.

I said "guest" not "host" :)

You need to compare ARM and i386 blocks *before* translation.


Laurent

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] Re: branches are expensive
  2009-03-19 10:07         ` Steffen Liebergeld
  2009-03-19 10:30           ` Laurent Desnogues
  2009-03-19 10:52           ` Avi Kivity
@ 2009-03-19 11:34           ` Paul Brook
  2 siblings, 0 replies; 14+ messages in thread
From: Paul Brook @ 2009-03-19 11:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: Steffen Liebergeld

> I've tested Qemu 0.10.0 and with i386-softmmu on a i386 host I get the
> following numbers:
> direct jump count 70%, 2 jumps 54%
>
> For completeness I tested qemu-system-arm on a i386 host as well:
> direct jump count 44%, 2 jumps 37%
>
> So it looks like the chaining on ARM targets is not as effective as on i386
> targets (regardless of the guest, I used the same guest setup, compiled for
> different architectures, on all tests). Do you have any ideas why this is
> the case?

A couple of likely reasons:

- ARM uses 1k pages, i386 uses 4k pages, so there's greater probability of 
spanning a page boundary.
- ARM has conditional execution, so code will tend to have less conditional 
branches. The number of function calls (in particular function returns, which 
are indirect branches) is likely to be about the same, so the proportion of 
direct jumps is less.

Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-03-19 11:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-17 11:05 [Qemu-devel] branches are expensive Steffen Liebergeld
2009-03-17 11:18 ` Avi Kivity
2009-03-17 11:32   ` [Qemu-devel] " Jan Kiszka
2009-03-17 12:31     ` Steffen Liebergeld
2009-03-17 12:51       ` Paul Brook
2009-03-17 13:24         ` Avi Kivity
2009-03-19 10:07         ` Steffen Liebergeld
2009-03-19 10:30           ` Laurent Desnogues
2009-03-19 10:39             ` Steffen Liebergeld
2009-03-19 11:06               ` Laurent Desnogues
2009-03-19 10:52           ` Avi Kivity
2009-03-19 11:34           ` Paul Brook
2009-03-17 11:36   ` Steffen Liebergeld
2009-03-17 13:30 ` [Qemu-devel] " Laurent Desnogues

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.