Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
@ 2012-06-14 12:35 YeongKyoon Lee
  2012-06-15 19:20 ` Blue Swirl
  0 siblings, 1 reply; 30+ messages in thread
From: YeongKyoon Lee @ 2012-06-14 12:35 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen), Laurent Desnogues; +Cc: qemu-devel

Hi,

I proposed QEMU ld/st optimizations two times, late last year and early this year, however, there was no replies and even more 0/3 patch mail in this year has been missed so I cannot find the mail in QEMU mail archives. :(
The link of 0/3 (http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html) is old one in last year.

Anyway, the other links of patches (1/3~3/3) are as follows.
http://lists.nongnu.org/archive/html/qemu-devel/2012-01/msg00026.html
http://lists.nongnu.org/archive/html/qemu-devel/2012-01/msg00027.html
http://lists.gnu.org/archive/html/qemu-devel/2012-01/msg00028.html

Because there was no replies from qemu-devel, I have applied the patches only to Tizen emulator, where Tizen is a new mobile platform of Linux foundation.
At this time, QEMU mainline has been changed since the patches, so I'll propose them again after some modification for latest sources.

Thanks.

------- Original Message -------
Sender : 陳韋任 (Wei-Ren Chen)<chenwj@iis.sinica.edu.tw>
Date : 2012-06-14 12:31 (GMT+09:00)
Title : Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?

> As a side note, it might be interesting to gather statistics about the hit
> rate of the QEMU TLB.  Another thing to consider is speeding up the
> fast path;  see YeongKyoon Lee RFC patch:
> 
> http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html

  I only see PATCH 0/3, any idea on where the others?

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

__________________________________
Principal Engineer 
VM Team 
Yeongkyoon Lee 

S-Core Co., Ltd.
D.L.: +82-31-696-7249
M.P.: +82-10-9965-1265
__________________________________

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-14 12:35 [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time? YeongKyoon Lee
@ 2012-06-15 19:20 ` Blue Swirl
  2012-06-18  6:57   ` 陳韋任 (Wei-Ren Chen)
  0 siblings, 1 reply; 30+ messages in thread
From: Blue Swirl @ 2012-06-15 19:20 UTC (permalink / raw)
  To: yeongkyoon.lee
  Cc: Laurent Desnogues, qemu-devel, 陳韋任 (Wei-Ren Chen)

On Thu, Jun 14, 2012 at 12:35 PM, YeongKyoon Lee
<yeongkyoon.lee@samsung.com> wrote:
> Hi,
>
> I proposed QEMU ld/st optimizations two times, late last year and early this year, however, there was no replies and even more 0/3 patch mail in this year has been missed so I cannot find the mail in QEMU mail archives. :(
> The link of 0/3 (http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html) is old one in last year.
>
> Anyway, the other links of patches (1/3~3/3) are as follows.
> http://lists.nongnu.org/archive/html/qemu-devel/2012-01/msg00026.html
> http://lists.nongnu.org/archive/html/qemu-devel/2012-01/msg00027.html
> http://lists.gnu.org/archive/html/qemu-devel/2012-01/msg00028.html
>
> Because there was no replies from qemu-devel, I have applied the patches only to Tizen emulator, where Tizen is a new mobile platform of Linux foundation.
> At this time, QEMU mainline has been changed since the patches, so I'll propose them again after some modification for latest sources.

The idea looks nice, but instead of different TLB functions selected
at configure time, the optimization should be enabled by default.

Maybe a 'call' instruction could be used to jump to the slow path,
that way the slow path could be shared.

>
> Thanks.
>
> ------- Original Message -------
> Sender : 陳韋任 (Wei-Ren Chen)<chenwj@iis.sinica.edu.tw>
> Date : 2012-06-14 12:31 (GMT+09:00)
> Title : Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
>
>> As a side note, it might be interesting to gather statistics about the hit
>> rate of the QEMU TLB.  Another thing to consider is speeding up the
>> fast path;  see YeongKyoon Lee RFC patch:
>>
>> http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html
>
>  I only see PATCH 0/3, any idea on where the others?
>
> Regards,
> chenwj
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj
>
> __________________________________
> Principal Engineer
> VM Team
> Yeongkyoon Lee
>
> S-Core Co., Ltd.
> D.L.: +82-31-696-7249
> M.P.: +82-10-9965-1265
> __________________________________

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-15 19:20 ` Blue Swirl
@ 2012-06-18  6:57   ` 陳韋任 (Wei-Ren Chen)
  2012-06-18 19:28     ` Blue Swirl
  0 siblings, 1 reply; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-18  6:57 UTC (permalink / raw)
  To: Blue Swirl
  Cc: Laurent Desnogues, qemu-devel,
	陳韋任 (Wei-Ren Chen),
	yeongkyoon.lee

> The idea looks nice, but instead of different TLB functions selected
> at configure time, the optimization should be enabled by default.
> 
> Maybe a 'call' instruction could be used to jump to the slow path,
> that way the slow path could be shared.

  I don't understand what "maybe a 'call' instruction could be used to jump to
the slow path", could you elaborate on that? From YeongKyoon's cover letter [1],
the current flow is,

  (1) TLB check
  (2) If hit fall through, else jump to TLB miss case (5)
  (3) TLB hit case: Load value from host memory
  (4) Jump to next code (6)
  (5) TLB miss case: call MMU helper
  (6) ... (next code)

Do you mean we directly call MMU helper ing step 2?

Regards,
chenwj

[1] http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-18  6:57   ` 陳韋任 (Wei-Ren Chen)
@ 2012-06-18 19:28     ` Blue Swirl
  0 siblings, 0 replies; 30+ messages in thread
From: Blue Swirl @ 2012-06-18 19:28 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen)
  Cc: Laurent Desnogues, qemu-devel, yeongkyoon.lee

On Mon, Jun 18, 2012 at 6:57 AM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
>> The idea looks nice, but instead of different TLB functions selected
>> at configure time, the optimization should be enabled by default.
>>
>> Maybe a 'call' instruction could be used to jump to the slow path,
>> that way the slow path could be shared.
>
>  I don't understand what "maybe a 'call' instruction could be used to jump to
> the slow path", could you elaborate on that? From YeongKyoon's cover letter [1],
> the current flow is,
>
>  (1) TLB check
>  (2) If hit fall through, else jump to TLB miss case (5)
>  (3) TLB hit case: Load value from host memory
>  (4) Jump to next code (6)
>  (5) TLB miss case: call MMU helper
>  (6) ... (next code)

AFAICT, with the patches, the flow would become
(1) TLB check
(2) If hit fall through, else jump to TLB miss case (5)
(3) TLB hit case: Load value from host memory
(4) ... (next code)

Later generated:

(5) TLB miss case: call MMU helper
(6) Jump back to (4)

>
> Do you mean we directly call MMU helper ing step 2?

Not really. But the interface between the helper and the generated
code could be optimized, for example to get rid of the redundant TLB
check or pass around already calculated values.

>
> Regards,
> chenwj
>
> [1] http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-22  9:58 ` Xin Tong
@ 2012-06-24  6:13   ` Blue Swirl
  0 siblings, 0 replies; 30+ messages in thread
From: Blue Swirl @ 2012-06-24  6:13 UTC (permalink / raw)
  To: Xin Tong; +Cc: qemu-devel, 陳韋任 (Wei-Ren Chen)

On Fri, Jun 22, 2012 at 9:58 AM, Xin Tong <xerox.time.tech@gmail.com> wrote:
> It is a pity that QEMU does not outline the TLB lookup code. I do not
> know how much impact the inlined TLB code has due to icache misses...

With a test case the impact could be measured. Maybe it could be just
a program performing loads in a loop, executing under a user emulator.

> Another benefit one gets from outlined TLB code is that it is much
> easier to gather the amount of time spent in the TLB. one can just
> profile QEMU and count up how many ticks happened in the outlined TLB
> translation code.

There's also a possible benefit that the code generation buffer does
not fill as fast.

> In fact, i do not think outlining QEMU inlined TLB lookup is too hard
> to implement. one can still keep most of the original inlined TLB code
> and use call/ret to get a TLB translation. of course, one needs to
> come up with a new linkage.

If it can be shown with a test case and statistics that outlining does
not make things worse, we can switch.

>
> Xin
>
>
> On Wed, Jun 20, 2012 at 3:57 AM, 陳韋任 (Wei-Ren Chen)
> <chenwj@iis.sinica.edu.tw> wrote:
>>  CC'ed to the mailing list.
>>
>> --
>> Wei-Ren Chen (陳韋任)
>> Computer Systems Lab, Institute of Information Science,
>> Academia Sinica, Taiwan (R.O.C.)
>> Tel:886-2-2788-3799 #1667
>> Homepage: http://people.cs.nctu.edu.tw/~chenwj
>>
>>
>> ---------- Forwarded message ----------
>> From: Orit Wasserman <owasserm@redhat.com>
>> To: "\"陳韋任 (Wei-Ren Chen)\"" <chenwj@iis.sinica.edu.tw>
>> Cc:
>> Date: Tue, 19 Jun 2012 12:01:08 +0300
>> Subject: Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
>> On 06/19/2012 11:49 AM, 陳韋任 (Wei-Ren Chen) wrote:
>>>   Mind me CC this to ML? :)
>> sure I will read the threads to understand more.
>>
>> Orit
>>>
>>>> Well it was a while back (2008-9) ,the company was acquired by IBM a year later :
>>>> http://www.linux-kvm.org/wiki/images/9/98/KvmForum2008%24kdf2008_2.pdf
>>>> I think stefan Hanjoczi worked there ...
>>>> The company used the technology for cross platform guest support but claim to get speedup too
>>>> (for ppc) don't think the speedup was related to mmu but more to the instruction stream.
>>>> I hope this is helpful.
>>>
>>>   Thanks.
>>>
>>>> Do you have performance result for the cost of the address translation ?
>>>> If I understand you are concentrating on ARM ?
>>>
>>>   The whole discussion thread is on [1], and you can get some feel about
>>> the cost of address translation here [2]. Yes, ARM is our target right now,
>>> but I think we are not limit to it.
>>>
>>> Regards,
>>> chenwj
>>>
>>> [1] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116159.html
>>> [2] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116404.html
>>>
>>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-20  7:57 陳韋任 (Wei-Ren Chen)
@ 2012-06-22  9:58 ` Xin Tong
  2012-06-24  6:13   ` Blue Swirl
  0 siblings, 1 reply; 30+ messages in thread
From: Xin Tong @ 2012-06-22  9:58 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen); +Cc: qemu-devel

It is a pity that QEMU does not outline the TLB lookup code. I do not
know how much impact the inlined TLB code has due to icache misses...

Another benefit one gets from outlined TLB code is that it is much
easier to gather the amount of time spent in the TLB. one can just
profile QEMU and count up how many ticks happened in the outlined TLB
translation code.

In fact, i do not think outlining QEMU inlined TLB lookup is too hard
to implement. one can still keep most of the original inlined TLB code
and use call/ret to get a TLB translation. of course, one needs to
come up with a new linkage.

Xin


On Wed, Jun 20, 2012 at 3:57 AM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
>  CC'ed to the mailing list.
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj
>
>
> ---------- Forwarded message ----------
> From: Orit Wasserman <owasserm@redhat.com>
> To: "\"陳韋任 (Wei-Ren Chen)\"" <chenwj@iis.sinica.edu.tw>
> Cc:
> Date: Tue, 19 Jun 2012 12:01:08 +0300
> Subject: Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
> On 06/19/2012 11:49 AM, 陳韋任 (Wei-Ren Chen) wrote:
>>   Mind me CC this to ML? :)
> sure I will read the threads to understand more.
>
> Orit
>>
>>> Well it was a while back (2008-9) ,the company was acquired by IBM a year later :
>>> http://www.linux-kvm.org/wiki/images/9/98/KvmForum2008%24kdf2008_2.pdf
>>> I think stefan Hanjoczi worked there ...
>>> The company used the technology for cross platform guest support but claim to get speedup too
>>> (for ppc) don't think the speedup was related to mmu but more to the instruction stream.
>>> I hope this is helpful.
>>
>>   Thanks.
>>
>>> Do you have performance result for the cost of the address translation ?
>>> If I understand you are concentrating on ARM ?
>>
>>   The whole discussion thread is on [1], and you can get some feel about
>> the cost of address translation here [2]. Yes, ARM is our target right now,
>> but I think we are not limit to it.
>>
>> Regards,
>> chenwj
>>
>> [1] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116159.html
>> [2] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116404.html
>>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
@ 2012-06-20  7:57 陳韋任 (Wei-Ren Chen)
  2012-06-22  9:58 ` Xin Tong
  0 siblings, 1 reply; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-20  7:57 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 229 bytes --]

  CC'ed to the mailing list.

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

[-- Attachment #2: Type: message/rfc822, Size: 4685 bytes --]

From: Orit Wasserman <owasserm@redhat.com>
To: "\"陳韋任 (Wei-Ren Chen)\"" <chenwj@iis.sinica.edu.tw>
Subject: Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
Date: Tue, 19 Jun 2012 12:01:08 +0300
Message-ID: <4FE03FD4.4070508@redhat.com>

On 06/19/2012 11:49 AM, 陳韋任 (Wei-Ren Chen) wrote:
>   Mind me CC this to ML? :)
sure I will read the threads to understand more.

Orit
> 
>> Well it was a while back (2008-9) ,the company was acquired by IBM a year later :
>> http://www.linux-kvm.org/wiki/images/9/98/KvmForum2008%24kdf2008_2.pdf
>> I think stefan Hanjoczi worked there ...
>> The company used the technology for cross platform guest support but claim to get speedup too
>> (for ppc) don't think the speedup was related to mmu but more to the instruction stream.
>> I hope this is helpful.
> 
>   Thanks.
>  
>> Do you have performance result for the cost of the address translation ?
>> If I understand you are concentrating on ARM ?
> 
>   The whole discussion thread is on [1], and you can get some feel about
> the cost of address translation here [2]. Yes, ARM is our target right now,
> but I think we are not limit to it. 
> 
> Regards,
> chenwj
> 
> [1] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116159.html
> [2] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116404.html
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-19  9:47           ` Michael.Kang
@ 2012-06-19 10:51             ` Lluís Vilanova
  0 siblings, 0 replies; 30+ messages in thread
From: Lluís Vilanova @ 2012-06-19 10:51 UTC (permalink / raw)
  To: Michael.Kang
  Cc: Blue Swirl, laurent.desnogues, qemu-devel,
	陳韋任 (Wei-Ren Chen),
	stefanha

Michael Kang writes:

> On Tue, Jun 19, 2012 at 4:26 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
[...]
>> I could understand having multiple 32bit regions in QEMU's virtual space (no
>> need for KVM), one per guest page table, and then simply adding an offset to
>> every memory access to redirect it to the appropriate 32-bit region (1 region
>> per guest page table).
>> 
>> This could translate a single guest ld/st into a host ld+add+ld/st (the first
>> load is to get the "region" offset for the currently executing guest context).
>> 
>> With this, you can use 'mprotect' in QEMU to enforce the guest's page
>> permissions (as long as the host supports it), and 'mmap' to share the host
>> physical memory between the different 32-bit regions whenever the guest page
>> tables share guest physical memory (again, as long as the host supports it).
>> 
>> But I suppose having a guest with as many or more bits than the host is the
>> common case, which hinders its applicability.

> I ever have some thought like you. Firstly , we only simulate 32bit
> guest on 64 bit host for the case.
> Secondly I ever did some experiments. And I can not mmap the address
> space more than
>  about 8G on 64 bit linux OS. Maybe there some limits in the linux
> kernel of host.

You can see your resource limits with "ulimit -a", but without more info I
cannot tell what's actually going on.


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-18 20:26         ` Lluís Vilanova
  2012-06-19  7:52           ` 陳韋任 (Wei-Ren Chen)
@ 2012-06-19  9:47           ` Michael.Kang
  2012-06-19 10:51             ` Lluís Vilanova
  1 sibling, 1 reply; 30+ messages in thread
From: Michael.Kang @ 2012-06-19  9:47 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: Blue Swirl, laurent.desnogues, qemu-devel,
	陳韋任 (Wei-Ren Chen),
	stefanha

On Tue, Jun 19, 2012 at 4:26 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Blue Swirl writes:
>
>> On Mon, Jun 18, 2012 at 8:28 AM, 陳韋任 (Wei-Ren Chen)
>> <chenwj@iis.sinica.edu.tw> wrote:
>>>>   The reason why we want to do the measuring is we want to use KVM (sounds crazy
>>>> idea) MMU virtualization to speedup the guest -> host memory address translation.
>>>> I talked to some people on LinuxCon Japan, included Paolo, about this idea. The
>>>> feedback I got is we can only use shadow page table rather than EPT/NPT to do
>>>> the address translation (if possible!) since different ISA (ARM and x86, for
>>>> example) have different page table format. Besides, QEMU has to use ioctl to ask
>>>> KVM to get the translation result, but it's an overkill as the ARM page table
>>>> is quite simple, which can be done in user mode very fast.
>>>
>>>  Anyone would like to give a comment on this? ;)
>>>
>>>  From the talk with Laurent on #qemu, he said the way he thought of is
>>> translating GVA -> GPA manually (through software), then try to insert
>>> GPA -> HPA into EPT, that's the only way HW can help.
>
>> For some 32 bit guests on some 64 bit hosts, maybe KVM could indeed
>> help. Just map the whole 4G guest virtual address space so that guest
>> memory accesses can be turned 1:1 into raw direct accesses. I/O pages
>> would be unmapped, accesses handled via fault path.
>
> But if QEMU/TCG is doing a GVA->GPA translation as Wei-Ren said, I don't see how
> KVM can help.
>
> I could understand having multiple 32bit regions in QEMU's virtual space (no
> need for KVM), one per guest page table, and then simply adding an offset to
> every memory access to redirect it to the appropriate 32-bit region (1 region
> per guest page table).
>
> This could translate a single guest ld/st into a host ld+add+ld/st (the first
> load is to get the "region" offset for the currently executing guest context).
>
> With this, you can use 'mprotect' in QEMU to enforce the guest's page
> permissions (as long as the host supports it), and 'mmap' to share the host
> physical memory between the different 32-bit regions whenever the guest page
> tables share guest physical memory (again, as long as the host supports it).
>
> But I suppose having a guest with as many or more bits than the host is the
> common case, which hinders its applicability.

I ever have some thought like you. Firstly , we only simulate 32bit
guest on 64 bit host for the case.
Secondly I ever did some experiments. And I can not mmap the address
space more than
 about 8G on 64 bit linux OS. Maybe there some limits in the linux
kernel of host.

Thanks
MK

>
>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
>



-- 
www.skyeye.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-19  7:52           ` 陳韋任 (Wei-Ren Chen)
@ 2012-06-19  9:35             ` Michael.Kang
  0 siblings, 0 replies; 30+ messages in thread
From: Michael.Kang @ 2012-06-19  9:35 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen)
  Cc: Blue Swirl, laurent.desnogues, stefanha, Lluís Vilanova, qemu-devel

On Tue, Jun 19, 2012 at 3:52 PM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
>> But if QEMU/TCG is doing a GVA->GPA translation as Wei-Ren said, I don't see how
>> KVM can help.
>
>  Just want to clarify. QEMU maintain a TLB (env->tlb_table) which stores GVA ->
> HVA mapping, it is used to speedup the address translation. If TLB miss, QEMU
> will call cpu_arm_handle_mmu_fault (take ARM as an example) doing GVA -> GPA
> translation.
>
>> I could understand having multiple 32bit regions in QEMU's virtual space (no
>> need for KVM), one per guest page table, and then simply adding an offset to
>> every memory access to redirect it to the appropriate 32-bit region (1 region
>> per guest page table).
>>
>> This could translate a single guest ld/st into a host ld+add+ld/st (the first
>> load is to get the "region" offset for the currently executing guest context).
>
>  It differs from what QEMU's doing? Each time we fill TLB, we add an offset to
> the GPA to get HVA, then store GVA -> HVA mapping into the TLB (IIUC). I don't
> see much differences here.
I think What is Qemu doing is to mapped GPA to HVA . Lluís mean we can
map GVA to HVA. So
 we event do not need to lookup TLB and just use one host memory
access instruction to simulate one guest
memory access instruction.

Thanks
MK

>
> Regards,
> chenwj
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj
>



-- 
www.skyeye.org

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
       [not found]           ` <20120619084939.GA37515@cs.nctu.edu.tw>
@ 2012-06-19  9:01             ` Orit Wasserman
  0 siblings, 0 replies; 30+ messages in thread
From: Orit Wasserman @ 2012-06-19  9:01 UTC (permalink / raw)
  To: "陳韋任 (Wei-Ren Chen)"

On 06/19/2012 11:49 AM, 陳韋任 (Wei-Ren Chen) wrote:
>   Mind me CC this to ML? :)
sure I will read the threads to understand more.

Orit
> 
>> Well it was a while back (2008-9) ,the company was acquired by IBM a year later :
>> http://www.linux-kvm.org/wiki/images/9/98/KvmForum2008%24kdf2008_2.pdf
>> I think stefan Hanjoczi worked there ...
>> The company used the technology for cross platform guest support but claim to get speedup too
>> (for ppc) don't think the speedup was related to mmu but more to the instruction stream.
>> I hope this is helpful.
> 
>   Thanks.
>  
>> Do you have performance result for the cost of the address translation ?
>> If I understand you are concentrating on ARM ?
> 
>   The whole discussion thread is on [1], and you can get some feel about
> the cost of address translation here [2]. Yes, ARM is our target right now,
> but I think we are not limit to it. 
> 
> Regards,
> chenwj
> 
> [1] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116159.html
> [2] http://www.mail-archive.com/qemu-devel@nongnu.org/msg116404.html
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-18 20:26         ` Lluís Vilanova
@ 2012-06-19  7:52           ` 陳韋任 (Wei-Ren Chen)
  2012-06-19  9:35             ` Michael.Kang
  2012-06-19  9:47           ` Michael.Kang
  1 sibling, 1 reply; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-19  7:52 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: Blue Swirl, laurent.desnogues, qemu-devel,
	陳韋任 (Wei-Ren Chen),
	stefanha

> But if QEMU/TCG is doing a GVA->GPA translation as Wei-Ren said, I don't see how
> KVM can help.

  Just want to clarify. QEMU maintain a TLB (env->tlb_table) which stores GVA ->
HVA mapping, it is used to speedup the address translation. If TLB miss, QEMU
will call cpu_arm_handle_mmu_fault (take ARM as an example) doing GVA -> GPA
translation.
 
> I could understand having multiple 32bit regions in QEMU's virtual space (no
> need for KVM), one per guest page table, and then simply adding an offset to
> every memory access to redirect it to the appropriate 32-bit region (1 region
> per guest page table).
> 
> This could translate a single guest ld/st into a host ld+add+ld/st (the first
> load is to get the "region" offset for the currently executing guest context).

  It differs from what QEMU's doing? Each time we fill TLB, we add an offset to
the GPA to get HVA, then store GVA -> HVA mapping into the TLB (IIUC). I don't
see much differences here.
 
Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-18 19:36       ` Blue Swirl
@ 2012-06-18 20:26         ` Lluís Vilanova
  2012-06-19  7:52           ` 陳韋任 (Wei-Ren Chen)
  2012-06-19  9:47           ` Michael.Kang
  0 siblings, 2 replies; 30+ messages in thread
From: Lluís Vilanova @ 2012-06-18 20:26 UTC (permalink / raw)
  To: Blue Swirl
  Cc: laurent.desnogues, stefanha, qemu-devel,
	陳韋任 (Wei-Ren Chen)

Blue Swirl writes:

> On Mon, Jun 18, 2012 at 8:28 AM, 陳韋任 (Wei-Ren Chen)
> <chenwj@iis.sinica.edu.tw> wrote:
>>>   The reason why we want to do the measuring is we want to use KVM (sounds crazy
>>> idea) MMU virtualization to speedup the guest -> host memory address translation.
>>> I talked to some people on LinuxCon Japan, included Paolo, about this idea. The
>>> feedback I got is we can only use shadow page table rather than EPT/NPT to do
>>> the address translation (if possible!) since different ISA (ARM and x86, for
>>> example) have different page table format. Besides, QEMU has to use ioctl to ask
>>> KVM to get the translation result, but it's an overkill as the ARM page table
>>> is quite simple, which can be done in user mode very fast.
>> 
>>  Anyone would like to give a comment on this? ;)
>> 
>>  From the talk with Laurent on #qemu, he said the way he thought of is
>> translating GVA -> GPA manually (through software), then try to insert
>> GPA -> HPA into EPT, that's the only way HW can help.

> For some 32 bit guests on some 64 bit hosts, maybe KVM could indeed
> help. Just map the whole 4G guest virtual address space so that guest
> memory accesses can be turned 1:1 into raw direct accesses. I/O pages
> would be unmapped, accesses handled via fault path.

But if QEMU/TCG is doing a GVA->GPA translation as Wei-Ren said, I don't see how
KVM can help.

I could understand having multiple 32bit regions in QEMU's virtual space (no
need for KVM), one per guest page table, and then simply adding an offset to
every memory access to redirect it to the appropriate 32-bit region (1 region
per guest page table).

This could translate a single guest ld/st into a host ld+add+ld/st (the first
load is to get the "region" offset for the currently executing guest context).

With this, you can use 'mprotect' in QEMU to enforce the guest's page
permissions (as long as the host supports it), and 'mmap' to share the host
physical memory between the different 32-bit regions whenever the guest page
tables share guest physical memory (again, as long as the host supports it).

But I suppose having a guest with as many or more bits than the host is the
common case, which hinders its applicability.

Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-18  8:28     ` 陳韋任 (Wei-Ren Chen)
  2012-06-18 19:36       ` Blue Swirl
@ 2012-06-18 19:38       ` Lluís Vilanova
       [not found]       ` <20120619075534.GB34488@cs.nctu.edu.tw>
  2 siblings, 0 replies; 30+ messages in thread
From: Lluís Vilanova @ 2012-06-18 19:38 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen)
  Cc: laurent.desnogues, stefanha, qemu-devel

陳韋任 (Wei-Ren Chen) writes:

>> The reason why we want to do the measuring is we want to use KVM (sounds crazy
>> idea) MMU virtualization to speedup the guest -> host memory address translation.
>> I talked to some people on LinuxCon Japan, included Paolo, about this idea. The
>> feedback I got is we can only use shadow page table rather than EPT/NPT to do
>> the address translation (if possible!) since different ISA (ARM and x86, for
>> example) have different page table format. Besides, QEMU has to use ioctl to ask
>> KVM to get the translation result, but it's an overkill as the ARM page table
>> is quite simple, which can be done in user mode very fast.

>   Anyone would like to give a comment on this? ;)

>   From the talk with Laurent on #qemu, he said the way he thought of is
> translating GVA -> GPA manually (through software), then try to insert
> GPA -> HPA into EPT, that's the only way HW can help.

Sorry, but I don't see the benefit here.


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-18  8:28     ` 陳韋任 (Wei-Ren Chen)
@ 2012-06-18 19:36       ` Blue Swirl
  2012-06-18 20:26         ` Lluís Vilanova
  2012-06-18 19:38       ` Lluís Vilanova
       [not found]       ` <20120619075534.GB34488@cs.nctu.edu.tw>
  2 siblings, 1 reply; 30+ messages in thread
From: Blue Swirl @ 2012-06-18 19:36 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen)
  Cc: laurent.desnogues, stefanha, qemu-devel, vilanova

On Mon, Jun 18, 2012 at 8:28 AM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
>>   The reason why we want to do the measuring is we want to use KVM (sounds crazy
>> idea) MMU virtualization to speedup the guest -> host memory address translation.
>> I talked to some people on LinuxCon Japan, included Paolo, about this idea. The
>> feedback I got is we can only use shadow page table rather than EPT/NPT to do
>> the address translation (if possible!) since different ISA (ARM and x86, for
>> example) have different page table format. Besides, QEMU has to use ioctl to ask
>> KVM to get the translation result, but it's an overkill as the ARM page table
>> is quite simple, which can be done in user mode very fast.
>
>  Anyone would like to give a comment on this? ;)
>
>  From the talk with Laurent on #qemu, he said the way he thought of is
> translating GVA -> GPA manually (through software), then try to insert
> GPA -> HPA into EPT, that's the only way HW can help.

For some 32 bit guests on some 64 bit hosts, maybe KVM could indeed
help. Just map the whole 4G guest virtual address space so that guest
memory accesses can be turned 1:1 into raw direct accesses. I/O pages
would be unmapped, accesses handled via fault path.

This of course depends on guest and host MMU models being sufficiently
similar (esp. page size).

>
> Regards,
> chenwj
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-18  4:59 YeongKyoon Lee
@ 2012-06-18 19:21 ` Blue Swirl
  0 siblings, 0 replies; 30+ messages in thread
From: Blue Swirl @ 2012-06-18 19:21 UTC (permalink / raw)
  To: yeongkyoon.lee
  Cc: Laurent Desnogues, qemu-devel, 陳韋任 (Wei-Ren Chen)

On Mon, Jun 18, 2012 at 4:59 AM, YeongKyoon Lee
<yeongkyoon.lee@samsung.com> wrote:
>> The idea looks nice, but instead of different TLB functions selected
>> at configure time, the optimization should be enabled by default.
>>
>> Maybe a 'call' instruction could be used to jump to the slow path,
>> that way the slow path could be shared.
>
> I had considered the approach of sharing slow path, actually argument settings for ld/st helper functions.
> However, the problem is that we don't know which runtime registers having arguments.

Hm, I didn't consider that.

>
> There are three possible solutions for sharing, I think.
> 1. Using code stub table including all the possible argument register combinations
> 2. Using register information flag constant generated at TCG time, and a special stub which parsing the flag and setting the arguments
> 3. Setting the arguments commonly for fast and slow paths, which needs additional stack or register cleanup for fast path
>
> I think the solution #2 looks better for performance, however, it could be implemented using jump and assembly because using call could come together with unwanted code and register clobber by the prolog of callee.
> But it might be painful of implementing the parsing stub using assembly (or code generation). It is the weak point of solution #2.
>
> How do you think about it?

I agree that #2 looks best.

>
> __________________________________
> Principal Engineer
> VM Team
> Yeongkyoon Lee
>
> S-Core Co., Ltd.
> D.L.: +82-31-696-7249
> M.P.: +82-10-9965-1265
> __________________________________

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-14  3:18   ` 陳韋任 (Wei-Ren Chen)
  2012-06-14 22:30     ` Lluís Vilanova
@ 2012-06-18  8:28     ` 陳韋任 (Wei-Ren Chen)
  2012-06-18 19:36       ` Blue Swirl
                         ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-18  8:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: laurent.desnogues, stefanha, vilanova

>   The reason why we want to do the measuring is we want to use KVM (sounds crazy
> idea) MMU virtualization to speedup the guest -> host memory address translation.
> I talked to some people on LinuxCon Japan, included Paolo, about this idea. The
> feedback I got is we can only use shadow page table rather than EPT/NPT to do
> the address translation (if possible!) since different ISA (ARM and x86, for
> example) have different page table format. Besides, QEMU has to use ioctl to ask
> KVM to get the translation result, but it's an overkill as the ARM page table
> is quite simple, which can be done in user mode very fast.

  Anyone would like to give a comment on this? ;)

  From the talk with Laurent on #qemu, he said the way he thought of is
translating GVA -> GPA manually (through software), then try to insert
GPA -> HPA into EPT, that's the only way HW can help.

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
@ 2012-06-18  4:59 YeongKyoon Lee
  2012-06-18 19:21 ` Blue Swirl
  0 siblings, 1 reply; 30+ messages in thread
From: YeongKyoon Lee @ 2012-06-18  4:59 UTC (permalink / raw)
  To: Blue Swirl
  Cc: Laurent Desnogues, qemu-devel, 陳韋任 (Wei-Ren Chen)

> The idea looks nice, but instead of different TLB functions selected
> at configure time, the optimization should be enabled by default.
>
> Maybe a 'call' instruction could be used to jump to the slow path,
> that way the slow path could be shared.

I had considered the approach of sharing slow path, actually argument settings for ld/st helper functions.
However, the problem is that we don't know which runtime registers having arguments.

There are three possible solutions for sharing, I think.
1. Using code stub table including all the possible argument register combinations 
2. Using register information flag constant generated at TCG time, and a special stub which parsing the flag and setting the arguments
3. Setting the arguments commonly for fast and slow paths, which needs additional stack or register cleanup for fast path

I think the solution #2 looks better for performance, however, it could be implemented using jump and assembly because using call could come together with unwanted code and register clobber by the prolog of callee.
But it might be painful of implementing the parsing stub using assembly (or code generation). It is the weak point of solution #2.

How do you think about it?

__________________________________
Principal Engineer 
VM Team 
Yeongkyoon Lee 

S-Core Co., Ltd.
D.L.: +82-31-696-7249
M.P.: +82-10-9965-1265
__________________________________

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-15  8:23       ` Laurent Desnogues
@ 2012-06-15 19:10         ` Lluís Vilanova
  0 siblings, 0 replies; 30+ messages in thread
From: Lluís Vilanova @ 2012-06-15 19:10 UTC (permalink / raw)
  To: Laurent Desnogues
  Cc: stefanha, qemu-devel, 陳韋任 (Wei-Ren Chen)

Laurent Desnogues writes:

> On Fri, Jun 15, 2012 at 12:30 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> [...]
>> Now that I think of it, you will have problems generating code to surround each
>> qemu_ld/st with a lightweight mechanism to get the time. In x86 it would be
>> rdtsc, but you want to generate a host rdtsc instruction inside the code
>> generated by QEMU's TCG, so you should also have to hack TCG (or the code
>> generation pointers) to issue an rdtsc instruction.

> Even rdtsc would introduce enough noise that it wouldn't be reliable
> for such a micro measurement:  as far as I understand it, this instruction
> can be reordered, so you need to flush the pipeline before issuing it.

> Intel has a document about that:
> download.intel.com/embedded/software/IA/324264.pdf
> The overhead of their proposed method is so high that it's likely it
> would take longer than the execution of the fast path itself.

> IMHO a mix of YeongKyoon Lee way to count ld/st and comparison
> between user mode and softmmu still seems to be the best approach
> (well unless you have access to a cycle accurate simulator :-).

Ah, true; I forgot about the architectural implications. Sometimes you just
assume the nice in-order world :)


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-14 22:30     ` Lluís Vilanova
@ 2012-06-15  8:23       ` Laurent Desnogues
  2012-06-15 19:10         ` Lluís Vilanova
  0 siblings, 1 reply; 30+ messages in thread
From: Laurent Desnogues @ 2012-06-15  8:23 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: stefanha, qemu-devel, 陳韋任 (Wei-Ren Chen)

On Fri, Jun 15, 2012 at 12:30 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
[...]
> Now that I think of it, you will have problems generating code to surround each
> qemu_ld/st with a lightweight mechanism to get the time. In x86 it would be
> rdtsc, but you want to generate a host rdtsc instruction inside the code
> generated by QEMU's TCG, so you should also have to hack TCG (or the code
> generation pointers) to issue an rdtsc instruction.

Even rdtsc would introduce enough noise that it wouldn't be reliable
for such a micro measurement:  as far as I understand it, this instruction
can be reordered, so you need to flush the pipeline before issuing it.

Intel has a document about that:
download.intel.com/embedded/software/IA/324264.pdf
The overhead of their proposed method is so high that it's likely it
would take longer than the execution of the fast path itself.

IMHO a mix of YeongKyoon Lee way to count ld/st and comparison
between user mode and softmmu still seems to be the best approach
(well unless you have access to a cycle accurate simulator :-).

Laurent

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
@ 2012-06-15  2:10 YeongKyoon Lee
  0 siblings, 0 replies; 30+ messages in thread
From: YeongKyoon Lee @ 2012-06-15  2:10 UTC (permalink / raw)
  To: Llu?s Vilanova, 陳韋任 (Wei-Ren Chen)
  Cc: Stefan Hajnoczi, qemu-devel

[-- Attachment #1: Type: text/html, Size: 2969 bytes --]

[-- Attachment #2: 201206151110961_PYMC4CBB.gif --]
[-- Type: image/gif, Size: 10014 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-14  3:29     ` 陳韋任 (Wei-Ren Chen)
@ 2012-06-14 23:17       ` Lluís Vilanova
  0 siblings, 0 replies; 30+ messages in thread
From: Lluís Vilanova @ 2012-06-14 23:17 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen); +Cc: Stefan Hajnoczi, qemu-devel

陳韋任 (Wei-Ren Chen) writes:

>> Unfortunately, I had the bad idea of rebasing all my series on top of the latest
>> makefile changes, and I'll have to go through each patch to check it's still
>> working (I'm sure some of them broke).

>   Need some help? :)

Well, it's just a matter of going through all series I have, check if they're
compiling and adjust the makefiles accordingly, so it should not take me that
much effort.

But thanks anyway :)

I'll try to find some time for it next week.


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-14  3:18   ` 陳韋任 (Wei-Ren Chen)
@ 2012-06-14 22:30     ` Lluís Vilanova
  2012-06-15  8:23       ` Laurent Desnogues
  2012-06-18  8:28     ` 陳韋任 (Wei-Ren Chen)
  1 sibling, 1 reply; 30+ messages in thread
From: Lluís Vilanova @ 2012-06-14 22:30 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen)
  Cc: laurent.desnogues, stefanha, qemu-devel

陳韋任 (Wei-Ren Chen) writes:

> On Wed, Jun 13, 2012 at 12:43:28PM +0200, Laurent Desnogues wrote:
>> On Wed, Jun 13, 2012 at 5:14 AM, 陳韋任 (Wei-Ren Chen)
>> <chenwj@iis.sinica.edu.tw> wrote:
>> > Hi all,
>> >
>> >  I suspect that guest memory access (qemu_ld/qemu_st) account for the major of
>> > time spent in system mode. I would like to know precisely how much (if possible).
>> > We use tools like perf [1] before, but since the logic of guest memory access aslo
>> > embedded in the host binary not only helper functions, the result cannot be
>> > relied. The current idea is adding helper functions before/after guest memory
>> > access logic. Take ARM guest on x86_64 host for example, should I add the helper
>> > functions before/after tcg_gen_qemu_{ld,st} in target-arm/translate.c or
>> > tcg_out_qemu_{ld,st} in tcg/i386/tcg-target.c? Or there is a better way to know
>> > how much time QEMU spend on handling guest memory access?
>> 
>> I'm afraid there's no easy way to measure that: any change you make
>> to generated code will completely change the timing given that the ld/st
>> fast path is only a few instructions long.

>   Lluis, how's your opinion on that? Does your tracepoints have the same timing
> issue, too?

They just give you a set of well-known events and a public API to insert
whatever you want in there, so whatever overhead you might have by directly
hacking into QEMU, you will have it also when using trace instrumentation.

Now that I think of it, you will have problems generating code to surround each
qemu_ld/st with a lightweight mechanism to get the time. In x86 it would be
rdtsc, but you want to generate a host rdtsc instruction inside the code
generated by QEMU's TCG, so you should also have to hack TCG (or the code
generation pointers) to issue an rdtsc instruction.


>> Another approach might be to run the program in user mode and then in system
>> mode (provided the guest OS is very light).

>   We ran SPEC2006 test input both in user and system mode (arm guest os). The
> result is that system mode is roughly 2x slower than user mode. Not sure if the
> result is reasonable. 

Well, you have all the MMU checks in system mode.

You might try checking which percentage of the application is actually
performing memory oprerations. This could help you accept or dismiss your
theory.


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-13 10:43 ` Laurent Desnogues
  2012-06-14  3:18   ` 陳韋任 (Wei-Ren Chen)
@ 2012-06-14  3:31   ` 陳韋任 (Wei-Ren Chen)
  1 sibling, 0 replies; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-14  3:31 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: qemu-devel, 陳韋任 (Wei-Ren Chen)

> As a side note, it might be interesting to gather statistics about the hit
> rate of the QEMU TLB.  Another thing to consider is speeding up the
> fast path;  see YeongKyoon Lee RFC patch:
> 
> http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html

  I only see PATCH 0/3, any idea on where the others?

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-13 12:43   ` Lluís Vilanova
@ 2012-06-14  3:29     ` 陳韋任 (Wei-Ren Chen)
  2012-06-14 23:17       ` Lluís Vilanova
  0 siblings, 1 reply; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-14  3:29 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: Stefan Hajnoczi, qemu-devel, 陳韋任 (Wei-Ren Chen)

> Unfortunately, I had the bad idea of rebasing all my series on top of the latest
> makefile changes, and I'll have to go through each patch to check it's still
> working (I'm sure some of them broke).

  Need some help? :)

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-13 10:43 ` Laurent Desnogues
@ 2012-06-14  3:18   ` 陳韋任 (Wei-Ren Chen)
  2012-06-14 22:30     ` Lluís Vilanova
  2012-06-18  8:28     ` 陳韋任 (Wei-Ren Chen)
  2012-06-14  3:31   ` 陳韋任 (Wei-Ren Chen)
  1 sibling, 2 replies; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-14  3:18 UTC (permalink / raw)
  To: laurent.desnogues
  Cc: stefanha, qemu-devel, 陳韋任 (Wei-Ren Chen), vilanova

On Wed, Jun 13, 2012 at 12:43:28PM +0200, Laurent Desnogues wrote:
> On Wed, Jun 13, 2012 at 5:14 AM, 陳韋任 (Wei-Ren Chen)
> <chenwj@iis.sinica.edu.tw> wrote:
> > Hi all,
> >
> >  I suspect that guest memory access (qemu_ld/qemu_st) account for the major of
> > time spent in system mode. I would like to know precisely how much (if possible).
> > We use tools like perf [1] before, but since the logic of guest memory access aslo
> > embedded in the host binary not only helper functions, the result cannot be
> > relied. The current idea is adding helper functions before/after guest memory
> > access logic. Take ARM guest on x86_64 host for example, should I add the helper
> > functions before/after tcg_gen_qemu_{ld,st} in target-arm/translate.c or
> > tcg_out_qemu_{ld,st} in tcg/i386/tcg-target.c? Or there is a better way to know
> > how much time QEMU spend on handling guest memory access?
> 
> I'm afraid there's no easy way to measure that: any change you make
> to generated code will completely change the timing given that the ld/st
> fast path is only a few instructions long.

  Lluis, how's your opinion on that? Does your tracepoints have the same timing
issue, too?

> Another approach might be to run the program in user mode and then
> in system mode (provided the guest OS is very light).

  We ran SPEC2006 test input both in user and system mode (arm guest os). The
result is that system mode is roughly 2x slower than user mode. Not sure if the
result is reasonable. 

> As a side note, it might be interesting to gather statistics about the hit
> rate of the QEMU TLB.  Another thing to consider is speeding up the
> fast path;  see YeongKyoon Lee RFC patch:
> 
> http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html

  We have some result on TLB hit rate on the link below,

  https://docs.google.com/spreadsheet/ccc?key=0Aq_07U3IjpY8dFN6dTczMldtQVRUSk9Qa2ZKZTZEZGc&pli=1#gid=0

Here is how we get the TLB hit rate. We use tcg_out_xxx to insert counting code
in the tcg_out_tlb_load (tcg/i386/tcg-target.c). At the beginning of tcg_out_tlb_load,
we count the total guest memory access, and count the tlb hit number at the TLB
hit flow. You can see the code at

  https://github.com/ZackClown/QEMU_1.0.1/commit/013a9f8e2611e25344bc095a9f72fdfbb0c64d06#diff-3  

  The reason why we want to do the measuring is we want to use KVM (sounds crazy
idea) MMU virtualization to speedup the guest -> host memory address translation.
I talked to some people on LinuxCon Japan, included Paolo, about this idea. The
feedback I got is we can only use shadow page table rather than EPT/NPT to do
the address translation (if possible!) since different ISA (ARM and x86, for
example) have different page table format. Besides, QEMU has to use ioctl to ask
KVM to get the translation result, but it's an overkill as the ARM page table
is quite simple, which can be done in user mode very fast.

  Any comment is welcomed.

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-13  9:18 ` Stefan Hajnoczi
@ 2012-06-13 12:43   ` Lluís Vilanova
  2012-06-14  3:29     ` 陳韋任 (Wei-Ren Chen)
  0 siblings, 1 reply; 30+ messages in thread
From: Lluís Vilanova @ 2012-06-13 12:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, 陳韋任 (Wei-Ren Chen)

Stefan Hajnoczi writes:

> On Wed, Jun 13, 2012 at 4:14 AM, 陳韋任 (Wei-Ren Chen)
> <chenwj@iis.sinica.edu.tw> wrote:
>>  I suspect that guest memory access (qemu_ld/qemu_st) account for the major of
>> time spent in system mode. I would like to know precisely how much (if possible).
>> We use tools like perf [1] before, but since the logic of guest memory access aslo
>> embedded in the host binary not only helper functions, the result cannot be
>> relied. The current idea is adding helper functions before/after guest memory
>> access logic. Take ARM guest on x86_64 host for example, should I add the helper
>> functions before/after tcg_gen_qemu_{ld,st} in target-arm/translate.c or
>> tcg_out_qemu_{ld,st} in tcg/i386/tcg-target.c? Or there is a better way to know
>> how much time QEMU spend on handling guest memory access?

> Lluís: Can the instrumentation you've been working on do this?

Sure. I have tracepoints for memory accesses before they are actually
performed. It would just be a matter of adding another tracepoint after the
memory access operation has been performed (I had plans for adding this together
with physical memory address information, but I'm on other tasks for the time
being).

Given that memory access tracepoints are added through macros (by redefining the
memory access routine), it's trivial to add a tracepoint after the memory access
itself.

Unfortunately, I had the bad idea of rebasing all my series on top of the latest
makefile changes, and I'll have to go through each patch to check it's still
working (I'm sure some of them broke).

Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-13  3:14 陳韋任 (Wei-Ren Chen)
  2012-06-13  9:18 ` Stefan Hajnoczi
@ 2012-06-13 10:43 ` Laurent Desnogues
  2012-06-14  3:18   ` 陳韋任 (Wei-Ren Chen)
  2012-06-14  3:31   ` 陳韋任 (Wei-Ren Chen)
  1 sibling, 2 replies; 30+ messages in thread
From: Laurent Desnogues @ 2012-06-13 10:43 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen); +Cc: qemu-devel

On Wed, Jun 13, 2012 at 5:14 AM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
> Hi all,
>
>  I suspect that guest memory access (qemu_ld/qemu_st) account for the major of
> time spent in system mode. I would like to know precisely how much (if possible).
> We use tools like perf [1] before, but since the logic of guest memory access aslo
> embedded in the host binary not only helper functions, the result cannot be
> relied. The current idea is adding helper functions before/after guest memory
> access logic. Take ARM guest on x86_64 host for example, should I add the helper
> functions before/after tcg_gen_qemu_{ld,st} in target-arm/translate.c or
> tcg_out_qemu_{ld,st} in tcg/i386/tcg-target.c? Or there is a better way to know
> how much time QEMU spend on handling guest memory access?

I'm afraid there's no easy way to measure that: any change you make
to generated code will completely change the timing given that the ld/st
fast path is only a few instructions long.

Another approach might be to run the program in user mode and then
in system mode (provided the guest OS is very light).

As a side note, it might be interesting to gather statistics about the hit
rate of the QEMU TLB.  Another thing to consider is speeding up the
fast path;  see YeongKyoon Lee RFC patch:

http://www.mail-archive.com/qemu-devel@nongnu.org/msg91294.html


Laurent

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
  2012-06-13  3:14 陳韋任 (Wei-Ren Chen)
@ 2012-06-13  9:18 ` Stefan Hajnoczi
  2012-06-13 12:43   ` Lluís Vilanova
  2012-06-13 10:43 ` Laurent Desnogues
  1 sibling, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2012-06-13  9:18 UTC (permalink / raw)
  To: 陳韋任 (Wei-Ren Chen); +Cc: qemu-devel, Lluís Vilanova

On Wed, Jun 13, 2012 at 4:14 AM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
>  I suspect that guest memory access (qemu_ld/qemu_st) account for the major of
> time spent in system mode. I would like to know precisely how much (if possible).
> We use tools like perf [1] before, but since the logic of guest memory access aslo
> embedded in the host binary not only helper functions, the result cannot be
> relied. The current idea is adding helper functions before/after guest memory
> access logic. Take ARM guest on x86_64 host for example, should I add the helper
> functions before/after tcg_gen_qemu_{ld,st} in target-arm/translate.c or
> tcg_out_qemu_{ld,st} in tcg/i386/tcg-target.c? Or there is a better way to know
> how much time QEMU spend on handling guest memory access?

Lluís: Can the instrumentation you've been working on do this?

Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time?
@ 2012-06-13  3:14 陳韋任 (Wei-Ren Chen)
  2012-06-13  9:18 ` Stefan Hajnoczi
  2012-06-13 10:43 ` Laurent Desnogues
  0 siblings, 2 replies; 30+ messages in thread
From: 陳韋任 (Wei-Ren Chen) @ 2012-06-13  3:14 UTC (permalink / raw)
  To: qemu-devel

Hi all,

  I suspect that guest memory access (qemu_ld/qemu_st) account for the major of
time spent in system mode. I would like to know precisely how much (if possible).
We use tools like perf [1] before, but since the logic of guest memory access aslo
embedded in the host binary not only helper functions, the result cannot be
relied. The current idea is adding helper functions before/after guest memory
access logic. Take ARM guest on x86_64 host for example, should I add the helper
functions before/after tcg_gen_qemu_{ld,st} in target-arm/translate.c or
tcg_out_qemu_{ld,st} in tcg/i386/tcg-target.c? Or there is a better way to know
how much time QEMU spend on handling guest memory access?

  Any suggestion/comment is welcomed. Thanks!

Regards,
chenwj

[1] https://perf.wiki.kernel.org/index.php/Main_Page

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-06-24  6:14 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-14 12:35 [Qemu-devel] How to measure guest memory access (qemu_ld/qemu_st) time? YeongKyoon Lee
2012-06-15 19:20 ` Blue Swirl
2012-06-18  6:57   ` 陳韋任 (Wei-Ren Chen)
2012-06-18 19:28     ` Blue Swirl
  -- strict thread matches above, loose matches on Subject: below --
2012-06-20  7:57 陳韋任 (Wei-Ren Chen)
2012-06-22  9:58 ` Xin Tong
2012-06-24  6:13   ` Blue Swirl
2012-06-18  4:59 YeongKyoon Lee
2012-06-18 19:21 ` Blue Swirl
2012-06-15  2:10 YeongKyoon Lee
2012-06-13  3:14 陳韋任 (Wei-Ren Chen)
2012-06-13  9:18 ` Stefan Hajnoczi
2012-06-13 12:43   ` Lluís Vilanova
2012-06-14  3:29     ` 陳韋任 (Wei-Ren Chen)
2012-06-14 23:17       ` Lluís Vilanova
2012-06-13 10:43 ` Laurent Desnogues
2012-06-14  3:18   ` 陳韋任 (Wei-Ren Chen)
2012-06-14 22:30     ` Lluís Vilanova
2012-06-15  8:23       ` Laurent Desnogues
2012-06-15 19:10         ` Lluís Vilanova
2012-06-18  8:28     ` 陳韋任 (Wei-Ren Chen)
2012-06-18 19:36       ` Blue Swirl
2012-06-18 20:26         ` Lluís Vilanova
2012-06-19  7:52           ` 陳韋任 (Wei-Ren Chen)
2012-06-19  9:35             ` Michael.Kang
2012-06-19  9:47           ` Michael.Kang
2012-06-19 10:51             ` Lluís Vilanova
2012-06-18 19:38       ` Lluís Vilanova
     [not found]       ` <20120619075534.GB34488@cs.nctu.edu.tw>
     [not found]         ` <4FE0347A.9040309@redhat.com>
     [not found]           ` <20120619084939.GA37515@cs.nctu.edu.tw>
2012-06-19  9:01             ` Orit Wasserman
2012-06-14  3:31   ` 陳韋任 (Wei-Ren Chen)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.