All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] MTTCG Tasks (kvmforum summary)
@ 2015-09-04  7:49 Alex Bennée
  2015-09-04  8:10 ` Frederic Konrad
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Alex Bennée @ 2015-09-04  7:49 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, claudio.fontana, a.rigo, Emilio G. Cota,
	Alexander Spyridakis, Pbonzini, fred.konrad

Hi,

At KVM Forum I sat down with Paolo and Frederic and we came up with the
current outstanding tasks on MTTCG. This is not comprehensive but
hopefully covers the big areas. They are sorted in rough order we'd like
to get them up-streamed.

* linux-user patches (Paolo)

Paolo has already grabbed a bunch of Fred's patch set where it makes
sense on its own. They are up on the list and need review to expedite
there way into the main tree now it is open for pull requests.

See thread: 1439397664-70734-1-git-send-email-pbonzini@redhat.com

* TLB_EXCL based LL/SC patches (Alvise)

I think our consensus is these provide a good basis for the solution to
modelling our atomics within TCG. I haven't had a chance to review
Emilio's series yet which may approach this problem differently. I think
the core patches with the generic backend support make a good basis to
base development work on.

We need to iterate and review the non-MTTCG variant of the patch set
with a view to up-streaming soon.

* Signal free qemu_cpu_kick (Paolo)

I don't know much about this patch set but I assume this avoids the need
to catch signals and longjmp about just to wake up?

* RCU tb_flush (needs writing)

The idea has been floated to introduce an RCU based translation buffer
to flushes can be done lazily and the buffers dumped once all threads
have stopped using them.

I have been pondering if looking into having translation regions would
be worth looking into so we can have translation buffers for contiguous
series of pages. That way we don't have to throw away all translations
on these big events. Currently every time we roll over the translation
buffer we throw a bunch of perfectly good code away. This may or may not
be orthogonal to using RCU? 

* Memory barrier support (need RFC for discussion)

I came to KVM forum with a back of the envelope idea we could implement
one or two barrier ops (acquire/release?). Various suggestions of other
types of memory behaviour have been suggested.

I'll try to pull together an RFC patch with design outline for
discussion. It would be nice to be able to demonstrate barrier failures
in my test cases as well ;-)

* longjmp in cpu_exec

Paolo is fairly sure that if you take page faults while IRQs happening
problems will occur with cpu->interrupt_request. Does it need to take
the BQL?

I'd like to see if we can get a torture test to stress this code
although it will require IPI support in the unit tests.

* tlb_flush and dmb behaviour (am I waiting for TLB flush?)

I think this means we need explicit memory barriers to sync updates to
the tlb.

* tb_find_fast outside the lock

Currently it is a big performance win as the tb_find_fast has a lot of
contention with other threads. However there is concern it needs to be
properly protected.

* Additional review comments on the Fred's branch
 - page->code_bitmap isn't protected by lock
 - cpu_breakpoint_insert needs locks
 - check gdbstub works

* What to do about icount?

What is the impact of multi-thread on icount? Do we need to disable it
for MTTCG or can it be correct per-cpu? Can it be updated lock-step?

We need some input from the guys that use icount the most.

Cheers,

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04  7:49 [Qemu-devel] MTTCG Tasks (kvmforum summary) Alex Bennée
@ 2015-09-04  8:10 ` Frederic Konrad
  2015-09-04  9:25 ` Paolo Bonzini
  2015-09-04  9:45 ` dovgaluk
  2 siblings, 0 replies; 11+ messages in thread
From: Frederic Konrad @ 2015-09-04  8:10 UTC (permalink / raw)
  To: Alex Bennée, qemu-devel, mttcg
  Cc: mark.burton, claudio.fontana, a.rigo, Emilio G. Cota,
	Alexander Spyridakis, Pbonzini

Hi Alex,

On 04/09/2015 09:49, Alex Bennée wrote:
> Hi,
>
> At KVM Forum I sat down with Paolo and Frederic and we came up with the
> current outstanding tasks on MTTCG. This is not comprehensive but
> hopefully covers the big areas. They are sorted in rough order we'd like
> to get them up-streamed.
>
> * linux-user patches (Paolo)
>
> Paolo has already grabbed a bunch of Fred's patch set where it makes
> sense on its own. They are up on the list and need review to expedite
> there way into the main tree now it is open for pull requests.
>
> See thread: 1439397664-70734-1-git-send-email-pbonzini@redhat.com
>
> * TLB_EXCL based LL/SC patches (Alvise)
>
> I think our consensus is these provide a good basis for the solution to
> modelling our atomics within TCG. I haven't had a chance to review
> Emilio's series yet which may approach this problem differently. I think
> the core patches with the generic backend support make a good basis to
> base development work on.
>
> We need to iterate and review the non-MTTCG variant of the patch set
> with a view to up-streaming soon.
>
> * Signal free qemu_cpu_kick (Paolo)
>
> I don't know much about this patch set but I assume this avoids the need
> to catch signals and longjmp about just to wake up?
>
> * RCU tb_flush (needs writing)
>
> The idea has been floated to introduce an RCU based translation buffer
> to flushes can be done lazily and the buffers dumped once all threads
> have stopped using them.
>
> I have been pondering if looking into having translation regions would
> be worth looking into so we can have translation buffers for contiguous
> series of pages. That way we don't have to throw away all translations
> on these big events. Currently every time we roll over the translation
> buffer we throw a bunch of perfectly good code away. This may or may not
> be orthogonal to using RCU?
I'm still not sure tb_flush needs so much effort.
tb_flush is happening very rarely just exiting everybody seems easier..

>
> * Memory barrier support (need RFC for discussion)
>
> I came to KVM forum with a back of the envelope idea we could implement
> one or two barrier ops (acquire/release?). Various suggestions of other
> types of memory behaviour have been suggested.
>
> I'll try to pull together an RFC patch with design outline for
> discussion. It would be nice to be able to demonstrate barrier failures
> in my test cases as well ;-)
>
> * longjmp in cpu_exec
>
> Paolo is fairly sure that if you take page faults while IRQs happening
> problems will occur with cpu->interrupt_request. Does it need to take
> the BQL?
>
> I'd like to see if we can get a torture test to stress this code
> although it will require IPI support in the unit tests.
>
> * tlb_flush and dmb behaviour (am I waiting for TLB flush?)
>
> I think this means we need explicit memory barriers to sync updates to
> the tlb.
>
> * tb_find_fast outside the lock
>
> Currently it is a big performance win as the tb_find_fast has a lot of
> contention with other threads. However there is concern it needs to be
> properly protected.
>
> * Additional review comments on the Fred's branch
>   - page->code_bitmap isn't protected by lock
>   - cpu_breakpoint_insert needs locks
Thoses one are OK, I didn't send all that yet.

>   - check gdbstub works
>
> * What to do about icount?
>
> What is the impact of multi-thread on icount? Do we need to disable it
> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
>
> We need some input from the guys that use icount the most.
>
> Cheers,
>
Also we might want to have everything in a branch. So rebasing the 
Atomics series
on eg: 2.4.0 + 2 Paolo's series so I can rebase MTTCG on it and pick the 
MTTCG
atomic parts can be usefull.

Thanks,
Fred

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04  7:49 [Qemu-devel] MTTCG Tasks (kvmforum summary) Alex Bennée
  2015-09-04  8:10 ` Frederic Konrad
@ 2015-09-04  9:25 ` Paolo Bonzini
  2015-09-04  9:41   ` Edgar E. Iglesias
  2015-09-04  9:45 ` dovgaluk
  2 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2015-09-04  9:25 UTC (permalink / raw)
  To: Alex Bennée, qemu-devel, mttcg
  Cc: mark.burton, claudio.fontana, a.rigo, Emilio G. Cota,
	Alexander Spyridakis, Edgar E. Iglesias, fred.konrad



On 04/09/2015 09:49, Alex Bennée wrote:
> * Signal free qemu_cpu_kick (Paolo)
> 
> I don't know much about this patch set but I assume this avoids the need
> to catch signals and longjmp about just to wake up?

It was part of Fred's patches, so I've extracted it to its own series.
Removing 150 lines of code can't hurt.

> * Memory barrier support (need RFC for discussion)
> 
> I came to KVM forum with a back of the envelope idea we could implement
> one or two barrier ops (acquire/release?). Various suggestions of other
> types of memory behaviour have been suggested.
> 
> I'll try to pull together an RFC patch with design outline for
> discussion. It would be nice to be able to demonstrate barrier failures
> in my test cases as well ;-)

Emilio has something about it in his own MTTCG implementation.

> * longjmp in cpu_exec
> 
> Paolo is fairly sure that if you take page faults while IRQs happening
> problems will occur with cpu->interrupt_request. Does it need to take
> the BQL?
> 
> I'd like to see if we can get a torture test to stress this code
> although it will require IPI support in the unit tests.

It's x86-specific (hardware interrupts push to the stack and can cause a
page fault or other exception), so a unit test can be written for it.

> * tlb_flush and dmb behaviour (am I waiting for TLB flush?)
> 
> I think this means we need explicit memory barriers to sync updates to
> the tlb.

Yes.

> * tb_find_fast outside the lock
> 
> Currently it is a big performance win as the tb_find_fast has a lot of
> contention with other threads. However there is concern it needs to be
> properly protected.

This, BTW, can be done for user-mode emulation first, so it can go in
early.  Same for RCU-ized code_gen_buffer.

> * What to do about icount?
> 
> What is the impact of multi-thread on icount? Do we need to disable it
> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
> 
> We need some input from the guys that use icount the most.

That means Edgar. :)

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04  9:25 ` Paolo Bonzini
@ 2015-09-04  9:41   ` Edgar E. Iglesias
  2015-09-04 10:18     ` Mark Burton
  0 siblings, 1 reply; 11+ messages in thread
From: Edgar E. Iglesias @ 2015-09-04  9:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, claudio.fontana, mark.burton, a.rigo, qemu-devel,
	Emilio G. Cota, Alexander Spyridakis, Edgar E. Iglesias,
	Alex Bennée, fred.konrad

On Fri, Sep 04, 2015 at 11:25:33AM +0200, Paolo Bonzini wrote:
> 
> 
> On 04/09/2015 09:49, Alex Bennée wrote:
> > * Signal free qemu_cpu_kick (Paolo)
> > 
> > I don't know much about this patch set but I assume this avoids the need
> > to catch signals and longjmp about just to wake up?
> 
> It was part of Fred's patches, so I've extracted it to its own series.
> Removing 150 lines of code can't hurt.
> 
> > * Memory barrier support (need RFC for discussion)
> > 
> > I came to KVM forum with a back of the envelope idea we could implement
> > one or two barrier ops (acquire/release?). Various suggestions of other
> > types of memory behaviour have been suggested.
> > 
> > I'll try to pull together an RFC patch with design outline for
> > discussion. It would be nice to be able to demonstrate barrier failures
> > in my test cases as well ;-)
> 
> Emilio has something about it in his own MTTCG implementation.
> 
> > * longjmp in cpu_exec
> > 
> > Paolo is fairly sure that if you take page faults while IRQs happening
> > problems will occur with cpu->interrupt_request. Does it need to take
> > the BQL?
> > 
> > I'd like to see if we can get a torture test to stress this code
> > although it will require IPI support in the unit tests.
> 
> It's x86-specific (hardware interrupts push to the stack and can cause a
> page fault or other exception), so a unit test can be written for it.
> 
> > * tlb_flush and dmb behaviour (am I waiting for TLB flush?)
> > 
> > I think this means we need explicit memory barriers to sync updates to
> > the tlb.
> 
> Yes.
> 
> > * tb_find_fast outside the lock
> > 
> > Currently it is a big performance win as the tb_find_fast has a lot of
> > contention with other threads. However there is concern it needs to be
> > properly protected.
> 
> This, BTW, can be done for user-mode emulation first, so it can go in
> early.  Same for RCU-ized code_gen_buffer.
> 
> > * What to do about icount?
> > 
> > What is the impact of multi-thread on icount? Do we need to disable it
> > for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
> > 
> > We need some input from the guys that use icount the most.
> 
> That means Edgar. :)

Hi!

IMO it would be nice if we could run the cores in some kind of lock-step
with a configurable amount of instructions that they can run ahead
of time (X).

For example, if X is 10000, every thread/core would checkpoint at
10000 insn boundaries and wait for other cores. Between these
checkpoints, the cores will not be in sync. We might need to
consider synchronizing at I/O accesses aswell to avoid weird
timing issues when reading counter registers for example.

Of course the devil will be in the details but an approach roughly
like that sounds useful to me.

Are there any other ideas floating around that may be better?

BTW, where can I find the latest series? Is it on a git-repo/branch
somewhere?

Best regards,
Edgar

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04  7:49 [Qemu-devel] MTTCG Tasks (kvmforum summary) Alex Bennée
  2015-09-04  8:10 ` Frederic Konrad
  2015-09-04  9:25 ` Paolo Bonzini
@ 2015-09-04  9:45 ` dovgaluk
  2015-09-04 12:38   ` Lluís Vilanova
  2 siblings, 1 reply; 11+ messages in thread
From: dovgaluk @ 2015-09-04  9:45 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, claudio.fontana, mark.burton, qemu-devel, a.rigo,
	Emilio G. Cota, Alexander Spyridakis, Pbonzini, mttcg-request,
	fred.konrad

Hi!

Alex Bennée писал 2015-09-04 10:49:
> * What to do about icount?
> 
> What is the impact of multi-thread on icount? Do we need to disable it
> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?

Why can't we have separate icount for each CPU?
Then virtual timer will be assigned to one of them.

Pavel Dovgalyuk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04  9:41   ` Edgar E. Iglesias
@ 2015-09-04 10:18     ` Mark Burton
  2015-09-04 13:00       ` Lluís Vilanova
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Burton @ 2015-09-04 10:18 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: mttcg, Alexander Spyridakis, Claudio Fontana, Alvise Rigo,
	qemu-devel, Emilio G. Cota, Edgar E. Iglesias, Paolo Bonzini,
	Alex Bennée, KONRAD Frédéric


> On 4 Sep 2015, at 11:41, Edgar E. Iglesias <edgar.iglesias@xilinx.com> wrote:
> 
> On Fri, Sep 04, 2015 at 11:25:33AM +0200, Paolo Bonzini wrote:
>> 
>> 
>> On 04/09/2015 09:49, Alex Bennée wrote:
>>> * Signal free qemu_cpu_kick (Paolo)
>>> 
>>> I don't know much about this patch set but I assume this avoids the need
>>> to catch signals and longjmp about just to wake up?
>> 
>> It was part of Fred's patches, so I've extracted it to its own series.
>> Removing 150 lines of code can't hurt.
>> 
>>> * Memory barrier support (need RFC for discussion)
>>> 
>>> I came to KVM forum with a back of the envelope idea we could implement
>>> one or two barrier ops (acquire/release?). Various suggestions of other
>>> types of memory behaviour have been suggested.
>>> 
>>> I'll try to pull together an RFC patch with design outline for
>>> discussion. It would be nice to be able to demonstrate barrier failures
>>> in my test cases as well ;-)
>> 
>> Emilio has something about it in his own MTTCG implementation.
>> 
>>> * longjmp in cpu_exec
>>> 
>>> Paolo is fairly sure that if you take page faults while IRQs happening
>>> problems will occur with cpu->interrupt_request. Does it need to take
>>> the BQL?
>>> 
>>> I'd like to see if we can get a torture test to stress this code
>>> although it will require IPI support in the unit tests.
>> 
>> It's x86-specific (hardware interrupts push to the stack and can cause a
>> page fault or other exception), so a unit test can be written for it.
>> 
>>> * tlb_flush and dmb behaviour (am I waiting for TLB flush?)
>>> 
>>> I think this means we need explicit memory barriers to sync updates to
>>> the tlb.
>> 
>> Yes.
>> 
>>> * tb_find_fast outside the lock
>>> 
>>> Currently it is a big performance win as the tb_find_fast has a lot of
>>> contention with other threads. However there is concern it needs to be
>>> properly protected.
>> 
>> This, BTW, can be done for user-mode emulation first, so it can go in
>> early.  Same for RCU-ized code_gen_buffer.
>> 
>>> * What to do about icount?
>>> 
>>> What is the impact of multi-thread on icount? Do we need to disable it
>>> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
>>> 
>>> We need some input from the guys that use icount the most.
>> 
>> That means Edgar. :)
> 
> Hi!
> 
> IMO it would be nice if we could run the cores in some kind of lock-step
> with a configurable amount of instructions that they can run ahead
> of time (X).
> 
> For example, if X is 10000, every thread/core would checkpoint at
> 10000 insn boundaries and wait for other cores. Between these
> checkpoints, the cores will not be in sync. We might need to
> consider synchronizing at I/O accesses aswell to avoid weird
> timing issues when reading counter registers for example.
> 
> Of course the devil will be in the details but an approach roughly
> like that sounds useful to me.

And “works" in other domains.
Theoretically we dont need to sync at IO (Dynamic quantums), for most systems that have ’normal' IO its normally less efficient I believe. However, the trouble is, the user typically doesn’t know, and mucking about with quantum lengths, dynamic quantum switches etc is probably a royal pain in the butt. And if you dont set your quantum right, the thing will run really slowly (or will break)… 

The choices are a rock or a hard place. Dynamic quantums risk to be slow (you’ll be forcing an expensive ’sync’ - all CPU’s will have to exit etc) on each IO access from each core…. not great. Syncing with host time (e.g. each CPU tries to sync with host clock as best it can) will fail when one or other CPU can’t keep up…. In the end you end up with leaving the user with a nice long bit of string and a message saying “hang yourself here”. 

Cheers
Mark.

> 
> Are there any other ideas floating around that may be better?
> 
> BTW, where can I find the latest series? Is it on a git-repo/branch
> somewhere?
> 
> Best regards,
> Edgar


	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04  9:45 ` dovgaluk
@ 2015-09-04 12:38   ` Lluís Vilanova
  2015-09-04 12:46     ` Mark Burton
  0 siblings, 1 reply; 11+ messages in thread
From: Lluís Vilanova @ 2015-09-04 12:38 UTC (permalink / raw)
  To: dovgaluk
  Cc: mttcg, mttcg-request, claudio.fontana, mark.burton, qemu-devel,
	a.rigo, Emilio G. Cota, Alexander Spyridakis, Pbonzini,
	Alex Bennée, fred.konrad

dovgaluk  writes:

> Hi!
> Alex Bennée писал 2015-09-04 10:49:
>> * What to do about icount?
>> 
>> What is the impact of multi-thread on icount? Do we need to disable it
>> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?

> Why can't we have separate icount for each CPU?
> Then virtual timer will be assigned to one of them.

My understanding is that icount means by deasign that time should be
synchronized between cpus, where the number of executed instructions is the time
unit. If all elements worked under this assumption (I'm afraid that's not the
case for I/O devices), it should be possible to reproduce executions by setting
icount to 1.

Now, MTTCG faces the same icount accuracy problems that the current TCG
implementation deals with (only at different scale). The naive implementation is
to execute 1 instruction per CPU in lockstep. TCG currently relaxes this at the
translation block level.

The MTTCG implementation could do something similar, but just at a different
(configurable?) granularity. Every N per-cpu instructions, synchronize all CPUs
until each has, at least, arrived at that time step, then proceed with the next
batch. Ideally, this synchronization delay (N) could be adapted dynamically.

My half cent.

Lluis

-- 
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04 12:38   ` Lluís Vilanova
@ 2015-09-04 12:46     ` Mark Burton
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Burton @ 2015-09-04 12:46 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: mttcg, Alexander Spyridakis, Claudio Fontana, qemu-devel,
	Alvise Rigo, Emilio G. Cota, dovgaluk, Pbonzini,
	Alex Bennée, KONRAD Frédéric


> On 4 Sep 2015, at 14:38, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> 
> dovgaluk  writes:
> 
>> Hi!
>> Alex Bennée писал 2015-09-04 10:49:
>>> * What to do about icount?
>>> 
>>> What is the impact of multi-thread on icount? Do we need to disable it
>>> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
> 
>> Why can't we have separate icount for each CPU?
>> Then virtual timer will be assigned to one of them.
> 
> My understanding is that icount means by deasign that time should be
> synchronized between cpus, where the number of executed instructions is the time
> unit. If all elements worked under this assumption (I'm afraid that's not the
> case for I/O devices), it should be possible to reproduce executions by setting
> icount to 1.
> 
> Now, MTTCG faces the same icount accuracy problems that the current TCG
> implementation deals with (only at different scale). The naive implementation is
> to execute 1 instruction per CPU in lockstep. TCG currently relaxes this at the
> translation block level.
> 
> The MTTCG implementation could do something similar, but just at a different
> (configurable?) granularity. Every N per-cpu instructions, synchronize all CPUs
> until each has, at least, arrived at that time step, then proceed with the next
> batch. Ideally, this synchronization delay (N) could be adapted dynamically.

This is often called a Quantum.

Cheers
Mark

> 
> My half cent.
> 
> Lluis
> 
> -- 
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth


	 +44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04 10:18     ` Mark Burton
@ 2015-09-04 13:00       ` Lluís Vilanova
  2015-09-04 13:10         ` dovgaluk
  0 siblings, 1 reply; 11+ messages in thread
From: Lluís Vilanova @ 2015-09-04 13:00 UTC (permalink / raw)
  To: Mark Burton
  Cc: Edgar E. Iglesias, mttcg, Alexander Spyridakis, Claudio Fontana,
	qemu-devel, Alvise Rigo, Emilio G. Cota, Paolo Bonzini,
	Edgar E. Iglesias, Alex Bennée, KONRAD Frédéric

Mark Burton writes:
[...]
>>>> * What to do about icount?
>>>> 
>>>> What is the impact of multi-thread on icount? Do we need to disable it
>>>> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
>>>> 
>>>> We need some input from the guys that use icount the most.
>>> 
>>> That means Edgar. :)
>> 
>> Hi!
>> 
>> IMO it would be nice if we could run the cores in some kind of lock-step
>> with a configurable amount of instructions that they can run ahead
>> of time (X).
>> 
>> For example, if X is 10000, every thread/core would checkpoint at
>> 10000 insn boundaries and wait for other cores. Between these
>> checkpoints, the cores will not be in sync. We might need to
>> consider synchronizing at I/O accesses aswell to avoid weird
>> timing issues when reading counter registers for example.
>> 
>> Of course the devil will be in the details but an approach roughly
>> like that sounds useful to me.

> And “works" in other domains.
> Theoretically we dont need to sync at IO (Dynamic quantums), for most systems
> that have ’normal' IO its normally less efficient I believe. However, the
> trouble is, the user typically doesn’t know, and mucking about with quantum
> lengths, dynamic quantum switches etc is probably a royal pain in the butt. And
> if you dont set your quantum right, the thing will run really slowly (or will
> break)…

> The choices are a rock or a hard place. Dynamic quantums risk to be slow (you’ll
> be forcing an expensive ’sync’ - all CPU’s will have to exit etc) on each IO
> access from each core…. not great. Syncing with host time (e.g. each CPU tries
> to sync with host clock as best it can) will fail when one or other CPU can’t
> keep up…. In the end you end up with leaving the user with a nice long bit of
> string and a message saying “hang yourself here”.

That price would not be paid when icount is disabled. Well, the code complexity
price is always paid... I meant runtime :)

Then, I think this depends on what type of guarantees you require from
icount. I see two possible semantics:

* All CPUs are *exactly* synchronized at icount granularity

  This means that every icount instructions everyone has to stop and
  synchronize.

* All CPUs are *loosely* synchronized at icount granularity

  You can implement it in a way that ensures that every cpu has *at least*
  reached a certain timestamp. So cpus can keep on running nonetheless.

The downside is that the latter loses the ability for reproducible runs, which
IMHO are useful. A more complex option is to merge both: icount sets the
"synchronization granularity" and another parameter sets the maximum delta
between cpus (i.e., set it to 0 to have the first option, and infinite for the
second).


Cheers,
  Lluis

-- 
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04 13:00       ` Lluís Vilanova
@ 2015-09-04 13:10         ` dovgaluk
  2015-09-04 14:59           ` Lluís Vilanova
  0 siblings, 1 reply; 11+ messages in thread
From: dovgaluk @ 2015-09-04 13:10 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: Edgar E. Iglesias, mttcg, Paolo Bonzini, Claudio Fontana,
	Mark Burton, Alvise Rigo, qemu-devel, Emilio G. Cota,
	Alexander Spyridakis, Edgar E. Iglesias, mttcg-request,
	Alex Bennée, KONRAD Frédéric

Lluís Vilanova писал 2015-09-04 16:00:
> Mark Burton writes:
> [...]
>>>>> * What to do about icount?
>>>>> 
>>>>> What is the impact of multi-thread on icount? Do we need to disable 
>>>>> it
>>>>> for MTTCG or can it be correct per-cpu? Can it be updated 
>>>>> lock-step?
>>>>> 
>>>>> We need some input from the guys that use icount the most.
>>>> 
>>>> That means Edgar. :)
>>> 
>>> Hi!
>>> 
>>> IMO it would be nice if we could run the cores in some kind of 
>>> lock-step
>>> with a configurable amount of instructions that they can run ahead
>>> of time (X).
>>> 
>>> For example, if X is 10000, every thread/core would checkpoint at
>>> 10000 insn boundaries and wait for other cores. Between these
>>> checkpoints, the cores will not be in sync. We might need to
>>> consider synchronizing at I/O accesses aswell to avoid weird
>>> timing issues when reading counter registers for example.
>>> 
>>> Of course the devil will be in the details but an approach roughly
>>> like that sounds useful to me.
> 
>> And “works" in other domains.
>> Theoretically we dont need to sync at IO (Dynamic quantums), for most 
>> systems
>> that have ’normal' IO its normally less efficient I believe. However, 
>> the
>> trouble is, the user typically doesn’t know, and mucking about with 
>> quantum
>> lengths, dynamic quantum switches etc is probably a royal pain in the 
>> butt. And
>> if you dont set your quantum right, the thing will run really slowly 
>> (or will
>> break)…
> 
>> The choices are a rock or a hard place. Dynamic quantums risk to be 
>> slow (you’ll
>> be forcing an expensive ’sync’ - all CPU’s will have to exit etc) on 
>> each IO
>> access from each core…. not great. Syncing with host time (e.g. each 
>> CPU tries
>> to sync with host clock as best it can) will fail when one or other 
>> CPU can’t
>> keep up…. In the end you end up with leaving the user with a nice long 
>> bit of
>> string and a message saying “hang yourself here”.
> 
> That price would not be paid when icount is disabled. Well, the code 
> complexity
> price is always paid... I meant runtime :)
> 
> Then, I think this depends on what type of guarantees you require from
> icount. I see two possible semantics:
> 
> * All CPUs are *exactly* synchronized at icount granularity
> 
>   This means that every icount instructions everyone has to stop and
>   synchronize.
> 
> * All CPUs are *loosely* synchronized at icount granularity
> 
>   You can implement it in a way that ensures that every cpu has *at 
> least*
>   reached a certain timestamp. So cpus can keep on running nonetheless.
> 

Is the third possibility looks sane?

* All CPUs synchronize at shared memory operations.
   When somebody tries to read/write shared memory, it should wait until 
all others
   will reach the same icount.

> The downside is that the latter loses the ability for reproducible 
> runs, which
> IMHO are useful. A more complex option is to merge both: icount sets 
> the
> "synchronization granularity" and another parameter sets the maximum 
> delta
> between cpus (i.e., set it to 0 to have the first option, and infinite 
> for the
> second).


Pavel Dovgalyuk

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)
  2015-09-04 13:10         ` dovgaluk
@ 2015-09-04 14:59           ` Lluís Vilanova
  0 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2015-09-04 14:59 UTC (permalink / raw)
  To: dovgaluk
  Cc: Edgar E. Iglesias, mttcg, mttcg-request, Paolo Bonzini,
	Claudio Fontana, Mark Burton, Alvise Rigo, qemu-devel,
	Emilio G. Cota, Alexander Spyridakis, Edgar E. Iglesias,
	Alex Bennée, KONRAD Frédéric

dovgaluk  writes:

> Lluís Vilanova писал 2015-09-04 16:00:
>> Mark Burton writes:
>> [...]
>>>>>> * What to do about icount?
>>>>>> 
>>>>>> What is the impact of multi-thread on icount? Do we need to disable it
>>>>>> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
>>>>>> 
>>>>>> We need some input from the guys that use icount the most.
>>>>> 
>>>>> That means Edgar. :)
>>>> 
>>>> Hi!
>>>> 
>>>> IMO it would be nice if we could run the cores in some kind of lock-step
>>>> with a configurable amount of instructions that they can run ahead
>>>> of time (X).
>>>> 
>>>> For example, if X is 10000, every thread/core would checkpoint at
>>>> 10000 insn boundaries and wait for other cores. Between these
>>>> checkpoints, the cores will not be in sync. We might need to
>>>> consider synchronizing at I/O accesses aswell to avoid weird
>>>> timing issues when reading counter registers for example.
>>>> 
>>>> Of course the devil will be in the details but an approach roughly
>>>> like that sounds useful to me.
>> 
>>> And “works" in other domains.
>>> Theoretically we dont need to sync at IO (Dynamic quantums), for most systems
>>> that have ’normal' IO its normally less efficient I believe. However, the
>>> trouble is, the user typically doesn’t know, and mucking about with quantum
>>> lengths, dynamic quantum switches etc is probably a royal pain in the
>>> butt. And
>>> if you dont set your quantum right, the thing will run really slowly (or will
>>> break)…
>> 
>>> The choices are a rock or a hard place. Dynamic quantums risk to be slow
>>> (you’ll
>>> be forcing an expensive ’sync’ - all CPU’s will have to exit etc) on each IO
>>> access from each core…. not great. Syncing with host time (e.g. each CPU
>>> tries
>>> to sync with host clock as best it can) will fail when one or other CPU can’t
>>> keep up…. In the end you end up with leaving the user with a nice long bit of
>>> string and a message saying “hang yourself here”.
>> 
>> That price would not be paid when icount is disabled. Well, the code
>> complexity
>> price is always paid... I meant runtime :)
>> 
>> Then, I think this depends on what type of guarantees you require from
>> icount. I see two possible semantics:
>> 
>> * All CPUs are *exactly* synchronized at icount granularity
>> 
>> This means that every icount instructions everyone has to stop and
>> synchronize.
>> 
>> * All CPUs are *loosely* synchronized at icount granularity
>> 
>> You can implement it in a way that ensures that every cpu has *at least*
>> reached a certain timestamp. So cpus can keep on running nonetheless.
>> 

> Is the third possibility looks sane?

> * All CPUs synchronize at shared memory operations.
>   When somebody tries to read/write shared memory, it should wait until all
> others
>   will reach the same icount.

I think that's too heavyweight. Every memory access is a potential shared memory
operation. You could refine it by tagging which pages are shared across cores
and limit the number of synchronizations. But all pages would eventually end up
as shared.


Lluis

-- 
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-09-04 14:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-04  7:49 [Qemu-devel] MTTCG Tasks (kvmforum summary) Alex Bennée
2015-09-04  8:10 ` Frederic Konrad
2015-09-04  9:25 ` Paolo Bonzini
2015-09-04  9:41   ` Edgar E. Iglesias
2015-09-04 10:18     ` Mark Burton
2015-09-04 13:00       ` Lluís Vilanova
2015-09-04 13:10         ` dovgaluk
2015-09-04 14:59           ` Lluís Vilanova
2015-09-04  9:45 ` dovgaluk
2015-09-04 12:38   ` Lluís Vilanova
2015-09-04 12:46     ` Mark Burton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.