All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] RFC Multi-threaded TCG design document
@ 2015-06-12 16:37 Alex Bennée
  2015-06-15  9:13 ` Frederic Konrad
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Alex Bennée @ 2015-06-12 16:37 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: peter.maydell, mark.burton, agraf, guillaume.delbergue, pbonzini,
	alex.bennee, fred.konrad

Hi,

One thing that Peter has been asking for is a design document for the
way we are going to approach multi-threaded TCG emulation. I started
with the information that was captured on the wiki and tried to build on
that. It's almost certainly incomplete but I thought it would be worth
posting for wider discussion early rather than later.

One obvious omission at the moment is the lack of discussion about other
non-TLB shared data structures in QEMU (I'm thinking of the various
dirty page tracking bits, I'm sure there is more).

I've also deliberately tried to avoid documenting the design decisions
made in the current Greensoc's patch series. This is so we can
concentrate on the big picture before getting side-tracked into the
implementation details.

I have now started digging into the Greensocs code in earnest and the
plan is eventually the design and the implementation will converge on a
final documented complete solution ;-)

Anyway as ever I look forward to the comments and discussion:

STATUS: DRAFTING

Introduction
============

This document outlines the design for multi-threaded TCG emulation.
The original TCG implementation was single threaded and dealt with
multiple CPUs by with simple round-robin scheduling. This simplified a
lot of things but became increasingly limited as systems being
emulated gained additional cores and per-core performance gains for host
systems started to level off.

Memory Consistency
==================

Between emulated guests and host systems there are a range of memory
consistency models. While emulating weakly ordered systems on strongly
ordered hosts shouldn't cause any problems the same is not true for
the reverse setup.

The proposed design currently does not address the problem of
emulating strong ordering on a weakly ordered host although even on
strongly ordered systems software should be using synchronisation
primitives to ensure correct operation.

Memory Barriers
---------------

Barriers (sometimes known as fences) provide a mechanism for software
to enforce a particular ordering of memory operations from the point
of view of external observers (e.g. another processor core). They can
apply to any memory operations as well as just loads or stores.

The Linux kernel has an excellent write-up on the various forms of
memory barrier and the guarantees they can provide [1].

Barriers are often wrapped around synchronisation primitives to
provide explicit memory ordering semantics. However they can be used
by themselves to provide safe lockless access by ensuring for example
a signal flag will always be set after a payload.

DESIGN REQUIREMENT: Add a new tcg_memory_barrier op

This would enforce a strong load/store ordering so all loads/stores
complete at the memory barrier. On single-core non-SMP strongly
ordered backends this could become a NOP.

There may be a case for further refinement if this causes performance
bottlenecks.

Memory Control and Maintenance
------------------------------

This includes a class of instructions for controlling system cache
behaviour. While QEMU doesn't model cache behaviour these instructions
are often seen when code modification has taken place to ensure the
changes take effect.

Synchronisation Primitives
--------------------------

There are two broad types of synchronisation primitives found in
modern ISAs: atomic instructions and exclusive regions.

The first type offer a simple atomic instruction which will guarantee
some sort of test and conditional store will be truly atomic w.r.t.
other cores sharing access to the memory. The classic example is the
x86 cmpxchg instruction.

The second type offer a pair of load/store instructions which offer a
guarantee that an region of memory has not been touched between the
load and store instructions. An example of this is ARM's ldrex/strex
pair where the strex instruction will return a flag indicating a
successful store only if no other CPU has accessed the memory region
since the ldrex.

Traditionally TCG has generated a series of operations that work
because they are within the context of a single translation block so
will have completed before another CPU is scheduled. However with
the ability to have multiple threads running to emulate multiple CPUs
we will need to explicitly expose these semantics.

DESIGN REQUIREMENTS:
 - atomics
   - Introduce some atomic TCG ops for the common semantics
   - The default fallback helper function will use qemu_atomics
   - Each backend can then add a more efficient implementation
 - load/store exclusive
   [AJB:
        There are currently a number proposals of interest:
     - Greensocs tweaks to ldst ex (using locks)
     - Slow-path for atomic instruction translation [2]
     - Helper-based Atomic Instruction Emulation (AIE) [3]
    ]


Shared Data Structures
======================

Global TCG State
----------------

We need to protect the entire code generation cycle including any post
generation patching of the translated code. This also implies a shared
translation buffer which contains code running on all cores. Any
execution path that comes to the main run loop will need to hold a
mutex for code generation. This also includes times when we need flush
code or jumps from the tb_cache.

DESIGN REQUIREMENT: Add locking around all code generation, patching
and jump cache modification

Memory maps and TLBs
--------------------

The memory handling code is fairly critical to the speed of memory
access in the emulated system.

  - Memory regions (dividing up access to PIO, MMIO and RAM)
  - Dirty page tracking (for code gen, migration and display)
  - Virtual TLB (for translating guest address->real address)

There is a both a fast path walked by the generated code and a slow
path when resolution is required. When the TLB tables are updated we
need to ensure they are done in a safe way by bringing all executing
threads to a halt before making the modifications.

DESIGN REQUIREMENTS:

  - TLB Flush All/Page
    - can be across-CPUs
    - will need all other CPUs brought to a halt
  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
    - This is a per-CPU table - by definition can't race
    - updated by it's own thread when the slow-path is forced

Emulated hardware state
-----------------------

Currently the hardware emulation has no protection against
multiple-accesses. However guest systems accessing emulated hardware
should be carrying out their own locking to prevent multiple CPUs
confusing the hardware. Of course there is no guarantee the there
couldn't be a broken guest that doesn't lock so you could get racing
accesses to the hardware.

There is the class of paravirtualized hardware (VIRTIO) that works in
a purely mmio mode. Often setting flags directly in guest memory as a
result of a guest triggered transaction.

DESIGN REQUIREMENTS:

  - Access to IO Memory should be serialised by an IOMem mutex
  - The mutex should be recursive (e.g. allowing pid to relock itself)

IO Subsystem
------------

The I/O subsystem is heavily used by KVM and has seen a lot of
improvements to offload I/O tasks to dedicated IOThreads. There should
be no additional locking required once we reach the Block Driver.

DESIGN REQUIREMENTS:

  - The dataplane should continue to be protected by the iothread locks


References
==========

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
[3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297



-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-12 16:37 [Qemu-devel] RFC Multi-threaded TCG design document Alex Bennée
@ 2015-06-15  9:13 ` Frederic Konrad
  2015-06-15 10:06   ` Alex Bennée
  2015-06-15 13:06 ` alvise rigo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Frederic Konrad @ 2015-06-15  9:13 UTC (permalink / raw)
  To: Alex Bennée, qemu-devel, mttcg
  Cc: peter.maydell, pbonzini, mark.burton, agraf, guillaume.delbergue

On 12/06/2015 18:37, Alex Bennée wrote:
> Hi,

Hi Alex,

I've completed some of the points below. We will also work on a design 
decisions
document to add to this one.

We probably want to merge that with what we did on the wiki?
http://wiki.qemu.org/Features/tcg-multithread

> One thing that Peter has been asking for is a design document for the
> way we are going to approach multi-threaded TCG emulation. I started
> with the information that was captured on the wiki and tried to build on
> that. It's almost certainly incomplete but I thought it would be worth
> posting for wider discussion early rather than later.
>
> One obvious omission at the moment is the lack of discussion about other
> non-TLB shared data structures in QEMU (I'm thinking of the various
> dirty page tracking bits, I'm sure there is more).
>
> I've also deliberately tried to avoid documenting the design decisions
> made in the current Greensoc's patch series. This is so we can
> concentrate on the big picture before getting side-tracked into the
> implementation details.
>
> I have now started digging into the Greensocs code in earnest and the
> plan is eventually the design and the implementation will converge on a
> final documented complete solution ;-)
>
> Anyway as ever I look forward to the comments and discussion:
>
> STATUS: DRAFTING
>
> Introduction
> ============
>
> This document outlines the design for multi-threaded TCG emulation.
> The original TCG implementation was single threaded and dealt with
> multiple CPUs by with simple round-robin scheduling. This simplified a
> lot of things but became increasingly limited as systems being
> emulated gained additional cores and per-core performance gains for host
> systems started to level off.
>
> Memory Consistency
> ==================
>
> Between emulated guests and host systems there are a range of memory
> consistency models. While emulating weakly ordered systems on strongly
> ordered hosts shouldn't cause any problems the same is not true for
> the reverse setup.
>
> The proposed design currently does not address the problem of
> emulating strong ordering on a weakly ordered host although even on
> strongly ordered systems software should be using synchronisation
> primitives to ensure correct operation.
>
> Memory Barriers
> ---------------
>
> Barriers (sometimes known as fences) provide a mechanism for software
> to enforce a particular ordering of memory operations from the point
> of view of external observers (e.g. another processor core). They can
> apply to any memory operations as well as just loads or stores.
>
> The Linux kernel has an excellent write-up on the various forms of
> memory barrier and the guarantees they can provide [1].
>
> Barriers are often wrapped around synchronisation primitives to
> provide explicit memory ordering semantics. However they can be used
> by themselves to provide safe lockless access by ensuring for example
> a signal flag will always be set after a payload.
>
> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>
> This would enforce a strong load/store ordering so all loads/stores
> complete at the memory barrier. On single-core non-SMP strongly
> ordered backends this could become a NOP.
>
> There may be a case for further refinement if this causes performance
> bottlenecks.
>
> Memory Control and Maintenance
> ------------------------------
>
> This includes a class of instructions for controlling system cache
> behaviour. While QEMU doesn't model cache behaviour these instructions
> are often seen when code modification has taken place to ensure the
> changes take effect.
>
> Synchronisation Primitives
> --------------------------
>
> There are two broad types of synchronisation primitives found in
> modern ISAs: atomic instructions and exclusive regions.
>
> The first type offer a simple atomic instruction which will guarantee
> some sort of test and conditional store will be truly atomic w.r.t.
> other cores sharing access to the memory. The classic example is the
> x86 cmpxchg instruction.
>
> The second type offer a pair of load/store instructions which offer a
> guarantee that an region of memory has not been touched between the
> load and store instructions. An example of this is ARM's ldrex/strex
> pair where the strex instruction will return a flag indicating a
> successful store only if no other CPU has accessed the memory region
> since the ldrex.
>
> Traditionally TCG has generated a series of operations that work
> because they are within the context of a single translation block so
> will have completed before another CPU is scheduled. However with
> the ability to have multiple threads running to emulate multiple CPUs
> we will need to explicitly expose these semantics.
>
> DESIGN REQUIREMENTS:
>   - atomics
>     - Introduce some atomic TCG ops for the common semantics
>     - The default fallback helper function will use qemu_atomics
>     - Each backend can then add a more efficient implementation
>   - load/store exclusive
>     [AJB:
>          There are currently a number proposals of interest:
>       - Greensocs tweaks to ldst ex (using locks)
>       - Slow-path for atomic instruction translation [2]
>       - Helper-based Atomic Instruction Emulation (AIE) [3]
>      ]
>
>
> Shared Data Structures
> ======================
>
> Global TCG State
> ----------------
>
> We need to protect the entire code generation cycle including any post
> generation patching of the translated code. This also implies a shared
> translation buffer which contains code running on all cores. Any
> execution path that comes to the main run loop will need to hold a
> mutex for code generation. This also includes times when we need flush
> code or jumps from the tb_cache.
>
> DESIGN REQUIREMENT: Add locking around all code generation, patching
> and jump cache modification
Actually from my point of view jump cache modification requires more than a
lock as other VCPU thread can be executing code during the modification.

Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
tb_invalidate which need all CPU to be halted anyway.

>
> Memory maps and TLBs
> --------------------
>
> The memory handling code is fairly critical to the speed of memory
> access in the emulated system.
>
>    - Memory regions (dividing up access to PIO, MMIO and RAM)
>    - Dirty page tracking (for code gen, migration and display)
>    - Virtual TLB (for translating guest address->real address)
>
> There is a both a fast path walked by the generated code and a slow
> path when resolution is required. When the TLB tables are updated we
> need to ensure they are done in a safe way by bringing all executing
> threads to a halt before making the modifications.
>
> DESIGN REQUIREMENTS:
>
>    - TLB Flush All/Page
>      - can be across-CPUs
>      - will need all other CPUs brought to a halt
>    - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>      - This is a per-CPU table - by definition can't race
>      - updated by it's own thread when the slow-path is forced
Actually as we have  approximately the same behaviour for all of this memory
handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
all playing with
the TranslationBlock and the jump cache across-CPU I think we have to add a
generic "exit and do something" mechanism for the CPU threads.
So every VCPU threads has a list of thing to do when they exit (such as 
clearing it's
own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
entry for
tb_invalidate).

>
> Emulated hardware state
> -----------------------
>
> Currently the hardware emulation has no protection against
> multiple-accesses. However guest systems accessing emulated hardware
> should be carrying out their own locking to prevent multiple CPUs
> confusing the hardware. Of course there is no guarantee the there
> couldn't be a broken guest that doesn't lock so you could get racing
> accesses to the hardware.
>
> There is the class of paravirtualized hardware (VIRTIO) that works in
> a purely mmio mode. Often setting flags directly in guest memory as a
> result of a guest triggered transaction.
>
> DESIGN REQUIREMENTS:
>
>    - Access to IO Memory should be serialised by an IOMem mutex
>    - The mutex should be recursive (e.g. allowing pid to relock itself)
That might be done with the global mutex as it is today?
We need changes here anyway to have VCPU threads running in parallel.

Thanks,
Fred

> IO Subsystem
> ------------
>
> The I/O subsystem is heavily used by KVM and has seen a lot of
> improvements to offload I/O tasks to dedicated IOThreads. There should
> be no additional locking required once we reach the Block Driver.
>
> DESIGN REQUIREMENTS:
>
>    - The dataplane should continue to be protected by the iothread locks
>
>
> References
> ==========
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-15  9:13 ` Frederic Konrad
@ 2015-06-15 10:06   ` Alex Bennée
  2015-06-15 10:51     ` Mark Burton
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Bennée @ 2015-06-15 10:06 UTC (permalink / raw)
  To: Frederic Konrad
  Cc: mttcg, peter.maydell, mark.burton, qemu-devel, agraf,
	guillaume.delbergue, pbonzini


Frederic Konrad <fred.konrad@greensocs.com> writes:

> On 12/06/2015 18:37, Alex Bennée wrote:
>> Hi,
>
> Hi Alex,
>
> I've completed some of the points below. We will also work on a design 
> decisions
> document to add to this one.
>
> We probably want to merge that with what we did on the wiki?
> http://wiki.qemu.org/Features/tcg-multithread

Well hopefully there is cross-over as I started with the wiki as a basic
;-)

Do we want to just keep the wiki as the live design document or put
pointers to the current drafts? I'm hoping eventually the page will just
point to the design in the doc directory at git.qemu.org.

>> One thing that Peter has been asking for is a design document for the
>> way we are going to approach multi-threaded TCG emulation. I started
>> with the information that was captured on the wiki and tried to build on
>> that. It's almost certainly incomplete but I thought it would be worth
>> posting for wider discussion early rather than later.
>>
>> One obvious omission at the moment is the lack of discussion about other
>> non-TLB shared data structures in QEMU (I'm thinking of the various
>> dirty page tracking bits, I'm sure there is more).
>>
>> I've also deliberately tried to avoid documenting the design decisions
>> made in the current Greensoc's patch series. This is so we can
>> concentrate on the big picture before getting side-tracked into the
>> implementation details.
>>
>> I have now started digging into the Greensocs code in earnest and the
>> plan is eventually the design and the implementation will converge on a
>> final documented complete solution ;-)
>>
>> Anyway as ever I look forward to the comments and discussion:
>>
>> STATUS: DRAFTING
>>
>> Introduction
>> ============
>>
>> This document outlines the design for multi-threaded TCG emulation.
>> The original TCG implementation was single threaded and dealt with
>> multiple CPUs by with simple round-robin scheduling. This simplified a
>> lot of things but became increasingly limited as systems being
>> emulated gained additional cores and per-core performance gains for host
>> systems started to level off.
>>
>> Memory Consistency
>> ==================
>>
>> Between emulated guests and host systems there are a range of memory
>> consistency models. While emulating weakly ordered systems on strongly
>> ordered hosts shouldn't cause any problems the same is not true for
>> the reverse setup.
>>
>> The proposed design currently does not address the problem of
>> emulating strong ordering on a weakly ordered host although even on
>> strongly ordered systems software should be using synchronisation
>> primitives to ensure correct operation.
>>
>> Memory Barriers
>> ---------------
>>
>> Barriers (sometimes known as fences) provide a mechanism for software
>> to enforce a particular ordering of memory operations from the point
>> of view of external observers (e.g. another processor core). They can
>> apply to any memory operations as well as just loads or stores.
>>
>> The Linux kernel has an excellent write-up on the various forms of
>> memory barrier and the guarantees they can provide [1].
>>
>> Barriers are often wrapped around synchronisation primitives to
>> provide explicit memory ordering semantics. However they can be used
>> by themselves to provide safe lockless access by ensuring for example
>> a signal flag will always be set after a payload.
>>
>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>
>> This would enforce a strong load/store ordering so all loads/stores
>> complete at the memory barrier. On single-core non-SMP strongly
>> ordered backends this could become a NOP.
>>
>> There may be a case for further refinement if this causes performance
>> bottlenecks.
>>
>> Memory Control and Maintenance
>> ------------------------------
>>
>> This includes a class of instructions for controlling system cache
>> behaviour. While QEMU doesn't model cache behaviour these instructions
>> are often seen when code modification has taken place to ensure the
>> changes take effect.
>>
>> Synchronisation Primitives
>> --------------------------
>>
>> There are two broad types of synchronisation primitives found in
>> modern ISAs: atomic instructions and exclusive regions.
>>
>> The first type offer a simple atomic instruction which will guarantee
>> some sort of test and conditional store will be truly atomic w.r.t.
>> other cores sharing access to the memory. The classic example is the
>> x86 cmpxchg instruction.
>>
>> The second type offer a pair of load/store instructions which offer a
>> guarantee that an region of memory has not been touched between the
>> load and store instructions. An example of this is ARM's ldrex/strex
>> pair where the strex instruction will return a flag indicating a
>> successful store only if no other CPU has accessed the memory region
>> since the ldrex.
>>
>> Traditionally TCG has generated a series of operations that work
>> because they are within the context of a single translation block so
>> will have completed before another CPU is scheduled. However with
>> the ability to have multiple threads running to emulate multiple CPUs
>> we will need to explicitly expose these semantics.
>>
>> DESIGN REQUIREMENTS:
>>   - atomics
>>     - Introduce some atomic TCG ops for the common semantics
>>     - The default fallback helper function will use qemu_atomics
>>     - Each backend can then add a more efficient implementation
>>   - load/store exclusive
>>     [AJB:
>>          There are currently a number proposals of interest:
>>       - Greensocs tweaks to ldst ex (using locks)
>>       - Slow-path for atomic instruction translation [2]
>>       - Helper-based Atomic Instruction Emulation (AIE) [3]
>>      ]
>>
>>
>> Shared Data Structures
>> ======================
>>
>> Global TCG State
>> ----------------
>>
>> We need to protect the entire code generation cycle including any post
>> generation patching of the translated code. This also implies a shared
>> translation buffer which contains code running on all cores. Any
>> execution path that comes to the main run loop will need to hold a
>> mutex for code generation. This also includes times when we need flush
>> code or jumps from the tb_cache.
>>
>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>> and jump cache modification
> Actually from my point of view jump cache modification requires more than a
> lock as other VCPU thread can be executing code during the modification.
>
> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
> tb_invalidate which need all CPU to be halted anyway.

How about:

DESIGN REQUIREMENT:
       - Code generation and patching will be protected by a lock
       - Jump cache modification will assert all CPUs are halted

>>
>> Memory maps and TLBs
>> --------------------
>>
>> The memory handling code is fairly critical to the speed of memory
>> access in the emulated system.
>>
>>    - Memory regions (dividing up access to PIO, MMIO and RAM)
>>    - Dirty page tracking (for code gen, migration and display)
>>    - Virtual TLB (for translating guest address->real address)
>>
>> There is a both a fast path walked by the generated code and a slow
>> path when resolution is required. When the TLB tables are updated we
>> need to ensure they are done in a safe way by bringing all executing
>> threads to a halt before making the modifications.
>>
>> DESIGN REQUIREMENTS:
>>
>>    - TLB Flush All/Page
>>      - can be across-CPUs
>>      - will need all other CPUs brought to a halt
>>    - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>      - This is a per-CPU table - by definition can't race
>>      - updated by it's own thread when the slow-path is forced
> Actually as we have  approximately the same behaviour for all of this memory
> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
> all playing with
> the TranslationBlock and the jump cache across-CPU I think we have to add a
> generic "exit and do something" mechanism for the CPU threads.
> So every VCPU threads has a list of thing to do when they exit (such as 
> clearing it's
> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
> entry for
> tb_invalidate).

Sounds like I should write an additional section to describe the process
of halting CPUs and carrying out deferred per-CPU actions as well as
ensuring we can tell when they are all halted.

>> Emulated hardware state
>> -----------------------
>>
>> Currently the hardware emulation has no protection against
>> multiple-accesses. However guest systems accessing emulated hardware
>> should be carrying out their own locking to prevent multiple CPUs
>> confusing the hardware. Of course there is no guarantee the there
>> couldn't be a broken guest that doesn't lock so you could get racing
>> accesses to the hardware.
>>
>> There is the class of paravirtualized hardware (VIRTIO) that works in
>> a purely mmio mode. Often setting flags directly in guest memory as a
>> result of a guest triggered transaction.
>>
>> DESIGN REQUIREMENTS:
>>
>>    - Access to IO Memory should be serialised by an IOMem mutex
>>    - The mutex should be recursive (e.g. allowing pid to relock itself)
> That might be done with the global mutex as it is today?
> We need changes here anyway to have VCPU threads running in parallel.

I'm not sure re-using the global mutex is a good idea. I've had to hack
the global mutex to allow recursive locking to get around the virtio
hang I discovered last week. While it works I'm uneasy making such a
radical change upstream given how widely the global mutex is used hence
the suggestion to have an explicit IOMem mutex.

Actually I'm surprised the iothread muxtex just re-uses the global one.
I guess I need to talk to the IO guys as to why they took that
decision.

>
> Thanks,

Thanks for your quick review :-)

> Fred
>
>> IO Subsystem
>> ------------
>>
>> The I/O subsystem is heavily used by KVM and has seen a lot of
>> improvements to offload I/O tasks to dedicated IOThreads. There should
>> be no additional locking required once we reach the Block Driver.
>>
>> DESIGN REQUIREMENTS:
>>
>>    - The dataplane should continue to be protected by the iothread locks
>>
>>
>> References
>> ==========
>>
>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>
>>
>>

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-15 10:06   ` Alex Bennée
@ 2015-06-15 10:51     ` Mark Burton
  2015-06-15 12:36       ` Alex Bennée
  2015-06-15 14:25       ` Alex Bennée
  0 siblings, 2 replies; 14+ messages in thread
From: Mark Burton @ 2015-06-15 10:51 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Peter Maydell, Alexander Graf, QEMU Developers,
	Guillaume Delbergue, pbonzini, KONRAD Frédéric

I think we SHOUDL use the wiki - and keep it current. A lot of what you have is in the wiki too, but I’d like to see the wiki updated.
We will add our stuff there too…

Cheers
Mark.



> On 15 Jun 2015, at 12:06, Alex Bennée <alex.bennee@linaro.org> wrote:
> 
> 
> Frederic Konrad <fred.konrad@greensocs.com> writes:
> 
>> On 12/06/2015 18:37, Alex Bennée wrote:
>>> Hi,
>> 
>> Hi Alex,
>> 
>> I've completed some of the points below. We will also work on a design 
>> decisions
>> document to add to this one.
>> 
>> We probably want to merge that with what we did on the wiki?
>> http://wiki.qemu.org/Features/tcg-multithread
> 
> Well hopefully there is cross-over as I started with the wiki as a basic
> ;-)
> 
> Do we want to just keep the wiki as the live design document or put
> pointers to the current drafts? I'm hoping eventually the page will just
> point to the design in the doc directory at git.qemu.org.
> 
>>> One thing that Peter has been asking for is a design document for the
>>> way we are going to approach multi-threaded TCG emulation. I started
>>> with the information that was captured on the wiki and tried to build on
>>> that. It's almost certainly incomplete but I thought it would be worth
>>> posting for wider discussion early rather than later.
>>> 
>>> One obvious omission at the moment is the lack of discussion about other
>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>> dirty page tracking bits, I'm sure there is more).
>>> 
>>> I've also deliberately tried to avoid documenting the design decisions
>>> made in the current Greensoc's patch series. This is so we can
>>> concentrate on the big picture before getting side-tracked into the
>>> implementation details.
>>> 
>>> I have now started digging into the Greensocs code in earnest and the
>>> plan is eventually the design and the implementation will converge on a
>>> final documented complete solution ;-)
>>> 
>>> Anyway as ever I look forward to the comments and discussion:
>>> 
>>> STATUS: DRAFTING
>>> 
>>> Introduction
>>> ============
>>> 
>>> This document outlines the design for multi-threaded TCG emulation.
>>> The original TCG implementation was single threaded and dealt with
>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>> lot of things but became increasingly limited as systems being
>>> emulated gained additional cores and per-core performance gains for host
>>> systems started to level off.
>>> 
>>> Memory Consistency
>>> ==================
>>> 
>>> Between emulated guests and host systems there are a range of memory
>>> consistency models. While emulating weakly ordered systems on strongly
>>> ordered hosts shouldn't cause any problems the same is not true for
>>> the reverse setup.
>>> 
>>> The proposed design currently does not address the problem of
>>> emulating strong ordering on a weakly ordered host although even on
>>> strongly ordered systems software should be using synchronisation
>>> primitives to ensure correct operation.
>>> 
>>> Memory Barriers
>>> ---------------
>>> 
>>> Barriers (sometimes known as fences) provide a mechanism for software
>>> to enforce a particular ordering of memory operations from the point
>>> of view of external observers (e.g. another processor core). They can
>>> apply to any memory operations as well as just loads or stores.
>>> 
>>> The Linux kernel has an excellent write-up on the various forms of
>>> memory barrier and the guarantees they can provide [1].
>>> 
>>> Barriers are often wrapped around synchronisation primitives to
>>> provide explicit memory ordering semantics. However they can be used
>>> by themselves to provide safe lockless access by ensuring for example
>>> a signal flag will always be set after a payload.
>>> 
>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>> 
>>> This would enforce a strong load/store ordering so all loads/stores
>>> complete at the memory barrier. On single-core non-SMP strongly
>>> ordered backends this could become a NOP.
>>> 
>>> There may be a case for further refinement if this causes performance
>>> bottlenecks.
>>> 
>>> Memory Control and Maintenance
>>> ------------------------------
>>> 
>>> This includes a class of instructions for controlling system cache
>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>> are often seen when code modification has taken place to ensure the
>>> changes take effect.
>>> 
>>> Synchronisation Primitives
>>> --------------------------
>>> 
>>> There are two broad types of synchronisation primitives found in
>>> modern ISAs: atomic instructions and exclusive regions.
>>> 
>>> The first type offer a simple atomic instruction which will guarantee
>>> some sort of test and conditional store will be truly atomic w.r.t.
>>> other cores sharing access to the memory. The classic example is the
>>> x86 cmpxchg instruction.
>>> 
>>> The second type offer a pair of load/store instructions which offer a
>>> guarantee that an region of memory has not been touched between the
>>> load and store instructions. An example of this is ARM's ldrex/strex
>>> pair where the strex instruction will return a flag indicating a
>>> successful store only if no other CPU has accessed the memory region
>>> since the ldrex.
>>> 
>>> Traditionally TCG has generated a series of operations that work
>>> because they are within the context of a single translation block so
>>> will have completed before another CPU is scheduled. However with
>>> the ability to have multiple threads running to emulate multiple CPUs
>>> we will need to explicitly expose these semantics.
>>> 
>>> DESIGN REQUIREMENTS:
>>>  - atomics
>>>    - Introduce some atomic TCG ops for the common semantics
>>>    - The default fallback helper function will use qemu_atomics
>>>    - Each backend can then add a more efficient implementation
>>>  - load/store exclusive
>>>    [AJB:
>>>         There are currently a number proposals of interest:
>>>      - Greensocs tweaks to ldst ex (using locks)
>>>      - Slow-path for atomic instruction translation [2]
>>>      - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>     ]
>>> 
>>> 
>>> Shared Data Structures
>>> ======================
>>> 
>>> Global TCG State
>>> ----------------
>>> 
>>> We need to protect the entire code generation cycle including any post
>>> generation patching of the translated code. This also implies a shared
>>> translation buffer which contains code running on all cores. Any
>>> execution path that comes to the main run loop will need to hold a
>>> mutex for code generation. This also includes times when we need flush
>>> code or jumps from the tb_cache.
>>> 
>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>> and jump cache modification
>> Actually from my point of view jump cache modification requires more than a
>> lock as other VCPU thread can be executing code during the modification.
>> 
>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>> tb_invalidate which need all CPU to be halted anyway.
> 
> How about:
> 
> DESIGN REQUIREMENT:
>       - Code generation and patching will be protected by a lock
>       - Jump cache modification will assert all CPUs are halted
> 
>>> 
>>> Memory maps and TLBs
>>> --------------------
>>> 
>>> The memory handling code is fairly critical to the speed of memory
>>> access in the emulated system.
>>> 
>>>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>   - Dirty page tracking (for code gen, migration and display)
>>>   - Virtual TLB (for translating guest address->real address)
>>> 
>>> There is a both a fast path walked by the generated code and a slow
>>> path when resolution is required. When the TLB tables are updated we
>>> need to ensure they are done in a safe way by bringing all executing
>>> threads to a halt before making the modifications.
>>> 
>>> DESIGN REQUIREMENTS:
>>> 
>>>   - TLB Flush All/Page
>>>     - can be across-CPUs
>>>     - will need all other CPUs brought to a halt
>>>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>     - This is a per-CPU table - by definition can't race
>>>     - updated by it's own thread when the slow-path is forced
>> Actually as we have  approximately the same behaviour for all of this memory
>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
>> all playing with
>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>> generic "exit and do something" mechanism for the CPU threads.
>> So every VCPU threads has a list of thing to do when they exit (such as 
>> clearing it's
>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
>> entry for
>> tb_invalidate).
> 
> Sounds like I should write an additional section to describe the process
> of halting CPUs and carrying out deferred per-CPU actions as well as
> ensuring we can tell when they are all halted.
> 
>>> Emulated hardware state
>>> -----------------------
>>> 
>>> Currently the hardware emulation has no protection against
>>> multiple-accesses. However guest systems accessing emulated hardware
>>> should be carrying out their own locking to prevent multiple CPUs
>>> confusing the hardware. Of course there is no guarantee the there
>>> couldn't be a broken guest that doesn't lock so you could get racing
>>> accesses to the hardware.
>>> 
>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>> result of a guest triggered transaction.
>>> 
>>> DESIGN REQUIREMENTS:
>>> 
>>>   - Access to IO Memory should be serialised by an IOMem mutex
>>>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>> That might be done with the global mutex as it is today?
>> We need changes here anyway to have VCPU threads running in parallel.
> 
> I'm not sure re-using the global mutex is a good idea. I've had to hack
> the global mutex to allow recursive locking to get around the virtio
> hang I discovered last week. While it works I'm uneasy making such a
> radical change upstream given how widely the global mutex is used hence
> the suggestion to have an explicit IOMem mutex.
> 
> Actually I'm surprised the iothread muxtex just re-uses the global one.
> I guess I need to talk to the IO guys as to why they took that
> decision.
> 
>> 
>> Thanks,
> 
> Thanks for your quick review :-)
> 
>> Fred
>> 
>>> IO Subsystem
>>> ------------
>>> 
>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>> be no additional locking required once we reach the Block Driver.
>>> 
>>> DESIGN REQUIREMENTS:
>>> 
>>>   - The dataplane should continue to be protected by the iothread locks
>>> 
>>> 
>>> References
>>> ==========
>>> 
>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>> 
>>> 
>>> 
> 
> -- 
> Alex Bennée


	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-15 10:51     ` Mark Burton
@ 2015-06-15 12:36       ` Alex Bennée
  2015-06-15 14:25       ` Alex Bennée
  1 sibling, 0 replies; 14+ messages in thread
From: Alex Bennée @ 2015-06-15 12:36 UTC (permalink / raw)
  To: Mark Burton
  Cc: mttcg, Peter Maydell, Alexander Graf, QEMU Developers,
	Guillaume Delbergue, pbonzini, KONRAD Frédéric


Mark Burton <mark.burton@greensocs.com> writes:

> I think we SHOUDL use the wiki - and keep it current. A lot of what you have is in the wiki too, but I’d like to see the wiki updated.
> We will add our stuff there too…

I'll do a pass today and update it to point to lists, discussions and
WIP trees.

>
> Cheers
> Mark.
>
>
>
>> On 15 Jun 2015, at 12:06, Alex Bennée <alex.bennee@linaro.org> wrote:
>> 
>> 
>> Frederic Konrad <fred.konrad@greensocs.com> writes:
>> 
>>> On 12/06/2015 18:37, Alex Bennée wrote:
>>>> Hi,
>>> 
>>> Hi Alex,
>>> 
>>> I've completed some of the points below. We will also work on a design 
>>> decisions
>>> document to add to this one.
>>> 
>>> We probably want to merge that with what we did on the wiki?
>>> http://wiki.qemu.org/Features/tcg-multithread
>> 
>> Well hopefully there is cross-over as I started with the wiki as a basic
>> ;-)
>> 
>> Do we want to just keep the wiki as the live design document or put
>> pointers to the current drafts? I'm hoping eventually the page will just
>> point to the design in the doc directory at git.qemu.org.
>> 
>>>> One thing that Peter has been asking for is a design document for the
>>>> way we are going to approach multi-threaded TCG emulation. I started
>>>> with the information that was captured on the wiki and tried to build on
>>>> that. It's almost certainly incomplete but I thought it would be worth
>>>> posting for wider discussion early rather than later.
>>>> 
>>>> One obvious omission at the moment is the lack of discussion about other
>>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>>> dirty page tracking bits, I'm sure there is more).
>>>> 
>>>> I've also deliberately tried to avoid documenting the design decisions
>>>> made in the current Greensoc's patch series. This is so we can
>>>> concentrate on the big picture before getting side-tracked into the
>>>> implementation details.
>>>> 
>>>> I have now started digging into the Greensocs code in earnest and the
>>>> plan is eventually the design and the implementation will converge on a
>>>> final documented complete solution ;-)
>>>> 
>>>> Anyway as ever I look forward to the comments and discussion:
>>>> 
>>>> STATUS: DRAFTING
>>>> 
>>>> Introduction
>>>> ============
>>>> 
>>>> This document outlines the design for multi-threaded TCG emulation.
>>>> The original TCG implementation was single threaded and dealt with
>>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>>> lot of things but became increasingly limited as systems being
>>>> emulated gained additional cores and per-core performance gains for host
>>>> systems started to level off.
>>>> 
>>>> Memory Consistency
>>>> ==================
>>>> 
>>>> Between emulated guests and host systems there are a range of memory
>>>> consistency models. While emulating weakly ordered systems on strongly
>>>> ordered hosts shouldn't cause any problems the same is not true for
>>>> the reverse setup.
>>>> 
>>>> The proposed design currently does not address the problem of
>>>> emulating strong ordering on a weakly ordered host although even on
>>>> strongly ordered systems software should be using synchronisation
>>>> primitives to ensure correct operation.
>>>> 
>>>> Memory Barriers
>>>> ---------------
>>>> 
>>>> Barriers (sometimes known as fences) provide a mechanism for software
>>>> to enforce a particular ordering of memory operations from the point
>>>> of view of external observers (e.g. another processor core). They can
>>>> apply to any memory operations as well as just loads or stores.
>>>> 
>>>> The Linux kernel has an excellent write-up on the various forms of
>>>> memory barrier and the guarantees they can provide [1].
>>>> 
>>>> Barriers are often wrapped around synchronisation primitives to
>>>> provide explicit memory ordering semantics. However they can be used
>>>> by themselves to provide safe lockless access by ensuring for example
>>>> a signal flag will always be set after a payload.
>>>> 
>>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>>> 
>>>> This would enforce a strong load/store ordering so all loads/stores
>>>> complete at the memory barrier. On single-core non-SMP strongly
>>>> ordered backends this could become a NOP.
>>>> 
>>>> There may be a case for further refinement if this causes performance
>>>> bottlenecks.
>>>> 
>>>> Memory Control and Maintenance
>>>> ------------------------------
>>>> 
>>>> This includes a class of instructions for controlling system cache
>>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>>> are often seen when code modification has taken place to ensure the
>>>> changes take effect.
>>>> 
>>>> Synchronisation Primitives
>>>> --------------------------
>>>> 
>>>> There are two broad types of synchronisation primitives found in
>>>> modern ISAs: atomic instructions and exclusive regions.
>>>> 
>>>> The first type offer a simple atomic instruction which will guarantee
>>>> some sort of test and conditional store will be truly atomic w.r.t.
>>>> other cores sharing access to the memory. The classic example is the
>>>> x86 cmpxchg instruction.
>>>> 
>>>> The second type offer a pair of load/store instructions which offer a
>>>> guarantee that an region of memory has not been touched between the
>>>> load and store instructions. An example of this is ARM's ldrex/strex
>>>> pair where the strex instruction will return a flag indicating a
>>>> successful store only if no other CPU has accessed the memory region
>>>> since the ldrex.
>>>> 
>>>> Traditionally TCG has generated a series of operations that work
>>>> because they are within the context of a single translation block so
>>>> will have completed before another CPU is scheduled. However with
>>>> the ability to have multiple threads running to emulate multiple CPUs
>>>> we will need to explicitly expose these semantics.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>>  - atomics
>>>>    - Introduce some atomic TCG ops for the common semantics
>>>>    - The default fallback helper function will use qemu_atomics
>>>>    - Each backend can then add a more efficient implementation
>>>>  - load/store exclusive
>>>>    [AJB:
>>>>         There are currently a number proposals of interest:
>>>>      - Greensocs tweaks to ldst ex (using locks)
>>>>      - Slow-path for atomic instruction translation [2]
>>>>      - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>>     ]
>>>> 
>>>> 
>>>> Shared Data Structures
>>>> ======================
>>>> 
>>>> Global TCG State
>>>> ----------------
>>>> 
>>>> We need to protect the entire code generation cycle including any post
>>>> generation patching of the translated code. This also implies a shared
>>>> translation buffer which contains code running on all cores. Any
>>>> execution path that comes to the main run loop will need to hold a
>>>> mutex for code generation. This also includes times when we need flush
>>>> code or jumps from the tb_cache.
>>>> 
>>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>>> and jump cache modification
>>> Actually from my point of view jump cache modification requires more than a
>>> lock as other VCPU thread can be executing code during the modification.
>>> 
>>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>>> tb_invalidate which need all CPU to be halted anyway.
>> 
>> How about:
>> 
>> DESIGN REQUIREMENT:
>>       - Code generation and patching will be protected by a lock
>>       - Jump cache modification will assert all CPUs are halted
>> 
>>>> 
>>>> Memory maps and TLBs
>>>> --------------------
>>>> 
>>>> The memory handling code is fairly critical to the speed of memory
>>>> access in the emulated system.
>>>> 
>>>>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>>   - Dirty page tracking (for code gen, migration and display)
>>>>   - Virtual TLB (for translating guest address->real address)
>>>> 
>>>> There is a both a fast path walked by the generated code and a slow
>>>> path when resolution is required. When the TLB tables are updated we
>>>> need to ensure they are done in a safe way by bringing all executing
>>>> threads to a halt before making the modifications.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - TLB Flush All/Page
>>>>     - can be across-CPUs
>>>>     - will need all other CPUs brought to a halt
>>>>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>>     - This is a per-CPU table - by definition can't race
>>>>     - updated by it's own thread when the slow-path is forced
>>> Actually as we have  approximately the same behaviour for all of this memory
>>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
>>> all playing with
>>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>>> generic "exit and do something" mechanism for the CPU threads.
>>> So every VCPU threads has a list of thing to do when they exit (such as 
>>> clearing it's
>>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
>>> entry for
>>> tb_invalidate).
>> 
>> Sounds like I should write an additional section to describe the process
>> of halting CPUs and carrying out deferred per-CPU actions as well as
>> ensuring we can tell when they are all halted.
>> 
>>>> Emulated hardware state
>>>> -----------------------
>>>> 
>>>> Currently the hardware emulation has no protection against
>>>> multiple-accesses. However guest systems accessing emulated hardware
>>>> should be carrying out their own locking to prevent multiple CPUs
>>>> confusing the hardware. Of course there is no guarantee the there
>>>> couldn't be a broken guest that doesn't lock so you could get racing
>>>> accesses to the hardware.
>>>> 
>>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>>> result of a guest triggered transaction.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - Access to IO Memory should be serialised by an IOMem mutex
>>>>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>>> That might be done with the global mutex as it is today?
>>> We need changes here anyway to have VCPU threads running in parallel.
>> 
>> I'm not sure re-using the global mutex is a good idea. I've had to hack
>> the global mutex to allow recursive locking to get around the virtio
>> hang I discovered last week. While it works I'm uneasy making such a
>> radical change upstream given how widely the global mutex is used hence
>> the suggestion to have an explicit IOMem mutex.
>> 
>> Actually I'm surprised the iothread muxtex just re-uses the global one.
>> I guess I need to talk to the IO guys as to why they took that
>> decision.
>> 
>>> 
>>> Thanks,
>> 
>> Thanks for your quick review :-)
>> 
>>> Fred
>>> 
>>>> IO Subsystem
>>>> ------------
>>>> 
>>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>>> be no additional locking required once we reach the Block Driver.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - The dataplane should continue to be protected by the iothread locks
>>>> 
>>>> 
>>>> References
>>>> ==========
>>>> 
>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>> 
>>>> 
>>>> 
>> 
>> -- 
>> Alex Bennée
>
>
> 	 +44 (0)20 7100 3485 x 210
>  +33 (0)5 33 52 01 77x 210
>
> 	+33 (0)603762104
> 	mark.burton

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-12 16:37 [Qemu-devel] RFC Multi-threaded TCG design document Alex Bennée
  2015-06-15  9:13 ` Frederic Konrad
@ 2015-06-15 13:06 ` alvise rigo
  2015-06-15 14:25   ` Alex Bennée
  2015-06-17 11:58 ` Paolo Bonzini
  2015-06-17 16:57 ` Dr. David Alan Gilbert
  3 siblings, 1 reply; 14+ messages in thread
From: alvise rigo @ 2015-06-15 13:06 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Peter Maydell, Mark Burton, QEMU Developers,
	Alexander Graf, guillaume.delbergue, Paolo Bonzini,
	KONRAD Frédéric

Hi Alex,

Let me just add one comment.

On Fri, Jun 12, 2015 at 6:37 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
> Hi,
>
> One thing that Peter has been asking for is a design document for the
> way we are going to approach multi-threaded TCG emulation. I started
> with the information that was captured on the wiki and tried to build on
> that. It's almost certainly incomplete but I thought it would be worth
> posting for wider discussion early rather than later.
>
> One obvious omission at the moment is the lack of discussion about other
> non-TLB shared data structures in QEMU (I'm thinking of the various
> dirty page tracking bits, I'm sure there is more).
>
> I've also deliberately tried to avoid documenting the design decisions
> made in the current Greensoc's patch series. This is so we can
> concentrate on the big picture before getting side-tracked into the
> implementation details.
>
> I have now started digging into the Greensocs code in earnest and the
> plan is eventually the design and the implementation will converge on a
> final documented complete solution ;-)
>
> Anyway as ever I look forward to the comments and discussion:
>
> STATUS: DRAFTING
>
> Introduction
> ============
>
> This document outlines the design for multi-threaded TCG emulation.
> The original TCG implementation was single threaded and dealt with
> multiple CPUs by with simple round-robin scheduling. This simplified a
> lot of things but became increasingly limited as systems being
> emulated gained additional cores and per-core performance gains for host
> systems started to level off.
>
> Memory Consistency
> ==================
>
> Between emulated guests and host systems there are a range of memory
> consistency models. While emulating weakly ordered systems on strongly
> ordered hosts shouldn't cause any problems the same is not true for
> the reverse setup.
>
> The proposed design currently does not address the problem of
> emulating strong ordering on a weakly ordered host although even on
> strongly ordered systems software should be using synchronisation
> primitives to ensure correct operation.
>
> Memory Barriers
> ---------------
>
> Barriers (sometimes known as fences) provide a mechanism for software
> to enforce a particular ordering of memory operations from the point
> of view of external observers (e.g. another processor core). They can
> apply to any memory operations as well as just loads or stores.
>
> The Linux kernel has an excellent write-up on the various forms of
> memory barrier and the guarantees they can provide [1].
>
> Barriers are often wrapped around synchronisation primitives to
> provide explicit memory ordering semantics. However they can be used
> by themselves to provide safe lockless access by ensuring for example
> a signal flag will always be set after a payload.
>
> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>
> This would enforce a strong load/store ordering so all loads/stores
> complete at the memory barrier. On single-core non-SMP strongly
> ordered backends this could become a NOP.

I believe the main problem here is not just about translating guest
barriers to host barriers, but also about adding barriers in the TCG
generated code where they are needed i.e. when, in the guest code, the
synchronization/memory barriers don't wrap atomic instructions.

To give a concrete example, let's suppose a case where we emulate an x86
guest on ARM (on ARMv8 the situation should be not so complicated).  At
some point TCG will be asked to translate a Linux spin_lock(), that
eventually uses arch_spin_lock(). Simplifying a bit, what happens is
along the lines of:

- barrier() // meaning a compiler barrier
  - atomic update of the (spin)lock value
- barrier()

The architecture dependent part is of course the "atomic update of the
spinlock" implementation, which, on ARM, relies on ldrex/strex
instructions and eventually issues a full hardware memory barrier (dmb).
On the other hand, on x86, only the cmpxchg instructions is used
(coupled with a memory compiler clobber), but no hardware full memory
barrier is required because of a stronger memory model.  I'm pretty sure
that the TCG code generated from spin_lock() will not be the same as the
one present in an ARM kernel binary compiled with the latest
GCC, but still, that full memory barrier is likely to be required also
in the TCG generated code.

Now the question could be: looking at the bare flow of asm x86
instructions used to implement spin_lock(), how can we deduce that a dmb
instruction has to be added after the atomic instructions?  Should we
pair every guest atomic instruction with a dmb?

Regards,
alvise

>
> There may be a case for further refinement if this causes performance
> bottlenecks.
>
> Memory Control and Maintenance
> ------------------------------
>
> This includes a class of instructions for controlling system cache
> behaviour. While QEMU doesn't model cache behaviour these instructions
> are often seen when code modification has taken place to ensure the
> changes take effect.
>
> Synchronisation Primitives
> --------------------------
>
> There are two broad types of synchronisation primitives found in
> modern ISAs: atomic instructions and exclusive regions.
>
> The first type offer a simple atomic instruction which will guarantee
> some sort of test and conditional store will be truly atomic w.r.t.
> other cores sharing access to the memory. The classic example is the
> x86 cmpxchg instruction.
>
> The second type offer a pair of load/store instructions which offer a
> guarantee that an region of memory has not been touched between the
> load and store instructions. An example of this is ARM's ldrex/strex
> pair where the strex instruction will return a flag indicating a
> successful store only if no other CPU has accessed the memory region
> since the ldrex.
>
> Traditionally TCG has generated a series of operations that work
> because they are within the context of a single translation block so
> will have completed before another CPU is scheduled. However with
> the ability to have multiple threads running to emulate multiple CPUs
> we will need to explicitly expose these semantics.
>
> DESIGN REQUIREMENTS:
>  - atomics
>    - Introduce some atomic TCG ops for the common semantics
>    - The default fallback helper function will use qemu_atomics
>    - Each backend can then add a more efficient implementation
>  - load/store exclusive
>    [AJB:
>         There are currently a number proposals of interest:
>      - Greensocs tweaks to ldst ex (using locks)
>      - Slow-path for atomic instruction translation [2]
>      - Helper-based Atomic Instruction Emulation (AIE) [3]
>     ]
>
>
> Shared Data Structures
> ======================
>
> Global TCG State
> ----------------
>
> We need to protect the entire code generation cycle including any post
> generation patching of the translated code. This also implies a shared
> translation buffer which contains code running on all cores. Any
> execution path that comes to the main run loop will need to hold a
> mutex for code generation. This also includes times when we need flush
> code or jumps from the tb_cache.
>
> DESIGN REQUIREMENT: Add locking around all code generation, patching
> and jump cache modification
>
> Memory maps and TLBs
> --------------------
>
> The memory handling code is fairly critical to the speed of memory
> access in the emulated system.
>
>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>   - Dirty page tracking (for code gen, migration and display)
>   - Virtual TLB (for translating guest address->real address)
>
> There is a both a fast path walked by the generated code and a slow
> path when resolution is required. When the TLB tables are updated we
> need to ensure they are done in a safe way by bringing all executing
> threads to a halt before making the modifications.
>
> DESIGN REQUIREMENTS:
>
>   - TLB Flush All/Page
>     - can be across-CPUs
>     - will need all other CPUs brought to a halt
>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>     - This is a per-CPU table - by definition can't race
>     - updated by it's own thread when the slow-path is forced
>
> Emulated hardware state
> -----------------------
>
> Currently the hardware emulation has no protection against
> multiple-accesses. However guest systems accessing emulated hardware
> should be carrying out their own locking to prevent multiple CPUs
> confusing the hardware. Of course there is no guarantee the there
> couldn't be a broken guest that doesn't lock so you could get racing
> accesses to the hardware.
>
> There is the class of paravirtualized hardware (VIRTIO) that works in
> a purely mmio mode. Often setting flags directly in guest memory as a
> result of a guest triggered transaction.
>
> DESIGN REQUIREMENTS:
>
>   - Access to IO Memory should be serialised by an IOMem mutex
>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>
> IO Subsystem
> ------------
>
> The I/O subsystem is heavily used by KVM and has seen a lot of
> improvements to offload I/O tasks to dedicated IOThreads. There should
> be no additional locking required once we reach the Block Driver.
>
> DESIGN REQUIREMENTS:
>
>   - The dataplane should continue to be protected by the iothread locks
>
>
> References
> ==========
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-15 13:06 ` alvise rigo
@ 2015-06-15 14:25   ` Alex Bennée
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Bennée @ 2015-06-15 14:25 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Peter Maydell, Mark Burton, QEMU Developers,
	Alexander Graf, guillaume.delbergue, Paolo Bonzini,
	KONRAD Frédéric


alvise rigo <a.rigo@virtualopensystems.com> writes:

> Hi Alex,
>
> Let me just add one comment.
>
<snip>
>>
>> Memory Barriers
>> ---------------
>>
>> Barriers (sometimes known as fences) provide a mechanism for software
>> to enforce a particular ordering of memory operations from the point
>> of view of external observers (e.g. another processor core). They can
>> apply to any memory operations as well as just loads or stores.
>>
>> The Linux kernel has an excellent write-up on the various forms of
>> memory barrier and the guarantees they can provide [1].
>>
>> Barriers are often wrapped around synchronisation primitives to
>> provide explicit memory ordering semantics. However they can be used
>> by themselves to provide safe lockless access by ensuring for example
>> a signal flag will always be set after a payload.
>>
>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>
>> This would enforce a strong load/store ordering so all loads/stores
>> complete at the memory barrier. On single-core non-SMP strongly
>> ordered backends this could become a NOP.
>
> I believe the main problem here is not just about translating guest
> barriers to host barriers, but also about adding barriers in the TCG
> generated code where they are needed i.e. when, in the guest code, the
> synchronization/memory barriers don't wrap atomic instructions.

Not all atomic instructions imply memory barriers. AIUI on ARMv8 you
only have explicit memory barriers is you use the
load-acquire/store-release variants of load/store exclusive. 

>
> To give a concrete example, let's suppose a case where we emulate an x86
> guest on ARM (on ARMv8 the situation should be not so complicated).  At
> some point TCG will be asked to translate a Linux spin_lock(), that
> eventually uses arch_spin_lock(). Simplifying a bit, what happens is
> along the lines of:
>
> - barrier() // meaning a compiler barrier
>   - atomic update of the (spin)lock value
> - barrier()
>
> The architecture dependent part is of course the "atomic update of the
> spinlock" implementation, which, on ARM, relies on ldrex/strex
> instructions and eventually issues a full hardware memory barrier (dmb).
> On the other hand, on x86, only the cmpxchg instructions is used
> (coupled with a memory compiler clobber), but no hardware full memory
> barrier is required because of a stronger memory model.  I'm pretty sure
> that the TCG code generated from spin_lock() will not be the same as the
> one present in an ARM kernel binary compiled with the latest
> GCC, but still, that full memory barrier is likely to be required also
> in the TCG generated code.
>
> Now the question could be: looking at the bare flow of asm x86
> instructions used to implement spin_lock(), how can we deduce that a dmb
> instruction has to be added after the atomic instructions?  Should we
> pair every guest atomic instruction with a dmb?

I don't think so. We should follow the guest processors semantics which
AIUI for x86 is cmpxchg does enforce memory ordering across cores if
prefixed with the LOCK prefix. At that point we can prefix the cmpxchg
TCG ops with our new tcg_dmb barrier.

Without the LOCK prefix we still guarantee an atomic update but without
any explicit synchronisation between the cores.

In practice Linux at least uses LOCK prefixed cmpxchg instructions in
its synchronisation code.

x86 code will still emit s/m/lfence instructions to ensure external
devices see memory accesses in the right order. These should certainly
cause memory barriers tcg ops to be emitted.

>
> Regards,
> alvise
>

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-15 10:51     ` Mark Burton
  2015-06-15 12:36       ` Alex Bennée
@ 2015-06-15 14:25       ` Alex Bennée
  1 sibling, 0 replies; 14+ messages in thread
From: Alex Bennée @ 2015-06-15 14:25 UTC (permalink / raw)
  To: Mark Burton
  Cc: mttcg, Peter Maydell, Alexander Graf, QEMU Developers,
	Guillaume Delbergue, pbonzini, KONRAD Frédéric


Mark Burton <mark.burton@greensocs.com> writes:

> I think we SHOUDL use the wiki - and keep it current. A lot of what you have is in the wiki too, but I’d like to see the wiki updated.
> We will add our stuff there too…

Well I've added some pointer for now. Reading the wiki reminded me of
the informal phone conference. What are the details for that?

>
> Cheers
> Mark.
>
>
>
>> On 15 Jun 2015, at 12:06, Alex Bennée <alex.bennee@linaro.org> wrote:
>> 
>> 
>> Frederic Konrad <fred.konrad@greensocs.com> writes:
>> 
>>> On 12/06/2015 18:37, Alex Bennée wrote:
>>>> Hi,
>>> 
>>> Hi Alex,
>>> 
>>> I've completed some of the points below. We will also work on a design 
>>> decisions
>>> document to add to this one.
>>> 
>>> We probably want to merge that with what we did on the wiki?
>>> http://wiki.qemu.org/Features/tcg-multithread
>> 
>> Well hopefully there is cross-over as I started with the wiki as a basic
>> ;-)
>> 
>> Do we want to just keep the wiki as the live design document or put
>> pointers to the current drafts? I'm hoping eventually the page will just
>> point to the design in the doc directory at git.qemu.org.
>> 
>>>> One thing that Peter has been asking for is a design document for the
>>>> way we are going to approach multi-threaded TCG emulation. I started
>>>> with the information that was captured on the wiki and tried to build on
>>>> that. It's almost certainly incomplete but I thought it would be worth
>>>> posting for wider discussion early rather than later.
>>>> 
>>>> One obvious omission at the moment is the lack of discussion about other
>>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>>> dirty page tracking bits, I'm sure there is more).
>>>> 
>>>> I've also deliberately tried to avoid documenting the design decisions
>>>> made in the current Greensoc's patch series. This is so we can
>>>> concentrate on the big picture before getting side-tracked into the
>>>> implementation details.
>>>> 
>>>> I have now started digging into the Greensocs code in earnest and the
>>>> plan is eventually the design and the implementation will converge on a
>>>> final documented complete solution ;-)
>>>> 
>>>> Anyway as ever I look forward to the comments and discussion:
>>>> 
>>>> STATUS: DRAFTING
>>>> 
>>>> Introduction
>>>> ============
>>>> 
>>>> This document outlines the design for multi-threaded TCG emulation.
>>>> The original TCG implementation was single threaded and dealt with
>>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>>> lot of things but became increasingly limited as systems being
>>>> emulated gained additional cores and per-core performance gains for host
>>>> systems started to level off.
>>>> 
>>>> Memory Consistency
>>>> ==================
>>>> 
>>>> Between emulated guests and host systems there are a range of memory
>>>> consistency models. While emulating weakly ordered systems on strongly
>>>> ordered hosts shouldn't cause any problems the same is not true for
>>>> the reverse setup.
>>>> 
>>>> The proposed design currently does not address the problem of
>>>> emulating strong ordering on a weakly ordered host although even on
>>>> strongly ordered systems software should be using synchronisation
>>>> primitives to ensure correct operation.
>>>> 
>>>> Memory Barriers
>>>> ---------------
>>>> 
>>>> Barriers (sometimes known as fences) provide a mechanism for software
>>>> to enforce a particular ordering of memory operations from the point
>>>> of view of external observers (e.g. another processor core). They can
>>>> apply to any memory operations as well as just loads or stores.
>>>> 
>>>> The Linux kernel has an excellent write-up on the various forms of
>>>> memory barrier and the guarantees they can provide [1].
>>>> 
>>>> Barriers are often wrapped around synchronisation primitives to
>>>> provide explicit memory ordering semantics. However they can be used
>>>> by themselves to provide safe lockless access by ensuring for example
>>>> a signal flag will always be set after a payload.
>>>> 
>>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>>> 
>>>> This would enforce a strong load/store ordering so all loads/stores
>>>> complete at the memory barrier. On single-core non-SMP strongly
>>>> ordered backends this could become a NOP.
>>>> 
>>>> There may be a case for further refinement if this causes performance
>>>> bottlenecks.
>>>> 
>>>> Memory Control and Maintenance
>>>> ------------------------------
>>>> 
>>>> This includes a class of instructions for controlling system cache
>>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>>> are often seen when code modification has taken place to ensure the
>>>> changes take effect.
>>>> 
>>>> Synchronisation Primitives
>>>> --------------------------
>>>> 
>>>> There are two broad types of synchronisation primitives found in
>>>> modern ISAs: atomic instructions and exclusive regions.
>>>> 
>>>> The first type offer a simple atomic instruction which will guarantee
>>>> some sort of test and conditional store will be truly atomic w.r.t.
>>>> other cores sharing access to the memory. The classic example is the
>>>> x86 cmpxchg instruction.
>>>> 
>>>> The second type offer a pair of load/store instructions which offer a
>>>> guarantee that an region of memory has not been touched between the
>>>> load and store instructions. An example of this is ARM's ldrex/strex
>>>> pair where the strex instruction will return a flag indicating a
>>>> successful store only if no other CPU has accessed the memory region
>>>> since the ldrex.
>>>> 
>>>> Traditionally TCG has generated a series of operations that work
>>>> because they are within the context of a single translation block so
>>>> will have completed before another CPU is scheduled. However with
>>>> the ability to have multiple threads running to emulate multiple CPUs
>>>> we will need to explicitly expose these semantics.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>>  - atomics
>>>>    - Introduce some atomic TCG ops for the common semantics
>>>>    - The default fallback helper function will use qemu_atomics
>>>>    - Each backend can then add a more efficient implementation
>>>>  - load/store exclusive
>>>>    [AJB:
>>>>         There are currently a number proposals of interest:
>>>>      - Greensocs tweaks to ldst ex (using locks)
>>>>      - Slow-path for atomic instruction translation [2]
>>>>      - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>>     ]
>>>> 
>>>> 
>>>> Shared Data Structures
>>>> ======================
>>>> 
>>>> Global TCG State
>>>> ----------------
>>>> 
>>>> We need to protect the entire code generation cycle including any post
>>>> generation patching of the translated code. This also implies a shared
>>>> translation buffer which contains code running on all cores. Any
>>>> execution path that comes to the main run loop will need to hold a
>>>> mutex for code generation. This also includes times when we need flush
>>>> code or jumps from the tb_cache.
>>>> 
>>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>>> and jump cache modification
>>> Actually from my point of view jump cache modification requires more than a
>>> lock as other VCPU thread can be executing code during the modification.
>>> 
>>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>>> tb_invalidate which need all CPU to be halted anyway.
>> 
>> How about:
>> 
>> DESIGN REQUIREMENT:
>>       - Code generation and patching will be protected by a lock
>>       - Jump cache modification will assert all CPUs are halted
>> 
>>>> 
>>>> Memory maps and TLBs
>>>> --------------------
>>>> 
>>>> The memory handling code is fairly critical to the speed of memory
>>>> access in the emulated system.
>>>> 
>>>>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>>   - Dirty page tracking (for code gen, migration and display)
>>>>   - Virtual TLB (for translating guest address->real address)
>>>> 
>>>> There is a both a fast path walked by the generated code and a slow
>>>> path when resolution is required. When the TLB tables are updated we
>>>> need to ensure they are done in a safe way by bringing all executing
>>>> threads to a halt before making the modifications.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - TLB Flush All/Page
>>>>     - can be across-CPUs
>>>>     - will need all other CPUs brought to a halt
>>>>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>>     - This is a per-CPU table - by definition can't race
>>>>     - updated by it's own thread when the slow-path is forced
>>> Actually as we have  approximately the same behaviour for all of this memory
>>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
>>> all playing with
>>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>>> generic "exit and do something" mechanism for the CPU threads.
>>> So every VCPU threads has a list of thing to do when they exit (such as 
>>> clearing it's
>>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
>>> entry for
>>> tb_invalidate).
>> 
>> Sounds like I should write an additional section to describe the process
>> of halting CPUs and carrying out deferred per-CPU actions as well as
>> ensuring we can tell when they are all halted.
>> 
>>>> Emulated hardware state
>>>> -----------------------
>>>> 
>>>> Currently the hardware emulation has no protection against
>>>> multiple-accesses. However guest systems accessing emulated hardware
>>>> should be carrying out their own locking to prevent multiple CPUs
>>>> confusing the hardware. Of course there is no guarantee the there
>>>> couldn't be a broken guest that doesn't lock so you could get racing
>>>> accesses to the hardware.
>>>> 
>>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>>> result of a guest triggered transaction.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - Access to IO Memory should be serialised by an IOMem mutex
>>>>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>>> That might be done with the global mutex as it is today?
>>> We need changes here anyway to have VCPU threads running in parallel.
>> 
>> I'm not sure re-using the global mutex is a good idea. I've had to hack
>> the global mutex to allow recursive locking to get around the virtio
>> hang I discovered last week. While it works I'm uneasy making such a
>> radical change upstream given how widely the global mutex is used hence
>> the suggestion to have an explicit IOMem mutex.
>> 
>> Actually I'm surprised the iothread muxtex just re-uses the global one.
>> I guess I need to talk to the IO guys as to why they took that
>> decision.
>> 
>>> 
>>> Thanks,
>> 
>> Thanks for your quick review :-)
>> 
>>> Fred
>>> 
>>>> IO Subsystem
>>>> ------------
>>>> 
>>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>>> be no additional locking required once we reach the Block Driver.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - The dataplane should continue to be protected by the iothread locks
>>>> 
>>>> 
>>>> References
>>>> ==========
>>>> 
>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>> 
>>>> 
>>>> 
>> 
>> -- 
>> Alex Bennée
>
>
> 	 +44 (0)20 7100 3485 x 210
>  +33 (0)5 33 52 01 77x 210
>
> 	+33 (0)603762104
> 	mark.burton

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-12 16:37 [Qemu-devel] RFC Multi-threaded TCG design document Alex Bennée
  2015-06-15  9:13 ` Frederic Konrad
  2015-06-15 13:06 ` alvise rigo
@ 2015-06-17 11:58 ` Paolo Bonzini
  2015-06-17 15:57   ` Alex Bennée
  2015-06-17 16:57 ` Dr. David Alan Gilbert
  3 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2015-06-17 11:58 UTC (permalink / raw)
  To: Alex Bennée, qemu-devel, mttcg
  Cc: peter.maydell, fred.konrad, mark.burton, agraf, guillaume.delbergue



On 12/06/2015 18:37, Alex Bennée wrote:
> Emulated hardware state
> -----------------------
> 
> Currently the hardware emulation has no protection against
> multiple-accesses. However guest systems accessing emulated hardware
> should be carrying out their own locking to prevent multiple CPUs
> confusing the hardware. Of course there is no guarantee the there
> couldn't be a broken guest that doesn't lock so you could get racing
> accesses to the hardware.
> 
> There is the class of paravirtualized hardware (VIRTIO) that works in
> a purely mmio mode. Often setting flags directly in guest memory as a
> result of a guest triggered transaction.
> 
> DESIGN REQUIREMENTS:
> 
>   - Access to IO Memory should be serialised by an IOMem mutex

This should simply be the "big QEMU lock", which also protects the I/O
subsystem.

With BQL-free TCG (a subset of multi-threaded TCG), the code in
qemu_mutex_lock_iothread that forces VCPUs to relinquish the mutex can
be dropped.

Paolo

>   - The mutex should be recursive (e.g. allowing pid to relock itself)
> 
> IO Subsystem
> ------------
> 
> The I/O subsystem is heavily used by KVM and has seen a lot of
> improvements to offload I/O tasks to dedicated IOThreads. There should
> be no additional locking required once we reach the Block Driver.
> 
> DESIGN REQUIREMENTS:
> 
>   - The dataplane should continue to be protected by the iothread locks
> 
> 
> References
> ==========
> 
> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-17 11:58 ` Paolo Bonzini
@ 2015-06-17 15:57   ` Alex Bennée
  2015-06-17 16:13     ` Paolo Bonzini
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Bennée @ 2015-06-17 15:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, peter.maydell, mark.burton, qemu-devel, agraf,
	guillaume.delbergue, fred.konrad


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 12/06/2015 18:37, Alex Bennée wrote:
>> Emulated hardware state
>> -----------------------
>> 
>> Currently the hardware emulation has no protection against
>> multiple-accesses. However guest systems accessing emulated hardware
>> should be carrying out their own locking to prevent multiple CPUs
>> confusing the hardware. Of course there is no guarantee the there
>> couldn't be a broken guest that doesn't lock so you could get racing
>> accesses to the hardware.
>> 
>> There is the class of paravirtualized hardware (VIRTIO) that works in
>> a purely mmio mode. Often setting flags directly in guest memory as a
>> result of a guest triggered transaction.
>> 
>> DESIGN REQUIREMENTS:
>> 
>>   - Access to IO Memory should be serialised by an IOMem mutex
>
> This should simply be the "big QEMU lock", which also protects the I/O
> subsystem.
>
> With BQL-free TCG (a subset of multi-threaded TCG), the code in
> qemu_mutex_lock_iothread that forces VCPUs to relinquish the mutex can
> be dropped.
>
> Paolo
>
>>   - The mutex should be recursive (e.g. allowing pid to relock
>>   itself)

Paolo,

But would there be a risk is we make the BQL recursive?

I had to do this because the iomem accesses either side of a virt-io
transaction would deadlock otherwise. 


-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-17 15:57   ` Alex Bennée
@ 2015-06-17 16:13     ` Paolo Bonzini
  0 siblings, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2015-06-17 16:13 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, peter.maydell, mark.burton, qemu-devel, agraf,
	guillaume.delbergue, fred.konrad



On 17/06/2015 17:57, Alex Bennée wrote:
> But would there be a risk is we make the BQL recursive?
> 
> I had to do this because the iomem accesses either side of a virt-io
> transaction would deadlock otherwise. 

The idea was to check if the BQL is taken, and if not take it in
memory_region_dispatch_read/memory_region_dispatch_write.

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-12 16:37 [Qemu-devel] RFC Multi-threaded TCG design document Alex Bennée
                   ` (2 preceding siblings ...)
  2015-06-17 11:58 ` Paolo Bonzini
@ 2015-06-17 16:57 ` Dr. David Alan Gilbert
  2015-06-17 18:23   ` Mark Burton
  3 siblings, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2015-06-17 16:57 UTC (permalink / raw)
  To: Alex Benn?e
  Cc: mttcg, peter.maydell, mark.burton, qemu-devel, agraf,
	guillaume.delbergue, pbonzini, fred.konrad

* Alex Benn?e (alex.bennee@linaro.org) wrote:
> Hi,

> Shared Data Structures
> ======================
> 
> Global TCG State
> ----------------
> 
> We need to protect the entire code generation cycle including any post
> generation patching of the translated code. This also implies a shared
> translation buffer which contains code running on all cores. Any
> execution path that comes to the main run loop will need to hold a
> mutex for code generation. This also includes times when we need flush
> code or jumps from the tb_cache.
> 
> DESIGN REQUIREMENT: Add locking around all code generation, patching
> and jump cache modification

I don't think that you require a shared translation buffer between
cores to do this - although it *might* be the easiest way.
You could have a per-core translation buffer, the only requirement is
that most invalidation operations happen on all the buffers
(although that might depend on the emulated architecture).
With a per-core translation buffer, each core could generate new translations
without locking the other cores as long as no one is doing invalidations.

> Memory maps and TLBs
> --------------------
> 
> The memory handling code is fairly critical to the speed of memory
> access in the emulated system.
> 
>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>   - Dirty page tracking (for code gen, migration and display)
>   - Virtual TLB (for translating guest address->real address)
> 
> There is a both a fast path walked by the generated code and a slow
> path when resolution is required. When the TLB tables are updated we
> need to ensure they are done in a safe way by bringing all executing
> threads to a halt before making the modifications.
> 
> DESIGN REQUIREMENTS:
> 
>   - TLB Flush All/Page
>     - can be across-CPUs
>     - will need all other CPUs brought to a halt
>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>     - This is a per-CPU table - by definition can't race
>     - updated by it's own thread when the slow-path is forced
> 
> Emulated hardware state
> -----------------------
> 
> Currently the hardware emulation has no protection against
> multiple-accesses. However guest systems accessing emulated hardware
> should be carrying out their own locking to prevent multiple CPUs
> confusing the hardware. Of course there is no guarantee the there
> couldn't be a broken guest that doesn't lock so you could get racing
> accesses to the hardware.
> 
> There is the class of paravirtualized hardware (VIRTIO) that works in
> a purely mmio mode. Often setting flags directly in guest memory as a
> result of a guest triggered transaction.
> 
> DESIGN REQUIREMENTS:
> 
>   - Access to IO Memory should be serialised by an IOMem mutex
>   - The mutex should be recursive (e.g. allowing pid to relock itself)
> 
> IO Subsystem
> ------------
> 
> The I/O subsystem is heavily used by KVM and has seen a lot of
> improvements to offload I/O tasks to dedicated IOThreads. There should
> be no additional locking required once we reach the Block Driver.
> 
> DESIGN REQUIREMENTS:
> 
>   - The dataplane should continue to be protected by the iothread locks

Watch out for where DMA invalidates the translated code.

Dave

> 
> 
> References
> ==========
> 
> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
> 
> 
> 
> -- 
> Alex Bennée
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-17 16:57 ` Dr. David Alan Gilbert
@ 2015-06-17 18:23   ` Mark Burton
  2015-06-17 21:45     ` Frederic Konrad
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Burton @ 2015-06-17 18:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: mttcg, Peter Maydell, QEMU Developers, Alexander Graf,
	Guillaume Delbergue, Paolo Bonzini, Alex Benn?e,
	KONRAD Frédéric


> On 17 Jun 2015, at 18:57, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Alex Benn?e (alex.bennee@linaro.org) wrote:
>> Hi,
> 
>> Shared Data Structures
>> ======================
>> 
>> Global TCG State
>> ----------------
>> 
>> We need to protect the entire code generation cycle including any post
>> generation patching of the translated code. This also implies a shared
>> translation buffer which contains code running on all cores. Any
>> execution path that comes to the main run loop will need to hold a
>> mutex for code generation. This also includes times when we need flush
>> code or jumps from the tb_cache.
>> 
>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>> and jump cache modification
> 
> I don't think that you require a shared translation buffer between
> cores to do this - although it *might* be the easiest way.
> You could have a per-core translation buffer, the only requirement is
> that most invalidation operations happen on all the buffers
> (although that might depend on the emulated architecture).
> With a per-core translation buffer, each core could generate new translations
> without locking the other cores as long as no one is doing invalidations.

I agree it’s not a design requirement - however we’ve kind of gone round this loop in terms of getting things to work.
Fred will doubtless fill in some details, but basically it looks like making the TCG so you could run several in parallel is a nightmare. We seem to get reasonable performance having just one CPU at a time generating TBs.  At the same time, of course, the way Qemu is constructed there are actually several ‘layers’ of buffer - from the CPU local ones through to the TB ‘pool’. So, actually, my accident or design, we benefit from a sort of caching structure. 


> 
>> Memory maps and TLBs
>> --------------------
>> 
>> The memory handling code is fairly critical to the speed of memory
>> access in the emulated system.
>> 
>>  - Memory regions (dividing up access to PIO, MMIO and RAM)
>>  - Dirty page tracking (for code gen, migration and display)
>>  - Virtual TLB (for translating guest address->real address)
>> 
>> There is a both a fast path walked by the generated code and a slow
>> path when resolution is required. When the TLB tables are updated we
>> need to ensure they are done in a safe way by bringing all executing
>> threads to a halt before making the modifications.
>> 
>> DESIGN REQUIREMENTS:
>> 
>>  - TLB Flush All/Page
>>    - can be across-CPUs
>>    - will need all other CPUs brought to a halt
>>  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>    - This is a per-CPU table - by definition can't race
>>    - updated by it's own thread when the slow-path is forced
>> 
>> Emulated hardware state
>> -----------------------
>> 
>> Currently the hardware emulation has no protection against
>> multiple-accesses. However guest systems accessing emulated hardware
>> should be carrying out their own locking to prevent multiple CPUs
>> confusing the hardware. Of course there is no guarantee the there
>> couldn't be a broken guest that doesn't lock so you could get racing
>> accesses to the hardware.
>> 
>> There is the class of paravirtualized hardware (VIRTIO) that works in
>> a purely mmio mode. Often setting flags directly in guest memory as a
>> result of a guest triggered transaction.
>> 
>> DESIGN REQUIREMENTS:
>> 
>>  - Access to IO Memory should be serialised by an IOMem mutex
>>  - The mutex should be recursive (e.g. allowing pid to relock itself)
>> 
>> IO Subsystem
>> ------------
>> 
>> The I/O subsystem is heavily used by KVM and has seen a lot of
>> improvements to offload I/O tasks to dedicated IOThreads. There should
>> be no additional locking required once we reach the Block Driver.
>> 
>> DESIGN REQUIREMENTS:
>> 
>>  - The dataplane should continue to be protected by the iothread locks
> 
> Watch out for where DMA invalidates the translated code.
> 


need to check - that might be a great catch !

Cheers

Mark.

> Dave
> 
>> 
>> 
>> References
>> ==========
>> 
>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>> 
>> 
>> 
>> -- 
>> Alex Bennée
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


	 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

	+33 (0)603762104
	mark.burton

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] RFC Multi-threaded TCG design document
  2015-06-17 18:23   ` Mark Burton
@ 2015-06-17 21:45     ` Frederic Konrad
  0 siblings, 0 replies; 14+ messages in thread
From: Frederic Konrad @ 2015-06-17 21:45 UTC (permalink / raw)
  To: Mark Burton, Dr. David Alan Gilbert
  Cc: mttcg, Peter Maydell, QEMU Developers, Alexander Graf,
	Guillaume Delbergue, Paolo Bonzini, Alex Benn?e

On 17/06/2015 20:23, Mark Burton wrote:
>> On 17 Jun 2015, at 18:57, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>
>> * Alex Benn?e (alex.bennee@linaro.org) wrote:
>>> Hi,
>>> Shared Data Structures
>>> ======================
>>>
>>> Global TCG State
>>> ----------------
>>>
>>> We need to protect the entire code generation cycle including any post
>>> generation patching of the translated code. This also implies a shared
>>> translation buffer which contains code running on all cores. Any
>>> execution path that comes to the main run loop will need to hold a
>>> mutex for code generation. This also includes times when we need flush
>>> code or jumps from the tb_cache.
>>>
>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>> and jump cache modification
>> I don't think that you require a shared translation buffer between
>> cores to do this - although it *might* be the easiest way.
>> You could have a per-core translation buffer, the only requirement is
>> that most invalidation operations happen on all the buffers
>> (although that might depend on the emulated architecture).
>> With a per-core translation buffer, each core could generate new translations
>> without locking the other cores as long as no one is doing invalidations.
> I agree it’s not a design requirement - however we’ve kind of gone round this loop in terms of getting things to work.
> Fred will doubtless fill in some details, but basically it looks like making the TCG so you could run several in parallel is a nightmare. We seem to get reasonable performance having just one CPU at a time generating TBs.  At the same time, of course, the way Qemu is constructed there are actually several ‘layers’ of buffer - from the CPU local ones through to the TB ‘pool’. So, actually, my accident or design, we benefit from a sort of caching structure.
>
True, it seems to be very complex at least on ARM because of the disassemble
context etc.. But on the other side the invalidation might be easier I 
guess.
For performance I'm not sure of what is the better way..

Fred
>>> Memory maps and TLBs
>>> --------------------
>>>
>>> The memory handling code is fairly critical to the speed of memory
>>> access in the emulated system.
>>>
>>>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>   - Dirty page tracking (for code gen, migration and display)
>>>   - Virtual TLB (for translating guest address->real address)
>>>
>>> There is a both a fast path walked by the generated code and a slow
>>> path when resolution is required. When the TLB tables are updated we
>>> need to ensure they are done in a safe way by bringing all executing
>>> threads to a halt before making the modifications.
>>>
>>> DESIGN REQUIREMENTS:
>>>
>>>   - TLB Flush All/Page
>>>     - can be across-CPUs
>>>     - will need all other CPUs brought to a halt
>>>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>     - This is a per-CPU table - by definition can't race
>>>     - updated by it's own thread when the slow-path is forced
>>>
>>> Emulated hardware state
>>> -----------------------
>>>
>>> Currently the hardware emulation has no protection against
>>> multiple-accesses. However guest systems accessing emulated hardware
>>> should be carrying out their own locking to prevent multiple CPUs
>>> confusing the hardware. Of course there is no guarantee the there
>>> couldn't be a broken guest that doesn't lock so you could get racing
>>> accesses to the hardware.
>>>
>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>> result of a guest triggered transaction.
>>>
>>> DESIGN REQUIREMENTS:
>>>
>>>   - Access to IO Memory should be serialised by an IOMem mutex
>>>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>>>
>>> IO Subsystem
>>> ------------
>>>
>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>> be no additional locking required once we reach the Block Driver.
>>>
>>> DESIGN REQUIREMENTS:
>>>
>>>   - The dataplane should continue to be protected by the iothread locks
>> Watch out for where DMA invalidates the translated code.
>>
>
> need to check - that might be a great catch !
>
> Cheers
>
> Mark.
>
>> Dave
>>
>>>
>>> References
>>> ==========
>>>
>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>
>>>
>>>
>>> -- 
>>> Alex Bennée
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
> 	 +44 (0)20 7100 3485 x 210
>   +33 (0)5 33 52 01 77x 210
>
> 	+33 (0)603762104
> 	mark.burton
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-06-17 21:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-12 16:37 [Qemu-devel] RFC Multi-threaded TCG design document Alex Bennée
2015-06-15  9:13 ` Frederic Konrad
2015-06-15 10:06   ` Alex Bennée
2015-06-15 10:51     ` Mark Burton
2015-06-15 12:36       ` Alex Bennée
2015-06-15 14:25       ` Alex Bennée
2015-06-15 13:06 ` alvise rigo
2015-06-15 14:25   ` Alex Bennée
2015-06-17 11:58 ` Paolo Bonzini
2015-06-17 15:57   ` Alex Bennée
2015-06-17 16:13     ` Paolo Bonzini
2015-06-17 16:57 ` Dr. David Alan Gilbert
2015-06-17 18:23   ` Mark Burton
2015-06-17 21:45     ` Frederic Konrad

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.