* [Qemu-devel] exec: Safe work in quiescent state @ 2016-06-09 21:51 Sergey Fedorov 2016-06-15 12:59 ` Sergey Fedorov 0 siblings, 1 reply; 8+ messages in thread From: Sergey Fedorov @ 2016-06-09 21:51 UTC (permalink / raw) To: QEMU Developers Cc: MTTCG Devel, KONRAD Frédéric, Alvise Rigo, Emilio G. Cota, Alex Bennée, Paolo Bonzini, Richard Henderson, Peter Maydell Hi, For certain kinds of tasks we might need a quiescent state to perform an operation safely. Quiescent state means no CPU thread executing, and probably BQL held as well. The tasks could include: - Translation buffer flush (user and system-mode) - Cross-CPU TLB flush (system-mode) - Exclusive operation emulation (user-mode) If we use a single shared translation buffer which is not managed by RCU and simply flushed when full, we'll need a quiescent state to flush it safely. In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could probably be made with async_run_on_cpu(). I suppose it is always the guest system that needs to synchronise this operation properly. And as soon as we request the target CPU to exit its execution loop for serving the asynchronous work, we should probably be okay to continue execution on the CPU requested the operation while the target CPU executing till the end of its current TB before it actually flushed its TLB. As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB flushes (actually TLB flushes on all CPUs) must me done synchronously and thus might require quiescent state. Exclusive operation emulation in user-mode is currently implemented in this manner, see for start_exclusive(). It might change to some generic mechanism of atomic/exclusive instruction emulation for system and user-mode. It looks like we need to implement a common mechanism to perform safe work in a quiescent state which could work in both system and user-mode, at least for safe translation bufferflush in user-mode and MTTCG. I'm going to implement such a mechanism. I would appreciate any suggestions, comments and remarks. Thanks, Sergey ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] exec: Safe work in quiescent state 2016-06-09 21:51 [Qemu-devel] exec: Safe work in quiescent state Sergey Fedorov @ 2016-06-15 12:59 ` Sergey Fedorov 2016-06-15 14:16 ` alvise rigo 2016-06-15 14:56 ` Alex Bennée 0 siblings, 2 replies; 8+ messages in thread From: Sergey Fedorov @ 2016-06-15 12:59 UTC (permalink / raw) To: QEMU Developers Cc: MTTCG Devel, KONRAD Frédéric, Alvise Rigo, Emilio G. Cota, Alex Bennée, Paolo Bonzini, Richard Henderson, Peter Maydell On 10/06/16 00:51, Sergey Fedorov wrote: > For certain kinds of tasks we might need a quiescent state to perform an > operation safely. Quiescent state means no CPU thread executing, and > probably BQL held as well. The tasks could include: > - Translation buffer flush (user and system-mode) > - Cross-CPU TLB flush (system-mode) > - Exclusive operation emulation (user-mode) > > If we use a single shared translation buffer which is not managed by RCU > and simply flushed when full, we'll need a quiescent state to flush it > safely. > > In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could > probably be made with async_run_on_cpu(). I suppose it is always the > guest system that needs to synchronise this operation properly. And as > soon as we request the target CPU to exit its execution loop for serving > the asynchronous work, we should probably be okay to continue execution > on the CPU requested the operation while the target CPU executing till > the end of its current TB before it actually flushed its TLB. > > As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB > flushes (actually TLB flushes on all CPUs) must me done synchronously > and thus might require quiescent state. > > Exclusive operation emulation in user-mode is currently implemented in > this manner, see for start_exclusive(). It might change to some generic > mechanism of atomic/exclusive instruction emulation for system and > user-mode. > > It looks like we need to implement a common mechanism to perform safe > work in a quiescent state which could work in both system and user-mode, > at least for safe translation bufferflush in user-mode and MTTCG. I'm > going to implement such a mechanism. I would appreciate any suggestions, > comments and remarks. Considering different attempts to implement similar functionality, I've got the following summary. Fred's original async_run_safe_work_on_cpu() [1]: - resembles async_run_on_cpu(); - introduces a per-CPU safe work queue, a per-CPU flag to prevent the CPU from executing code, and a global counter of pending jobs; - implements rather complicated scheduling of jobs relying on both the per-CPU flag and the global counter; - may be not entirely safe when draining work queues if multiple CPUs have scheduled safe work; - does not support user-mode emulation. Alex's reiteration of Fred's approach [2]: - maintains a single global safe work queue; - uses GArray rather than linked list to implement the work queue; - introduces a global counter of CPUs which have entered their execution loop; - makes use of the last CPU exited its execution loop to drain the safe work queue; - still does not support user-mode emulation. Alvise's async_wait_run_on_cpu() [3]: - uses the same queue as async_run_on_cpu(); - the CPU that requested the job is recorded in qemu_work_item; - each CPU has a counter of such jobs it has requested; - the counter is decremented upon job completion; - only the target CPU is forced to exit the execution loop, i.e. the job is not run in quiescent state; - does not support user-mode emulation. Emilio's cpu_tcg_sched_work() [4]: - exploits tb_lock() to force CPUs exit their execution loop; - requires 'tb_lock' to be held when scheduling a job; - allows each CPU to schedule only a single job; - handles scheduled work right in cpu_exec(); - exploits synchronize_rcu() to wait for other CPUs to exit their execution loop; - implements a complicated synchronization scheme; - should support both system and user-mode emulation. As of requirements for common safe work mechanism, each use case has its own considerations. Translation buffer flush just requires that no CPU is executing generated code during the operation. Cross-CPU TLB flush basically requires no CPU is performing TLB lookup/modification. Some architectures might require TLB flush be complete before the requesting CPU can continue execution; other might allow to delay it until some "synchronization point". In case of ARM, one of such synchronization points is DMB instruction. We might allow the operation to be performed asynchronously and continue execution, but we'd need to end TB and synchronize on each DMB instruction. That doesn't seem very efficient. So a simple approach to force the operation to complete before executing anything else would probably make sense in both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush to be complete before it can finish emulation of a LL instruction. Exclusive operation emulation in user-mode basically requires that no other CPU is executing generated code. However, I hope that both system and user-mode would use some common implementation of exclusive instruction emulation. It was pointed out that special care must be taken to avoid deadlocks [5, 6]. A simple and reliable approach might be to exit all CPU's execution loop including the requesting CPU and then serve all the pending requests. Distilling the requirements, safe work mechanism should: - support both system and user-mode emulation; - allow to schedule an asynchronous operation to be performed out of CPU execution loop; - guarantee that all CPUs are out of execution loop before the operation can begin; - guarantee that no CPU enters execution loop before all the scheduled operations are complete. If that sounds like a sane approach, I'll come up with a more specific solution to discuss. The solution could be merged into v2.7 along with safe translation buffer flush in user-mode as an actual use case. Safe cross-CPU TLB flush would become a part of MTTCG work. Comments, suggestions, arguments etc. are welcome! [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632 [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039 [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982 [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789 [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301 [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231 Kind regards, Sergey ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] exec: Safe work in quiescent state 2016-06-15 12:59 ` Sergey Fedorov @ 2016-06-15 14:16 ` alvise rigo 2016-06-15 14:51 ` Alex Bennée 2016-06-15 14:56 ` Alex Bennée 1 sibling, 1 reply; 8+ messages in thread From: alvise rigo @ 2016-06-15 14:16 UTC (permalink / raw) To: Sergey Fedorov Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric, Emilio G. Cota, Alex Bennée, Paolo Bonzini, Richard Henderson, Peter Maydell Hi Sergey, Nice review of the implementations we have so far. Just few comments below. On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: > On 10/06/16 00:51, Sergey Fedorov wrote: >> For certain kinds of tasks we might need a quiescent state to perform an >> operation safely. Quiescent state means no CPU thread executing, and >> probably BQL held as well. The tasks could include: >> - Translation buffer flush (user and system-mode) >> - Cross-CPU TLB flush (system-mode) >> - Exclusive operation emulation (user-mode) >> >> If we use a single shared translation buffer which is not managed by RCU >> and simply flushed when full, we'll need a quiescent state to flush it >> safely. >> >> In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could >> probably be made with async_run_on_cpu(). I suppose it is always the >> guest system that needs to synchronise this operation properly. And as >> soon as we request the target CPU to exit its execution loop for serving >> the asynchronous work, we should probably be okay to continue execution >> on the CPU requested the operation while the target CPU executing till >> the end of its current TB before it actually flushed its TLB. >> >> As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB >> flushes (actually TLB flushes on all CPUs) must me done synchronously >> and thus might require quiescent state. >> >> Exclusive operation emulation in user-mode is currently implemented in >> this manner, see for start_exclusive(). It might change to some generic >> mechanism of atomic/exclusive instruction emulation for system and >> user-mode. >> >> It looks like we need to implement a common mechanism to perform safe >> work in a quiescent state which could work in both system and user-mode, >> at least for safe translation bufferflush in user-mode and MTTCG. I'm >> going to implement such a mechanism. I would appreciate any suggestions, >> comments and remarks. > > Considering different attempts to implement similar functionality, I've > got the following summary. > > Fred's original async_run_safe_work_on_cpu() [1]: > - resembles async_run_on_cpu(); > - introduces a per-CPU safe work queue, a per-CPU flag to prevent the > CPU from executing code, and a global counter of pending jobs; > - implements rather complicated scheduling of jobs relying on both the > per-CPU flag and the global counter; > - may be not entirely safe when draining work queues if multiple CPUs > have scheduled safe work; > - does not support user-mode emulation. > > Alex's reiteration of Fred's approach [2]: > - maintains a single global safe work queue; > - uses GArray rather than linked list to implement the work queue; > - introduces a global counter of CPUs which have entered their execution > loop; > - makes use of the last CPU exited its execution loop to drain the safe > work queue; > - still does not support user-mode emulation. > > Alvise's async_wait_run_on_cpu() [3]: > - uses the same queue as async_run_on_cpu(); > - the CPU that requested the job is recorded in qemu_work_item; > - each CPU has a counter of such jobs it has requested; > - the counter is decremented upon job completion; > - only the target CPU is forced to exit the execution loop, i.e. the job > is not run in quiescent state; async_wait_run_on_cpu() kicks the target VCPU before calling cpu_exit() on the current VCPU, so all the VCPUs are forced to exit. Moreover, the current VCPU waits for all the tasks to be completed. > - does not support user-mode emulation. > > Emilio's cpu_tcg_sched_work() [4]: > - exploits tb_lock() to force CPUs exit their execution loop; > - requires 'tb_lock' to be held when scheduling a job; > - allows each CPU to schedule only a single job; > - handles scheduled work right in cpu_exec(); > - exploits synchronize_rcu() to wait for other CPUs to exit their > execution loop; > - implements a complicated synchronization scheme; > - should support both system and user-mode emulation. > > > As of requirements for common safe work mechanism, each use case has its > own considerations. > > Translation buffer flush just requires that no CPU is executing > generated code during the operation. > > Cross-CPU TLB flush basically requires no CPU is performing TLB > lookup/modification. Some architectures might require TLB flush be > complete before the requesting CPU can continue execution; other might > allow to delay it until some "synchronization point". In case of ARM, > one of such synchronization points is DMB instruction. We might allow > the operation to be performed asynchronously and continue execution, but > we'd need to end TB and synchronize on each DMB instruction. That > doesn't seem very efficient. So a simple approach to force the operation > to complete before executing anything else would probably make sense in > both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush > to be complete before it can finish emulation of a LL instruction. > > Exclusive operation emulation in user-mode basically requires that no > other CPU is executing generated code. However, I hope that both system > and user-mode would use some common implementation of exclusive > instruction emulation. > > It was pointed out that special care must be taken to avoid deadlocks > [5, 6]. A simple and reliable approach might be to exit all CPU's > execution loop including the requesting CPU and then serve all the > pending requests. > > Distilling the requirements, safe work mechanism should: > - support both system and user-mode emulation; > - allow to schedule an asynchronous operation to be performed out of CPU > execution loop; > - guarantee that all CPUs are out of execution loop before the operation > can begin; This requirement is probably not necessary if we need to query TLB flushes to other VCPUs, since every VCPU will flush its own TLB. For this reason we probably need to mechanisms: - The first allows a VCPU to query a job to all the others and wait for all of them to be done (like for global TLB flush) - The second allows a VCPU to perform a task in quiescent state i.e. the task starts and finishes when all VCPUs are out of the execution loop (translation buffer flush) Does this make sense? > - guarantee that no CPU enters execution loop before all the scheduled > operations are complete. This is probably too much in some cases for the reasons of before. Best regards, alvise > > If that sounds like a sane approach, I'll come up with a more specific > solution to discuss. The solution could be merged into v2.7 along with > safe translation buffer flush in user-mode as an actual use case. Safe > cross-CPU TLB flush would become a part of MTTCG work. Comments, > suggestions, arguments etc. are welcome! > > [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632 > [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039 > [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982 > [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789 > [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301 > [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231 > > Kind regards, > Sergey ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] exec: Safe work in quiescent state 2016-06-15 14:16 ` alvise rigo @ 2016-06-15 14:51 ` Alex Bennée 2016-06-15 15:25 ` alvise rigo 0 siblings, 1 reply; 8+ messages in thread From: Alex Bennée @ 2016-06-15 14:51 UTC (permalink / raw) To: alvise rigo Cc: Sergey Fedorov, QEMU Developers, MTTCG Devel, KONRAD Frédéric, Emilio G. Cota, Paolo Bonzini, Richard Henderson, Peter Maydell alvise rigo <a.rigo@virtualopensystems.com> writes: > Hi Sergey, > > Nice review of the implementations we have so far. > Just few comments below. > > On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >> On 10/06/16 00:51, Sergey Fedorov wrote: >>> For certain kinds of tasks we might need a quiescent state to perform an >>> operation safely. Quiescent state means no CPU thread executing, and >>> probably BQL held as well. The tasks could include: <snip> >> >> Alvise's async_wait_run_on_cpu() [3]: >> - uses the same queue as async_run_on_cpu(); >> - the CPU that requested the job is recorded in qemu_work_item; >> - each CPU has a counter of such jobs it has requested; >> - the counter is decremented upon job completion; >> - only the target CPU is forced to exit the execution loop, i.e. the job >> is not run in quiescent state; > > async_wait_run_on_cpu() kicks the target VCPU before calling > cpu_exit() on the current VCPU, so all the VCPUs are forced to exit. > Moreover, the current VCPU waits for all the tasks to be completed. The effect of qemu_cpu_kick() for TCG is effectively just doing a cpu_exit() anyway. Once done any TCG code will exit on it's next intra-block transition. > <snip> >> Distilling the requirements, safe work mechanism should: >> - support both system and user-mode emulation; >> - allow to schedule an asynchronous operation to be performed out of CPU >> execution loop; >> - guarantee that all CPUs are out of execution loop before the operation >> can begin; > > This requirement is probably not necessary if we need to query TLB > flushes to other VCPUs, since every VCPU will flush its own TLB. > For this reason we probably need to mechanisms: > - The first allows a VCPU to query a job to all the others and wait > for all of them to be done (like for global TLB flush) Do we need to wait? > - The second allows a VCPU to perform a task in quiescent state i.e. > the task starts and finishes when all VCPUs are out of the execution > loop (translation buffer flush) If you really want to ensure everything is done then you can exit the block early. To get the sort of dsb() flush semantics mentioned you simply: - queue your async safe work - exit block on dsb() This ensures by the time the TCG thread restarts for the next instruction all pending work has been flushed. > Does this make sense? I think we want one way of doing things for anything that is Cross CPU and requires a degree of synchronisation. If it ends up being too expensive then we can look at more efficient special casing solutions. > >> - guarantee that no CPU enters execution loop before all the scheduled >> operations are complete. > > This is probably too much in some cases for the reasons of before. > > Best regards, > alvise > >> >> If that sounds like a sane approach, I'll come up with a more specific >> solution to discuss. The solution could be merged into v2.7 along with >> safe translation buffer flush in user-mode as an actual use case. Safe >> cross-CPU TLB flush would become a part of MTTCG work. Comments, >> suggestions, arguments etc. are welcome! >> >> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632 >> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039 >> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982 >> [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789 >> [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301 >> [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231 >> >> Kind regards, >> Sergey -- Alex Bennée ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] exec: Safe work in quiescent state 2016-06-15 14:51 ` Alex Bennée @ 2016-06-15 15:25 ` alvise rigo 2016-06-15 20:05 ` Sergey Fedorov 0 siblings, 1 reply; 8+ messages in thread From: alvise rigo @ 2016-06-15 15:25 UTC (permalink / raw) To: Alex Bennée Cc: Sergey Fedorov, QEMU Developers, MTTCG Devel, KONRAD Frédéric, Emilio G. Cota, Paolo Bonzini, Richard Henderson, Peter Maydell On Wed, Jun 15, 2016 at 4:51 PM, Alex Bennée <alex.bennee@linaro.org> wrote: > > alvise rigo <a.rigo@virtualopensystems.com> writes: > >> Hi Sergey, >> >> Nice review of the implementations we have so far. >> Just few comments below. >> >> On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >>> On 10/06/16 00:51, Sergey Fedorov wrote: >>>> For certain kinds of tasks we might need a quiescent state to perform an >>>> operation safely. Quiescent state means no CPU thread executing, and >>>> probably BQL held as well. The tasks could include: > <snip> >>> >>> Alvise's async_wait_run_on_cpu() [3]: >>> - uses the same queue as async_run_on_cpu(); >>> - the CPU that requested the job is recorded in qemu_work_item; >>> - each CPU has a counter of such jobs it has requested; >>> - the counter is decremented upon job completion; >>> - only the target CPU is forced to exit the execution loop, i.e. the job >>> is not run in quiescent state; >> >> async_wait_run_on_cpu() kicks the target VCPU before calling >> cpu_exit() on the current VCPU, so all the VCPUs are forced to exit. >> Moreover, the current VCPU waits for all the tasks to be completed. > > The effect of qemu_cpu_kick() for TCG is effectively just doing a > cpu_exit() anyway. Once done any TCG code will exit on it's next > intra-block transition. > >> > <snip> >>> Distilling the requirements, safe work mechanism should: >>> - support both system and user-mode emulation; >>> - allow to schedule an asynchronous operation to be performed out of CPU >>> execution loop; >>> - guarantee that all CPUs are out of execution loop before the operation >>> can begin; >> >> This requirement is probably not necessary if we need to query TLB >> flushes to other VCPUs, since every VCPU will flush its own TLB. >> For this reason we probably need to mechanisms: >> - The first allows a VCPU to query a job to all the others and wait >> for all of them to be done (like for global TLB flush) > > Do we need to wait? Yes, otherwise the instruction (like MCR which allows to do TLB invalidation) is not completely emulated before executing the following one. During the LL emulation is also required since it avoids possible race conditions. > >> - The second allows a VCPU to perform a task in quiescent state i.e. >> the task starts and finishes when all VCPUs are out of the execution >> loop (translation buffer flush) > > If you really want to ensure everything is done then you can exit the > block early. To get the sort of dsb() flush semantics mentioned you > simply: > > - queue your async safe work > - exit block on dsb() > > This ensures by the time the TCG thread restarts for the next > instruction all pending work has been flushed. > >> Does this make sense? > > I think we want one way of doing things for anything that is Cross CPU > and requires a degree of synchronisation. If it ends up being too > expensive then we can look at more efficient special casing solutions. OK, I agree that we should start with an approach that fits the two use cases. Thank you, alvise > >> >>> - guarantee that no CPU enters execution loop before all the scheduled >>> operations are complete. >> >> This is probably too much in some cases for the reasons of before. >> >> Best regards, >> alvise >> >>> >>> If that sounds like a sane approach, I'll come up with a more specific >>> solution to discuss. The solution could be merged into v2.7 along with >>> safe translation buffer flush in user-mode as an actual use case. Safe >>> cross-CPU TLB flush would become a part of MTTCG work. Comments, >>> suggestions, arguments etc. are welcome! >>> >>> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632 >>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039 >>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982 >>> [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789 >>> [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301 >>> [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231 >>> >>> Kind regards, >>> Sergey > > > -- > Alex Bennée ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] exec: Safe work in quiescent state 2016-06-15 15:25 ` alvise rigo @ 2016-06-15 20:05 ` Sergey Fedorov 0 siblings, 0 replies; 8+ messages in thread From: Sergey Fedorov @ 2016-06-15 20:05 UTC (permalink / raw) To: alvise rigo, Alex Bennée Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric, Emilio G. Cota, Paolo Bonzini, Richard Henderson, Peter Maydell On 15/06/16 18:25, alvise rigo wrote: > On Wed, Jun 15, 2016 at 4:51 PM, Alex Bennée <alex.bennee@linaro.org> wrote: >> alvise rigo <a.rigo@virtualopensystems.com> writes: >>> On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >>>> On 10/06/16 00:51, Sergey Fedorov wrote: >>>>> For certain kinds of tasks we might need a quiescent state to perform an >>>>> operation safely. Quiescent state means no CPU thread executing, and >>>>> probably BQL held as well. The tasks could include: >> <snip> >>>> Alvise's async_wait_run_on_cpu() [3]: >>>> - uses the same queue as async_run_on_cpu(); >>>> - the CPU that requested the job is recorded in qemu_work_item; >>>> - each CPU has a counter of such jobs it has requested; >>>> - the counter is decremented upon job completion; >>>> - only the target CPU is forced to exit the execution loop, i.e. the job >>>> is not run in quiescent state; >>> async_wait_run_on_cpu() kicks the target VCPU before calling >>> cpu_exit() on the current VCPU, so all the VCPUs are forced to exit. >>> Moreover, the current VCPU waits for all the tasks to be completed. >> The effect of qemu_cpu_kick() for TCG is effectively just doing a >> cpu_exit() anyway. Once done any TCG code will exit on it's next >> intra-block transition. I was just meaning that async_wait_run_on_cpu() does not stop all the CPUs: it only affects the current CPU and the target CPU. So this mechanism cannot be used for tb_flush(). >> <snip> >>>> Distilling the requirements, safe work mechanism should: >>>> - support both system and user-mode emulation; >>>> - allow to schedule an asynchronous operation to be performed out of CPU >>>> execution loop; >>>> - guarantee that all CPUs are out of execution loop before the operation >>>> can begin; >>> This requirement is probably not necessary if we need to query TLB >>> flushes to other VCPUs, since every VCPU will flush its own TLB. >>> For this reason we probably need to mechanisms: >>> - The first allows a VCPU to query a job to all the others and wait >>> for all of them to be done (like for global TLB flush) >> Do we need to wait? > Yes, otherwise the instruction (like MCR which allows to do TLB > invalidation) is not completely emulated before executing the > following one. I think I need to specify this in the requirements: the CPU which requested an asynchronous safe operation must exit its execution loop at the end of the current TB and wait for operation completion. Then guest cross-CPU TLB invalidation instruction can force end of the TB to ensure no further instructions get executed until the flush is complete. > During the LL emulation is also required since it avoids possible race > conditions. As it was pointed in [1], LL can be implemented using such "safe work in quiescent state" mechanism. [1] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=418664 >>> - The second allows a VCPU to perform a task in quiescent state i.e. >>> the task starts and finishes when all VCPUs are out of the execution >>> loop (translation buffer flush) >> If you really want to ensure everything is done then you can exit the >> block early. To get the sort of dsb() flush semantics mentioned you >> simply: >> >> - queue your async safe work >> - exit block on dsb() >> >> This ensures by the time the TCG thread restarts for the next >> instruction all pending work has been flushed. Indeed, if we kick the CPU which requested the job and just end the TB at DSB instruction then the CPU will see the exit request and go out of its execution loop to wait for operation completion. >>> Does this make sense? >> I think we want one way of doing things for anything that is Cross CPU >> and requires a degree of synchronisation. If it ends up being too >> expensive then we can look at more efficient special casing solutions. > OK, I agree that we should start with an approach that fits the two use cases. So refined the requirements, safe work mechanism should: - support both system and user-mode emulation; - allow to schedule an asynchronous operation to be performed out of CPU execution loop; - force all CPUs to exit execution loop at the end of the currently executed TB once an operation is scheduled; - guarantee that all CPUs are out of execution loop before the operation can begin; - guarantee that no CPU enters execution loop until all the scheduled operations are complete. Kind regards, Sergey ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] exec: Safe work in quiescent state 2016-06-15 12:59 ` Sergey Fedorov 2016-06-15 14:16 ` alvise rigo @ 2016-06-15 14:56 ` Alex Bennée 2016-06-15 19:16 ` Sergey Fedorov 1 sibling, 1 reply; 8+ messages in thread From: Alex Bennée @ 2016-06-15 14:56 UTC (permalink / raw) To: Sergey Fedorov Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric, Alvise Rigo, Emilio G. Cota, Paolo Bonzini, Richard Henderson, Peter Maydell Sergey Fedorov <serge.fdrv@gmail.com> writes: > On 10/06/16 00:51, Sergey Fedorov wrote: >> For certain kinds of tasks we might need a quiescent state to perform an >> operation safely. Quiescent state means no CPU thread executing, and >> probably BQL held as well. The tasks could include: <snip> > > Considering different attempts to implement similar functionality, I've > got the following summary. > > Fred's original async_run_safe_work_on_cpu() [1]: > - resembles async_run_on_cpu(); > - introduces a per-CPU safe work queue, a per-CPU flag to prevent the > CPU from executing code, and a global counter of pending jobs; > - implements rather complicated scheduling of jobs relying on both the > per-CPU flag and the global counter; > - may be not entirely safe when draining work queues if multiple CPUs > have scheduled safe work; > - does not support user-mode emulation. Just some quick comments for context: > Alex's reiteration of Fred's approach [2]: > - maintains a single global safe work queue; Having separate queues can lead to problems with draining queues as only queue gets drained at a time and some threads exit more frequently than others. > - uses GArray rather than linked list to implement the work queue; This was to minimise g_malloc on job creation and working through the list. An awful lot of jobs just need the CPU id and a single parameter. This is why I made it the simple case. > - introduces a global counter of CPUs which have entered their execution > loop; > - makes use of the last CPU exited its execution loop to drain the safe > work queue; I suspect you can still race with other deferred work as those tasks are being done outside the exec loop. This should be fixable though. > - still does not support user-mode emulation. There is not particular reason it couldn't. However it would mean updating the linux-user cpu_exec loop which most likely needs a good clean-up and re-factoring to avoid making the change to $ARCH loops. > > Alvise's async_wait_run_on_cpu() [3]: > - uses the same queue as async_run_on_cpu(); > - the CPU that requested the job is recorded in qemu_work_item; > - each CPU has a counter of such jobs it has requested; > - the counter is decremented upon job completion; > - only the target CPU is forced to exit the execution loop, i.e. the job > is not run in quiescent state; > - does not support user-mode emulation. > > Emilio's cpu_tcg_sched_work() [4]: > - exploits tb_lock() to force CPUs exit their execution loop; > - requires 'tb_lock' to be held when scheduling a job; > - allows each CPU to schedule only a single job; > - handles scheduled work right in cpu_exec(); > - exploits synchronize_rcu() to wait for other CPUs to exit their > execution loop; > - implements a complicated synchronization scheme; > - should support both system and user-mode emulation. > > > As of requirements for common safe work mechanism, each use case has its > own considerations. > > Translation buffer flush just requires that no CPU is executing > generated code during the operation. > > Cross-CPU TLB flush basically requires no CPU is performing TLB > lookup/modification. Some architectures might require TLB flush be > complete before the requesting CPU can continue execution; other might > allow to delay it until some "synchronization point". In case of ARM, > one of such synchronization points is DMB instruction. We might allow > the operation to be performed asynchronously and continue execution, but > we'd need to end TB and synchronize on each DMB instruction. That > doesn't seem very efficient. So a simple approach to force the operation > to complete before executing anything else would probably make sense in > both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush > to be complete before it can finish emulation of a LL instruction. > > Exclusive operation emulation in user-mode basically requires that no > other CPU is executing generated code. However, I hope that both system > and user-mode would use some common implementation of exclusive > instruction emulation. > > It was pointed out that special care must be taken to avoid deadlocks > [5, 6]. A simple and reliable approach might be to exit all CPU's > execution loop including the requesting CPU and then serve all the > pending requests. > > Distilling the requirements, safe work mechanism should: > - support both system and user-mode emulation; > - allow to schedule an asynchronous operation to be performed out of CPU > execution loop; > - guarantee that all CPUs are out of execution loop before the operation > can begin; > - guarantee that no CPU enters execution loop before all the scheduled > operations are complete. > > If that sounds like a sane approach, I'll come up with a more specific > solution to discuss. The solution could be merged into v2.7 along with > safe translation buffer flush in user-mode as an actual use case. Safe > cross-CPU TLB flush would become a part of MTTCG work. Comments, > suggestions, arguments etc. are welcome! > > [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632 > [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039 > [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982 > [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789 > [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301 > [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231 > > Kind regards, > Sergey -- Alex Bennée ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] exec: Safe work in quiescent state 2016-06-15 14:56 ` Alex Bennée @ 2016-06-15 19:16 ` Sergey Fedorov 0 siblings, 0 replies; 8+ messages in thread From: Sergey Fedorov @ 2016-06-15 19:16 UTC (permalink / raw) To: Alex Bennée Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric, Alvise Rigo, Emilio G. Cota, Paolo Bonzini, Richard Henderson, Peter Maydell On 15/06/16 17:56, Alex Bennée wrote: > Sergey Fedorov <serge.fdrv@gmail.com> writes: (snip) > Just some quick comments for context: > >> Alex's reiteration of Fred's approach [2]: >> - maintains a single global safe work queue; > Having separate queues can lead to problems with draining queues as only > queue gets drained at a time and some threads exit more frequently than > others. I think it can't happen if we drain all the queues from all the CPUs, as we should. The requirement is: stop all the CPUs and process all the pending work. If we follow this requirement, I think it's not important whether we have separate queues for each CPU or just a single global queue. > >> - uses GArray rather than linked list to implement the work queue; > This was to minimise g_malloc on job creation and working through the > list. An awful lot of jobs just need the CPU id and a single parameter. > This is why I made it the simple case. I think it would be nice to avoid g_malloc but don't use an array at the same time. I have some thoughts how to do this easily, let's see the code ;-) >> - introduces a global counter of CPUs which have entered their execution >> loop; >> - makes use of the last CPU exited its execution loop to drain the safe >> work queue; > I suspect you can still race with other deferred work as those tasks are > being done outside the exec loop. This should be fixable though. Will keep an eye on this, thanks. > >> - still does not support user-mode emulation. > There is not particular reason it couldn't. However it would mean > updating the linux-user cpu_exec loop which most likely needs a good > clean-up and re-factoring to avoid making the change to $ARCH loops. Yes, you are right, I just fixed the facts here :) Kind regards, Sergey ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-06-15 20:05 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-09 21:51 [Qemu-devel] exec: Safe work in quiescent state Sergey Fedorov 2016-06-15 12:59 ` Sergey Fedorov 2016-06-15 14:16 ` alvise rigo 2016-06-15 14:51 ` Alex Bennée 2016-06-15 15:25 ` alvise rigo 2016-06-15 20:05 ` Sergey Fedorov 2016-06-15 14:56 ` Alex Bennée 2016-06-15 19:16 ` Sergey Fedorov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.