All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] exec: Safe work in quiescent state
@ 2016-06-09 21:51 Sergey Fedorov
  2016-06-15 12:59 ` Sergey Fedorov
  0 siblings, 1 reply; 8+ messages in thread
From: Sergey Fedorov @ 2016-06-09 21:51 UTC (permalink / raw)
  To: QEMU Developers
  Cc: MTTCG Devel, KONRAD Frédéric, Alvise Rigo,
	Emilio G. Cota, Alex Bennée, Paolo Bonzini,
	Richard Henderson, Peter Maydell

Hi,

For certain kinds of tasks we might need a quiescent state to perform an
operation safely. Quiescent state means no CPU thread executing, and
probably BQL held as well. The tasks could include:
- Translation buffer flush (user and system-mode)
- Cross-CPU TLB flush (system-mode)
- Exclusive operation emulation (user-mode)

If we use a single shared translation buffer which is not managed by RCU
and simply flushed when full, we'll need a quiescent state to flush it
safely.

In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could
probably be made with async_run_on_cpu(). I suppose it is always the
guest system that needs to synchronise this operation properly. And as
soon as we request the target CPU to exit its execution loop for serving
the asynchronous work, we should probably be okay to continue execution
on the CPU requested the operation while the target CPU executing till
the end of its current TB before it actually flushed its TLB.

As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB
flushes (actually TLB flushes on all CPUs) must me done synchronously
and thus might require quiescent state.

Exclusive operation emulation in user-mode is currently implemented in
this manner, see for start_exclusive(). It might change to some generic
mechanism of atomic/exclusive instruction emulation for system and
user-mode.

It looks like we need to implement a common mechanism to perform safe
work in a quiescent state which could work in both system and user-mode,
at least for safe translation bufferflush in user-mode and MTTCG. I'm
going to implement such a mechanism. I would appreciate any suggestions,
comments and remarks.

Thanks,
Sergey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] exec: Safe work in quiescent state
  2016-06-09 21:51 [Qemu-devel] exec: Safe work in quiescent state Sergey Fedorov
@ 2016-06-15 12:59 ` Sergey Fedorov
  2016-06-15 14:16   ` alvise rigo
  2016-06-15 14:56   ` Alex Bennée
  0 siblings, 2 replies; 8+ messages in thread
From: Sergey Fedorov @ 2016-06-15 12:59 UTC (permalink / raw)
  To: QEMU Developers
  Cc: MTTCG Devel, KONRAD Frédéric, Alvise Rigo,
	Emilio G. Cota, Alex Bennée, Paolo Bonzini,
	Richard Henderson, Peter Maydell

On 10/06/16 00:51, Sergey Fedorov wrote:
> For certain kinds of tasks we might need a quiescent state to perform an
> operation safely. Quiescent state means no CPU thread executing, and
> probably BQL held as well. The tasks could include:
> - Translation buffer flush (user and system-mode)
> - Cross-CPU TLB flush (system-mode)
> - Exclusive operation emulation (user-mode)
>
> If we use a single shared translation buffer which is not managed by RCU
> and simply flushed when full, we'll need a quiescent state to flush it
> safely.
>
> In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could
> probably be made with async_run_on_cpu(). I suppose it is always the
> guest system that needs to synchronise this operation properly. And as
> soon as we request the target CPU to exit its execution loop for serving
> the asynchronous work, we should probably be okay to continue execution
> on the CPU requested the operation while the target CPU executing till
> the end of its current TB before it actually flushed its TLB.
>
> As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB
> flushes (actually TLB flushes on all CPUs) must me done synchronously
> and thus might require quiescent state.
>
> Exclusive operation emulation in user-mode is currently implemented in
> this manner, see for start_exclusive(). It might change to some generic
> mechanism of atomic/exclusive instruction emulation for system and
> user-mode.
>
> It looks like we need to implement a common mechanism to perform safe
> work in a quiescent state which could work in both system and user-mode,
> at least for safe translation bufferflush in user-mode and MTTCG. I'm
> going to implement such a mechanism. I would appreciate any suggestions,
> comments and remarks.

Considering different attempts to implement similar functionality, I've
got the following summary.

Fred's original async_run_safe_work_on_cpu() [1]:
- resembles async_run_on_cpu();
- introduces a per-CPU safe work queue, a per-CPU flag to prevent the
CPU from executing code, and a global counter of pending jobs;
- implements rather complicated scheduling of jobs relying on both the
per-CPU flag and the global counter;
- may be not entirely safe when draining work queues if multiple CPUs
have scheduled safe work;
- does not support user-mode emulation.

Alex's reiteration of Fred's approach [2]:
- maintains a single global safe work queue;
- uses GArray rather than linked list to implement the work queue;
- introduces a global counter of CPUs which have entered their execution
loop;
- makes use of the last CPU exited its execution loop to drain the safe
work queue;
- still does not support user-mode emulation.

Alvise's async_wait_run_on_cpu() [3]:
- uses the same queue as async_run_on_cpu();
- the CPU that requested the job is recorded in qemu_work_item;
- each CPU has a counter of such jobs it has requested;
- the counter is decremented upon job completion;
- only the target CPU is forced to exit the execution loop, i.e. the job
is not run in quiescent state;
- does not support user-mode emulation.

Emilio's cpu_tcg_sched_work() [4]:
- exploits tb_lock() to force CPUs exit their execution loop;
- requires 'tb_lock' to be held when scheduling a job;
- allows each CPU to schedule only a single job;
- handles scheduled work right in cpu_exec();
- exploits synchronize_rcu() to wait for other CPUs to exit their
execution loop;
- implements a complicated synchronization scheme;
- should support both system and user-mode emulation.


As of requirements for common safe work mechanism, each use case has its
own considerations.

Translation buffer flush just requires that no CPU is executing
generated code during the operation.

Cross-CPU TLB flush basically requires no CPU is performing TLB
lookup/modification. Some architectures might require TLB flush be
complete before the requesting CPU can continue execution; other might
allow to delay it until some "synchronization point". In case of ARM,
one of such synchronization points is DMB instruction. We might allow
the operation to be performed asynchronously and continue execution, but
we'd need to end TB and synchronize on each DMB instruction. That
doesn't seem very efficient. So a simple approach to force the operation
to complete before executing anything else would probably make sense in
both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush
to be complete before it can finish emulation of a LL instruction.

Exclusive operation emulation in user-mode basically requires that no
other CPU is executing generated code. However, I hope that both system
and user-mode would use some common implementation of exclusive
instruction emulation.

It was pointed out that special care must be taken to avoid deadlocks
[5, 6]. A simple and reliable approach might be to exit all CPU's
execution loop including the requesting CPU and then serve all the
pending requests.

Distilling the requirements, safe work mechanism should:
- support both system and user-mode emulation;
- allow to schedule an asynchronous operation to be performed out of CPU
execution loop;
- guarantee that all CPUs are out of execution loop before the operation
can begin;
- guarantee that no CPU enters execution loop before all the scheduled
operations are complete.

If that sounds like a sane approach, I'll come up with a more specific
solution to discuss. The solution could be merged into v2.7 along with
safe translation buffer flush in user-mode as an actual use case. Safe
cross-CPU TLB flush would become a part of MTTCG work. Comments,
suggestions, arguments etc. are welcome!

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039
[3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982
[4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789
[5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301
[6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] exec: Safe work in quiescent state
  2016-06-15 12:59 ` Sergey Fedorov
@ 2016-06-15 14:16   ` alvise rigo
  2016-06-15 14:51     ` Alex Bennée
  2016-06-15 14:56   ` Alex Bennée
  1 sibling, 1 reply; 8+ messages in thread
From: alvise rigo @ 2016-06-15 14:16 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric,
	Emilio G. Cota, Alex Bennée, Paolo Bonzini,
	Richard Henderson, Peter Maydell

Hi Sergey,

Nice review of the implementations we have so far.
Just few comments below.

On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> On 10/06/16 00:51, Sergey Fedorov wrote:
>> For certain kinds of tasks we might need a quiescent state to perform an
>> operation safely. Quiescent state means no CPU thread executing, and
>> probably BQL held as well. The tasks could include:
>> - Translation buffer flush (user and system-mode)
>> - Cross-CPU TLB flush (system-mode)
>> - Exclusive operation emulation (user-mode)
>>
>> If we use a single shared translation buffer which is not managed by RCU
>> and simply flushed when full, we'll need a quiescent state to flush it
>> safely.
>>
>> In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could
>> probably be made with async_run_on_cpu(). I suppose it is always the
>> guest system that needs to synchronise this operation properly. And as
>> soon as we request the target CPU to exit its execution loop for serving
>> the asynchronous work, we should probably be okay to continue execution
>> on the CPU requested the operation while the target CPU executing till
>> the end of its current TB before it actually flushed its TLB.
>>
>> As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB
>> flushes (actually TLB flushes on all CPUs) must me done synchronously
>> and thus might require quiescent state.
>>
>> Exclusive operation emulation in user-mode is currently implemented in
>> this manner, see for start_exclusive(). It might change to some generic
>> mechanism of atomic/exclusive instruction emulation for system and
>> user-mode.
>>
>> It looks like we need to implement a common mechanism to perform safe
>> work in a quiescent state which could work in both system and user-mode,
>> at least for safe translation bufferflush in user-mode and MTTCG. I'm
>> going to implement such a mechanism. I would appreciate any suggestions,
>> comments and remarks.
>
> Considering different attempts to implement similar functionality, I've
> got the following summary.
>
> Fred's original async_run_safe_work_on_cpu() [1]:
> - resembles async_run_on_cpu();
> - introduces a per-CPU safe work queue, a per-CPU flag to prevent the
> CPU from executing code, and a global counter of pending jobs;
> - implements rather complicated scheduling of jobs relying on both the
> per-CPU flag and the global counter;
> - may be not entirely safe when draining work queues if multiple CPUs
> have scheduled safe work;
> - does not support user-mode emulation.
>
> Alex's reiteration of Fred's approach [2]:
> - maintains a single global safe work queue;
> - uses GArray rather than linked list to implement the work queue;
> - introduces a global counter of CPUs which have entered their execution
> loop;
> - makes use of the last CPU exited its execution loop to drain the safe
> work queue;
> - still does not support user-mode emulation.
>
> Alvise's async_wait_run_on_cpu() [3]:
> - uses the same queue as async_run_on_cpu();
> - the CPU that requested the job is recorded in qemu_work_item;
> - each CPU has a counter of such jobs it has requested;
> - the counter is decremented upon job completion;
> - only the target CPU is forced to exit the execution loop, i.e. the job
> is not run in quiescent state;

async_wait_run_on_cpu() kicks the target VCPU before calling
cpu_exit() on the current VCPU, so all the VCPUs are forced to exit.
Moreover, the current VCPU waits for all the tasks to be completed.

> - does not support user-mode emulation.
>
> Emilio's cpu_tcg_sched_work() [4]:
> - exploits tb_lock() to force CPUs exit their execution loop;
> - requires 'tb_lock' to be held when scheduling a job;
> - allows each CPU to schedule only a single job;
> - handles scheduled work right in cpu_exec();
> - exploits synchronize_rcu() to wait for other CPUs to exit their
> execution loop;
> - implements a complicated synchronization scheme;
> - should support both system and user-mode emulation.
>
>
> As of requirements for common safe work mechanism, each use case has its
> own considerations.
>
> Translation buffer flush just requires that no CPU is executing
> generated code during the operation.
>
> Cross-CPU TLB flush basically requires no CPU is performing TLB
> lookup/modification. Some architectures might require TLB flush be
> complete before the requesting CPU can continue execution; other might
> allow to delay it until some "synchronization point". In case of ARM,
> one of such synchronization points is DMB instruction. We might allow
> the operation to be performed asynchronously and continue execution, but
> we'd need to end TB and synchronize on each DMB instruction. That
> doesn't seem very efficient. So a simple approach to force the operation
> to complete before executing anything else would probably make sense in
> both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush
> to be complete before it can finish emulation of a LL instruction.
>
> Exclusive operation emulation in user-mode basically requires that no
> other CPU is executing generated code. However, I hope that both system
> and user-mode would use some common implementation of exclusive
> instruction emulation.
>
> It was pointed out that special care must be taken to avoid deadlocks
> [5, 6]. A simple and reliable approach might be to exit all CPU's
> execution loop including the requesting CPU and then serve all the
> pending requests.
>
> Distilling the requirements, safe work mechanism should:
> - support both system and user-mode emulation;
> - allow to schedule an asynchronous operation to be performed out of CPU
> execution loop;
> - guarantee that all CPUs are out of execution loop before the operation
> can begin;

This requirement is probably not necessary if we need to query TLB
flushes to other VCPUs, since every VCPU will flush its own TLB.
For this reason we probably need to mechanisms:
- The first allows a VCPU to query a job to all the others and wait
for all of them to be done (like for global TLB flush)
- The second allows a VCPU to perform a task in quiescent state i.e.
the task starts and finishes when all VCPUs are out of the execution
loop (translation buffer flush)
Does this make sense?

> - guarantee that no CPU enters execution loop before all the scheduled
> operations are complete.

This is probably too much in some cases for the reasons of before.

Best regards,
alvise

>
> If that sounds like a sane approach, I'll come up with a more specific
> solution to discuss. The solution could be merged into v2.7 along with
> safe translation buffer flush in user-mode as an actual use case. Safe
> cross-CPU TLB flush would become a part of MTTCG work. Comments,
> suggestions, arguments etc. are welcome!
>
> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039
> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982
> [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789
> [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301
> [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231
>
> Kind regards,
> Sergey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] exec: Safe work in quiescent state
  2016-06-15 14:16   ` alvise rigo
@ 2016-06-15 14:51     ` Alex Bennée
  2016-06-15 15:25       ` alvise rigo
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Bennée @ 2016-06-15 14:51 UTC (permalink / raw)
  To: alvise rigo
  Cc: Sergey Fedorov, QEMU Developers, MTTCG Devel,
	KONRAD Frédéric, Emilio G. Cota, Paolo Bonzini,
	Richard Henderson, Peter Maydell


alvise rigo <a.rigo@virtualopensystems.com> writes:

> Hi Sergey,
>
> Nice review of the implementations we have so far.
> Just few comments below.
>
> On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> On 10/06/16 00:51, Sergey Fedorov wrote:
>>> For certain kinds of tasks we might need a quiescent state to perform an
>>> operation safely. Quiescent state means no CPU thread executing, and
>>> probably BQL held as well. The tasks could include:
<snip>
>>
>> Alvise's async_wait_run_on_cpu() [3]:
>> - uses the same queue as async_run_on_cpu();
>> - the CPU that requested the job is recorded in qemu_work_item;
>> - each CPU has a counter of such jobs it has requested;
>> - the counter is decremented upon job completion;
>> - only the target CPU is forced to exit the execution loop, i.e. the job
>> is not run in quiescent state;
>
> async_wait_run_on_cpu() kicks the target VCPU before calling
> cpu_exit() on the current VCPU, so all the VCPUs are forced to exit.
> Moreover, the current VCPU waits for all the tasks to be completed.

The effect of qemu_cpu_kick() for TCG is effectively just doing a
cpu_exit() anyway. Once done any TCG code will exit on it's next
intra-block transition.

>
<snip>
>> Distilling the requirements, safe work mechanism should:
>> - support both system and user-mode emulation;
>> - allow to schedule an asynchronous operation to be performed out of CPU
>> execution loop;
>> - guarantee that all CPUs are out of execution loop before the operation
>> can begin;
>
> This requirement is probably not necessary if we need to query TLB
> flushes to other VCPUs, since every VCPU will flush its own TLB.
> For this reason we probably need to mechanisms:
> - The first allows a VCPU to query a job to all the others and wait
> for all of them to be done (like for global TLB flush)

Do we need to wait?

> - The second allows a VCPU to perform a task in quiescent state i.e.
> the task starts and finishes when all VCPUs are out of the execution
> loop (translation buffer flush)

If you really want to ensure everything is done then you can exit the
block early. To get the sort of dsb() flush semantics mentioned you
simply:

  - queue your async safe work
  - exit block on dsb()

  This ensures by the time the TCG thread restarts for the next
  instruction all pending work has been flushed.

> Does this make sense?

I think we want one way of doing things for anything that is Cross CPU
and requires a degree of synchronisation. If it ends up being too
expensive then we can look at more efficient special casing solutions.

>
>> - guarantee that no CPU enters execution loop before all the scheduled
>> operations are complete.
>
> This is probably too much in some cases for the reasons of before.
>
> Best regards,
> alvise
>
>>
>> If that sounds like a sane approach, I'll come up with a more specific
>> solution to discuss. The solution could be merged into v2.7 along with
>> safe translation buffer flush in user-mode as an actual use case. Safe
>> cross-CPU TLB flush would become a part of MTTCG work. Comments,
>> suggestions, arguments etc. are welcome!
>>
>> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632
>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039
>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982
>> [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789
>> [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301
>> [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231
>>
>> Kind regards,
>> Sergey


--
Alex Bennée

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] exec: Safe work in quiescent state
  2016-06-15 12:59 ` Sergey Fedorov
  2016-06-15 14:16   ` alvise rigo
@ 2016-06-15 14:56   ` Alex Bennée
  2016-06-15 19:16     ` Sergey Fedorov
  1 sibling, 1 reply; 8+ messages in thread
From: Alex Bennée @ 2016-06-15 14:56 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric,
	Alvise Rigo, Emilio G. Cota, Paolo Bonzini, Richard Henderson,
	Peter Maydell


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 10/06/16 00:51, Sergey Fedorov wrote:
>> For certain kinds of tasks we might need a quiescent state to perform an
>> operation safely. Quiescent state means no CPU thread executing, and
>> probably BQL held as well. The tasks could include:
<snip>
>
> Considering different attempts to implement similar functionality, I've
> got the following summary.
>
> Fred's original async_run_safe_work_on_cpu() [1]:
> - resembles async_run_on_cpu();
> - introduces a per-CPU safe work queue, a per-CPU flag to prevent the
> CPU from executing code, and a global counter of pending jobs;
> - implements rather complicated scheduling of jobs relying on both the
> per-CPU flag and the global counter;
> - may be not entirely safe when draining work queues if multiple CPUs
> have scheduled safe work;
> - does not support user-mode emulation.

Just some quick comments for context:

> Alex's reiteration of Fred's approach [2]:
> - maintains a single global safe work queue;

Having separate queues can lead to problems with draining queues as only
queue gets drained at a time and some threads exit more frequently than
others.

> - uses GArray rather than linked list to implement the work queue;

This was to minimise g_malloc on job creation and working through the
list. An awful lot of jobs just need the CPU id and a single parameter.
This is why I made it the simple case.

> - introduces a global counter of CPUs which have entered their execution
> loop;
> - makes use of the last CPU exited its execution loop to drain the safe
> work queue;

I suspect you can still race with other deferred work as those tasks are
being done outside the exec loop. This should be fixable though.

> - still does not support user-mode emulation.

There is not particular reason it couldn't. However it would mean
updating the linux-user cpu_exec loop which most likely needs a good
clean-up and re-factoring to avoid making the change to $ARCH loops.

>
> Alvise's async_wait_run_on_cpu() [3]:
> - uses the same queue as async_run_on_cpu();
> - the CPU that requested the job is recorded in qemu_work_item;
> - each CPU has a counter of such jobs it has requested;
> - the counter is decremented upon job completion;
> - only the target CPU is forced to exit the execution loop, i.e. the job
> is not run in quiescent state;
> - does not support user-mode emulation.
>
> Emilio's cpu_tcg_sched_work() [4]:
> - exploits tb_lock() to force CPUs exit their execution loop;
> - requires 'tb_lock' to be held when scheduling a job;
> - allows each CPU to schedule only a single job;
> - handles scheduled work right in cpu_exec();
> - exploits synchronize_rcu() to wait for other CPUs to exit their
> execution loop;
> - implements a complicated synchronization scheme;
> - should support both system and user-mode emulation.
>
>
> As of requirements for common safe work mechanism, each use case has its
> own considerations.
>
> Translation buffer flush just requires that no CPU is executing
> generated code during the operation.
>
> Cross-CPU TLB flush basically requires no CPU is performing TLB
> lookup/modification. Some architectures might require TLB flush be
> complete before the requesting CPU can continue execution; other might
> allow to delay it until some "synchronization point". In case of ARM,
> one of such synchronization points is DMB instruction. We might allow
> the operation to be performed asynchronously and continue execution, but
> we'd need to end TB and synchronize on each DMB instruction. That
> doesn't seem very efficient. So a simple approach to force the operation
> to complete before executing anything else would probably make sense in
> both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush
> to be complete before it can finish emulation of a LL instruction.
>
> Exclusive operation emulation in user-mode basically requires that no
> other CPU is executing generated code. However, I hope that both system
> and user-mode would use some common implementation of exclusive
> instruction emulation.
>
> It was pointed out that special care must be taken to avoid deadlocks
> [5, 6]. A simple and reliable approach might be to exit all CPU's
> execution loop including the requesting CPU and then serve all the
> pending requests.
>
> Distilling the requirements, safe work mechanism should:
> - support both system and user-mode emulation;
> - allow to schedule an asynchronous operation to be performed out of CPU
> execution loop;
> - guarantee that all CPUs are out of execution loop before the operation
> can begin;
> - guarantee that no CPU enters execution loop before all the scheduled
> operations are complete.
>
> If that sounds like a sane approach, I'll come up with a more specific
> solution to discuss. The solution could be merged into v2.7 along with
> safe translation buffer flush in user-mode as an actual use case. Safe
> cross-CPU TLB flush would become a part of MTTCG work. Comments,
> suggestions, arguments etc. are welcome!
>
> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039
> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982
> [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789
> [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301
> [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231
>
> Kind regards,
> Sergey


--
Alex Bennée

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] exec: Safe work in quiescent state
  2016-06-15 14:51     ` Alex Bennée
@ 2016-06-15 15:25       ` alvise rigo
  2016-06-15 20:05         ` Sergey Fedorov
  0 siblings, 1 reply; 8+ messages in thread
From: alvise rigo @ 2016-06-15 15:25 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Sergey Fedorov, QEMU Developers, MTTCG Devel,
	KONRAD Frédéric, Emilio G. Cota, Paolo Bonzini,
	Richard Henderson, Peter Maydell

On Wed, Jun 15, 2016 at 4:51 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> alvise rigo <a.rigo@virtualopensystems.com> writes:
>
>> Hi Sergey,
>>
>> Nice review of the implementations we have so far.
>> Just few comments below.
>>
>> On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>> On 10/06/16 00:51, Sergey Fedorov wrote:
>>>> For certain kinds of tasks we might need a quiescent state to perform an
>>>> operation safely. Quiescent state means no CPU thread executing, and
>>>> probably BQL held as well. The tasks could include:
> <snip>
>>>
>>> Alvise's async_wait_run_on_cpu() [3]:
>>> - uses the same queue as async_run_on_cpu();
>>> - the CPU that requested the job is recorded in qemu_work_item;
>>> - each CPU has a counter of such jobs it has requested;
>>> - the counter is decremented upon job completion;
>>> - only the target CPU is forced to exit the execution loop, i.e. the job
>>> is not run in quiescent state;
>>
>> async_wait_run_on_cpu() kicks the target VCPU before calling
>> cpu_exit() on the current VCPU, so all the VCPUs are forced to exit.
>> Moreover, the current VCPU waits for all the tasks to be completed.
>
> The effect of qemu_cpu_kick() for TCG is effectively just doing a
> cpu_exit() anyway. Once done any TCG code will exit on it's next
> intra-block transition.
>
>>
> <snip>
>>> Distilling the requirements, safe work mechanism should:
>>> - support both system and user-mode emulation;
>>> - allow to schedule an asynchronous operation to be performed out of CPU
>>> execution loop;
>>> - guarantee that all CPUs are out of execution loop before the operation
>>> can begin;
>>
>> This requirement is probably not necessary if we need to query TLB
>> flushes to other VCPUs, since every VCPU will flush its own TLB.
>> For this reason we probably need to mechanisms:
>> - The first allows a VCPU to query a job to all the others and wait
>> for all of them to be done (like for global TLB flush)
>
> Do we need to wait?

Yes, otherwise the instruction (like MCR which allows to do TLB
invalidation) is not completely emulated before executing the
following one.
During the LL emulation is also required since it avoids possible race
conditions.

>
>> - The second allows a VCPU to perform a task in quiescent state i.e.
>> the task starts and finishes when all VCPUs are out of the execution
>> loop (translation buffer flush)
>
> If you really want to ensure everything is done then you can exit the
> block early. To get the sort of dsb() flush semantics mentioned you
> simply:
>
>   - queue your async safe work
>   - exit block on dsb()
>
>   This ensures by the time the TCG thread restarts for the next
>   instruction all pending work has been flushed.
>
>> Does this make sense?
>
> I think we want one way of doing things for anything that is Cross CPU
> and requires a degree of synchronisation. If it ends up being too
> expensive then we can look at more efficient special casing solutions.

OK, I agree that we should start with an approach that fits the two use cases.

Thank you,
alvise

>
>>
>>> - guarantee that no CPU enters execution loop before all the scheduled
>>> operations are complete.
>>
>> This is probably too much in some cases for the reasons of before.
>>
>> Best regards,
>> alvise
>>
>>>
>>> If that sounds like a sane approach, I'll come up with a more specific
>>> solution to discuss. The solution could be merged into v2.7 along with
>>> safe translation buffer flush in user-mode as an actual use case. Safe
>>> cross-CPU TLB flush would become a part of MTTCG work. Comments,
>>> suggestions, arguments etc. are welcome!
>>>
>>> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632
>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039
>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982
>>> [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789
>>> [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301
>>> [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231
>>>
>>> Kind regards,
>>> Sergey
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] exec: Safe work in quiescent state
  2016-06-15 14:56   ` Alex Bennée
@ 2016-06-15 19:16     ` Sergey Fedorov
  0 siblings, 0 replies; 8+ messages in thread
From: Sergey Fedorov @ 2016-06-15 19:16 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric,
	Alvise Rigo, Emilio G. Cota, Paolo Bonzini, Richard Henderson,
	Peter Maydell

On 15/06/16 17:56, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
(snip)
> Just some quick comments for context:
>
>> Alex's reiteration of Fred's approach [2]:
>> - maintains a single global safe work queue;
> Having separate queues can lead to problems with draining queues as only
> queue gets drained at a time and some threads exit more frequently than
> others.

I think it can't happen if we drain all the queues from all the CPUs, as
we should. The requirement is: stop all the CPUs and process all the
pending work. If we follow this requirement, I think it's not important
whether we have separate queues for each CPU or just a single global queue.

>
>> - uses GArray rather than linked list to implement the work queue;
> This was to minimise g_malloc on job creation and working through the
> list. An awful lot of jobs just need the CPU id and a single parameter.
> This is why I made it the simple case.

I think it would be nice to avoid g_malloc but don't use an array at the
same time. I have some thoughts how to do this easily, let's see the
code ;-)

>> - introduces a global counter of CPUs which have entered their execution
>> loop;
>> - makes use of the last CPU exited its execution loop to drain the safe
>> work queue;
> I suspect you can still race with other deferred work as those tasks are
> being done outside the exec loop. This should be fixable though.

Will keep an eye on this, thanks.

>
>> - still does not support user-mode emulation.
> There is not particular reason it couldn't. However it would mean
> updating the linux-user cpu_exec loop which most likely needs a good
> clean-up and re-factoring to avoid making the change to $ARCH loops.

Yes, you are right, I just fixed the facts here :)

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] exec: Safe work in quiescent state
  2016-06-15 15:25       ` alvise rigo
@ 2016-06-15 20:05         ` Sergey Fedorov
  0 siblings, 0 replies; 8+ messages in thread
From: Sergey Fedorov @ 2016-06-15 20:05 UTC (permalink / raw)
  To: alvise rigo, Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, KONRAD Frédéric,
	Emilio G. Cota, Paolo Bonzini, Richard Henderson, Peter Maydell

On 15/06/16 18:25, alvise rigo wrote:
> On Wed, Jun 15, 2016 at 4:51 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>> alvise rigo <a.rigo@virtualopensystems.com> writes:
>>> On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>>> On 10/06/16 00:51, Sergey Fedorov wrote:
>>>>> For certain kinds of tasks we might need a quiescent state to perform an
>>>>> operation safely. Quiescent state means no CPU thread executing, and
>>>>> probably BQL held as well. The tasks could include:
>> <snip>
>>>> Alvise's async_wait_run_on_cpu() [3]:
>>>> - uses the same queue as async_run_on_cpu();
>>>> - the CPU that requested the job is recorded in qemu_work_item;
>>>> - each CPU has a counter of such jobs it has requested;
>>>> - the counter is decremented upon job completion;
>>>> - only the target CPU is forced to exit the execution loop, i.e. the job
>>>> is not run in quiescent state;
>>> async_wait_run_on_cpu() kicks the target VCPU before calling
>>> cpu_exit() on the current VCPU, so all the VCPUs are forced to exit.
>>> Moreover, the current VCPU waits for all the tasks to be completed.
>> The effect of qemu_cpu_kick() for TCG is effectively just doing a
>> cpu_exit() anyway. Once done any TCG code will exit on it's next
>> intra-block transition.

I was just meaning that async_wait_run_on_cpu() does not stop all the
CPUs: it only affects the current CPU and the target CPU. So this
mechanism cannot be used for tb_flush().

>> <snip>
>>>> Distilling the requirements, safe work mechanism should:
>>>> - support both system and user-mode emulation;
>>>> - allow to schedule an asynchronous operation to be performed out of CPU
>>>> execution loop;
>>>> - guarantee that all CPUs are out of execution loop before the operation
>>>> can begin;
>>> This requirement is probably not necessary if we need to query TLB
>>> flushes to other VCPUs, since every VCPU will flush its own TLB.
>>> For this reason we probably need to mechanisms:
>>> - The first allows a VCPU to query a job to all the others and wait
>>> for all of them to be done (like for global TLB flush)
>> Do we need to wait?
> Yes, otherwise the instruction (like MCR which allows to do TLB
> invalidation) is not completely emulated before executing the
> following one.

I think I need to specify this in the requirements: the CPU which
requested an asynchronous safe operation must exit its execution loop at
the end of the current TB and wait for operation completion. Then guest
cross-CPU TLB invalidation instruction can force end of the TB to ensure
no further instructions get executed until the flush is complete.

> During the LL emulation is also required since it avoids possible race
> conditions.

As it was pointed in [1], LL can be implemented using such "safe work in
quiescent state" mechanism.

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=418664


>>> - The second allows a VCPU to perform a task in quiescent state i.e.
>>> the task starts and finishes when all VCPUs are out of the execution
>>> loop (translation buffer flush)
>> If you really want to ensure everything is done then you can exit the
>> block early. To get the sort of dsb() flush semantics mentioned you
>> simply:
>>
>>   - queue your async safe work
>>   - exit block on dsb()
>>
>>   This ensures by the time the TCG thread restarts for the next
>>   instruction all pending work has been flushed.

Indeed, if we kick the CPU which requested the job and just end the TB
at DSB instruction then the CPU will see the exit request and go out of
its execution loop to wait for operation completion.

>>> Does this make sense?
>> I think we want one way of doing things for anything that is Cross CPU
>> and requires a degree of synchronisation. If it ends up being too
>> expensive then we can look at more efficient special casing solutions.
> OK, I agree that we should start with an approach that fits the two use cases.

So refined the requirements, safe work mechanism should:
- support both system and user-mode emulation;
- allow to schedule an asynchronous operation to be performed out of CPU
execution loop;
- force all CPUs to exit execution loop at the end of the currently
executed TB once an operation is scheduled;
- guarantee that all CPUs are out of execution loop before the operation
can begin;
- guarantee that no CPU enters execution loop until all the scheduled
operations are complete.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-06-15 20:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-09 21:51 [Qemu-devel] exec: Safe work in quiescent state Sergey Fedorov
2016-06-15 12:59 ` Sergey Fedorov
2016-06-15 14:16   ` alvise rigo
2016-06-15 14:51     ` Alex Bennée
2016-06-15 15:25       ` alvise rigo
2016-06-15 20:05         ` Sergey Fedorov
2016-06-15 14:56   ` Alex Bennée
2016-06-15 19:16     ` Sergey Fedorov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.