sharing intention for developing per-target, dynamically loadable accelerator modules

All of lore.kernel.org
 help / color / mirror / Atom feed

* sharing intention for developing per-target, dynamically loadable accelerator modules
@ 2020-05-17 11:30 Claudio Fontana
  2020-05-18 18:18 ` Alex Bennée
  0 siblings, 1 reply; 4+ messages in thread
From: Claudio Fontana @ 2020-05-17 11:30 UTC (permalink / raw)
  To: qemu-devel, Philippe Mathieu-Daudé, Peter Maydell; +Cc: Paolo Bonzini

Hello all,

my intention would be to develop per-target, dynamically loadable accelerator modules.

This would allow to distribute a single QEMU base binary, and then provide accelerators as optional additional binary packages to install,
with the first separate optional package being TCG.

CONFIG_TCG would become 'm' as a result, but then also CONFIG_KVM, CONFIG_HAX, CONFIG_WHPX, CONFIG_HVF.

Here are some elements that seem to be needed:

1 - The module CONFIG_MODULE part of the build system would need some extension to add per-target modules. I have some tentative results that shows that this is possible (but a bit clunky atm).
    There is some existing instability in the existing Makefile infrastructure of modules that shows up when trying to extend it.

2 - new "accelerator drivers" seems to be needed, either in addition or as additional functionality inside the current AccelState.

3 - for target/i386 in particular, there is some refactoring work needed to split even more different unrelated bits and pieces.
    dependencies of hw/i386 machine stuff with accelerator-specific stuff are also painful.

4 - CPU Arch Classes could be extended with per-accelerator methods. Initial fooling around shows it should probably work.
    One alternative would be trying to play with the dynamic linker (weak symbols, creative use of dlsym etc), but I have not sorted out the details of this option.

5 - cputlb, in particular tlb_flush and friends is a separate problem since it is not part of the cpuclass. Should it be?

6 - a painpoint is represented by the fact that in USER mode, the accel class is not applied, which causes a lot of uncleanliness all around
    (tcg_allowed outside of the AccelClass).

7 - I have not really thought about the KConfig aspects because I am not super-familiar

8 - cpus.c needs some good splitting

... more things to find out and think about ...

Overall, I think that the activity has the potential to provide benefits overall beyond the actual goal, in the form of cleanups, leaner configurations,
minor fixes, maybe improving the CONFIG_MODULE instabilities if any etc.

As an example, the first activity I would plan to submit as RFC is point 8 above,
there is the split between cpus.c and cpus-tcg.c that results in lots of TCG-specific code being removed from non-tcg builds (using CONFIG_TCG).

One thing that should be kept in check is any performance impact of the changes, in particular for point 4, hot paths should probably avoid going through too many pointer indirections.

Does anybody share similar goals? Any major obstacle or blocker that would put the feasibility into question?
Any suggestion on any of this? In particular point 4 and 5 come to mind, as well as some better understanding of the reasons behind 6, or even suggestions on how to best to 2.

Anyway, I will continue to work on the first RFC for some smaller initial steps and hopefully have something to submit soon.

Ciao ciao,

Claudio

-- 
Claudio Fontana
Engineering Manager Virtualization, SUSE Labs Core

SUSE Software Solutions Italy Srl

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: sharing intention for developing per-target, dynamically loadable accelerator modules
  2020-05-17 11:30 sharing intention for developing per-target, dynamically loadable accelerator modules Claudio Fontana
@ 2020-05-18 18:18 ` Alex Bennée
  2020-05-19  7:53   ` Claudio Fontana
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Bennée @ 2020-05-18 18:18 UTC (permalink / raw)
  To: Claudio Fontana
  Cc: Peter Maydell, Philippe Mathieu-Daudé, qemu-devel, Paolo Bonzini

Claudio Fontana <cfontana@suse.de> writes:

> Hello all,
>
> my intention would be to develop per-target, dynamically loadable accelerator modules.
>
> This would allow to distribute a single QEMU base binary, and then provide accelerators as optional additional binary packages to install,
> with the first separate optional package being TCG.
>
> CONFIG_TCG would become 'm' as a result, but then also CONFIG_KVM, CONFIG_HAX, CONFIG_WHPX, CONFIG_HVF.
>
> Here are some elements that seem to be needed:
>
> 1 - The module CONFIG_MODULE part of the build system would need some extension to add per-target modules. I have some tentative results that shows that this is possible (but a bit clunky atm).
>     There is some existing instability in the existing Makefile infrastructure of modules that shows up when trying to extend it.
>
> 2 - new "accelerator drivers" seems to be needed, either in addition or as additional functionality inside the current AccelState.
>
> 3 - for target/i386 in particular, there is some refactoring work needed to split even more different unrelated bits and pieces.
>     dependencies of hw/i386 machine stuff with accelerator-specific
> stuff are also painful.

There are a couple of per-arch hacks in the main TCG cpu loops it would
be good to excise from the code.

>
> 4 - CPU Arch Classes could be extended with per-accelerator methods. Initial fooling around shows it should probably work.
>     One alternative would be trying to play with the dynamic linker (weak symbols, creative use of dlsym etc), but I have not sorted out the details of this option.
>
> 5 - cputlb, in particular tlb_flush and friends is a separate problem
> since it is not part of the cpuclass. Should it be?

tlb_flush and friends are TCG implementation details for softmmu that
ensure a lookup for any address will always return to a guest specific
tlb_fill to rebuild the cache. The behaviour is not guest-specific
per-se although we do partition the table entries based on the guest
size.

Perhaps we can make it more dynamic but it would have to ensure both the
slow path and the fast path are using the same mask and shifts when
walking the table.

> 6 - a painpoint is represented by the fact that in USER mode, the accel class is not applied, which causes a lot of uncleanliness all around
>     (tcg_allowed outside of the AccelClass).

The user-mode run loops are a whole separate chunk of code. I don't know
if it is worth trying to make them plugable as you will only ever have
one per linux-user binary.

> 7 - I have not really thought about the KConfig aspects because I am not super-familiar
>
> 8 - cpus.c needs some good splitting

Although there are several different run loops in there I think they
share a lot of commonality which is why they are bundled together. They
all share the same backend for dealing with ioevents and generic
pause/unpause machinery. But feel free to prove me wrong ;-)

> ... more things to find out and think about ...
>
> Overall, I think that the activity has the potential to provide benefits overall beyond the actual goal, in the form of cleanups, leaner configurations,
> minor fixes, maybe improving the CONFIG_MODULE instabilities if any
> etc.

There are certainly some ugly bits we could shave down in such an
exercise.

> As an example, the first activity I would plan to submit as RFC is point 8 above,
> there is the split between cpus.c and cpus-tcg.c that results in lots of TCG-specific code being removed from non-tcg builds (using CONFIG_TCG).
>
> One thing that should be kept in check is any performance impact of
> the changes, in particular for point 4, hot paths should probably
> avoid going through too many pointer indirections.

Maybe - certainly for TCG you have pretty much lost if you have exited
the main execution loop I doubt it would show up much. Not so sure about
the HW accelerators. Most of the performance sensitive stuff is dealt
with close to the ioctls IIRC.

> Does anybody share similar goals? Any major obstacle or blocker that would put the feasibility into question?
> Any suggestion on any of this? In particular point 4 and 5 come to
> mind, as well as some better understanding of the reasons behind 6, or
> even suggestions on how to best to 2.

For linux-user each CPU run loop is it's own special snowflake because
pretty much every architecture has it's own special set of EXCP_ exits
to deal with various bits. There are per-arch EXCP_ flags for system
emulation as well but not nearly as many.

>
> Anyway, I will continue to work on the first RFC for some smaller initial steps and hopefully have something to submit soon.
>
> Ciao ciao,
>
> Claudio

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: sharing intention for developing per-target, dynamically loadable accelerator modules
  2020-05-18 18:18 ` Alex Bennée
@ 2020-05-19  7:53   ` Claudio Fontana
  2020-05-21 11:28     ` Claudio Fontana
  0 siblings, 1 reply; 4+ messages in thread
From: Claudio Fontana @ 2020-05-19  7:53 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Maydell, Philippe Mathieu-Daudé, qemu-devel, Paolo Bonzini

On 5/18/20 8:18 PM, Alex Bennée wrote:
> 
> Claudio Fontana <cfontana@suse.de> writes:
> 
>> Hello all,
>>
>> my intention would be to develop per-target, dynamically loadable accelerator modules.
>>
>> This would allow to distribute a single QEMU base binary, and then provide accelerators as optional additional binary packages to install,
>> with the first separate optional package being TCG.
>>
>> CONFIG_TCG would become 'm' as a result, but then also CONFIG_KVM, CONFIG_HAX, CONFIG_WHPX, CONFIG_HVF.
>>
>> Here are some elements that seem to be needed:
>>
>> 1 - The module CONFIG_MODULE part of the build system would need some extension to add per-target modules. I have some tentative results that shows that this is possible (but a bit clunky atm).
>>     There is some existing instability in the existing Makefile infrastructure of modules that shows up when trying to extend it.
>>
>> 2 - new "accelerator drivers" seems to be needed, either in addition or as additional functionality inside the current AccelState.
>>
>> 3 - for target/i386 in particular, there is some refactoring work needed to split even more different unrelated bits and pieces.
>>     dependencies of hw/i386 machine stuff with accelerator-specific
>> stuff are also painful.
> 
> There are a couple of per-arch hacks in the main TCG cpu loops it would
> be good to excise from the code.
> 
>>
>> 4 - CPU Arch Classes could be extended with per-accelerator methods. Initial fooling around shows it should probably work.
>>     One alternative would be trying to play with the dynamic linker (weak symbols, creative use of dlsym etc), but I have not sorted out the details of this option.
>>
>> 5 - cputlb, in particular tlb_flush and friends is a separate problem
>> since it is not part of the cpuclass. Should it be?
> 
> tlb_flush and friends are TCG implementation details for softmmu that
> ensure a lookup for any address will always return to a guest specific
> tlb_fill to rebuild the cache. The behaviour is not guest-specific
> per-se although we do partition the table entries based on the guest
> size.
> 
> Perhaps we can make it more dynamic but it would have to ensure both the
> slow path and the fast path are using the same mask and shifts when
> walking the table.
> 
>> 6 - a painpoint is represented by the fact that in USER mode, the accel class is not applied, which causes a lot of uncleanliness all around
>>     (tcg_allowed outside of the AccelClass).
> 
> The user-mode run loops are a whole separate chunk of code. I don't know
> if it is worth trying to make them plugable as you will only ever have
> one per linux-user binary.
> 
>> 7 - I have not really thought about the KConfig aspects because I am not super-familiar
>>
>> 8 - cpus.c needs some good splitting
> 
> Although there are several different run loops in there I think they
> share a lot of commonality which is why they are bundled together. They
> all share the same backend for dealing with ioevents and generic
> pause/unpause machinery. But feel free to prove me wrong ;-)

Hi Alex, I got my first compile, and it is currently in github, I still need to split the series though and there is still cleanup needed.

https://github.com/hw-claudio/qemu.git
branch "cpus-refactoring"

just in case you are interested in a peek.

The idea results currently in:

 cpu-throttle.c                |  154 +++++++++
 cpu-timers.c                  |  784 +++++++++++++++++++++++++++++++++++++++++++++
 cpus-tcg.c                    |  515 ++++++++++++++++++++++++++++++
 cpus.c                        | 1405 +++++----------------------------------------------------------------------------

New interfaces are in:

include/sysemu/cpu-throttle.h |   50 +++
include/sysemu/cpu-timers.h   |   73 +++++
include/sysemu/cpus.h         |   44 ++-

cpu-throttle (new) is self-explanatory, contains the cpu-throttle in cpus.c
cpu-timers (new) contains the icount, ticks, clock timers from cpus.c

cpus.h adds an interface to per-accel vcpu threads:

qemu_register_start_vcpu(void (*f)(CPUState *cpu));
bool all_cpu_threads_idle(void);
bool cpu_can_run(CPUState *cpu);
void qemu_wait_io_event_common(CPUState *cpu);
void qemu_wait_io_event(CPUState *cpu);
void cpu_thread_signal_created(CPUState *cpu);
void cpu_thread_signal_destroyed(CPUState *cpu);
void cpu_handle_guest_debug(CPUState *cpu);

Very much still all WIP...

Ciao,

C


> 
>> ... more things to find out and think about ...
>>
>> Overall, I think that the activity has the potential to provide benefits overall beyond the actual goal, in the form of cleanups, leaner configurations,
>> minor fixes, maybe improving the CONFIG_MODULE instabilities if any
>> etc.
> 
> There are certainly some ugly bits we could shave down in such an
> exercise.
> 
>> As an example, the first activity I would plan to submit as RFC is point 8 above,
>> there is the split between cpus.c and cpus-tcg.c that results in lots of TCG-specific code being removed from non-tcg builds (using CONFIG_TCG).
>>
>> One thing that should be kept in check is any performance impact of
>> the changes, in particular for point 4, hot paths should probably
>> avoid going through too many pointer indirections.
> 
> Maybe - certainly for TCG you have pretty much lost if you have exited
> the main execution loop I doubt it would show up much. Not so sure about
> the HW accelerators. Most of the performance sensitive stuff is dealt
> with close to the ioctls IIRC.
> 
>> Does anybody share similar goals? Any major obstacle or blocker that would put the feasibility into question?
>> Any suggestion on any of this? In particular point 4 and 5 come to
>> mind, as well as some better understanding of the reasons behind 6, or
>> even suggestions on how to best to 2.
> 
> For linux-user each CPU run loop is it's own special snowflake because
> pretty much every architecture has it's own special set of EXCP_ exits
> to deal with various bits. There are per-arch EXCP_ flags for system
> emulation as well but not nearly as many.
> 
>>
>> Anyway, I will continue to work on the first RFC for some smaller initial steps and hopefully have something to submit soon.
>>
>> Ciao ciao,
>>
>> Claudio
> 
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: sharing intention for developing per-target, dynamically loadable accelerator modules
  2020-05-19  7:53   ` Claudio Fontana
@ 2020-05-21 11:28     ` Claudio Fontana
  0 siblings, 0 replies; 4+ messages in thread
From: Claudio Fontana @ 2020-05-21 11:28 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Peter Maydell, Philippe Mathieu-Daudé, qemu-devel, Paolo Bonzini

On 5/19/20 9:53 AM, Claudio Fontana wrote:
> On 5/18/20 8:18 PM, Alex Bennée wrote:
>>
>> Claudio Fontana <cfontana@suse.de> writes:
>>
>>> Hello all,
>>>
>>> my intention would be to develop per-target, dynamically loadable accelerator modules.
>>>
>>> This would allow to distribute a single QEMU base binary, and then provide accelerators as optional additional binary packages to install,
>>> with the first separate optional package being TCG.
>>>
>>> CONFIG_TCG would become 'm' as a result, but then also CONFIG_KVM, CONFIG_HAX, CONFIG_WHPX, CONFIG_HVF.
>>>
>>> Here are some elements that seem to be needed:
>>>
>>> 1 - The module CONFIG_MODULE part of the build system would need some extension to add per-target modules. I have some tentative results that shows that this is possible (but a bit clunky atm).
>>>     There is some existing instability in the existing Makefile infrastructure of modules that shows up when trying to extend it.
>>>
>>> 2 - new "accelerator drivers" seems to be needed, either in addition or as additional functionality inside the current AccelState.
>>>
>>> 3 - for target/i386 in particular, there is some refactoring work needed to split even more different unrelated bits and pieces.
>>>     dependencies of hw/i386 machine stuff with accelerator-specific
>>> stuff are also painful.
>>
>> There are a couple of per-arch hacks in the main TCG cpu loops it would
>> be good to excise from the code.
>>
>>>
>>> 4 - CPU Arch Classes could be extended with per-accelerator methods. Initial fooling around shows it should probably work.
>>>     One alternative would be trying to play with the dynamic linker (weak symbols, creative use of dlsym etc), but I have not sorted out the details of this option.
>>>
>>> 5 - cputlb, in particular tlb_flush and friends is a separate problem
>>> since it is not part of the cpuclass. Should it be?
>>
>> tlb_flush and friends are TCG implementation details for softmmu that
>> ensure a lookup for any address will always return to a guest specific
>> tlb_fill to rebuild the cache. The behaviour is not guest-specific
>> per-se although we do partition the table entries based on the guest
>> size.
>>
>> Perhaps we can make it more dynamic but it would have to ensure both the
>> slow path and the fast path are using the same mask and shifts when
>> walking the table.
>>
>>> 6 - a painpoint is represented by the fact that in USER mode, the accel class is not applied, which causes a lot of uncleanliness all around
>>>     (tcg_allowed outside of the AccelClass).
>>
>> The user-mode run loops are a whole separate chunk of code. I don't know
>> if it is worth trying to make them plugable as you will only ever have
>> one per linux-user binary.
>>
>>> 7 - I have not really thought about the KConfig aspects because I am not super-familiar
>>>
>>> 8 - cpus.c needs some good splitting
>>
>> Although there are several different run loops in there I think they
>> share a lot of commonality which is why they are bundled together. They
>> all share the same backend for dealing with ioevents and generic
>> pause/unpause machinery. But feel free to prove me wrong ;-)
> 
> Hi Alex, I got my first compile, and it is currently in github, I still need to split the series though and there is still cleanup needed.
> 
> https://github.com/hw-claudio/qemu.git
> branch "cpus-refactoring"
> 
> just in case you are interested in a peek.
> 
> The idea results currently in:
> 
>  cpu-throttle.c                |  154 +++++++++
>  cpu-timers.c                  |  784 +++++++++++++++++++++++++++++++++++++++++++++
>  cpus-tcg.c                    |  515 ++++++++++++++++++++++++++++++
>  cpus.c                        | 1405 +++++----------------------------------------------------------------------------
> 
> New interfaces are in:
> 
> include/sysemu/cpu-throttle.h |   50 +++
> include/sysemu/cpu-timers.h   |   73 +++++
> include/sysemu/cpus.h         |   44 ++-
> 
> cpu-throttle (new) is self-explanatory, contains the cpu-throttle in cpus.c
> cpu-timers (new) contains the icount, ticks, clock timers from cpus.c
> 
> cpus.h adds an interface to per-accel vcpu threads:
> 
> qemu_register_start_vcpu(void (*f)(CPUState *cpu));
> bool all_cpu_threads_idle(void);
> bool cpu_can_run(CPUState *cpu);
> void qemu_wait_io_event_common(CPUState *cpu);
> void qemu_wait_io_event(CPUState *cpu);
> void cpu_thread_signal_created(CPUState *cpu);
> void cpu_thread_signal_destroyed(CPUState *cpu);
> void cpu_handle_guest_debug(CPUState *cpu);
> 
> Very much still all WIP...
> 
> Ciao,
> 
> C
> 
> 
>>
>>> ... more things to find out and think about ...
>>>
>>> Overall, I think that the activity has the potential to provide benefits overall beyond the actual goal, in the form of cleanups, leaner configurations,
>>> minor fixes, maybe improving the CONFIG_MODULE instabilities if any
>>> etc.
>>
>> There are certainly some ugly bits we could shave down in such an
>> exercise.
>>
>>> As an example, the first activity I would plan to submit as RFC is point 8 above,
>>> there is the split between cpus.c and cpus-tcg.c that results in lots of TCG-specific code being removed from non-tcg builds (using CONFIG_TCG).
>>>
>>> One thing that should be kept in check is any performance impact of
>>> the changes, in particular for point 4, hot paths should probably
>>> avoid going through too many pointer indirections.
>>
>> Maybe - certainly for TCG you have pretty much lost if you have exited
>> the main execution loop I doubt it would show up much. Not so sure about
>> the HW accelerators. Most of the performance sensitive stuff is dealt
>> with close to the ioctls IIRC.
>>
>>> Does anybody share similar goals? Any major obstacle or blocker that would put the feasibility into question?
>>> Any suggestion on any of this? In particular point 4 and 5 come to
>>> mind, as well as some better understanding of the reasons behind 6, or
>>> even suggestions on how to best to 2.
>>
>> For linux-user each CPU run loop is it's own special snowflake because
>> pretty much every architecture has it's own special set of EXCP_ exits
>> to deal with various bits. There are per-arch EXCP_ flags for system
>> emulation as well but not nearly as many.
>>
>>>
>>> Anyway, I will continue to work on the first RFC for some smaller initial steps and hopefully have something to submit soon.
>>>
>>> Ciao ciao,
>>>
>>> Claudio
>>
>>

I have tried something a bit more ambitious at

https://github.com/hw-claudio/qemu.git branch: "cpus-interface"

starting to look nice, with each single accelerator registering an interface to cpus.c,
so that the cpu thread can reside within that specific accelerator.

Some workarounds are eliminated as a result as well, for example for the kick.
Removes some windows special casing (note: while doing this I noticed that windows has been bolted in initially for HAX, so there might be some assumptions there that became wrong with the addition of whpx.

In particular, the DUMMY_APC thing. Left the "bug" there, but added a comment so it can be discussed.

The most difficult part is to split the patch: everything is so knit together that "doing one accel at a time" for example is really challenging.

Will try to post what I have and then I will be looking for feedback on how to split it if necessary for easier reviewing, but already some comments on the approach could be valuable..

Ciao,

C





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-21 11:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-17 11:30 sharing intention for developing per-target, dynamically loadable accelerator modules Claudio Fontana
2020-05-18 18:18 ` Alex Bennée
2020-05-19  7:53   ` Claudio Fontana
2020-05-21 11:28     ` Claudio Fontana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.