linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [Question] How to testing SDEI client driver
@ 2020-06-30  5:17 Gavin Shan
  2020-07-01 11:57 ` James Morse
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Shan @ 2020-06-30  5:17 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: mark.rutland, gshan, james.morse

Hi Folks,

I'm currently looking into SDEI client driver and reworking on it so that
it can provide capability/services to arm64/kvm to get it virtualized. The
primary reason is we want to use SDEI to deliver the asynchronous page fault
notification from host to guest.

The code of rework on SDEI client driver, including some cleanup, is almost
done. Currently, I have issues on how to test/verify the client driver. Any
suggestions regarding this are appreciated.

It seems that TRF (Trusted Firmware) is the only firmware with SDEI service
implemented and supported. If so, does it mean I need to install TRF on my
bare metal machine? I'm wandering how it can be installed and not sure if
there is any document about this.

Besides, GHES seems the only user of SDEI in the linux kernel. If so, is
there a way to inject the relevant errors and how?

Thanks in advance for your comments and suggestions.

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-06-30  5:17 [Question] How to testing SDEI client driver Gavin Shan
@ 2020-07-01 11:57 ` James Morse
  2020-07-03  0:26   ` Gavin Shan
  0 siblings, 1 reply; 11+ messages in thread
From: James Morse @ 2020-07-01 11:57 UTC (permalink / raw)
  To: Gavin Shan, linux-arm-kernel; +Cc: mark.rutland

Hi Gavin,

On 30/06/2020 06:17, Gavin Shan wrote:
> I'm currently looking into SDEI client driver and reworking on it so that
> it can provide capability/services to arm64/kvm to get it virtualized.

What do you mean by virtualised? The expectation is the VMM would implement the 'firmware'
side of this. 'events' are most likely to come from the VMM, and having to handshake with
the kernel to work out if the event you want to inject is registered and enabled is
over-complicated. Supporting it in the VMM means you can notify a different vCPU if that
is appropriate, or take a different action if the event isn't registered.

This was all blocked on finding a future-proof way for tools like Qemu to consume
reference code from ATF.


> The
> primary reason is we want to use SDEI to deliver the asynchronous page fault
> notification from host to guest.

As an NMI?! Yuck!
The SDEI handler reads memory, you'd need to stop it being re-entrant. It exits through
the IRQ vector, (which is necessary for forward-progress given a synchronous RAS event,
and for KVM to trigger guest-exit before the 'real' work that is offloaded to an irq
handler can run), its going to be 'fun' to have any guarantee of forward-progress if this
is involved with stage2.


> The code of rework on SDEI client driver, including some cleanup,

Cleanup is always welcome.


> is almost
> done. Currently, I have issues on how to test/verify the client driver.

...

> Any suggestions regarding this are appreciated.

> It seems that TRF (Trusted Firmware) is the only firmware with SDEI service
> implemented and supported.

This project calls itself TF-A. ATF is the other widely used name. (I've never seen TRF
before)


> If so, does it mean I need to install TRF on my bare metal machine? 
> I'm wandering how it can be installed and not sure if
> there is any document about this.

Firmware should come with the platform. You'd need to know intricate details about power
management and initialising parts of the SoC to port it.

ATF has a port for the fast-model/foundation model. I test this with ATF in the fast-model.


> Besides, GHES seems the only user of SDEI in the linux kernel. If so, is
> there a way to inject the relevant errors and how?

It is, and unfortunately last time I checked, upstream ATF doesn't have the firmware-first
stuff for this. Its too SoC specific.

I test this by binding the fast-model's SP804 one-shot interrupt controller as an event,
then plumbing that into GHES. Its more of case-study in why the bindable-irq stuff is
nasty than usable error injection method.
I can push the most recently rebased version of this, but you'd also need to hack-up a
HEST table with GHES entries to actually get it running.


But, unless you are working on EL3 firmare, or a VMM, I don't think SDEI is what you want.
What problem are you trying to solve?


Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-01 11:57 ` James Morse
@ 2020-07-03  0:26   ` Gavin Shan
  2020-07-08 16:11     ` James Morse
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Shan @ 2020-07-03  0:26 UTC (permalink / raw)
  To: James Morse, linux-arm-kernel; +Cc: mark.rutland, pbonzini, maz

Hi James,

On 7/1/20 9:57 PM, James Morse wrote:
> On 30/06/2020 06:17, Gavin Shan wrote:
>> I'm currently looking into SDEI client driver and reworking on it so that
>> it can provide capability/services to arm64/kvm to get it virtualized.
> 
> What do you mean by virtualised? The expectation is the VMM would implement the 'firmware'
> side of this. 'events' are most likely to come from the VMM, and having to handshake with
> the kernel to work out if the event you want to inject is registered and enabled is
> over-complicated. Supporting it in the VMM means you can notify a different vCPU if that
> is appropriate, or take a different action if the event isn't registered.
> 
> This was all blocked on finding a future-proof way for tools like Qemu to consume
> reference code from ATF.
> 

Sorry that I didn't mention the story a bit last time. We plan to use SDEI to
deliver the notification (signal) from host to guest, needed by the asynchronous
page fault feature. The RFCv2 patchset was post a while ago [1]. For the SDEI
events needed by the async page fault, it's originated from KVM (host). In order
to achieve the goal, KVM needs some code so that SDEI event can be injected and
delivered. Also, the SDEI related hypercalls needs to be handled either.

Since we're here, I plan to expand the scope so that the firmware owned SDEI
events (private/shared) can be passed through to multiple VMs. Lets say they're
passthrou event. For these passthrou events, they can be shared by multiple VMs
either.

I would call this kind of feature as SDEI virtualization, but there might be
better name for it :)

[1] https://lore.kernel.org/kvmarm/924ee966-7412-f9ff-c2b0-598e4abbb05c@redhat.com/

> 
>> The
>> primary reason is we want to use SDEI to deliver the asynchronous page fault
>> notification from host to guest.
> 
> As an NMI?! Yuck!
> The SDEI handler reads memory, you'd need to stop it being re-entrant. It exits through
> the IRQ vector, (which is necessary for forward-progress given a synchronous RAS event,
> and for KVM to trigger guest-exit before the 'real' work that is offloaded to an irq
> handler can run), its going to be 'fun' to have any guarantee of forward-progress if this
> is involved with stage2.
> 

Yeah, It's something similar to NMI. The notification (signal) has to be
delivered in synchronous mode. Yes, The SDEI specification already mentioned
this: the client handler should have all required resources in place before
the handler is going to run. However, I don't see it's a problem so far.
Lets wait and see if it's a real issue until I post the RFC patchset :)

> 
>> The code of rework on SDEI client driver, including some cleanup,
> 
> Cleanup is always welcome.
> 

Thanks, James.

> 
>> is almost
>> done. Currently, I have issues on how to test/verify the client driver.
> 
> ...
> 
>> Any suggestions regarding this are appreciated.
> 
>> It seems that TRF (Trusted Firmware) is the only firmware with SDEI service
>> implemented and supported.
> 
> This project calls itself TF-A. ATF is the other widely used name. (I've never seen TRF
> before)
> 

Yeah, I must have provided wrong name. Here is the git repo I was
looking into:

    https://github.com/ARM-software/arm-trusted-firmware

> 
>> If so, does it mean I need to install TRF on my bare metal machine?
>> I'm wandering how it can be installed and not sure if
>> there is any document about this.
> 
> Firmware should come with the platform. You'd need to know intricate details about power
> management and initialising parts of the SoC to port it.
> 
> ATF has a port for the fast-model/foundation model. I test this with ATF in the fast-model.
> 

I have no idea of fast-model and foundation model, and I got nothing
from below commands in ATF git repo:

[gwshan@localhost atf]$ git grep -i fast | grep -i model
[gwshan@localhost atf]$ git grep -i fundation | grep -i model

> 
>> Besides, GHES seems the only user of SDEI in the linux kernel. If so, is
>> there a way to inject the relevant errors and how?
> 
> It is, and unfortunately last time I checked, upstream ATF doesn't have the firmware-first
> stuff for this. Its too SoC specific.
> 
> I test this by binding the fast-model's SP804 one-shot interrupt controller as an event,
> then plumbing that into GHES. Its more of case-study in why the bindable-irq stuff is
> nasty than usable error injection method.
> I can push the most recently rebased version of this, but you'd also need to hack-up a
> HEST table with GHES entries to actually get it running.
> > 
> But, unless you are working on EL3 firmare, or a VMM, I don't think SDEI is what you want.
> What problem are you trying to solve?
> 

Thanks for the information. It seems I also need to emulate SDEI event by
myself in order to test it. The best way for me is to inject SDEI event
from KVM. By the way, the code you had is part of the firmware used by
bare-metal machine or VM?

The issue we want to resolve is to deliver async page fault notification
as mentioned above. Please let me know if there are more concerns :)

Thanks,
Gavin



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-03  0:26   ` Gavin Shan
@ 2020-07-08 16:11     ` James Morse
  2020-07-08 16:49       ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: James Morse @ 2020-07-08 16:11 UTC (permalink / raw)
  To: Gavin Shan; +Cc: mark.rutland, pbonzini, linux-arm-kernel, maz

Hi Gavin,

On 03/07/2020 01:26, Gavin Shan wrote:
> On 7/1/20 9:57 PM, James Morse wrote:
>> On 30/06/2020 06:17, Gavin Shan wrote:
>>> I'm currently looking into SDEI client driver and reworking on it so that
>>> it can provide capability/services to arm64/kvm to get it virtualized.
>>
>> What do you mean by virtualised? The expectation is the VMM would implement the 'firmware'
>> side of this. 'events' are most likely to come from the VMM, and having to handshake with
>> the kernel to work out if the event you want to inject is registered and enabled is
>> over-complicated. Supporting it in the VMM means you can notify a different vCPU if that
>> is appropriate, or take a different action if the event isn't registered.
>>
>> This was all blocked on finding a future-proof way for tools like Qemu to consume
>> reference code from ATF.

> Sorry that I didn't mention the story a bit last time. We plan to use SDEI to
> deliver the notification (signal) from host to guest, needed by the asynchronous
> page fault feature. The RFCv2 patchset was post a while ago [1].

Thanks. So this is to hint to the guest that you'd swapped its memory to disk. Yuck.

When would you do this?

Surely this is "performance of an over-committed host sucks".

~

Isn't this roughly equivalent to SMT CPUs taking a cache-miss? ...
If you pinned two vCPUs to one physical CPU, the host:scheduler would multiplex between
them. If one couldn't due useful work because it was waiting for memory, the other gets
all the slack time. (the TLB maintenance would hurt, but not as much as waiting for the disk)
The good news is the guest:scheduler already knows how to deal with this!
(and, it works for other OS too)


Wouldn't it be better to let the guest make the swapping decision? You could provide a
fast virtio swap device to the guest that is backed by maybe-swapped host memory. (you'd
need to get the host to swap the block device in preference to the guest memory, or
mlock() it)
The guest gets great performance, unless its swap was actually swapped. It might even be
possible to do this without a guest exit!
(I'm not aware of a way for user-space to give a preference on what gets swapped)

Done like this, you don't pay the penalty when the guest tries to swap out a page that the
host had already swapped.


I think re-using some of these existing concepts would be better than something that is
linux+kvm+aarch64 specific.


> For the SDEI
> events needed by the async page fault, it's originated from KVM (host). In order
> to achieve the goal, KVM needs some code so that SDEI event can be injected and
> delivered. Also, the SDEI related hypercalls needs to be handled either.

I avoided doing this because it makes it massively complicated for the VMM. All that
in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
events, which gets nasty for shared events where some CPUs are masked, and others aren't.

Having something like Qemu drive the reference code from TFA is the right thing to do for
SDEI.


> Since we're here, I plan to expand the scope so that the firmware owned SDEI
> events (private/shared) can be passed through to multiple VMs. Lets say they're
> passthrou event. For these passthrou events, they can be shared by multiple VMs
> either.

Why? Do you have an example where that is necessary?

This stuff is for things firmware needs to tell the OS urgently, e.g. like RAS events,
platform over-temperature or the reboot watchdog is about to fire.

I can't think of anything that firmware would know about, that a guest needs to know. It
violates the isolation and abstraction that running stuff in a guest is all about!


RAS events come the closest. For RAS events the host has to handle the error first, then
it notifies the VMM like linux would for any user-space process. The VMM can then, at its
option, replay the event into the guest using whatever mechanism it likes.
This decoupling is important to ensure the VMM does not need to know how the host learns
about RAS errors, and has free choice over how it tells the guest.


>>> The
>>> primary reason is we want to use SDEI to deliver the asynchronous page fault
>>> notification from host to guest.
>>
>> As an NMI?! Yuck!
>> The SDEI handler reads memory, you'd need to stop it being re-entrant. It exits through
>> the IRQ vector, (which is necessary for forward-progress given a synchronous RAS event,
>> and for KVM to trigger guest-exit before the 'real' work that is offloaded to an irq
>> handler can run), its going to be 'fun' to have any guarantee of forward-progress if this
>> is involved with stage2.

> Yeah, It's something similar to NMI.

Aarch64 doesn't define an NMI, but we use the term for anything that interrupts IRQ-masked
code. You want to schedule(), which you can't do from an NMI.


> The notification (signal) has to be delivered in synchronous mode.

Heh, so you're using SDEI to get into the IRQ handler synchronously, so you can
reschedule. You don't actually want the NMI properties, only the software defined
synchronous exception.


> Yes, The SDEI specification already mentioned
> this: the client handler should have all required resources in place before
> the handler is going to run. However, I don't see it's a problem so far.

What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
The host has no clue what is in guest memory.


> Lets wait and see if it's a real issue until I post the RFC patchset :)

Its not really a try it and see thing!

[...]

>>> It seems that TRF (Trusted Firmware) is the only firmware with SDEI service
>>> implemented and supported.
>>
>> This project calls itself TF-A. ATF is the other widely used name. (I've never seen TRF
>> before)
>>
> 
> Yeah, I must have provided wrong name. Here is the git repo I was
> looking into:
> 
>    https://github.com/ARM-software/arm-trusted-firmware
> 
>>
>>> If so, does it mean I need to install TRF on my bare metal machine?
>>> I'm wandering how it can be installed and not sure if
>>> there is any document about this.
>>
>> Firmware should come with the platform. You'd need to know intricate details about power
>> management and initialising parts of the SoC to port it.
>>
>> ATF has a port for the fast-model/foundation model. I test this with ATF in the fast-model.

> I have no idea of fast-model and foundation model, and I got nothing
> from below commands in ATF git repo:

What are you using to test your kernel changes? Can it run EL3 software.

The foundation model can be downloaded here:
https://developer.arm.com/tools-and-software/simulation-models/fixed-virtual-platforms/arm-ecosystem-models


> [gwshan@localhost atf]$ git grep -i fast | grep -i model
> [gwshan@localhost atf]$ git grep -i fundation | grep -i model

The typo is why. Swap:

| morse@eglon:~/model/mpam/arm-trusted-firmware$ git grep -i foundation | grep model
| fdts/fvp-foundation-gicv2-psci.dts:     model = "FVP Foundation";
| fdts/fvp-foundation-gicv3-psci.dts:     model = "FVP Foundation";

'fvp' is the name atf uses for the platform.

The runes I had to build it with SDEI support are:
| make DEBUG=1 PLAT=fvp SDEI_SUPPORT=1 EL3_EXCEPTION_HANDLING=1 fip all



>>> Besides, GHES seems the only user of SDEI in the linux kernel. If so, is
>>> there a way to inject the relevant errors and how?
>>
>> It is, and unfortunately last time I checked, upstream ATF doesn't have the firmware-first
>> stuff for this. Its too SoC specific.
>>
>> I test this by binding the fast-model's SP804 one-shot interrupt controller as an event,
>> then plumbing that into GHES. Its more of case-study in why the bindable-irq stuff is
>> nasty than usable error injection method.
>> I can push the most recently rebased version of this, but you'd also need to hack-up a
>> HEST table with GHES entries to actually get it running.
>> > But, unless you are working on EL3 firmare, or a VMM, I don't think SDEI is what you
>> want.
>> What problem are you trying to solve?

> Thanks for the information. It seems I also need to emulate SDEI event by
> myself in order to test it. The best way for me is to inject SDEI event
> from KVM. By the way, the code you had is part of the firmware used by
> bare-metal machine or VM?
> 
> The issue we want to resolve is to deliver async page fault notification
> as mentioned above. Please let me know if there are more concerns :)

Re-entrance and forward progress.

I'd love to know why additional complexity to tell the guest this stuff is better than the
two approaches described above.


Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-08 16:11     ` James Morse
@ 2020-07-08 16:49       ` Paolo Bonzini
  2020-07-09  5:33         ` Gavin Shan
  2020-07-09 18:30         ` James Morse
  0 siblings, 2 replies; 11+ messages in thread
From: Paolo Bonzini @ 2020-07-08 16:49 UTC (permalink / raw)
  To: James Morse, Gavin Shan; +Cc: mark.rutland, maz, linux-arm-kernel

On 08/07/20 18:11, James Morse wrote:
> Hi Gavin,
> 
> On 03/07/2020 01:26, Gavin Shan wrote:
>> On 7/1/20 9:57 PM, James Morse wrote:
>>> On 30/06/2020 06:17, Gavin Shan wrote:
>>>> I'm currently looking into SDEI client driver and reworking on it so that
>>>> it can provide capability/services to arm64/kvm to get it virtualized.
>>>
>>> What do you mean by virtualised? The expectation is the VMM would implement the 'firmware'
>>> side of this. 'events' are most likely to come from the VMM, and having to handshake with
>>> the kernel to work out if the event you want to inject is registered and enabled is
>>> over-complicated. Supporting it in the VMM means you can notify a different vCPU if that
>>> is appropriate, or take a different action if the event isn't registered.
>>>
>>> This was all blocked on finding a future-proof way for tools like Qemu to consume
>>> reference code from ATF.
> 
>> Sorry that I didn't mention the story a bit last time. We plan to use SDEI to
>> deliver the notification (signal) from host to guest, needed by the asynchronous
>> page fault feature. The RFCv2 patchset was post a while ago [1].
> 
> Thanks. So this is to hint to the guest that you'd swapped its memory to disk. Yuck.
> 
> When would you do this?

These days, the main reason is on-demand paging with live migration.
Instead of waiting to have a consistent version of guest memory on the
destination, memory that the guest has dirtied can be copied on demand
from source to destination while the guest is running.  Letting the
guest reschedule is surprisingly effective in this case, especially with
workloads that have a lot of threads.

> Isn't this roughly equivalent to SMT CPUs taking a cache-miss? ...
> If you pinned two vCPUs to one physical CPU, the host:scheduler would multiplex between
> them. If one couldn't due useful work because it was waiting for memory, the other gets
> all the slack time. (the TLB maintenance would hurt, but not as much as waiting for the disk)
> The good news is the guest:scheduler already knows how to deal with this!
> (and, it works for other OS too)

The order of magnitude of both the wait and the reschedule is too
different for SMT heuristics to be applicable here.  Especially, two SMT
pCPUs compete equally for fetch resources, while two vCPUs pinned to the
same pCPU would only reschedule a few hundred times per second.  Latency
would be in the milliseconds and jitter would be horribl.

> Wouldn't it be better to let the guest make the swapping decision? 
> You could provide a fast virtio swap device to the guest that is
> backed by maybe-swapped host memory.
I think you are describing something similar to "transcendent memory",
which Xen implemented about 10 years ago
(https://lwn.net/Articles/454795/).  Unfortunately you've probably never
heard about it for good reasons. :)

The main showstopper is that you cannot rely on guest cooperation (also
because it works surprisingly well without).

>> For the SDEI
>> events needed by the async page fault, it's originated from KVM (host). In order
>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>> delivered. Also, the SDEI related hypercalls needs to be handled either.
> 
> I avoided doing this because it makes it massively complicated for the VMM. All that
> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
> 
> Having something like Qemu drive the reference code from TFA is the right thing to do for
> SDEI.

Are there usecases for injecting SDEIs from QEMU?

If not, it can be done much more easily with KVM (and it would also
would be really, really slow if each page fault had to be redirected
through QEMU), which wouldn't have more than a handful of SDEI events.
The in-kernel state is 4 64-bit values (EP address and argument, flags,
affinity) per event.

>> Yes, The SDEI specification already mentioned
>> this: the client handler should have all required resources in place before
>> the handler is going to run. However, I don't see it's a problem so far.
>
> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
> The host has no clue what is in guest memory.

On x86 we don't do the notification if interrupts are disabled.  On ARM
I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
be some state that has to be migrated).  In fact it would be nice if
SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".

>> Lets wait and see if it's a real issue until I post the RFC patchset :)
> 
> Its not really a try it and see thing!
On this we agree. ;)

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-08 16:49       ` Paolo Bonzini
@ 2020-07-09  5:33         ` Gavin Shan
  2020-07-09 18:31           ` James Morse
  2020-07-09 18:30         ` James Morse
  1 sibling, 1 reply; 11+ messages in thread
From: Gavin Shan @ 2020-07-09  5:33 UTC (permalink / raw)
  To: Paolo Bonzini, James Morse; +Cc: mark.rutland, maz, linux-arm-kernel

Hi James and Paolo,

On 7/9/20 2:49 AM, Paolo Bonzini wrote:
> On 08/07/20 18:11, James Morse wrote:
>> On 03/07/2020 01:26, Gavin Shan wrote:
>>> On 7/1/20 9:57 PM, James Morse wrote:
>>>> On 30/06/2020 06:17, Gavin Shan wrote:

[...]

>>
>>> Sorry that I didn't mention the story a bit last time. We plan to use SDEI to
>>> deliver the notification (signal) from host to guest, needed by the asynchronous
>>> page fault feature. The RFCv2 patchset was post a while ago [1].
>>
>> Thanks. So this is to hint to the guest that you'd swapped its memory to disk. Yuck.
>>
>> When would you do this?
> 
> These days, the main reason is on-demand paging with live migration.
> Instead of waiting to have a consistent version of guest memory on the
> destination, memory that the guest has dirtied can be copied on demand
> from source to destination while the guest is running.  Letting the
> guest reschedule is surprisingly effective in this case, especially with
> workloads that have a lot of threads.
> 

Paolo, thanks for the explanation.

[...]


>>> For the SDEI
>>> events needed by the async page fault, it's originated from KVM (host). In order
>>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>>> delivered. Also, the SDEI related hypercalls needs to be handled either.
>>
>> I avoided doing this because it makes it massively complicated for the VMM. All that
>> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
>> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
>>
>> Having something like Qemu drive the reference code from TFA is the right thing to do for
>> SDEI.
> 
> Are there usecases for injecting SDEIs from QEMU?
> 
> If not, it can be done much more easily with KVM (and it would also
> would be really, really slow if each page fault had to be redirected
> through QEMU), which wouldn't have more than a handful of SDEI events.
> The in-kernel state is 4 64-bit values (EP address and argument, flags,
> affinity) per event.
> 

I don't think there is existing usercase to inject SDEIs from qemu.
However, there is one ioctl command is reserved for this purpose
in my code, so that QEMU can inject SDEI event if needed.

Yes, the implementation of my code is done in kvm to inject SDEI
event directly, on request received from the consumer like APF.

By the way, I just finished splitting the code into RFC patches.
Please let me I should post it to provide more details, or it
should be deferred until this discussion is finished.

>>> Yes, The SDEI specification already mentioned
>>> this: the client handler should have all required resources in place before
>>> the handler is going to run. However, I don't see it's a problem so far.
>>
>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>> The host has no clue what is in guest memory.
> 
> On x86 we don't do the notification if interrupts are disabled.  On ARM
> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
> be some state that has to be migrated).  In fact it would be nice if
> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".
> 

I'm not sure I understand this issue completely. When the vCPU is preempted,
all registers should have been saved to vcpu->arch.ctxt. The SDEI context is
saved to vcpu->arch.ctxt either. They will be restored when the vCPU gets
running afterwards. From the syntax perspective, it's not broken.

Yes, I plan to use private event, which is only visible to kvm and guest.
Also, it has critical priority. The new SDEI event can't be delivered until
the previous critical event is finished.

Paolo, it's intresting idea to reuse SDEI_EVENT_COMPLETE/AND_RESUME. Do you
mean to use these two hypercalls to designate PAGE_NOT_READY and PAGE_READY
separately? If possible, please provide more details.

[...]

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-08 16:49       ` Paolo Bonzini
  2020-07-09  5:33         ` Gavin Shan
@ 2020-07-09 18:30         ` James Morse
  2020-07-09 18:50           ` Paolo Bonzini
  1 sibling, 1 reply; 11+ messages in thread
From: James Morse @ 2020-07-09 18:30 UTC (permalink / raw)
  To: Paolo Bonzini, Gavin Shan; +Cc: mark.rutland, maz, linux-arm-kernel

Hi Paolo, Gavin,

On 08/07/2020 17:49, Paolo Bonzini wrote:
> On 08/07/20 18:11, James Morse wrote:
>> On 03/07/2020 01:26, Gavin Shan wrote:
>>> On 7/1/20 9:57 PM, James Morse wrote:
>>>> On 30/06/2020 06:17, Gavin Shan wrote:
>>>>> I'm currently looking into SDEI client driver and reworking on it so that
>>>>> it can provide capability/services to arm64/kvm to get it virtualized.
>>>>
>>>> What do you mean by virtualised? The expectation is the VMM would implement the 'firmware'
>>>> side of this. 'events' are most likely to come from the VMM, and having to handshake with
>>>> the kernel to work out if the event you want to inject is registered and enabled is
>>>> over-complicated. Supporting it in the VMM means you can notify a different vCPU if that
>>>> is appropriate, or take a different action if the event isn't registered.
>>>>
>>>> This was all blocked on finding a future-proof way for tools like Qemu to consume
>>>> reference code from ATF.
>>
>>> Sorry that I didn't mention the story a bit last time. We plan to use SDEI to
>>> deliver the notification (signal) from host to guest, needed by the asynchronous
>>> page fault feature. The RFCv2 patchset was post a while ago [1].
>>
>> Thanks. So this is to hint to the guest that you'd swapped its memory to disk. Yuck.
>>
>> When would you do this?

> These days, the main reason is on-demand paging with live migration.
> Instead of waiting to have a consistent version of guest memory on the
> destination, memory that the guest has dirtied can be copied on demand
> from source to destination while the guest is running.  Letting the
> guest reschedule is surprisingly effective in this case, especially with
> workloads that have a lot of threads.

Aha, so nothing to do with swap. This makes more sense.
New bedtime reading: "Post-Copy Live Migration of Virtual Machines" [0]

I can see why this would be useful. Is it widely used, or a bit of a niche sport?
I don't recall seeing anything about it last time I played with migration...


>> Isn't this roughly equivalent to SMT CPUs taking a cache-miss? ...
>> If you pinned two vCPUs to one physical CPU, the host:scheduler would multiplex between
>> them. If one couldn't due useful work because it was waiting for memory, the other gets
>> all the slack time. (the TLB maintenance would hurt, but not as much as waiting for the disk)
>> The good news is the guest:scheduler already knows how to deal with this!
>> (and, it works for other OS too)
> 
> The order of magnitude of both the wait and the reschedule is too
> different for SMT heuristics to be applicable here.  Especially, two SMT
> pCPUs compete equally for fetch resources, while two vCPUs pinned to the
> same pCPU would only reschedule a few hundred times per second.  Latency
> would be in the milliseconds and jitter would be horribl.
> 
>> Wouldn't it be better to let the guest make the swapping decision? 
>> You could provide a fast virtio swap device to the guest that is
>> backed by maybe-swapped host memory.

> I think you are describing something similar to "transcendent memory",
> which Xen implemented about 10 years ago
> (https://lwn.net/Articles/454795/).  Unfortunately you've probably never
> heard about it for good reasons. :)

Heh. With a name like that I expect it to solve all my problems!

I'm trying to work out what the problem with existing ways of doing this would be...


> The main showstopper is that you cannot rely on guest cooperation (also
> because it works surprisingly well without).

Aren't we changing the guest kernel to support this? Certainly I agree the guest may not
know about anything.


>>> For the SDEI
>>> events needed by the async page fault, it's originated from KVM (host). In order
>>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>>> delivered. Also, the SDEI related hypercalls needs to be handled either.
>>
>> I avoided doing this because it makes it massively complicated for the VMM. All that
>> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
>> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
>>
>> Having something like Qemu drive the reference code from TFA is the right thing to do for
>> SDEI.

> Are there usecases for injecting SDEIs from QEMU?

Yes. RAS.

When the VMM takes a SIGBUS:MCEERR_AO it can decide if and how to report this to the
guest. If it advertised firmware-first support at boot, there are about five options, of
which SDEI is one. It could emulate something platform specific, or do nothing at all.

The VMM owns the APCI-tables/DT, which advertise whether SDEI is supported, and for APCI
where the firmware first CPER regions are and how they are notified. We don't pass RAS
stuff into the guest directly, we treat the VMM like any other user-space process.


> If not, it can be done much more easily with KVM

The SDEI state would need to be exposed to Qemu to be migrated. If Qemu wants to use it
for a shared event which is masked on the local vCPU, we'd need to force the other vCPU to
exit to see if they can take it. Its not impossible, just very fiddly.


The mental-model I try to stick to is the VMM is the firmware for the guest, and KVM
'just' does the stuff it has to to maintain the illusion of real hardware, e.g. plumbing
stage2 page faults into mm as if they were taken from the VMM, and making the
timers+counters work.

Supporting SDEI in real firmware is done by manipulating system registers in EL3 firmware.
This falls firmly in the 'VMM is the firmware' court. Its possible for the VMM to inject
events using the existing KVM APIs, all that is missing is routing HVC to user-space for
the VMM to handle.


> (and it would also
> would be really, really slow if each page fault had to be redirected
> through QEMU),

Isn't this already true for any post-copy live migration?
There must be some way of telling Qemu that this page is urgently needed ahead of whatever
it is copying at the moment.

There are always going to be pages we must have, and can't make progress until we do. (The
vectors, the irq handlers .. in modules ..)


> which wouldn't have more than a handful of SDEI events.
> The in-kernel state is 4 64-bit values (EP address and argument, flags,
> affinity) per event.

flags: normal/critical, registered, enabled, in-progress and pending.
Pending might be backed by an IRQ that changes behind your back.


>>> Yes, The SDEI specification already mentioned
>>> this: the client handler should have all required resources in place before
>>> the handler is going to run. However, I don't see it's a problem so far.
>>
>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>> The host has no clue what is in guest memory.

> On x86 we don't do the notification if interrupts are disabled. 

... because you can't schedule()? What about CONFIG_PREEMPT? arm64 has that enabled in its
defconfig. I anticipate its the most common option.

Even outside that, we have psuedo-NMI which means interrupts are unmasked at the CPU, even
in spin_lock_irqsave() regions, but instead filtered out at the interrupt controller.

I don't think KVM can know what this means by inspection, the guest chooses interrupt
priorities to separate the 'common' IRQ from 'important', but KVM can't know its not
'common' and 'devices being ignored'.


> On ARM
> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
> be some state that has to be migrated).

My problem with SDEI is the extra complexity it brings for features you don't want.
Its an NMI, that's the last thing you want as you can't schedule().
This use of SDEI is really for its synchronous exit through the irq handler, which you can
re-schedule from ... iff you took the event from a pre-emptible context...

Can we bypass the unnecessary NMI, and come in straight at the irq handler?

IRQ are asynchronous, but as this is a paravirt interface, the hypervisor can try to
guarantee a particular PPI (per cpu interrupt) that it generates is taken synchronously.
(I've yet to work out if the vGIC already does this, or we'd need to fake it in software)

By having a virtual-IRQ that the guest has registered, we can interpret the guest's
psuedo-NMI settings to know if this virtual-IRQ could be taken right now, which tells us
if the guest can handle the deferred stage2 fault, or it needs fixing before the guest can
make progress.

PPI are a scarce resource, so this would need some VMM involvement at boot to say which
PPI can be used. We do this for the PMU too.

... I'd like to look into how x86 uses this, and what other hypervisors may do in this
area. (another nightmare is supporting similar but different things for KVM, Xen, HyperV
and VMWare. I'm sure other hypervisors are available...)


> In fact it would be nice if
> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".

Sneaky. How does x86 do this? I assume there is a hypercall for 'got it' or 'not now'.
If we go the PPI route we could use the same. (Ideally as much as possible is done in
common code)



Thanks,

James

[0] https://kartikgopalan.github.io/publications/hines09postcopy_osr.pdf

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-09  5:33         ` Gavin Shan
@ 2020-07-09 18:31           ` James Morse
  2020-07-10  9:08             ` Gavin Shan
  0 siblings, 1 reply; 11+ messages in thread
From: James Morse @ 2020-07-09 18:31 UTC (permalink / raw)
  To: Gavin Shan, Paolo Bonzini; +Cc: mark.rutland, maz, linux-arm-kernel

Hi Gavin,

On 09/07/2020 06:33, Gavin Shan wrote:
> On 7/9/20 2:49 AM, Paolo Bonzini wrote:
>> On 08/07/20 18:11, James Morse wrote:
>>> On 03/07/2020 01:26, Gavin Shan wrote:

>>>> For the SDEI
>>>> events needed by the async page fault, it's originated from KVM (host). In order
>>>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>>>> delivered. Also, the SDEI related hypercalls needs to be handled either.
>>>
>>> I avoided doing this because it makes it massively complicated for the VMM. All that
>>> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
>>> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
>>>
>>> Having something like Qemu drive the reference code from TFA is the right thing to do for
>>> SDEI.
>>
>> Are there usecases for injecting SDEIs from QEMU?
>>
>> If not, it can be done much more easily with KVM (and it would also
>> would be really, really slow if each page fault had to be redirected
>> through QEMU), which wouldn't have more than a handful of SDEI events.
>> The in-kernel state is 4 64-bit values (EP address and argument, flags,
>> affinity) per event.

> I don't think there is existing usercase to inject SDEIs from qemu.

use-case or user-space?

There was a series to add support for emulating firmware-first RAS. I think it got stuck
in the wider problem of how Qemu can consume reference code from TFA (the EL3 firmware) to
reduce the maintenance overhead. (every time Arm add something else up there, Qemu would
need to emulate it. It should be possible to consume the TFA reference code)


> However, there is one ioctl command is reserved for this purpose
> in my code, so that QEMU can inject SDEI event if needed.
> 
> Yes, the implementation of my code is done in kvm to inject SDEI
> event directly, on request received from the consumer like APF.

> By the way, I just finished splitting the code into RFC patches.
> Please let me I should post it to provide more details, or it
> should be deferred until this discussion is finished.

I need to go through the SDEI patches you posted yet. If you post a link to the branch I
can have a look to get a better idea of the shape of this thing...

(I've not gone looking for the x86 code yet)


>>>> Yes, The SDEI specification already mentioned
>>>> this: the client handler should have all required resources in place before
>>>> the handler is going to run. However, I don't see it's a problem so far.
>>>
>>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>>> The host has no clue what is in guest memory.
>>
>> On x86 we don't do the notification if interrupts are disabled.  On ARM
>> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
>> be some state that has to be migrated).  In fact it would be nice if
>> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
>> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".

> I'm not sure I understand this issue completely. When the vCPU is preempted,
> all registers should have been saved to vcpu->arch.ctxt. The SDEI context is
> saved to vcpu->arch.ctxt either. They will be restored when the vCPU gets
> running afterwards. From the syntax perspective, it's not broken.
> 
> Yes, I plan to use private event, which is only visible to kvm and guest.
> Also, it has critical priority. The new SDEI event can't be delivered until
> the previous critical event is finished.
> 
> Paolo, it's intresting idea to reuse SDEI_EVENT_COMPLETE/AND_RESUME. Do you
> mean to use these two hypercalls to designate PAGE_NOT_READY and PAGE_READY
> separately? If possible, please provide more details.

No, I think this suggestion is for the guest to hint back to the hypervisor whether it can
take this stage2 delay, or it must have the page to make progress.

SDEI_EVENT_COMPLETE returns to wherever we came from, the arch code will do this if it
couldn't have taken an IRQ. If it could have taken an IRQ, it uses
SDEI_EVENT_COMPLETE_AND_RESUME to exit through the interrupt vector.

This is a trick that gives us two things: KVM guest exit when this is in use on real
hardware, and the irq-work handler runs to do the work we couldn't do in NMI context, both
before we return to the context that triggered the fault in the first place.
Both are needed for the RAS support.


The problem is invoking this whole thing when the guest can't do anything about it,
because it can't schedule(). You can't know this from outside the guest.


Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-09 18:30         ` James Morse
@ 2020-07-09 18:50           ` Paolo Bonzini
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2020-07-09 18:50 UTC (permalink / raw)
  To: James Morse, Gavin Shan; +Cc: mark.rutland, maz, linux-arm-kernel

On 09/07/20 20:30, James Morse wrote:
> I can see why this would be useful. Is it widely used, or a bit of a niche sport?
> I don't recall seeing anything about it last time I played with migration...

It's widely used (at least Google uses it a lot).

>> The main showstopper is that you cannot rely on guest cooperation (also
>> because it works surprisingly well without).
> 
> Aren't we changing the guest kernel to support this? Certainly I agree the guest may not
> know about anything.

Yes, but the fallback is synchronous page faults and if the guest
doesn't collaborate it's business as usual.

Now I see better what you meant: the "fake swap" device would not
prevent *other* memory from being swapped even without the guest's
consent.  However, in the case of e.g. postcopy live migration you're in
a bit of a bind, because the working set is both: 1) exactly the set of
pages that are unlikely to be ready on the destination 2) the set of
pages that the guest would choose not to place in the "fake swap".

>> Are there usecases for injecting SDEIs from QEMU?
> 
> Yes. RAS.
> 
> When the VMM takes a SIGBUS:MCEERR_AO it can decide if and how to report this to the
> guest. If it advertised firmware-first support at boot, there are about five options, of
> which SDEI is one. It could emulate something platform specific, or do nothing at all.

Ok, for x86 there's an ioctl to inject MCEs and I was not really sure
how ARM does it.  But it could still be a kernel-managed event, just one
that QEMU can trigger at will.

>> If not, it can be done much more easily with KVM
> 
> The SDEI state would need to be exposed to Qemu to be migrated. If Qemu wants to use it
> for a shared event which is masked on the local vCPU, we'd need to force the other vCPU to
> exit to see if they can take it. Its not impossible, just very fiddly.
> 
> The mental-model I try to stick to is the VMM is the firmware for the guest, and KVM
> 'just' does the stuff it has to to maintain the illusion of real hardware, e.g. plumbing
> stage2 page faults into mm as if they were taken from the VMM, and making the
> timers+counters work.

Actually I try to make the firmware for the guest... the firmware of the
guest (with paravirtualized help from the hypervisor or VMM when
needed).  I think we've argued about that with Marc a lot though, so it
may not be the most common view in the ARM world!

My model is that KVM does processor stuff, while the VMM does everything
else.  It doesn't always match, for example KVM does more GIC emulation
than would fit this model (IIRC it handles the distributor?).  But this
is why I would prefer to put the system register manipulation in KVM
rather than the VMM, possibly with ioctls on the vCPU file descriptor
for use from the VMM.

> 
>> (and it would also
>> would be really, really slow if each page fault had to be redirected
>> through QEMU),
> 
> Isn't this already true for any post-copy live migration?
> There must be some way of telling Qemu that this page is urgently needed ahead of whatever
> it is copying at the moment.

It's done with userfaultfd, so it's entirely asynchronous.  It's
important to get the fault delivered quickly to the guest however,
because that affects the latency.

> There are always going to be pages we must have, and can't make progress until we do. (The
> vectors, the irq handlers .. in modules ..)

Yup but fortunately they don't change often.  For postcopy the
problematic pages are ironically those in the working set, not those
that never change (because those can be migrated just fine)!

>>>> Yes, The SDEI specification already mentioned
>>>> this: the client handler should have all required resources in place before
>>>> the handler is going to run. However, I don't see it's a problem so far.
>>>
>>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>>> The host has no clue what is in guest memory.
> 
>> On x86 we don't do the notification if interrupts are disabled. 
> 
> ... because you can't schedule()? What about CONFIG_PREEMPT? arm64 has that enabled in its
> defconfig. I anticipate its the most common option.

No not because we can't schedule() but because it would be a reentrancy
nightmare.  Actually it's even stricter: we don't do the notification at
all if we're in supervisor mode.  As I said above, we don't expect that
to be a big deal because the pages with the most churn for live
migration will be userspace data.

>> On ARM
>> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
>> be some state that has to be migrated).
> 
> My problem with SDEI is the extra complexity it brings for features you don't want.
> Its an NMI, that's the last thing you want as you can't schedule()

Scheduling can be done outside the NMI handler as long as it's done
before returning to EL3.  But yeah, on x86 it's nice that the page fault
exception handler can schedule() just fine (see
kvm_async_pf_task_wait_schedule in arch/x86/kernel/kvm.c).

> PPI are a scarce resource, so this would need some VMM involvement at boot to say which
> PPI can be used. We do this for the PMU too.

Yep, this is part of why we didn't consider PPIs.

> ... I'd like to look into how x86 uses this, and what other hypervisors may do in this
> area. (another nightmare is supporting similar but different things for KVM, Xen, HyperV
> and VMWare. I'm sure other hypervisors are available...)

I'm not sure if any other hypervisor than KVM does it.  Well, IBM's
proprietary hypervisors do it, but only on POWER or s390.

>> In fact it would be nice if
>> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
>> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".
> 
> Sneaky. How does x86 do this? I assume there is a hypercall for 'got it' or 'not now'.
> If we go the PPI route we could use the same. (Ideally as much as possible is done in
> common code)

It doesn't do it yet, but it is planned to have a hypercall to inform
KVM of the choice.  I explained to Gavin what the v2.0 of the x86
interface will look like, so that ARM can already do it like that and
perhaps even share some code or data structures.

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-09 18:31           ` James Morse
@ 2020-07-10  9:08             ` Gavin Shan
  2020-07-10  9:38               ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Shan @ 2020-07-10  9:08 UTC (permalink / raw)
  To: James Morse, Paolo Bonzini; +Cc: mark.rutland, maz, linux-arm-kernel

Hi James,

On 7/10/20 4:31 AM, James Morse wrote:
> On 09/07/2020 06:33, Gavin Shan wrote:
>> On 7/9/20 2:49 AM, Paolo Bonzini wrote:
>>> On 08/07/20 18:11, James Morse wrote:
>>>> On 03/07/2020 01:26, Gavin Shan wrote:
> 
>>>>> For the SDEI
>>>>> events needed by the async page fault, it's originated from KVM (host). In order
>>>>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>>>>> delivered. Also, the SDEI related hypercalls needs to be handled either.
>>>>
>>>> I avoided doing this because it makes it massively complicated for the VMM. All that
>>>> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
>>>> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
>>>>
>>>> Having something like Qemu drive the reference code from TFA is the right thing to do for
>>>> SDEI.
>>>
>>> Are there usecases for injecting SDEIs from QEMU?
>>>
>>> If not, it can be done much more easily with KVM (and it would also
>>> would be really, really slow if each page fault had to be redirected
>>> through QEMU), which wouldn't have more than a handful of SDEI events.
>>> The in-kernel state is 4 64-bit values (EP address and argument, flags,
>>> affinity) per event.
> 
>> I don't think there is existing usercase to inject SDEIs from qemu.
> 
> use-case or user-space?
> 
> There was a series to add support for emulating firmware-first RAS. I think it got stuck
> in the wider problem of how Qemu can consume reference code from TFA (the EL3 firmware) to
> reduce the maintenance overhead. (every time Arm add something else up there, Qemu would
> need to emulate it. It should be possible to consume the TFA reference code)
> 

I'm not sure if the patchset has been ever posted. If so, could you
please tell the link to that? I might take a look when getting a
chance.

>> However, there is one ioctl command is reserved for this purpose
>> in my code, so that QEMU can inject SDEI event if needed.
>>
>> Yes, the implementation of my code is done in kvm to inject SDEI
>> event directly, on request received from the consumer like APF.
> 
>> By the way, I just finished splitting the code into RFC patches.
>> Please let me I should post it to provide more details, or it
>> should be deferred until this discussion is finished.
> 
> I need to go through the SDEI patches you posted yet. If you post a link to the branch I
> can have a look to get a better idea of the shape of this thing...
> 
> (I've not gone looking for the x86 code yet)
> 

Sure. Here is the link to the git repo:

https://github.com/gwshan/linux.git

branch ("sdei_client"): the sdei client driver rework series I posted.
branch ("sdei"): the patches to make SDEI virtualized, which bases on "sdei_client".

>>>>> Yes, The SDEI specification already mentioned
>>>>> this: the client handler should have all required resources in place before
>>>>> the handler is going to run. However, I don't see it's a problem so far.
>>>>
>>>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>>>> The host has no clue what is in guest memory.
>>>
>>> On x86 we don't do the notification if interrupts are disabled.  On ARM
>>> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
>>> be some state that has to be migrated).  In fact it would be nice if
>>> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
>>> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".
> 
>> I'm not sure I understand this issue completely. When the vCPU is preempted,
>> all registers should have been saved to vcpu->arch.ctxt. The SDEI context is
>> saved to vcpu->arch.ctxt either. They will be restored when the vCPU gets
>> running afterwards. From the syntax perspective, it's not broken.
>>
>> Yes, I plan to use private event, which is only visible to kvm and guest.
>> Also, it has critical priority. The new SDEI event can't be delivered until
>> the previous critical event is finished.
>>
>> Paolo, it's intresting idea to reuse SDEI_EVENT_COMPLETE/AND_RESUME. Do you
>> mean to use these two hypercalls to designate PAGE_NOT_READY and PAGE_READY
>> separately? If possible, please provide more details.
> 
> No, I think this suggestion is for the guest to hint back to the hypervisor whether it can
> take this stage2 delay, or it must have the page to make progress.
> 
> SDEI_EVENT_COMPLETE returns to wherever we came from, the arch code will do this if it
> couldn't have taken an IRQ. If it could have taken an IRQ, it uses
> SDEI_EVENT_COMPLETE_AND_RESUME to exit through the interrupt vector.
> 
> This is a trick that gives us two things: KVM guest exit when this is in use on real
> hardware, and the irq-work handler runs to do the work we couldn't do in NMI context, both
> before we return to the context that triggered the fault in the first place.
> Both are needed for the RAS support.
> 

Ok, thanks for the information, which makes thing much more clear.
So SDEI_EVENT_COMPLETE/AND_RESUME is issued depending if current
process can be rescheduled. I think it's Paolo's idea?

> 
> The problem is invoking this whole thing when the guest can't do anything about it,
> because it can't schedule(). You can't know this from outside the guest.
> 

Yes, the interrupted process can't call schedule() before SDEI_EVENT_COMPLETE
at least because the SDEI event handler has to finish as quick as possible.
However, I

              process  ->         SDEI event trigger
                                        |
                                  SDEI event handler is called
                                        |
             schedule() <-        SDEI_EVENT_COMPLETE

As we don't have schedule() in place in advance, we might figure out one
way to insert the schedule() by the SDEI event handler.

Thanks,
Gavin


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Question] How to testing SDEI client driver
  2020-07-10  9:08             ` Gavin Shan
@ 2020-07-10  9:38               ` Paolo Bonzini
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2020-07-10  9:38 UTC (permalink / raw)
  To: Gavin Shan, James Morse; +Cc: mark.rutland, maz, linux-arm-kernel

On 10/07/20 11:08, Gavin Shan wrote:
> Ok, thanks for the information, which makes thing much more clear.
> So SDEI_EVENT_COMPLETE/AND_RESUME is issued depending if current
> process can be rescheduled. I think it's Paolo's idea?

Yes. :)

>> The problem is invoking this whole thing when the guest can't do 
>> anything about it, because it can't schedule(). You can't know this
>> from outside the guest.
> 
> Yes, the interrupted process can't call schedule() before 
> SDEI_EVENT_COMPLETE at least because the SDEI event handler has to
> finish as quick as possible.
> 
> [...] we might figure out one
> way to insert the schedule() by the SDEI event handler.

I think you could do smp_send_reschedule(smp_processor_id()) before
invoking SDEI_EVENT_COMPLETE_AND_RESUME.  As James said, after the
hypervisor processes SDEI_EVENT_COMPLETE_AND_RESUME the exit will be
through the reschedule interrupt.

Instead if the hypervisor sees SDEI_EVENT_COMPLETE it will wait for
synchronous page-in to complete, remove the async page fault from its
data structures, and resume execution.

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-07-10  9:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-30  5:17 [Question] How to testing SDEI client driver Gavin Shan
2020-07-01 11:57 ` James Morse
2020-07-03  0:26   ` Gavin Shan
2020-07-08 16:11     ` James Morse
2020-07-08 16:49       ` Paolo Bonzini
2020-07-09  5:33         ` Gavin Shan
2020-07-09 18:31           ` James Morse
2020-07-10  9:08             ` Gavin Shan
2020-07-10  9:38               ` Paolo Bonzini
2020-07-09 18:30         ` James Morse
2020-07-09 18:50           ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).