All of lore.kernel.org
 help / color / mirror / Atom feed
* Notes on stubdoms and latency on ARM
@ 2017-05-18 19:00 Stefano Stabellini
  2017-05-19 19:45 ` Volodymyr Babchuk
  0 siblings, 1 reply; 49+ messages in thread
From: Stefano Stabellini @ 2017-05-18 19:00 UTC (permalink / raw)
  To: xen-devel
  Cc: vlad.babchuk, dario.faggioli, sstabellini, julien.grall, george.dunlap

Hi all,

Julien, Dario, George and I had a quick meeting to discuss stubdom
scheduling. These are my notes.


Description of the problem: need for a place to run emulators and
mediators outside of Xen, with low latency.

Explanation of what EL0 apps are. What should be their interface with
Xen? Could the interface be the regular hypercall interface? In that
case, what's the benefit compared to stubdoms?

The problem with stubdoms is latency and scheduling. It is not
deterministic. We could easily improve the null scheduler to introduce
some sort of non-preemptive scheduling of stubdoms on the same pcpus of
the guest vcpus. It would still require manually pinning vcpus to pcpus.

Then, we could add a sched_op hypercall to let the schedulers know that
a stubdom is tied to a specific guest domain. At that point, the
scheduling of stubdoms would become deterministic and automatic with the
null scheduler. It could be done to other schedulers too, but it will be
more work.

The other issue with stubdoms is context switch times. Volodymyr showed
that minios has much higher context switch times compared to EL0 apps.
It is probably due to GIC context switch, that is skipped for EL0 apps.
Maybe we could skip GIC context switch for stubdoms too, if we knew that
they are not going to use the VGIC. At that point, context switch times
should be very similar to EL0 apps.


ACTIONS:
Improve the null scheduler to enable decent stubdoms scheduling on
latency sensitive systems.
Investigate ways to improve context switch times on ARM.


Cheers,

Stefano

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-18 19:00 Notes on stubdoms and latency on ARM Stefano Stabellini
@ 2017-05-19 19:45 ` Volodymyr Babchuk
  2017-05-22 21:41   ` Stefano Stabellini
                     ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-05-19 19:45 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Dario Faggioli,
	George Dunlap, Julien Grall

Hi Stefano,

On 18 May 2017 at 22:00, Stefano Stabellini <sstabellini@kernel.org> wrote:

> Description of the problem: need for a place to run emulators and
> mediators outside of Xen, with low latency.
>
> Explanation of what EL0 apps are. What should be their interface with
> Xen? Could the interface be the regular hypercall interface? In that
> case, what's the benefit compared to stubdoms?
I imagined this as separate syscall interface (with finer policy
rules). But this can be discussed, of course.

> The problem with stubdoms is latency and scheduling. It is not
> deterministic. We could easily improve the null scheduler to introduce
> some sort of non-preemptive scheduling of stubdoms on the same pcpus of
> the guest vcpus. It would still require manually pinning vcpus to pcpus.
I see couple of other problems with stubdoms. For example, we need
mechanism to load mediator stubdom before dom0.

> Then, we could add a sched_op hypercall to let the schedulers know that
> a stubdom is tied to a specific guest domain.
What if one stubdom serves multiple domains? This is TEE use case.

> The other issue with stubdoms is context switch times. Volodymyr showed
> that minios has much higher context switch times compared to EL0 apps.
> It is probably due to GIC context switch, that is skipped for EL0 apps.
> Maybe we could skip GIC context switch for stubdoms too, if we knew that
> they are not going to use the VGIC. At that point, context switch times
> should be very similar to EL0 apps.
So you are suggesting to create something like lightweight stubdom. I
generally like this idea. But AFAIK, vGIC is used to deliver events
from hypervisor to stubdom. Do you want to propose another mechanism?
Also, this is sounds much like my EL0 PoC :)

> ACTIONS:
> Improve the null scheduler to enable decent stubdoms scheduling on
> latency sensitive systems.
I'm not very familiar with XEN schedulers. Looks like null scheduler
is good for hard RT, but isn't fine for a generic consumer system. How
do you think: is it possible to make credit2 scheduler to schedule
stubdoms in the same way?

> Investigate ways to improve context switch times on ARM.
Do you have any tools to profile or trace XEN core? Also, I don't
think that pure context switch time is the biggest issue. Even now, it
allows 180 000 switches per second (if I'm not wrong). I think,
scheduling latency is more important.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-19 19:45 ` Volodymyr Babchuk
@ 2017-05-22 21:41   ` Stefano Stabellini
  2017-05-26 19:28     ` Volodymyr Babchuk
  2017-05-23  7:11   ` Dario Faggioli
  2017-05-23  9:08   ` George Dunlap
  2 siblings, 1 reply; 49+ messages in thread
From: Stefano Stabellini @ 2017-05-22 21:41 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	George Dunlap, Julien Grall, xen-devel

On Fri, 19 May 2017, Volodymyr Babchuk wrote:
> On 18 May 2017 at 22:00, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> > Description of the problem: need for a place to run emulators and
> > mediators outside of Xen, with low latency.
> >
> > Explanation of what EL0 apps are. What should be their interface with
> > Xen? Could the interface be the regular hypercall interface? In that
> > case, what's the benefit compared to stubdoms?
> I imagined this as separate syscall interface (with finer policy
> rules). But this can be discussed, of course.

Right, and to be clear, I am not against EL0 apps.


> > The problem with stubdoms is latency and scheduling. It is not
> > deterministic. We could easily improve the null scheduler to introduce
> > some sort of non-preemptive scheduling of stubdoms on the same pcpus of
> > the guest vcpus. It would still require manually pinning vcpus to pcpus.
> I see couple of other problems with stubdoms. For example, we need
> mechanism to load mediator stubdom before dom0.

This can be solved: unrelated to this discussion, I had already created a
project for Outreachy/GSoC to create multiple guests from device tree.

https://wiki.xenproject.org/wiki/Outreach_Program_Projects#Xen_on_ARM:_create_multiple_guests_from_device_tree


> > Then, we could add a sched_op hypercall to let the schedulers know that
> > a stubdom is tied to a specific guest domain.
> What if one stubdom serves multiple domains? This is TEE use case.

It can be done. Stubdoms are typically deployed one per domain but they
are not limited to that model.


> > The other issue with stubdoms is context switch times. Volodymyr showed
> > that minios has much higher context switch times compared to EL0 apps.
> > It is probably due to GIC context switch, that is skipped for EL0 apps.
> > Maybe we could skip GIC context switch for stubdoms too, if we knew that
> > they are not going to use the VGIC. At that point, context switch times
> > should be very similar to EL0 apps.
> So you are suggesting to create something like lightweight stubdom. I
> generally like this idea. But AFAIK, vGIC is used to deliver events
> from hypervisor to stubdom. Do you want to propose another mechanism?

There is no way out: if the stubdom needs events, then we'll have to
expose and context switch the vGIC. If it doesn't, then we can skip the
vGIC. However, we would have a similar problem with EL0 apps: I am
assuming that EL0 apps don't need to handle interrupts, but if they do,
then they might need something like a vGIC.


> Also, this is sounds much like my EL0 PoC :)

Yes :-)


> > ACTIONS:
> > Improve the null scheduler to enable decent stubdoms scheduling on
> > latency sensitive systems.
> I'm not very familiar with XEN schedulers. Looks like null scheduler
> is good for hard RT, but isn't fine for a generic consumer system. How
> do you think: is it possible to make credit2 scheduler to schedule
> stubdoms in the same way?

You can do more than that :-)
You can use credit2 and the null scheduler simultaneously on different
sets of physical cpus using cpupools. For example, you can use the null
scheduler on 2 physical cores and credit2 on the remaining cores.

To better answer your question, yes it can be done with credit2 too,
however it will obviously be more work (the null scheduler is trivial).


> > Investigate ways to improve context switch times on ARM.
> Do you have any tools to profile or trace XEN core? Also, I don't
> think that pure context switch time is the biggest issue. Even now, it
> allows 180 000 switches per second (if I'm not wrong). I think,
> scheduling latency is more important.

I am using the arch timer, manually reading the counter values. I know
it's not ideal but it does the job. I am sure that with a combination of
null scheduler and vcpu pinning the scheduling latencies can extremely
reduced.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-19 19:45 ` Volodymyr Babchuk
  2017-05-22 21:41   ` Stefano Stabellini
@ 2017-05-23  7:11   ` Dario Faggioli
  2017-05-26 20:09     ` Volodymyr Babchuk
  2017-05-23  9:08   ` George Dunlap
  2 siblings, 1 reply; 49+ messages in thread
From: Dario Faggioli @ 2017-05-23  7:11 UTC (permalink / raw)
  To: Volodymyr Babchuk, Stefano Stabellini
  Cc: Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1935 bytes --]

On Fri, 2017-05-19 at 22:45 +0300, Volodymyr Babchuk wrote:
> On 18 May 2017 at 22:00, Stefano Stabellini <sstabellini@kernel.org>
> wrote:
> > ACTIONS:
> > Improve the null scheduler to enable decent stubdoms scheduling on
> > latency sensitive systems.
> 
> I'm not very familiar with XEN schedulers. 
>
Feel free to ask anything. :-)

> Looks like null scheduler
> is good for hard RT, but isn't fine for a generic consumer system. 
>
The null scheduler is meant at being useful when you have a static
scenario, no (or very few) overbooking (i.e., total nr of vCPUs ~= nr
of pCPUS), and what to cut to _zero_ the scheduling overhead.

That may include certain class of real-time workloads, but it not
limited to such use case.

> How
> do you think: is it possible to make credit2 scheduler to schedule
> stubdoms in the same way?
> 
It is indeed possible. Actually, it's actually in the plans to do
exactly something like that, as it could potentially be useful for a
wide range of use cases.

Doing it in the null scheduler is just easier, and we think it would be
a nice way to quickly have a proof of concept done. Afterwards, we'll
focus on other schedulers too.

> > Investigate ways to improve context switch times on ARM.
> 
> Do you have any tools to profile or trace XEN core? Also, I don't
> think that pure context switch time is the biggest issue. Even now,
> it
> allows 180 000 switches per second (if I'm not wrong). I think,
> scheduling latency is more important.
> 
What do you refer to when you say 'scheduling latency'? As in, the
latency between which events, happening on which component?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-19 19:45 ` Volodymyr Babchuk
  2017-05-22 21:41   ` Stefano Stabellini
  2017-05-23  7:11   ` Dario Faggioli
@ 2017-05-23  9:08   ` George Dunlap
  2017-05-26 19:43     ` Volodymyr Babchuk
  2 siblings, 1 reply; 49+ messages in thread
From: George Dunlap @ 2017-05-23  9:08 UTC (permalink / raw)
  To: Volodymyr Babchuk, Stefano Stabellini
  Cc: Artem_Mygaiev, Dario Faggioli, xen-devel, Andrii Anisov, Julien Grall

On 19/05/17 20:45, Volodymyr Babchuk wrote:
> Hi Stefano,
> 
> On 18 May 2017 at 22:00, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
>> Description of the problem: need for a place to run emulators and
>> mediators outside of Xen, with low latency.
>>
>> Explanation of what EL0 apps are. What should be their interface with
>> Xen? Could the interface be the regular hypercall interface? In that
>> case, what's the benefit compared to stubdoms?
> I imagined this as separate syscall interface (with finer policy
> rules). But this can be discussed, of course.

I think that's a natural place to start.  But then you start thinking
about the details: this thing needs to be able to manage its own address
space, send and receive event channels / interrupts, &c &c -- and it
actually ends up looking exactly like a subset of what a stubdomain can
already do.

In which case -- why invent a new interface, instead of just reusing the
existing one?

>> The problem with stubdoms is latency and scheduling. It is not
>> deterministic. We could easily improve the null scheduler to introduce
>> some sort of non-preemptive scheduling of stubdoms on the same pcpus of
>> the guest vcpus. It would still require manually pinning vcpus to pcpus.
> I see couple of other problems with stubdoms. For example, we need
> mechanism to load mediator stubdom before dom0.

There are a couple of options here.  You could do something like the
Xoar project [1] did, and have Xen boot a special-purpose "system
builder" domain, which would start both the mediator and then a dom0.
Or you could have a mechanism for passing more than one domain / initrd
to Xen, and pass Xen both the mediator stubdom as well as the kernel for
dom0.

[1] tjd.phlegethon.org/words/sosp11-xoar.pdf

>> Then, we could add a sched_op hypercall to let the schedulers know that
>> a stubdom is tied to a specific guest domain.
> What if one stubdom serves multiple domains? This is TEE use case.

Then you don't make that hypercall. :-)  In any case you certainly can't
use an EL0 app for that, at least the way we've been describing it.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-22 21:41   ` Stefano Stabellini
@ 2017-05-26 19:28     ` Volodymyr Babchuk
  2017-05-30 17:29       ` Stefano Stabellini
  2017-05-31 17:02       ` George Dunlap
  0 siblings, 2 replies; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-05-26 19:28 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Dario Faggioli,
	George Dunlap, Julien Grall

Hello Stefano,

>> > The problem with stubdoms is latency and scheduling. It is not
>> > deterministic. We could easily improve the null scheduler to introduce
>> > some sort of non-preemptive scheduling of stubdoms on the same pcpus of
>> > the guest vcpus. It would still require manually pinning vcpus to pcpus.
>> I see couple of other problems with stubdoms. For example, we need
>> mechanism to load mediator stubdom before dom0.
>
> This can be solved: unrelated to this discussion, I had already created a
> project for Outreachy/GSoC to create multiple guests from device tree.
>
> https://wiki.xenproject.org/wiki/Outreach_Program_Projects#Xen_on_ARM:_create_multiple_guests_from_device_tree
Yes, that could be a solution.


>> > The other issue with stubdoms is context switch times. Volodymyr showed
>> > that minios has much higher context switch times compared to EL0 apps.
>> > It is probably due to GIC context switch, that is skipped for EL0 apps.
>> > Maybe we could skip GIC context switch for stubdoms too, if we knew that
>> > they are not going to use the VGIC. At that point, context switch times
>> > should be very similar to EL0 apps.
>> So you are suggesting to create something like lightweight stubdom. I
>> generally like this idea. But AFAIK, vGIC is used to deliver events
>> from hypervisor to stubdom. Do you want to propose another mechanism?
>
> There is no way out: if the stubdom needs events, then we'll have to
> expose and context switch the vGIC. If it doesn't, then we can skip the
> vGIC. However, we would have a similar problem with EL0 apps: I am
> assuming that EL0 apps don't need to handle interrupts, but if they do,
> then they might need something like a vGIC.
Hm. Correct me, but if we want make stubdom to handle some requests
(e.g. emulate MMIO access), then it needs events, and thus it needs
interrupts. At least, I'm not aware about any other mechanism, that
allows hypervisor to signal to a domain.
On other hand, EL0 app (as I see them) does not need such events.
Basically, you just call function `handle_mmio()` right in the app.
So, apps can live without interrupts and they still be able to handle
request.

>> I'm not very familiar with XEN schedulers. Looks like null scheduler
>> is good for hard RT, but isn't fine for a generic consumer system. How
>> do you think: is it possible to make credit2 scheduler to schedule
>> stubdoms in the same way?
>
> You can do more than that :-)
> You can use credit2 and the null scheduler simultaneously on different
> sets of physical cpus using cpupools. For example, you can use the null
> scheduler on 2 physical cores and credit2 on the remaining cores.
Wow. Didn't know that.


-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-23  9:08   ` George Dunlap
@ 2017-05-26 19:43     ` Volodymyr Babchuk
  2017-05-26 19:46       ` Volodymyr Babchuk
  0 siblings, 1 reply; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-05-26 19:43 UTC (permalink / raw)
  To: George Dunlap
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	Julien Grall, xen-devel

Hi Dario,

>>> Explanation of what EL0 apps are. What should be their interface with
>>> Xen? Could the interface be the regular hypercall interface? In that
>>> case, what's the benefit compared to stubdoms?
>> I imagined this as separate syscall interface (with finer policy
>> rules). But this can be discussed, of course.
>
> I think that's a natural place to start.  But then you start thinking
> about the details: this thing needs to be able to manage its own address
> space, send and receive event channels / interrupts, &c &c -- and it
> actually ends up looking exactly like a subset of what a stubdomain can
> already do.
Actually, I don't want it to handle events, interrupts and such. I see
it almost as a synchronous function call. For example. when you need
something from it, you don't fire interrupt into it. You just set
function number in r0, set parameters in r1-r7, set PC to an entry
point and you are good to go.

> In which case -- why invent a new interface, instead of just reusing the
> existing one?
Hypercalls (from domains) and syscalls (from apps) are intersecting
sets, but neither is subset for other. One can merge them, but then
there will be calls that have meaning only for apps and there will be
calls that are fine only for domains. Honestly, I have no strong
opinion, which approach is better. I see pros and cons for every
variant.

>>> The problem with stubdoms is latency and scheduling. It is not
>>> deterministic. We could easily improve the null scheduler to introduce
>>> some sort of non-preemptive scheduling of stubdoms on the same pcpus of
>>> the guest vcpus. It would still require manually pinning vcpus to pcpus.
>> I see couple of other problems with stubdoms. For example, we need
>> mechanism to load mediator stubdom before dom0.
>
> There are a couple of options here.  You could do something like the
> Xoar project [1] did, and have Xen boot a special-purpose "system
> builder" domain, which would start both the mediator and then a dom0.
Wow. That's very interesting idea.

>>> Then, we could add a sched_op hypercall to let the schedulers know that
>>> a stubdom is tied to a specific guest domain.
>> What if one stubdom serves multiple domains? This is TEE use case.
> Then you don't make that hypercall. :-)  In any case you certainly can't
> use an EL0 app for that, at least the way we've been describing it.
That depends on how many right you will give to an EL0 app. I think,
it is possible to use it for this purpose. But actually, I'd like to
see TEE mediator right in hypervisor.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-26 19:43     ` Volodymyr Babchuk
@ 2017-05-26 19:46       ` Volodymyr Babchuk
  0 siblings, 0 replies; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-05-26 19:46 UTC (permalink / raw)
  To: George Dunlap
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	Julien Grall, xen-devel

On 26 May 2017 at 12:43, Volodymyr Babchuk <vlad.babchuk@gmail.com> wrote:
> Hi Dario,
>
Oops, sorry, George. There was two emails in a row: yours one and
Dario's one. And I overlooked to whom I'm answering.


-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-23  7:11   ` Dario Faggioli
@ 2017-05-26 20:09     ` Volodymyr Babchuk
  2017-05-27  2:10       ` Dario Faggioli
  0 siblings, 1 reply; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-05-26 20:09 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, George Dunlap,
	Julien Grall, xen-devel

Hello Dario,
>> I'm not very familiar with XEN schedulers.
> Feel free to ask anything. :-)
I'm so unfamiliar, so even don't know what to ask :) But thank you.
Surely I'll have questions.

>> Looks like null scheduler
>> is good for hard RT, but isn't fine for a generic consumer system.
>>
> The null scheduler is meant at being useful when you have a static
> scenario, no (or very few) overbooking (i.e., total nr of vCPUs ~= nr
> of pCPUS), and what to cut to _zero_ the scheduling overhead.
>
> That may include certain class of real-time workloads, but it not
> limited to such use case.
Can't I achieve the same with any other scheduler by pining one vcpu
to one pcpu?

>> How
>> do you think: is it possible to make credit2 scheduler to schedule
>> stubdoms in the same way?
>>
> It is indeed possible. Actually, it's actually in the plans to do
> exactly something like that, as it could potentially be useful for a
> wide range of use cases.
>
> Doing it in the null scheduler is just easier, and we think it would be
> a nice way to quickly have a proof of concept done. Afterwards, we'll
> focus on other schedulers too.
>
>> Do you have any tools to profile or trace XEN core? Also, I don't
>> think that pure context switch time is the biggest issue. Even now,
>> it
>> allows 180 000 switches per second (if I'm not wrong). I think,
>> scheduling latency is more important.
>>
> What do you refer to when you say 'scheduling latency'? As in, the
> latency between which events, happening on which component?
I'm worried about interval between task switching events.
For example: vcpu1 is vcpu of some domU and vcpu2 is vcpu of stubdom
that runs device emulator for domU.
vcpu1 issues MMIO access that should be handled by vcpu2 and gets
blocked by hypervisor. Then there will be two context switches:
vcpu1->vcpu2 to emulate that MMIO access and vcpu2->vcpu1 to continue
work. AFAIK, credit2 does not guarantee that vcpu2 will be scheduled
right after when vcpu1 will be blocked. It can schedule some vcpu3,
then vcpu4 and only then come back to vcpu2.  That time interval
between event "vcpu2 was made runable" and event "vcpu2 was scheduled
on pcpu" is what I call 'scheduling latency'.
This latency can be minimized by mechanism similar to priority
inheritance: if scheduler knows that vcpu1 waits for vcpu2 and there
are remaining time slice for vcpu1 it should select vcpu2 as next
scheduled vcpu. Problem is how to populate such dependencies.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-26 20:09     ` Volodymyr Babchuk
@ 2017-05-27  2:10       ` Dario Faggioli
  0 siblings, 0 replies; 49+ messages in thread
From: Dario Faggioli @ 2017-05-27  2:10 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, George Dunlap,
	Julien Grall, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4448 bytes --]

On Fri, 2017-05-26 at 13:09 -0700, Volodymyr Babchuk wrote:
> Hello Dario,
>
Hi,

> > Feel free to ask anything. :-)
> 
> I'm so unfamiliar, so even don't know what to ask :) But thank you.
> Surely I'll have questions.
> 
Sure. As soon as you have one, go ahead with it.

> > The null scheduler is meant at being useful when you have a static
> > scenario, no (or very few) overbooking (i.e., total nr of vCPUs ~=
> > nr
> > of pCPUS), and what to cut to _zero_ the scheduling overhead.
> > 
> > That may include certain class of real-time workloads, but it not
> > limited to such use case.
> 
> Can't I achieve the same with any other scheduler by pining one vcpu
> to one pcpu?
> 
Of course you can, but not with the same (small!!) level of overhead of
the null scheduler. In fact, even if you do 1-to-1 pinning of all the
vcpus, the general purpose scheduler (like Credit1 and Credit2) can't
rely on assumptions that something like that is indeed in effect, and
that it will always be.

For instance, if you have all vcpus except one pinned to 1 pCPU. That
one missing vcpu, in its turn, can run everywhere. The scheduler has to
always go and see which vcpu is the one that is free to run everywhere,
and whether it should (for instance) preempt any (and, if yes, which)
of the pinned ones.

Also, still in those scheduler, there may be multiple vcpus that are
pinned to the same pCPU. In which case, the scheduler, at each
scheduling decision, needs to figure out which ones (among all the
vcpus) they are, and which one has the right to run on the pCPU.

And, unfortunately, since pinning can change 100% asynchronously wrt
the scheduler, it's really not possible to either make assumptions, nor
even to try to capture some (special case) situation in a data
structure.

Therefore, yes, if you configure 1-to-1 pinning in Credit1 or Credit2,
the actual schedule would be the same. But that will be achieve with
almost the same computational overhead, as if the vcpus were free.

OTOH, the null scheduler is specifically designed for the (semi-)static 
1-to-1 pinning use case, so the overhead it introduces (for making
scheduling decisions) is close to zero.

> > > Do you have any tools to profile or trace XEN core? Also, I don't
> > > think that pure context switch time is the biggest issue. Even
> > > now,
> > > it
> > > allows 180 000 switches per second (if I'm not wrong). I think,
> > > scheduling latency is more important.
> > > 
> > 
> > What do you refer to when you say 'scheduling latency'? As in, the
> > latency between which events, happening on which component?
> 
> I'm worried about interval between task switching events.
> For example: vcpu1 is vcpu of some domU and vcpu2 is vcpu of stubdom
> that runs device emulator for domU.
> vcpu1 issues MMIO access that should be handled by vcpu2 and gets
> blocked by hypervisor. Then there will be two context switches:
> vcpu1->vcpu2 to emulate that MMIO access and vcpu2->vcpu1 to continue
> work. AFAIK, credit2 does not guarantee that vcpu2 will be scheduled
> right after when vcpu1 will be blocked. It can schedule some vcpu3,
> then vcpu4 and only then come back to vcpu2.  That time interval
> between event "vcpu2 was made runable" and event "vcpu2 was scheduled
> on pcpu" is what I call 'scheduling latency'.
>
Yes, currently, that's true. Basically, from the scheduling point of
view, there's no particular relationship between a domain's vcpu, and
the vcpu of the driver/stub-dom that service the domain itself.

But there's a plan to change that, as both I and Stefano said already,
and do something in all schedulers. We'll just start with null, because
it's the easiest. :-)

> This latency can be minimized by mechanism similar to priority
> inheritance: if scheduler knows that vcpu1 waits for vcpu2 and there
> are remaining time slice for vcpu1 it should select vcpu2 as next
> scheduled vcpu. Problem is how to populate such dependencies.
> 
I've spent my PhD studying and doing stuff around priority
inheritance... so something similar to that, is exactly what I had in
mind. :-D

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-26 19:28     ` Volodymyr Babchuk
@ 2017-05-30 17:29       ` Stefano Stabellini
  2017-05-30 17:33         ` Julien Grall
  2017-05-31  9:09         ` George Dunlap
  2017-05-31 17:02       ` George Dunlap
  1 sibling, 2 replies; 49+ messages in thread
From: Stefano Stabellini @ 2017-05-30 17:29 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	George Dunlap, Julien Grall, xen-devel

On Fri, 26 May 2017, Volodymyr Babchuk wrote:
> >> > The other issue with stubdoms is context switch times. Volodymyr showed
> >> > that minios has much higher context switch times compared to EL0 apps.
> >> > It is probably due to GIC context switch, that is skipped for EL0 apps.
> >> > Maybe we could skip GIC context switch for stubdoms too, if we knew that
> >> > they are not going to use the VGIC. At that point, context switch times
> >> > should be very similar to EL0 apps.
> >> So you are suggesting to create something like lightweight stubdom. I
> >> generally like this idea. But AFAIK, vGIC is used to deliver events
> >> from hypervisor to stubdom. Do you want to propose another mechanism?
> >
> > There is no way out: if the stubdom needs events, then we'll have to
> > expose and context switch the vGIC. If it doesn't, then we can skip the
> > vGIC. However, we would have a similar problem with EL0 apps: I am
> > assuming that EL0 apps don't need to handle interrupts, but if they do,
> > then they might need something like a vGIC.
> Hm. Correct me, but if we want make stubdom to handle some requests
> (e.g. emulate MMIO access), then it needs events, and thus it needs
> interrupts. At least, I'm not aware about any other mechanism, that
> allows hypervisor to signal to a domain.

The stubdom could do polling and avoid interrupts for example, but that
would probably not be desirable.


> On other hand, EL0 app (as I see them) does not need such events.
> Basically, you just call function `handle_mmio()` right in the app.
> So, apps can live without interrupts and they still be able to handle
> request.

That's true.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-30 17:29       ` Stefano Stabellini
@ 2017-05-30 17:33         ` Julien Grall
  2017-06-01 10:28           ` Julien Grall
  2017-05-31  9:09         ` George Dunlap
  1 sibling, 1 reply; 49+ messages in thread
From: Julien Grall @ 2017-05-30 17:33 UTC (permalink / raw)
  To: Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Dario Faggioli, xen-devel, Andrii Anisov, George Dunlap



On 30/05/17 18:29, Stefano Stabellini wrote:
> On Fri, 26 May 2017, Volodymyr Babchuk wrote:
>>>>> The other issue with stubdoms is context switch times. Volodymyr showed
>>>>> that minios has much higher context switch times compared to EL0 apps.
>>>>> It is probably due to GIC context switch, that is skipped for EL0 apps.
>>>>> Maybe we could skip GIC context switch for stubdoms too, if we knew that
>>>>> they are not going to use the VGIC. At that point, context switch times
>>>>> should be very similar to EL0 apps.
>>>> So you are suggesting to create something like lightweight stubdom. I
>>>> generally like this idea. But AFAIK, vGIC is used to deliver events
>>>> from hypervisor to stubdom. Do you want to propose another mechanism?
>>>
>>> There is no way out: if the stubdom needs events, then we'll have to
>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>> then they might need something like a vGIC.
>> Hm. Correct me, but if we want make stubdom to handle some requests
>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>> interrupts. At least, I'm not aware about any other mechanism, that
>> allows hypervisor to signal to a domain.
>
> The stubdom could do polling and avoid interrupts for example, but that
> would probably not be desirable.

The polling can be minimized if you block the vCPU when there are 
nothing to do. It would get unblock when you have to schedule him 
because of a request.

>
>
>> On other hand, EL0 app (as I see them) does not need such events.
>> Basically, you just call function `handle_mmio()` right in the app.
>> So, apps can live without interrupts and they still be able to handle
>> request.
>
> That's true.
>

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-30 17:29       ` Stefano Stabellini
  2017-05-30 17:33         ` Julien Grall
@ 2017-05-31  9:09         ` George Dunlap
  2017-05-31 15:53           ` Dario Faggioli
  2017-05-31 17:45           ` Stefano Stabellini
  1 sibling, 2 replies; 49+ messages in thread
From: George Dunlap @ 2017-05-31  9:09 UTC (permalink / raw)
  To: Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Dario Faggioli, xen-devel, Andrii Anisov, Julien Grall

On 30/05/17 18:29, Stefano Stabellini wrote:
> On Fri, 26 May 2017, Volodymyr Babchuk wrote:
>>>>> The other issue with stubdoms is context switch times. Volodymyr showed
>>>>> that minios has much higher context switch times compared to EL0 apps.
>>>>> It is probably due to GIC context switch, that is skipped for EL0 apps.
>>>>> Maybe we could skip GIC context switch for stubdoms too, if we knew that
>>>>> they are not going to use the VGIC. At that point, context switch times
>>>>> should be very similar to EL0 apps.
>>>> So you are suggesting to create something like lightweight stubdom. I
>>>> generally like this idea. But AFAIK, vGIC is used to deliver events
>>>> from hypervisor to stubdom. Do you want to propose another mechanism?
>>>
>>> There is no way out: if the stubdom needs events, then we'll have to
>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>> then they might need something like a vGIC.
>> Hm. Correct me, but if we want make stubdom to handle some requests
>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>> interrupts. At least, I'm not aware about any other mechanism, that
>> allows hypervisor to signal to a domain.
> 
> The stubdom could do polling and avoid interrupts for example, but that
> would probably not be desirable.
> 
> 
>> On other hand, EL0 app (as I see them) does not need such events.
>> Basically, you just call function `handle_mmio()` right in the app.
>> So, apps can live without interrupts and they still be able to handle
>> request.
> 
> That's true.

Well if they're in a separate security zone, that's not going to work.
You have to have a defined interface between things and sanitize inputs
between them.  Furthermore, you probably want something like a stable
interface with some level of backwards compatibility, which is not
something the internal hypervisor interfaces are designed for.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-31  9:09         ` George Dunlap
@ 2017-05-31 15:53           ` Dario Faggioli
  2017-05-31 16:17             ` Volodymyr Babchuk
  2017-05-31 17:45           ` Stefano Stabellini
  1 sibling, 1 reply; 49+ messages in thread
From: Dario Faggioli @ 2017-05-31 15:53 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov


[-- Attachment #1.1: Type: text/plain, Size: 1281 bytes --]

On Wed, 2017-05-31 at 10:09 +0100, George Dunlap wrote:
> On 30/05/17 18:29, Stefano Stabellini wrote:
> > On Fri, 26 May 2017, Volodymyr Babchuk wrote:
> > > On other hand, EL0 app (as I see them) does not need such events.
> > > Basically, you just call function `handle_mmio()` right in the
> > > app.
> > > So, apps can live without interrupts and they still be able to
> > > handle
> > > request.
> > 
> > That's true.
> 
> Well if they're in a separate security zone, that's not going to
> work.
> You have to have a defined interface between things and sanitize
> inputs
> between them.  
>
Exactly, I was about to ask almost the same thing.

In fact, if you are "not" in Xen, as in, you are (and want to be there
by design) in an entity that is scheduled by Xen, and runs at a
different privilege level than Xen code, how come you can just call
random hypervisor functions?

Or am I still missing something (of either ARM in general, or of these
Apps in particular)?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-31 15:53           ` Dario Faggioli
@ 2017-05-31 16:17             ` Volodymyr Babchuk
  0 siblings, 0 replies; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-05-31 16:17 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, George Dunlap,
	Julien Grall, Stefano Stabellini

Hi Dario,

>> > > On other hand, EL0 app (as I see them) does not need such events.
>> > > Basically, you just call function `handle_mmio()` right in the
>> > > app.
>> > > So, apps can live without interrupts and they still be able to
>> > > handle
>> > > request.
>> >
>> > That's true.
>>
>> Well if they're in a separate security zone, that's not going to
>> work.
>> You have to have a defined interface between things and sanitize
>> inputs
>> between them.
>>
> Exactly, I was about to ask almost the same thing.
>
> In fact, if you are "not" in Xen, as in, you are (and want to be there
> by design) in an entity that is scheduled by Xen, and runs at a
> different privilege level than Xen code, how come you can just call
> random hypervisor functions?
It is impossible, indeed. As I said earlier, interface between app and
hypervisor would be similar to hypercall interface (or it would be
hypercall interface itself).
ARM provides native interface for syscalls in hypervisor mode. That
means, that if you wish, you can handle both hypercalls (as a
hypervisor) and syscalls (as an "OS" for apps).

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-26 19:28     ` Volodymyr Babchuk
  2017-05-30 17:29       ` Stefano Stabellini
@ 2017-05-31 17:02       ` George Dunlap
  2017-06-17  0:14         ` Volodymyr Babchuk
  1 sibling, 1 reply; 49+ messages in thread
From: George Dunlap @ 2017-05-31 17:02 UTC (permalink / raw)
  To: Volodymyr Babchuk, Stefano Stabellini
  Cc: Artem_Mygaiev, Dario Faggioli, xen-devel, Andrii Anisov, Julien Grall

On 26/05/17 20:28, Volodymyr Babchuk wrote:
>> There is no way out: if the stubdom needs events, then we'll have to
>> expose and context switch the vGIC. If it doesn't, then we can skip the
>> vGIC. However, we would have a similar problem with EL0 apps: I am
>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>> then they might need something like a vGIC.
> Hm. Correct me, but if we want make stubdom to handle some requests
> (e.g. emulate MMIO access), then it needs events, and thus it needs
> interrupts. At least, I'm not aware about any other mechanism, that
> allows hypervisor to signal to a domain.
> On other hand, EL0 app (as I see them) does not need such events.
> Basically, you just call function `handle_mmio()` right in the app.
> So, apps can live without interrupts and they still be able to handle
> request.

So remember that "interrupt" and "event" are basically the same as
"structured callback".  When anything happens that Xen wants to tell the
EL0 app about, it has to have a way of telling it.  If the EL0 app is
handling a device, it has to have some way of getting interrupts from
that device; if it needs to emulate devices sent to the guest, it needs
some way to tell Xen to deliver an interrupt to the guest.

Now, we could make the EL0 app interface "interruptless".  Xen could
write information about pending events in a shared memory region, and
the EL0 app could check that before calling some sort of block()
hypercall, and check it again when it returns from the block() call.

But the shared event information starts to look an awful lot like events
and/or pending bits on an interrupt controller -- the only difference
being that you aren't interrupted if you're already running.

I'm pretty sure you could run in this mode using the existing interfaces
if you didn't want the hassle of dealing with asynchrony.  If that's the
case, then why bother inventing an entirely new interface, with its own
bugs and duplication of functionality?  Why not just use what we already
have?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-31  9:09         ` George Dunlap
  2017-05-31 15:53           ` Dario Faggioli
@ 2017-05-31 17:45           ` Stefano Stabellini
  2017-06-01 10:48             ` Julien Grall
  2017-06-01 10:52             ` George Dunlap
  1 sibling, 2 replies; 49+ messages in thread
From: Stefano Stabellini @ 2017-05-31 17:45 UTC (permalink / raw)
  To: George Dunlap
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov,
	Volodymyr Babchuk, Dario Faggioli, Julien Grall, xen-devel

On Wed, 31 May 2017, George Dunlap wrote:
> On 30/05/17 18:29, Stefano Stabellini wrote:
> > On Fri, 26 May 2017, Volodymyr Babchuk wrote:
> >>>>> The other issue with stubdoms is context switch times. Volodymyr showed
> >>>>> that minios has much higher context switch times compared to EL0 apps.
> >>>>> It is probably due to GIC context switch, that is skipped for EL0 apps.
> >>>>> Maybe we could skip GIC context switch for stubdoms too, if we knew that
> >>>>> they are not going to use the VGIC. At that point, context switch times
> >>>>> should be very similar to EL0 apps.
> >>>> So you are suggesting to create something like lightweight stubdom. I
> >>>> generally like this idea. But AFAIK, vGIC is used to deliver events
> >>>> from hypervisor to stubdom. Do you want to propose another mechanism?
> >>>
> >>> There is no way out: if the stubdom needs events, then we'll have to
> >>> expose and context switch the vGIC. If it doesn't, then we can skip the
> >>> vGIC. However, we would have a similar problem with EL0 apps: I am
> >>> assuming that EL0 apps don't need to handle interrupts, but if they do,
> >>> then they might need something like a vGIC.
> >> Hm. Correct me, but if we want make stubdom to handle some requests
> >> (e.g. emulate MMIO access), then it needs events, and thus it needs
> >> interrupts. At least, I'm not aware about any other mechanism, that
> >> allows hypervisor to signal to a domain.
> > 
> > The stubdom could do polling and avoid interrupts for example, but that
> > would probably not be desirable.
> > 
> > 
> >> On other hand, EL0 app (as I see them) does not need such events.
> >> Basically, you just call function `handle_mmio()` right in the app.
> >> So, apps can live without interrupts and they still be able to handle
> >> request.
> > 
> > That's true.
> 
> Well if they're in a separate security zone, that's not going to work.
> You have to have a defined interface between things and sanitize inputs
> between them.

Why? The purpose of EL0 apps is not to do checks on VM traps in Xen but
in a different privilege level instead. Maybe I misunderstood what you
are saying? Specifically, what "inputs" do you think should be sanitized
in Xen before jumping into the EL0 app?


> Furthermore, you probably want something like a stable
> interface with some level of backwards compatibility, which is not
> something the internal hypervisor interfaces are designed for.

I don't think we should provide that. If the user wants a stable
interface, she can use domains. I suggested that the code for the EL0
app should come out of the Xen repository directly. Like for the Xen
tools, they would be expected to be always in-sync.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-30 17:33         ` Julien Grall
@ 2017-06-01 10:28           ` Julien Grall
  2017-06-17  0:17             ` Volodymyr Babchuk
  0 siblings, 1 reply; 49+ messages in thread
From: Julien Grall @ 2017-06-01 10:28 UTC (permalink / raw)
  To: Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Dario Faggioli, xen-devel, Andrii Anisov, George Dunlap

Hi,

On 30/05/17 18:33, Julien Grall wrote:
>
>
> On 30/05/17 18:29, Stefano Stabellini wrote:
>> On Fri, 26 May 2017, Volodymyr Babchuk wrote:
>>>>>> The other issue with stubdoms is context switch times. Volodymyr
>>>>>> showed
>>>>>> that minios has much higher context switch times compared to EL0
>>>>>> apps.
>>>>>> It is probably due to GIC context switch, that is skipped for EL0
>>>>>> apps.
>>>>>> Maybe we could skip GIC context switch for stubdoms too, if we
>>>>>> knew that
>>>>>> they are not going to use the VGIC. At that point, context switch
>>>>>> times
>>>>>> should be very similar to EL0 apps.
>>>>> So you are suggesting to create something like lightweight stubdom. I
>>>>> generally like this idea. But AFAIK, vGIC is used to deliver events
>>>>> from hypervisor to stubdom. Do you want to propose another mechanism?
>>>>
>>>> There is no way out: if the stubdom needs events, then we'll have to
>>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>>> then they might need something like a vGIC.
>>> Hm. Correct me, but if we want make stubdom to handle some requests
>>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>>> interrupts. At least, I'm not aware about any other mechanism, that
>>> allows hypervisor to signal to a domain.
>>
>> The stubdom could do polling and avoid interrupts for example, but that
>> would probably not be desirable.
>
> The polling can be minimized if you block the vCPU when there are
> nothing to do. It would get unblock when you have to schedule him
> because of a request.

Thinking a bit more about this. So far, we rely on the domain to use the 
vGIC interrupt controller which require the context switch.

We could also implement a dummy interrupt controller to handle a 
predefined limited amount of interrupts which would allow asynchronous 
support in stubdom and an interface to support upcall via the interrupt 
exception vector.

This is something that would be more tricky to do with EL0 app as there 
is no EL0 vector exception.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-31 17:45           ` Stefano Stabellini
@ 2017-06-01 10:48             ` Julien Grall
  2017-06-01 10:52             ` George Dunlap
  1 sibling, 0 replies; 49+ messages in thread
From: Julien Grall @ 2017-06-01 10:48 UTC (permalink / raw)
  To: Stefano Stabellini, George Dunlap
  Cc: Volodymyr Babchuk, Artem_Mygaiev, Dario Faggioli, xen-devel,
	Andrii Anisov

Hi Stefano,

On 31/05/17 18:45, Stefano Stabellini wrote:
> On Wed, 31 May 2017, George Dunlap wrote:
>> On 30/05/17 18:29, Stefano Stabellini wrote:
>>> On Fri, 26 May 2017, Volodymyr Babchuk wrote:
>>>>>>> The other issue with stubdoms is context switch times. Volodymyr showed
>>>>>>> that minios has much higher context switch times compared to EL0 apps.
>>>>>>> It is probably due to GIC context switch, that is skipped for EL0 apps.
>>>>>>> Maybe we could skip GIC context switch for stubdoms too, if we knew that
>>>>>>> they are not going to use the VGIC. At that point, context switch times
>>>>>>> should be very similar to EL0 apps.
>>>>>> So you are suggesting to create something like lightweight stubdom. I
>>>>>> generally like this idea. But AFAIK, vGIC is used to deliver events
>>>>>> from hypervisor to stubdom. Do you want to propose another mechanism?
>>>>>
>>>>> There is no way out: if the stubdom needs events, then we'll have to
>>>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>>>> then they might need something like a vGIC.
>>>> Hm. Correct me, but if we want make stubdom to handle some requests
>>>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>>>> interrupts. At least, I'm not aware about any other mechanism, that
>>>> allows hypervisor to signal to a domain.
>>>
>>> The stubdom could do polling and avoid interrupts for example, but that
>>> would probably not be desirable.
>>>
>>>
>>>> On other hand, EL0 app (as I see them) does not need such events.
>>>> Basically, you just call function `handle_mmio()` right in the app.
>>>> So, apps can live without interrupts and they still be able to handle
>>>> request.
>>>
>>> That's true.
>>
>> Well if they're in a separate security zone, that's not going to work.
>> You have to have a defined interface between things and sanitize inputs
>> between them.
>
> Why? The purpose of EL0 apps is not to do checks on VM traps in Xen but
> in a different privilege level instead. Maybe I misunderstood what you
> are saying? Specifically, what "inputs" do you think should be sanitized
> in Xen before jumping into the EL0 app?
>
>
>> Furthermore, you probably want something like a stable
>> interface with some level of backwards compatibility, which is not
>> something the internal hypervisor interfaces are designed for.
>
> I don't think we should provide that. If the user wants a stable
> interface, she can use domains. I suggested that the code for the EL0
> app should come out of the Xen repository directly. Like for the Xen
> tools, they would be expected to be always in-sync.

Realistically, even if the EL0 apps are available directly in Xen 
repository, they will be built as standalone binary. So any ABI change 
will require to inspect/testing all the EL0 apps if the change is subtle.

So This sounds like to me a waste of time and resource compare to 
providing a stable and clearly defined ABI.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-31 17:45           ` Stefano Stabellini
  2017-06-01 10:48             ` Julien Grall
@ 2017-06-01 10:52             ` George Dunlap
  2017-06-01 10:54               ` George Dunlap
                                 ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: George Dunlap @ 2017-06-01 10:52 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Volodymyr Babchuk,
	Dario Faggioli, Julien Grall


> On May 31, 2017, at 6:45 PM, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> On Wed, 31 May 2017, George Dunlap wrote:
>> On 30/05/17 18:29, Stefano Stabellini wrote:
>>> On Fri, 26 May 2017, Volodymyr Babchuk wrote:
>>>>>>> The other issue with stubdoms is context switch times. Volodymyr showed
>>>>>>> that minios has much higher context switch times compared to EL0 apps.
>>>>>>> It is probably due to GIC context switch, that is skipped for EL0 apps.
>>>>>>> Maybe we could skip GIC context switch for stubdoms too, if we knew that
>>>>>>> they are not going to use the VGIC. At that point, context switch times
>>>>>>> should be very similar to EL0 apps.
>>>>>> So you are suggesting to create something like lightweight stubdom. I
>>>>>> generally like this idea. But AFAIK, vGIC is used to deliver events
>>>>>> from hypervisor to stubdom. Do you want to propose another mechanism?
>>>>> 
>>>>> There is no way out: if the stubdom needs events, then we'll have to
>>>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>>>> then they might need something like a vGIC.
>>>> Hm. Correct me, but if we want make stubdom to handle some requests
>>>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>>>> interrupts. At least, I'm not aware about any other mechanism, that
>>>> allows hypervisor to signal to a domain.
>>> 
>>> The stubdom could do polling and avoid interrupts for example, but that
>>> would probably not be desirable.
>>> 
>>> 
>>>> On other hand, EL0 app (as I see them) does not need such events.
>>>> Basically, you just call function `handle_mmio()` right in the app.
>>>> So, apps can live without interrupts and they still be able to handle
>>>> request.
>>> 
>>> That's true.
>> 
>> Well if they're in a separate security zone, that's not going to work.
>> You have to have a defined interface between things and sanitize inputs
>> between them.
> 
> Why? The purpose of EL0 apps is not to do checks on VM traps in Xen but
> in a different privilege level instead. Maybe I misunderstood what you
> are saying? Specifically, what "inputs" do you think should be sanitized
> in Xen before jumping into the EL0 app?

>> Furthermore, you probably want something like a stable
>> interface with some level of backwards compatibility, which is not
>> something the internal hypervisor interfaces are designed for.
> 
> I don't think we should provide that. If the user wants a stable
> interface, she can use domains. I suggested that the code for the EL0
> app should come out of the Xen repository directly. Like for the Xen
> tools, they would be expected to be always in-sync.

Hmm, it sounds like perhaps I misunderstood you and Volodymyr.  I took “you just call function `handle_mmio()` right in the app” to mean that the *app* calls the *hypervisor* function named “handle_mmio”.  It sounds like what he (or at least you) actually meant was that the *hypervisor* calls the function named “handle_mmio” in the *app*?

But presumably the app will need to do privileged operations — change the guest’s state, read / write MMIO regions, &c.  We can theoretically have Xen ‘just call functions’ in the app; but we definitely *cannot* have the app ‘just call functions’ inside of Xen — that is, not if you actually want any additional security.

And that’s completely apart from the whole non-GPL discussion we had.  If you want non-GPL apps, I think you definitely want a nice clean interface, or you’ll have a hard time arguing that the resulting thing is not a derived work (in spite of the separate address spaces).

The two motivating factors for having apps were additional security and non-GPL implementations of device models / mediators.  Having the app being able to call into Xen undermines both.

 -George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-01 10:52             ` George Dunlap
@ 2017-06-01 10:54               ` George Dunlap
  2017-06-01 12:40               ` Dario Faggioli
  2017-06-01 18:27               ` Stefano Stabellini
  2 siblings, 0 replies; 49+ messages in thread
From: George Dunlap @ 2017-06-01 10:54 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Volodymyr Babchuk,
	Dario Faggioli, Julien Grall


> On Jun 1, 2017, at 11:52 AM, George Dunlap <george.dunlap@citrix.com> wrote:
> 
> 
>> On May 31, 2017, at 6:45 PM, Stefano Stabellini <sstabellini@kernel.org> wrote:
>> 
>> On Wed, 31 May 2017, George Dunlap wrote:
>>> On 30/05/17 18:29, Stefano Stabellini wrote:
>>>> On Fri, 26 May 2017, Volodymyr Babchuk wrote:
>>>>>>>> The other issue with stubdoms is context switch times. Volodymyr showed
>>>>>>>> that minios has much higher context switch times compared to EL0 apps.
>>>>>>>> It is probably due to GIC context switch, that is skipped for EL0 apps.
>>>>>>>> Maybe we could skip GIC context switch for stubdoms too, if we knew that
>>>>>>>> they are not going to use the VGIC. At that point, context switch times
>>>>>>>> should be very similar to EL0 apps.
>>>>>>> So you are suggesting to create something like lightweight stubdom. I
>>>>>>> generally like this idea. But AFAIK, vGIC is used to deliver events
>>>>>>> from hypervisor to stubdom. Do you want to propose another mechanism?
>>>>>> 
>>>>>> There is no way out: if the stubdom needs events, then we'll have to
>>>>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>>>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>>>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>>>>> then they might need something like a vGIC.
>>>>> Hm. Correct me, but if we want make stubdom to handle some requests
>>>>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>>>>> interrupts. At least, I'm not aware about any other mechanism, that
>>>>> allows hypervisor to signal to a domain.
>>>> 
>>>> The stubdom could do polling and avoid interrupts for example, but that
>>>> would probably not be desirable.
>>>> 
>>>> 
>>>>> On other hand, EL0 app (as I see them) does not need such events.
>>>>> Basically, you just call function `handle_mmio()` right in the app.
>>>>> So, apps can live without interrupts and they still be able to handle
>>>>> request.
>>>> 
>>>> That's true.
>>> 
>>> Well if they're in a separate security zone, that's not going to work.
>>> You have to have a defined interface between things and sanitize inputs
>>> between them.
>> 
>> Why? The purpose of EL0 apps is not to do checks on VM traps in Xen but
>> in a different privilege level instead. Maybe I misunderstood what you
>> are saying? Specifically, what "inputs" do you think should be sanitized
>> in Xen before jumping into the EL0 app?
> 
>>> Furthermore, you probably want something like a stable
>>> interface with some level of backwards compatibility, which is not
>>> something the internal hypervisor interfaces are designed for.
>> 
>> I don't think we should provide that. If the user wants a stable
>> interface, she can use domains. I suggested that the code for the EL0
>> app should come out of the Xen repository directly. Like for the Xen
>> tools, they would be expected to be always in-sync.
> 
> Hmm, it sounds like perhaps I misunderstood you and Volodymyr.  I took “you just call function `handle_mmio()` right in the app” to mean that the *app* calls the *hypervisor* function named “handle_mmio”.  It sounds like what he (or at least you) actually meant was that the *hypervisor* calls the function named “handle_mmio” in the *app*?
> 
> But presumably the app will need to do privileged operations — change the guest’s state, read / write MMIO regions, &c.  We can theoretically have Xen ‘just call functions’ in the app; but we definitely *cannot* have the app ‘just call functions’ inside of Xen — that is, not if you actually want any additional security.
> 
> And that’s completely apart from the whole non-GPL discussion we had.  If you want non-GPL apps, I think you definitely want a nice clean interface, or you’ll have a hard time arguing that the resulting thing is not a derived work (in spite of the separate address spaces).
> 
> The two motivating factors for having apps were additional security and non-GPL implementations of device models / mediators.  Having the app being able to call into Xen undermines both.

And here I mean, “call Xen functions directly”, not “make well-defined hypercalls”.

 -George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-01 10:52             ` George Dunlap
  2017-06-01 10:54               ` George Dunlap
@ 2017-06-01 12:40               ` Dario Faggioli
  2017-06-01 15:02                 ` George Dunlap
  2017-06-01 18:27               ` Stefano Stabellini
  2 siblings, 1 reply; 49+ messages in thread
From: Dario Faggioli @ 2017-06-01 12:40 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini
  Cc: Volodymyr Babchuk, Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov


[-- Attachment #1.1: Type: text/plain, Size: 1383 bytes --]

On Thu, 2017-06-01 at 12:52 +0200, George Dunlap wrote:
> > On May 31, 2017, at 6:45 PM, Stefano Stabellini <sstabellini@kernel
> > .org> wrote:
> > 
> > I don't think we should provide that. If the user wants a stable
> > interface, she can use domains. I suggested that the code for the
> > EL0
> > app should come out of the Xen repository directly. Like for the
> > Xen
> > tools, they would be expected to be always in-sync.
> 
> Hmm, it sounds like perhaps I misunderstood you and Volodymyr.  I
> took “you just call function `handle_mmio()` right in the app” to
> mean that the *app* calls the *hypervisor* function named
> “handle_mmio”.
>
Right. That what I had understood too.

> It sounds like what he (or at least you) actually meant was that the
> *hypervisor* calls the function named “handle_mmio” in the *app*?
> 
Mmm... it's clearly me that am being dense, but what do you exactly
mean with "the hypervisor calls the function named handle_mmio() in the
app"? In particular the "in the app" part, and how is the hypervisor
going to be "in" the app...

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-01 12:40               ` Dario Faggioli
@ 2017-06-01 15:02                 ` George Dunlap
  0 siblings, 0 replies; 49+ messages in thread
From: George Dunlap @ 2017-06-01 15:02 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini
  Cc: Volodymyr Babchuk, Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov

On 01/06/17 13:40, Dario Faggioli wrote:
> On Thu, 2017-06-01 at 12:52 +0200, George Dunlap wrote:
>>> On May 31, 2017, at 6:45 PM, Stefano Stabellini <sstabellini@kernel
>>> .org> wrote:
>>>
>>> I don't think we should provide that. If the user wants a stable
>>> interface, she can use domains. I suggested that the code for the
>>> EL0
>>> app should come out of the Xen repository directly. Like for the
>>> Xen
>>> tools, they would be expected to be always in-sync.
>>
>> Hmm, it sounds like perhaps I misunderstood you and Volodymyr.  I
>> took “you just call function `handle_mmio()` right in the app” to
>> mean that the *app* calls the *hypervisor* function named
>> “handle_mmio”.
>>
> Right. That what I had understood too.
> 
>> It sounds like what he (or at least you) actually meant was that the
>> *hypervisor* calls the function named “handle_mmio” in the *app*?
>>
> Mmm... it's clearly me that am being dense, but what do you exactly
> mean with "the hypervisor calls the function named handle_mmio() in the
> app"? In particular the "in the app" part, and how is the hypervisor
> going to be "in" the app...

Well it sounds to me similar to what Linux would do with modules: the
module has the symbols encoded somewhere in it.  The hypervisor would
load the "app" binary; and when the appropriate device MMIO happened, it
would call the "handle_mmio()" function (which would be a bit more like
an entry point).

But it seems to me like having an interface where the app actively
registers callbacks for specific events is a lot easier than working out
how to store the dynamic linking information in the module and then
parse it in Xen.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-01 10:52             ` George Dunlap
  2017-06-01 10:54               ` George Dunlap
  2017-06-01 12:40               ` Dario Faggioli
@ 2017-06-01 18:27               ` Stefano Stabellini
  2 siblings, 0 replies; 49+ messages in thread
From: Stefano Stabellini @ 2017-06-01 18:27 UTC (permalink / raw)
  To: George Dunlap
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov,
	Volodymyr Babchuk, Dario Faggioli, Julien Grall, xen-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4563 bytes --]

On Thu, 1 Jun 2017, George Dunlap wrote:
> > On May 31, 2017, at 6:45 PM, Stefano Stabellini <sstabellini@kernel.org> wrote:
> > 
> > On Wed, 31 May 2017, George Dunlap wrote:
> >> On 30/05/17 18:29, Stefano Stabellini wrote:
> >>> On Fri, 26 May 2017, Volodymyr Babchuk wrote:
> >>>>>>> The other issue with stubdoms is context switch times. Volodymyr showed
> >>>>>>> that minios has much higher context switch times compared to EL0 apps.
> >>>>>>> It is probably due to GIC context switch, that is skipped for EL0 apps.
> >>>>>>> Maybe we could skip GIC context switch for stubdoms too, if we knew that
> >>>>>>> they are not going to use the VGIC. At that point, context switch times
> >>>>>>> should be very similar to EL0 apps.
> >>>>>> So you are suggesting to create something like lightweight stubdom. I
> >>>>>> generally like this idea. But AFAIK, vGIC is used to deliver events
> >>>>>> from hypervisor to stubdom. Do you want to propose another mechanism?
> >>>>> 
> >>>>> There is no way out: if the stubdom needs events, then we'll have to
> >>>>> expose and context switch the vGIC. If it doesn't, then we can skip the
> >>>>> vGIC. However, we would have a similar problem with EL0 apps: I am
> >>>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
> >>>>> then they might need something like a vGIC.
> >>>> Hm. Correct me, but if we want make stubdom to handle some requests
> >>>> (e.g. emulate MMIO access), then it needs events, and thus it needs
> >>>> interrupts. At least, I'm not aware about any other mechanism, that
> >>>> allows hypervisor to signal to a domain.
> >>> 
> >>> The stubdom could do polling and avoid interrupts for example, but that
> >>> would probably not be desirable.
> >>> 
> >>> 
> >>>> On other hand, EL0 app (as I see them) does not need such events.
> >>>> Basically, you just call function `handle_mmio()` right in the app.
> >>>> So, apps can live without interrupts and they still be able to handle
> >>>> request.
> >>> 
> >>> That's true.
> >> 
> >> Well if they're in a separate security zone, that's not going to work.
> >> You have to have a defined interface between things and sanitize inputs
> >> between them.
> > 
> > Why? The purpose of EL0 apps is not to do checks on VM traps in Xen but
> > in a different privilege level instead. Maybe I misunderstood what you
> > are saying? Specifically, what "inputs" do you think should be sanitized
> > in Xen before jumping into the EL0 app?
> 
> >> Furthermore, you probably want something like a stable
> >> interface with some level of backwards compatibility, which is not
> >> something the internal hypervisor interfaces are designed for.
> > 
> > I don't think we should provide that. If the user wants a stable
> > interface, she can use domains. I suggested that the code for the EL0
> > app should come out of the Xen repository directly. Like for the Xen
> > tools, they would be expected to be always in-sync.
> 
> Hmm, it sounds like perhaps I misunderstood you and Volodymyr.  I took “you just call function `handle_mmio()` right in the app” to mean that the *app* calls the *hypervisor* function named “handle_mmio”.  It sounds like what he (or at least you) actually meant was that the *hypervisor* calls the function named “handle_mmio” in the *app*?

Indeed, I certainly understood Xen calls "handle_mmio" in an EL0 app.


> But presumably the app will need to do privileged operations — change the guest’s state, read / write MMIO regions, &c.  We can theoretically have Xen ‘just call functions’ in the app; but we definitely *cannot* have the app ‘just call functions’ inside of Xen — that is, not if you actually want any additional security.

Absolutely.


> And that’s completely apart from the whole non-GPL discussion we had.  If you want non-GPL apps, I think you definitely want a nice clean interface, or you’ll have a hard time arguing that the resulting thing is not a derived work (in spite of the separate address spaces).

That's right, I don't think EL0 apps are a good vehicle for non-GPL
components. Stubdoms are better for that.


> The two motivating factors for having apps were additional security and non-GPL implementations of device models / mediators.

I think the two motivating factors are additional security and extremely
low and deterministic latency.


> Having the app being able to call into Xen undermines both.

Indeed, but there needs to be a very small set of exposed calls, such as:

- (un)mapping memory of a VM
- inject interrupts into a VM

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-05-31 17:02       ` George Dunlap
@ 2017-06-17  0:14         ` Volodymyr Babchuk
  2017-06-19  9:37           ` George Dunlap
  0 siblings, 1 reply; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-06-17  0:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	Julien Grall, xen-devel

Hello George,

On 31 May 2017 at 20:02, George Dunlap <george.dunlap@citrix.com> wrote:
>>> There is no way out: if the stubdom needs events, then we'll have to
>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>> then they might need something like a vGIC.
>> Hm. Correct me, but if we want make stubdom to handle some requests
>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>> interrupts. At least, I'm not aware about any other mechanism, that
>> allows hypervisor to signal to a domain.
>> On other hand, EL0 app (as I see them) does not need such events.
>> Basically, you just call function `handle_mmio()` right in the app.
>> So, apps can live without interrupts and they still be able to handle
>> request.
>
> So remember that "interrupt" and "event" are basically the same as
> "structured callback".  When anything happens that Xen wants to tell the
> EL0 app about, it has to have a way of telling it.  If the EL0 app is
> handling a device, it has to have some way of getting interrupts from
> that device; if it needs to emulate devices sent to the guest, it needs
> some way to tell Xen to deliver an interrupt to the guest.
Basically yes. There should be mechanism to request something from
native application. Question is how this mechanism can be implemented.
Classical approach is a even-driven loop:

while(1) {
    wait_for_event();
    handle_event_event();
    return_back_results();
}

wait_for_event() can by anything from WFI instruction to read() on
socket. This is how stubdoms are working. I agree with you: there are
no sense to repeat this in native apps.

> Now, we could make the EL0 app interface "interruptless".  Xen could
> write information about pending events in a shared memory region, and
> the EL0 app could check that before calling some sort of block()
> hypercall, and check it again when it returns from the block() call.

> But the shared event information starts to look an awful lot like events
> and/or pending bits on an interrupt controller -- the only difference
> being that you aren't interrupted if you're already running.

Actually there are third way, which I have used. I described it in
original email (check out [1]).
Basically, native application is dead until it is needed by
hypervisor. When hypervisor wants some services from app, it setups
parameters, switches mode to EL0 and jumps at app entry point.
> I'm pretty sure you could run in this mode using the existing interfaces
> if you didn't want the hassle of dealing with asynchrony.  If that's the
> case, then why bother inventing an entirely new interface, with its own
> bugs and duplication of functionality?  Why not just use what we already
> have?
Because we are concerned about latency. In my benchmark, my native app
PoC is 1.6 times faster than stubdom.


[1] http://marc.info/?l=xen-devel&m=149151018801649&w=2


-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-01 10:28           ` Julien Grall
@ 2017-06-17  0:17             ` Volodymyr Babchuk
  0 siblings, 0 replies; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-06-17  0:17 UTC (permalink / raw)
  To: Julien Grall
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	George Dunlap, xen-devel

Hello Juilen,
>> The polling can be minimized if you block the vCPU when there are
>> nothing to do. It would get unblock when you have to schedule him
>> because of a request.
> Thinking a bit more about this. So far, we rely on the domain to use the
> vGIC interrupt controller which require the context switch.
>
> We could also implement a dummy interrupt controller to handle a predefined
> limited amount of interrupts which would allow asynchronous support in
> stubdom and an interface to support upcall via the interrupt exception
> vector.
>
> This is something that would be more tricky to do with EL0 app as there is
> no EL0 vector exception.
>
Actually, your idea about blocking vcpu is very interesting. Then we
don't need vGIC at all. For example, when stubdomain have finished
handling request, it can issue hypercall "block me until new
requests". XEN blocks vcpu at this moment and unblocks it only when
there are another request ready. This is very promising idea. Need to
think about it further.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-17  0:14         ` Volodymyr Babchuk
@ 2017-06-19  9:37           ` George Dunlap
  2017-06-19 17:54             ` Stefano Stabellini
  2017-06-19 18:26             ` Volodymyr Babchuk
  0 siblings, 2 replies; 49+ messages in thread
From: George Dunlap @ 2017-06-19  9:37 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	Julien Grall, xen-devel

On 17/06/17 01:14, Volodymyr Babchuk wrote:
> Hello George,
> 
> On 31 May 2017 at 20:02, George Dunlap <george.dunlap@citrix.com> wrote:
>>>> There is no way out: if the stubdom needs events, then we'll have to
>>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>>> then they might need something like a vGIC.
>>> Hm. Correct me, but if we want make stubdom to handle some requests
>>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>>> interrupts. At least, I'm not aware about any other mechanism, that
>>> allows hypervisor to signal to a domain.
>>> On other hand, EL0 app (as I see them) does not need such events.
>>> Basically, you just call function `handle_mmio()` right in the app.
>>> So, apps can live without interrupts and they still be able to handle
>>> request.
>>
>> So remember that "interrupt" and "event" are basically the same as
>> "structured callback".  When anything happens that Xen wants to tell the
>> EL0 app about, it has to have a way of telling it.  If the EL0 app is
>> handling a device, it has to have some way of getting interrupts from
>> that device; if it needs to emulate devices sent to the guest, it needs
>> some way to tell Xen to deliver an interrupt to the guest.
> Basically yes. There should be mechanism to request something from
> native application. Question is how this mechanism can be implemented.
> Classical approach is a even-driven loop:
> 
> while(1) {
>     wait_for_event();
>     handle_event_event();
>     return_back_results();
> }
> 
> wait_for_event() can by anything from WFI instruction to read() on
> socket. This is how stubdoms are working. I agree with you: there are
> no sense to repeat this in native apps.
> 
>> Now, we could make the EL0 app interface "interruptless".  Xen could
>> write information about pending events in a shared memory region, and
>> the EL0 app could check that before calling some sort of block()
>> hypercall, and check it again when it returns from the block() call.
> 
>> But the shared event information starts to look an awful lot like events
>> and/or pending bits on an interrupt controller -- the only difference
>> being that you aren't interrupted if you're already running.
> 
> Actually there are third way, which I have used. I described it in
> original email (check out [1]).
> Basically, native application is dead until it is needed by
> hypervisor. When hypervisor wants some services from app, it setups
> parameters, switches mode to EL0 and jumps at app entry point.

What's the difference between "jumps to an app entry point" and "jumps
to an interrupt handling routine"?  And what's the difference between
"Tells Xen about the location of the app entry point" and "tells Xen
about the location of the interrupt handling routine"?

If you want this "EL0 app" thing to be able to provide extra security
over just running the code inside of Xen, then the code must not be able
to DoS the host by spinning forever instead of returning.

What happens if two different pcpus in Xen decide they want to activate
some "app" functionality?

>> I'm pretty sure you could run in this mode using the existing interfaces
>> if you didn't want the hassle of dealing with asynchrony.  If that's the
>> case, then why bother inventing an entirely new interface, with its own
>> bugs and duplication of functionality?  Why not just use what we already
>> have?
> Because we are concerned about latency. In my benchmark, my native app
> PoC is 1.6 times faster than stubdom.

But given the conversation so far, it seems likely that that is mainly
due to the fact that context switching on ARM has not been optimized.

Just to be clear -- I'm not adamantly opposed to a new interface similar
to what you're describing above.  But I would be opposed to introducing
a new interface that doesn't achieve the stated goals (more secure, &c),
or a new interface that is the same as the old one but rewritten a bit.

The point of having this design discussion up front is to prevent a
situation where you spend months coding up something which is ultimately
rejected.  There are a lot of things that are hard to predict until
there's actually code to review, but at the moment the "jumps to an
interrupt handling routine" approach looks unpromising.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-19  9:37           ` George Dunlap
@ 2017-06-19 17:54             ` Stefano Stabellini
  2017-06-19 18:36               ` Volodymyr Babchuk
  2017-06-19 18:26             ` Volodymyr Babchuk
  1 sibling, 1 reply; 49+ messages in thread
From: Stefano Stabellini @ 2017-06-19 17:54 UTC (permalink / raw)
  To: George Dunlap
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov,
	Volodymyr Babchuk, Dario Faggioli, Julien Grall, xen-devel

On Mon, 19 Jun 2017, George Dunlap wrote:
> On 17/06/17 01:14, Volodymyr Babchuk wrote:
> > Hello George,
> > 
> > On 31 May 2017 at 20:02, George Dunlap <george.dunlap@citrix.com> wrote:
> >>>> There is no way out: if the stubdom needs events, then we'll have to
> >>>> expose and context switch the vGIC. If it doesn't, then we can skip the
> >>>> vGIC. However, we would have a similar problem with EL0 apps: I am
> >>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
> >>>> then they might need something like a vGIC.
> >>> Hm. Correct me, but if we want make stubdom to handle some requests
> >>> (e.g. emulate MMIO access), then it needs events, and thus it needs
> >>> interrupts. At least, I'm not aware about any other mechanism, that
> >>> allows hypervisor to signal to a domain.
> >>> On other hand, EL0 app (as I see them) does not need such events.
> >>> Basically, you just call function `handle_mmio()` right in the app.
> >>> So, apps can live without interrupts and they still be able to handle
> >>> request.
> >>
> >> So remember that "interrupt" and "event" are basically the same as
> >> "structured callback".  When anything happens that Xen wants to tell the
> >> EL0 app about, it has to have a way of telling it.  If the EL0 app is
> >> handling a device, it has to have some way of getting interrupts from
> >> that device; if it needs to emulate devices sent to the guest, it needs
> >> some way to tell Xen to deliver an interrupt to the guest.
> > Basically yes. There should be mechanism to request something from
> > native application. Question is how this mechanism can be implemented.
> > Classical approach is a even-driven loop:
> > 
> > while(1) {
> >     wait_for_event();
> >     handle_event_event();
> >     return_back_results();
> > }
> > 
> > wait_for_event() can by anything from WFI instruction to read() on
> > socket. This is how stubdoms are working. I agree with you: there are
> > no sense to repeat this in native apps.
> > 
> >> Now, we could make the EL0 app interface "interruptless".  Xen could
> >> write information about pending events in a shared memory region, and
> >> the EL0 app could check that before calling some sort of block()
> >> hypercall, and check it again when it returns from the block() call.
> > 
> >> But the shared event information starts to look an awful lot like events
> >> and/or pending bits on an interrupt controller -- the only difference
> >> being that you aren't interrupted if you're already running.
> > 
> > Actually there are third way, which I have used. I described it in
> > original email (check out [1]).
> > Basically, native application is dead until it is needed by
> > hypervisor. When hypervisor wants some services from app, it setups
> > parameters, switches mode to EL0 and jumps at app entry point.
> 
> What's the difference between "jumps to an app entry point" and "jumps
> to an interrupt handling routine"?  And what's the difference between
> "Tells Xen about the location of the app entry point" and "tells Xen
> about the location of the interrupt handling routine"?
> 
> If you want this "EL0 app" thing to be able to provide extra security
> over just running the code inside of Xen, then the code must not be able
> to DoS the host by spinning forever instead of returning.

I think that the "extra security" was mostly Julien's and my goal.
Volodymyr would be OK with having the code in Xen, if I recall correctly
from past conversations.

In any case, wouldn't the usual Xen timer interrupt prevent this scenario
from happening?


> What happens if two different pcpus in Xen decide they want to activate
> some "app" functionality?

It should work fine as long as the app code is written to be able to
cope with it (spin_locks, etc).


> >> I'm pretty sure you could run in this mode using the existing interfaces
> >> if you didn't want the hassle of dealing with asynchrony.  If that's the
> >> case, then why bother inventing an entirely new interface, with its own
> >> bugs and duplication of functionality?  Why not just use what we already
> >> have?
> > Because we are concerned about latency. In my benchmark, my native app
> > PoC is 1.6 times faster than stubdom.
> 
> But given the conversation so far, it seems likely that that is mainly
> due to the fact that context switching on ARM has not been optimized.

True. However, Volodymyr took the time to demonstrate the performance of
EL0 apps vs. stubdoms with a PoC, which is much more than most Xen
contributors do. Nodoby provided numbers for a faster ARM context switch
yet. I don't know on whom should fall the burden of proving that a
lighter context switch can match the EL0 app numbers. I am not sure it
would be fair to ask Volodymyr to do it.


> Just to be clear -- I'm not adamantly opposed to a new interface similar
> to what you're describing above.  But I would be opposed to introducing
> a new interface that doesn't achieve the stated goals (more secure, &c),
> or a new interface that is the same as the old one but rewritten a bit.
> 
> The point of having this design discussion up front is to prevent a
> situation where you spend months coding up something which is ultimately
> rejected.  There are a lot of things that are hard to predict until
> there's actually code to review, but at the moment the "jumps to an
> interrupt handling routine" approach looks unpromising.

Did you mean "jumps to a app entry point" or "jumps to an interrupt
handling routine"?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-19  9:37           ` George Dunlap
  2017-06-19 17:54             ` Stefano Stabellini
@ 2017-06-19 18:26             ` Volodymyr Babchuk
  2017-06-20 10:00               ` Dario Faggioli
  1 sibling, 1 reply; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-06-19 18:26 UTC (permalink / raw)
  To: George Dunlap
  Cc: Artem_Mygaiev, Stefano Stabellini, Andrii Anisov, Dario Faggioli,
	Julien Grall, xen-devel

Hi George,

On 19 June 2017 at 02:37, George Dunlap <george.dunlap@citrix.com> wrote:
>>>>> There is no way out: if the stubdom needs events, then we'll have to
>>>>> expose and context switch the vGIC. If it doesn't, then we can skip the
>>>>> vGIC. However, we would have a similar problem with EL0 apps: I am
>>>>> assuming that EL0 apps don't need to handle interrupts, but if they do,
>>>>> then they might need something like a vGIC.
>>>> Hm. Correct me, but if we want make stubdom to handle some requests
>>>> (e.g. emulate MMIO access), then it needs events, and thus it needs
>>>> interrupts. At least, I'm not aware about any other mechanism, that
>>>> allows hypervisor to signal to a domain.
>>>> On other hand, EL0 app (as I see them) does not need such events.
>>>> Basically, you just call function `handle_mmio()` right in the app.
>>>> So, apps can live without interrupts and they still be able to handle
>>>> request.
>>>
>>> So remember that "interrupt" and "event" are basically the same as
>>> "structured callback".  When anything happens that Xen wants to tell the
>>> EL0 app about, it has to have a way of telling it.  If the EL0 app is
>>> handling a device, it has to have some way of getting interrupts from
>>> that device; if it needs to emulate devices sent to the guest, it needs
>>> some way to tell Xen to deliver an interrupt to the guest.
>> Basically yes. There should be mechanism to request something from
>> native application. Question is how this mechanism can be implemented.
>> Classical approach is a even-driven loop:
>>
>> while(1) {
>>     wait_for_event();
>>     handle_event_event();
>>     return_back_results();
>> }
>>
>> wait_for_event() can by anything from WFI instruction to read() on
>> socket. This is how stubdoms are working. I agree with you: there are
>> no sense to repeat this in native apps.
>>
>>> Now, we could make the EL0 app interface "interruptless".  Xen could
>>> write information about pending events in a shared memory region, and
>>> the EL0 app could check that before calling some sort of block()
>>> hypercall, and check it again when it returns from the block() call.
>>
>>> But the shared event information starts to look an awful lot like events
>>> and/or pending bits on an interrupt controller -- the only difference
>>> being that you aren't interrupted if you're already running.
>>
>> Actually there are third way, which I have used. I described it in
>> original email (check out [1]).
>> Basically, native application is dead until it is needed by
>> hypervisor. When hypervisor wants some services from app, it setups
>> parameters, switches mode to EL0 and jumps at app entry point.
>
> What's the difference between "jumps to an app entry point" and "jumps
> to an interrupt handling routine"?
"Jumps to an app entry point" and "Unblocks vcpu that waits for an
interrupt". That would be more precise. There are two differences:
first approach is synchronous, no need to wait scheduler to schedule
vcpu. Also vGIC code can be omitted, which decreases switch latency.


>  And what's the difference between
> "Tells Xen about the location of the app entry point" and "tells Xen
> about the location of the interrupt handling routine"?
There are no difference at all.

> If you want this "EL0 app" thing to be able to provide extra security
> over just running the code inside of Xen, then the code must not be able
> to DoS the host by spinning forever instead of returning.
Right. This is a problem. Fortunately, it is running with interrupts
enabled, so next timer tick will switch back to XEN. There you can
terminate app which is running too long.

> What happens if two different pcpus in Xen decide they want to activate
> some "app" functionality?
There are two possibilities: we can make app single threaded, then
second pcpu can be assigned with another vcpu until app is busy. But I
don't like this approach.
I think that all apps should be multi threaded. They can use simple
spinlocks to control access to shared resources.

>>> I'm pretty sure you could run in this mode using the existing interfaces
>>> if you didn't want the hassle of dealing with asynchrony.  If that's the
>>> case, then why bother inventing an entirely new interface, with its own
>>> bugs and duplication of functionality?  Why not just use what we already
>>> have?
>> Because we are concerned about latency. In my benchmark, my native app
>> PoC is 1.6 times faster than stubdom.
>
> But given the conversation so far, it seems likely that that is mainly
> due to the fact that context switching on ARM has not been optimized.
Yes. Question is: can context switching in ARM be optimized more? I don't know.

> Just to be clear -- I'm not adamantly opposed to a new interface similar
> to what you're describing above.  But I would be opposed to introducing
> a new interface that doesn't achieve the stated goals (more secure, &c),
> or a new interface that is the same as the old one but rewritten a bit.
>
> The point of having this design discussion up front is to prevent a
> situation where you spend months coding up something which is ultimately
> rejected.  There are a lot of things that are hard to predict until
> there's actually code to review, but at the moment the "jumps to an
> interrupt handling routine" approach looks unpromising.
Yes, I'm agree with you. This is why I started those mail threads in
the first place. Actually, after all that discussions I stick more to
some sort of lightweight domain-bound stubdoms (without vGICs, for
example). But I want to discuss all possibilities, including native
apps.
Actually, what we really need right now is a hard numbers. I did one
benchmark, but that was ideal use case. I'm going to do more
experiments: with 1 or 1.5 active vcpu per pcpu, with p2m context
switch stripped off, etc.


-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-19 17:54             ` Stefano Stabellini
@ 2017-06-19 18:36               ` Volodymyr Babchuk
  2017-06-20 10:11                 ` Dario Faggioli
  2017-06-20 10:45                 ` Julien Grall
  0 siblings, 2 replies; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-06-19 18:36 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Dario Faggioli,
	George Dunlap, Julien Grall

Hi Stefano,

On 19 June 2017 at 10:54, Stefano Stabellini <sstabellini@kernel.org> wrote:

>> But given the conversation so far, it seems likely that that is mainly
>> due to the fact that context switching on ARM has not been optimized.
>
> True. However, Volodymyr took the time to demonstrate the performance of
> EL0 apps vs. stubdoms with a PoC, which is much more than most Xen
> contributors do. Nodoby provided numbers for a faster ARM context switch
> yet. I don't know on whom should fall the burden of proving that a
> lighter context switch can match the EL0 app numbers. I am not sure it
> would be fair to ask Volodymyr to do it.
Thanks. Actually, we discussed this topic internally today. Main
concern today is not a SMCs and OP-TEE (I will be happy to do this
right in XEN), but vcopros and GPU virtualization. Because of legal
issues, we can't put this in XEN. And because of vcpu framework nature
we will need multiple calls to vgpu driver per one vcpu context
switch.
I'm going to create worst case scenario, where multiple vcpu are
active and there are no free pcpu, to see how credit or credit2
scheduler will call my stubdom.
Also, I'm very interested in Julien's idea about stubdom without GIC.
Probably, I'll try to hack something like that to see how it will
affect overall switching latency.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-19 18:26             ` Volodymyr Babchuk
@ 2017-06-20 10:00               ` Dario Faggioli
  2017-06-20 10:30                 ` George Dunlap
  0 siblings, 1 reply; 49+ messages in thread
From: Dario Faggioli @ 2017-06-20 10:00 UTC (permalink / raw)
  To: Volodymyr Babchuk, George Dunlap
  Cc: Artem_Mygaiev, Julien Grall, Stefano Stabellini, Andrii Anisov,
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1268 bytes --]

On Mon, 2017-06-19 at 11:26 -0700, Volodymyr Babchuk wrote:
> On 19 June 2017 at 02:37, George Dunlap <george.dunlap@citrix.com>
> wrote:
> > If you want this "EL0 app" thing to be able to provide extra
> > security
> > over just running the code inside of Xen, then the code must not be
> > able
> > to DoS the host by spinning forever instead of returning.
> 
> Right. This is a problem. Fortunately, it is running with interrupts
> enabled, so next timer tick will switch back to XEN. There you can
> terminate app which is running too long.
> 
What timer tick? Xen does not have one. A scheduler may setup one, if
it's necessary for its own purposes, but that's entirely optional. For
example, Credit does have one; Credit2, RTDS and null do not.

Basically, (one of the) main purposes of this new "EL0 app mechanism"
is playing behind the scheduler back. Well, fine, but then you're not
allowed to assume that the scheduler will rescue you if something goes
wrong.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-19 18:36               ` Volodymyr Babchuk
@ 2017-06-20 10:11                 ` Dario Faggioli
  2017-07-07 15:02                   ` Volodymyr Babchuk
  2017-06-20 10:45                 ` Julien Grall
  1 sibling, 1 reply; 49+ messages in thread
From: Dario Faggioli @ 2017-06-20 10:11 UTC (permalink / raw)
  To: Volodymyr Babchuk, Stefano Stabellini
  Cc: Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 1914 bytes --]

On Mon, 2017-06-19 at 11:36 -0700, Volodymyr Babchuk wrote:
> On 19 June 2017 at 10:54, Stefano Stabellini <sstabellini@kernel.org>
> wrote:
> > True. However, Volodymyr took the time to demonstrate the
> > performance of
> > EL0 apps vs. stubdoms with a PoC, which is much more than most Xen
> > contributors do. Nodoby provided numbers for a faster ARM context
> > switch
> > yet. I don't know on whom should fall the burden of proving that a
> > lighter context switch can match the EL0 app numbers. I am not sure
> > it
> > would be fair to ask Volodymyr to do it.
> 
> Thanks. Actually, we discussed this topic internally today. Main
> concern today is not a SMCs and OP-TEE (I will be happy to do this
> right in XEN), but vcopros and GPU virtualization. Because of legal
> issues, we can't put this in XEN. And because of vcpu framework
> nature
> we will need multiple calls to vgpu driver per one vcpu context
> switch.
> I'm going to create worst case scenario, where multiple vcpu are
> active and there are no free pcpu, to see how credit or credit2
> scheduler will call my stubdom.
>
Well, that would be interesting and useful, thanks for offering doing
that.

Let's just keep in mind, though, that, if the numbers will turn out to
be bad (and we manage to trace that back to being due to scheduling),
then:
1) we can create a mechanism that bypasses the scheduler,
2) we can change the way stubdom are scheduled.

Option 2) is something generic, would (most likely) benefit other use
cases too, and we've said many times we'd be up for it... so let's
please just not rule it out... :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-20 10:00               ` Dario Faggioli
@ 2017-06-20 10:30                 ` George Dunlap
  0 siblings, 0 replies; 49+ messages in thread
From: George Dunlap @ 2017-06-20 10:30 UTC (permalink / raw)
  To: Dario Faggioli, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Julien Grall, Stefano Stabellini, Andrii Anisov,
	xen-devel

On 20/06/17 11:00, Dario Faggioli wrote:
> On Mon, 2017-06-19 at 11:26 -0700, Volodymyr Babchuk wrote:
>> On 19 June 2017 at 02:37, George Dunlap <george.dunlap@citrix.com>
>> wrote:
>>> If you want this "EL0 app" thing to be able to provide extra
>>> security
>>> over just running the code inside of Xen, then the code must not be
>>> able
>>> to DoS the host by spinning forever instead of returning.
>>
>> Right. This is a problem. Fortunately, it is running with interrupts
>> enabled, so next timer tick will switch back to XEN. There you can
>> terminate app which is running too long.
>>
> What timer tick? Xen does not have one. A scheduler may setup one, if
> it's necessary for its own purposes, but that's entirely optional. For
> example, Credit does have one; Credit2, RTDS and null do not.
> 
> Basically, (one of the) main purposes of this new "EL0 app mechanism"
> is playing behind the scheduler back. Well, fine, but then you're not
> allowed to assume that the scheduler will rescue you if something goes
> wrong.

Well another possibility would be to add "timeouts" to "calls" into the
el0 app: i.e., part of the calling mechanism itself would be to set a
timer to come back into Xen and fail the call.

But what to do if you fail?  You could just stop executing the "app",
but there's no telling what state its memory will be in, nor any device
it's using.  It's probably not safe to continue using.  Do you crash it?
 Restart it?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-19 18:36               ` Volodymyr Babchuk
  2017-06-20 10:11                 ` Dario Faggioli
@ 2017-06-20 10:45                 ` Julien Grall
  2017-06-20 16:23                   ` Volodymyr Babchuk
  1 sibling, 1 reply; 49+ messages in thread
From: Julien Grall @ 2017-06-20 10:45 UTC (permalink / raw)
  To: Volodymyr Babchuk, Stefano Stabellini
  Cc: Artem_Mygaiev, Dario Faggioli, xen-devel, Andrii Anisov, George Dunlap



On 06/19/2017 07:36 PM, Volodymyr Babchuk wrote:
> Hi Stefano,

Hi,

> On 19 June 2017 at 10:54, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
>>> But given the conversation so far, it seems likely that that is mainly
>>> due to the fact that context switching on ARM has not been optimized.
>>
>> True. However, Volodymyr took the time to demonstrate the performance of
>> EL0 apps vs. stubdoms with a PoC, which is much more than most Xen
>> contributors do. Nodoby provided numbers for a faster ARM context switch
>> yet. I don't know on whom should fall the burden of proving that a
>> lighter context switch can match the EL0 app numbers. I am not sure it
>> would be fair to ask Volodymyr to do it.
> Thanks. Actually, we discussed this topic internally today. Main
> concern today is not a SMCs and OP-TEE (I will be happy to do this
> right in XEN), but vcopros and GPU virtualization. Because of legal
> issues, we can't put this in XEN. And because of vcpu framework nature
> we will need multiple calls to vgpu driver per one vcpu context
> switch.
> I'm going to create worst case scenario, where multiple vcpu are
> active and there are no free pcpu, to see how credit or credit2
> scheduler will call my stubdom.
> Also, I'm very interested in Julien's idea about stubdom without GIC.
> Probably, I'll try to hack something like that to see how it will
> affect overall switching latency
This can only work if your stubdomain does not require interrupt. 
However, if you are dealing with devices you likely need interrupts, am 
I correct?

The problem would be the same with an EL0 app.

Cheers.

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-20 10:45                 ` Julien Grall
@ 2017-06-20 16:23                   ` Volodymyr Babchuk
  2017-06-21 10:38                     ` Julien Grall
  0 siblings, 1 reply; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-06-20 16:23 UTC (permalink / raw)
  To: Julien Grall
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Dario Faggioli,
	George Dunlap, Stefano Stabellini

Hi Julien,

On 20 June 2017 at 03:45, Julien Grall <julien.grall@arm.com> wrote:
>> On 19 June 2017 at 10:54, Stefano Stabellini <sstabellini@kernel.org>
>> wrote:
>>
>>>> But given the conversation so far, it seems likely that that is mainly
>>>> due to the fact that context switching on ARM has not been optimized.
>>>
>>>
>>> True. However, Volodymyr took the time to demonstrate the performance of
>>> EL0 apps vs. stubdoms with a PoC, which is much more than most Xen
>>> contributors do. Nodoby provided numbers for a faster ARM context switch
>>> yet. I don't know on whom should fall the burden of proving that a
>>> lighter context switch can match the EL0 app numbers. I am not sure it
>>> would be fair to ask Volodymyr to do it.
>>
>> Thanks. Actually, we discussed this topic internally today. Main
>> concern today is not a SMCs and OP-TEE (I will be happy to do this
>> right in XEN), but vcopros and GPU virtualization. Because of legal
>> issues, we can't put this in XEN. And because of vcpu framework nature
>> we will need multiple calls to vgpu driver per one vcpu context
>> switch.
>> I'm going to create worst case scenario, where multiple vcpu are
>> active and there are no free pcpu, to see how credit or credit2
>> scheduler will call my stubdom.
>> Also, I'm very interested in Julien's idea about stubdom without GIC.
>> Probably, I'll try to hack something like that to see how it will
>> affect overall switching latency
>
> This can only work if your stubdomain does not require interrupt. However,
> if you are dealing with devices you likely need interrupts, am I correct?
Ah yes, you are correct. I thought about OP-TEE use case, when there
are no interrupts. In case of co-processor virtualization we probably
will need interrupts.

> The problem would be the same with an EL0 app.
In case of EL0 there will be no problem, because EL0 can't handle
interrupts :) XEN should receive interrupt and invoke app. Yes, this
is another problem with apps, if we want to use them as devices
drivers.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-20 16:23                   ` Volodymyr Babchuk
@ 2017-06-21 10:38                     ` Julien Grall
  0 siblings, 0 replies; 49+ messages in thread
From: Julien Grall @ 2017-06-21 10:38 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Dario Faggioli,
	George Dunlap, Stefano Stabellini



On 20/06/17 17:23, Volodymyr Babchuk wrote:
> Hi Julien,

Hi Volodymyr,

>
> On 20 June 2017 at 03:45, Julien Grall <julien.grall@arm.com> wrote:
>>> On 19 June 2017 at 10:54, Stefano Stabellini <sstabellini@kernel.org>
>>> wrote:
>>>
>>>>> But given the conversation so far, it seems likely that that is mainly
>>>>> due to the fact that context switching on ARM has not been optimized.
>>>>
>>>>
>>>> True. However, Volodymyr took the time to demonstrate the performance of
>>>> EL0 apps vs. stubdoms with a PoC, which is much more than most Xen
>>>> contributors do. Nodoby provided numbers for a faster ARM context switch
>>>> yet. I don't know on whom should fall the burden of proving that a
>>>> lighter context switch can match the EL0 app numbers. I am not sure it
>>>> would be fair to ask Volodymyr to do it.
>>>
>>> Thanks. Actually, we discussed this topic internally today. Main
>>> concern today is not a SMCs and OP-TEE (I will be happy to do this
>>> right in XEN), but vcopros and GPU virtualization. Because of legal
>>> issues, we can't put this in XEN. And because of vcpu framework nature
>>> we will need multiple calls to vgpu driver per one vcpu context
>>> switch.
>>> I'm going to create worst case scenario, where multiple vcpu are
>>> active and there are no free pcpu, to see how credit or credit2
>>> scheduler will call my stubdom.
>>> Also, I'm very interested in Julien's idea about stubdom without GIC.
>>> Probably, I'll try to hack something like that to see how it will
>>> affect overall switching latency
>>
>> This can only work if your stubdomain does not require interrupt. However,
>> if you are dealing with devices you likely need interrupts, am I correct?
> Ah yes, you are correct. I thought about OP-TEE use case, when there
> are no interrupts. In case of co-processor virtualization we probably
> will need interrupts.
>
>> The problem would be the same with an EL0 app.
> In case of EL0 there will be no problem, because EL0 can't handle
> interrupts :) XEN should receive interrupt and invoke app. Yes, this
> is another problem with apps, if we want to use them as devices
> drivers.

Well, this is a bit more complex than that. When you receive an 
interrupt Xen may run a vCPU that will not use that app. So you have to 
ensure the time will not get accounted for it.

The more I read the discussion, the more I think we should look at 
optimizing the stubdom case. Xen EL0 should only be used for tiny 
emulation for a given domain. Otherwise you end up to re-invent the domain.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-06-20 10:11                 ` Dario Faggioli
@ 2017-07-07 15:02                   ` Volodymyr Babchuk
  2017-07-07 16:41                     ` Dario Faggioli
  0 siblings, 1 reply; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-07-07 15:02 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, George Dunlap,
	Julien Grall, Stefano Stabellini

Hello Dario,

On 20 June 2017 at 13:11, Dario Faggioli <dario.faggioli@citrix.com> wrote:
> On Mon, 2017-06-19 at 11:36 -0700, Volodymyr Babchuk wrote:
>> On 19 June 2017 at 10:54, Stefano Stabellini <sstabellini@kernel.org>
>> wrote:
>> > True. However, Volodymyr took the time to demonstrate the
>> > performance of
>> > EL0 apps vs. stubdoms with a PoC, which is much more than most Xen
>> > contributors do. Nodoby provided numbers for a faster ARM context
>> > switch
>> > yet. I don't know on whom should fall the burden of proving that a
>> > lighter context switch can match the EL0 app numbers. I am not sure
>> > it
>> > would be fair to ask Volodymyr to do it.
>>
>> Thanks. Actually, we discussed this topic internally today. Main
>> concern today is not a SMCs and OP-TEE (I will be happy to do this
>> right in XEN), but vcopros and GPU virtualization. Because of legal
>> issues, we can't put this in XEN. And because of vcpu framework
>> nature
>> we will need multiple calls to vgpu driver per one vcpu context
>> switch.
>> I'm going to create worst case scenario, where multiple vcpu are
>> active and there are no free pcpu, to see how credit or credit2
>> scheduler will call my stubdom.
>>
> Well, that would be interesting and useful, thanks for offering doing
> that.
Yeah, so I did that. And I have get some puzzling results. I don't know why,
but when I have 4 (or less) active vcpus on 4 pcpus, my test  takes
about 1 second to execute.
But if there are 5 (or mode) active vcpus on 4 pcpus, it executes from
80 to 110 seconds.

There will be the details, but first let me remind you my setup.
 I'm testing on ARM64 machine with 4 Cortex A57 cores. I wrote
special test driver for linux, that calls SMC instruction 100 000 times.
Also I hacked miniOS to act as monitor for DomU. This means that
XEN traps SMC invocation and asks MiniOS to handle this.
So, every SMC is handled in this way:

DomU->XEN->MiniOS->XEN->DomU.

Now, let's get back to results.

** Case 1:
- Dom0 has 4 vcpus and is idle
- DomU has 4 vcpus and is idle
- Minios has 1 vcpu and is not idle, because it's scheduler does
not calls WFI.
I run test in DomU:

root@salvator-x-h3-xt:~# time -p cat /proc/smc_bench
Will call SMC 100000 time(s)
Done!
real 1.10
user 0.00
sys 1.10


** Case 2:
- Dom0 has 4 vcpus. They all are executing endless loop with sh oneliner:
# while : ; do : ; done &
- DomU has 4 vcpus and is idle
- Minios has 1 vcpu and is not idle, because it's scheduler does not calls WFI.
- In total there are 6 vcpus active

I run test in DomU:
real 113.08
user 0.00
sys 113.04

** Case 3:
- Dom0 has 4 vcpus. Three of them are executing endless loop with sh oneliner:
# while : ; do : ; done &
- DomU has 4 vcpus and is idle
- Minios has 1 vcpu and is not idle, because it's scheduler does not calls WFI.
- In total there are 5 vcpus active

I run test in DomU:
real 88.55
user 0.00
sys 88.54

** Case 4:
- Dom0 has 4 vcpus. Two of them are executing endless loop with sh oneliner:
# while : ; do : ; done &
- DomU has 4 vcpus and is idle
- Minios has 1 vcpu and is not idle, because it's scheduler does not calls WFI.
- In total there are 4 vcpus active

I run test in DomU:
real 1.11
user 0.00
sys 1.11

** Case 5:
- Dom0 has 4 vcpus and is idle.
- DomU has 4 vcpus. Three of them are executing endless loop with sh oneliner:
# while : ; do : ; done &
- Minios have 1 vcpu and is not idle, because it's scheduler does not calls WFI.
- In total there are 5 vcpus active
I run test in DomU:

real 100.96
user 0.00
sys 100.94

** Case 6:
- Dom0 has 4 vcpus and is idle.
- DomU has 4 vcpus. Two of them are executing endless loop with sh oneliner:
# while : ; do : ; done &
- Minios have 1 vcpu and is not idle, because it's scheduler does not calls WFI.
- In total there are 4 vcpus active

I run test in DomU:
real 1.11
user 0.00
sys 1.10

* Case 7
- Dom0 has 4 vcpus and is idle.
- DomU has 4 vcpus. Two of them are executing endless loop with sh oneliner:
# while : ; do : ; done &
- Minios have 1 vcpu and is not idle, because it's scheduler does not calls WFI.
- *Minios is running on separate cpu pool with 1 pcpu*:
Name               CPUs   Sched     Active   Domain count
Pool-0               3    credit       y          2
minios               1    credit       y          1

I run test in DomU:
real 1.11
user 0.00
sys 1.10

* Case 8
- Dom0 has 4 vcpus and is idle.
- DomU has 4 vcpus. Three of them are executing endless loop with sh oneliner:
# while : ; do : ; done &
- Minios have 1 vcpu and is not idle, because it's scheduler does not calls WFI.
- Minios is running on separate cpu pool with 1 pcpu:

I run test in DomU:
real 100.12
user 0.00
sys 100.11


As you can see, I tried to move minios to separate cpu pool. But it
didn't helped a lot.

Name                                        ID   Mem VCPUs State
Time(s)         Cpupool
Domain-0                                     0   752     4     r-----
  1566.1          Pool-0
DomU                                         1   255     4     -b----
  4535.1          Pool-0
mini-os                                      2   128     1     r-----
  2395.7          minios


I expected that it would be 20% to 50% slower, when there are more
vCPUs than pCPUs. But it is 100 times slower and I can't explain this.
Probably, something is very broken in my XEN. But I used 4.9 with some
hacks to make minios work. I didn't touched scheduler at all.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-07 15:02                   ` Volodymyr Babchuk
@ 2017-07-07 16:41                     ` Dario Faggioli
  2017-07-07 17:03                       ` Volodymyr Babchuk
  0 siblings, 1 reply; 49+ messages in thread
From: Dario Faggioli @ 2017-07-07 16:41 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, George Dunlap,
	Julien Grall, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 6061 bytes --]

On Fri, 2017-07-07 at 18:02 +0300, Volodymyr Babchuk wrote:
> Hello Dario,
> 
Hi!

> On 20 June 2017 at 13:11, Dario Faggioli <dario.faggioli@citrix.com>
> wrote:
> > On Mon, 2017-06-19 at 11:36 -0700, Volodymyr Babchuk wrote:
> > > 
> > > Thanks. Actually, we discussed this topic internally today. Main
> > > concern today is not a SMCs and OP-TEE (I will be happy to do
> > > this
> > > right in XEN), but vcopros and GPU virtualization. Because of
> > > legal
> > > issues, we can't put this in XEN. And because of vcpu framework
> > > nature
> > > we will need multiple calls to vgpu driver per one vcpu context
> > > switch.
> > > I'm going to create worst case scenario, where multiple vcpu are
> > > active and there are no free pcpu, to see how credit or credit2
> > > scheduler will call my stubdom.
> > > 
> > 
> > Well, that would be interesting and useful, thanks for offering
> > doing
> > that.
> 
> Yeah, so I did that. 
>
Ok, great! Thanks for doing and reporting about this. :-D

> And I have get some puzzling results. I don't know why,
> but when I have 4 (or less) active vcpus on 4 pcpus, my test  takes
> about 1 second to execute.
> But if there are 5 (or mode) active vcpus on 4 pcpus, it executes
> from
> 80 to 110 seconds.
> 
I see. So, I've got just a handful of minutes right now, to only
quickly look at the result and ask a couple of questions. Will think
about this more in the coming days...

> There will be the details, but first let me remind you my setup.
>  I'm testing on ARM64 machine with 4 Cortex A57 cores. I wrote
> special test driver for linux, that calls SMC instruction 100 000
> times.
> Also I hacked miniOS to act as monitor for DomU. This means that
> XEN traps SMC invocation and asks MiniOS to handle this.
>
Ok.

> So, every SMC is handled in this way:
> 
> DomU->XEN->MiniOS->XEN->DomU.
> 
Right. Nice work again.

> Now, let's get back to results.
> 
> ** Case 1:
> - Dom0 has 4 vcpus and is idle
> - DomU has 4 vcpus and is idle
> - Minios has 1 vcpu and is not idle, because it's scheduler does
> not calls WFI.
> I run test in DomU:
> 
> root@salvator-x-h3-xt:~# time -p cat /proc/smc_bench
> Will call SMC 100000 time(s)
>
So, given what you said above, this means that the vCPU that is running
this will frequently block (when calling SMC) and resume (when SMC is
handled) quite frequently, right?

Also, are you sure (e.g., because of how the Linux driver is done) that
this always happen on one vCPU?

> Done!
> real 1.10
> user 0.00
> sys 1.10

> ** Case 2:
> - Dom0 has 4 vcpus. They all are executing endless loop with sh
> oneliner:
> # while : ; do : ; done &
> - DomU has 4 vcpus and is idle
> - Minios has 1 vcpu and is not idle, because it's scheduler does not
> calls WFI.
>
Ah, I see. This is unideal IMO. It's fine for this POC, of course, but
I guess you've got plans to change this (if we decide to go the stubdom
route)?

> - In total there are 6 vcpus active
> 
> I run test in DomU:
> real 113.08
> user 0.00
> sys 113.04
> 
Ok, so there's contention for pCPUs. Dom0's vCPUs are CPU hogs, while,
if my assumption above is correct, the "SMC vCPU" of the DomU is I/O
bound, in the sense that it blocks on an operation --which turns out to
be SMC call to MiniOS-- then resumes and block again almost
immediately.

Since you are using Credit, can you try to disable context switch rate
limiting? Something like:

# xl sched-credit -s -r 0

should work.

This looks to me like one of those typical scenario where rate limiting
is counterproductive. In fact, every time that your SMC vCPU is woken
up, despite being boosted, it finds all the pCPUs busy, and it can't
preempt any of the vCPUs that are running there, until rate limiting
expires.

That means it has to wait an interval of time that varies between 0 and
1ms. This happens 100000 times, and 1ms*100000 is 100 seconds... Which
is roughly how the test takes, in the overcommitted case.

> * Case 7
> - Dom0 has 4 vcpus and is idle.
> - DomU has 4 vcpus. Two of them are executing endless loop with sh
> oneliner:
> # while : ; do : ; done &
> - Minios have 1 vcpu and is not idle, because it's scheduler does not
> calls WFI.
> - *Minios is running on separate cpu pool with 1 pcpu*:
> Name               CPUs   Sched     Active   Domain count
> Pool-0               3    credit       y          2
> minios               1    credit       y          1
> 
> I run test in DomU:
> real 1.11
> user 0.00
> sys 1.10
> 
> * Case 8
> - Dom0 has 4 vcpus and is idle.
> - DomU has 4 vcpus. Three of them are executing endless loop with sh
> oneliner:
> # while : ; do : ; done &
> - Minios have 1 vcpu and is not idle, because it's scheduler does not
> calls WFI.
> - Minios is running on separate cpu pool with 1 pcpu:
> 
> I run test in DomU:
> real 100.12
> user 0.00
> sys 100.11
> 
> 
> As you can see, I tried to move minios to separate cpu pool. But it
> didn't helped a lot.
> 
Yes, but it again makes sense. In fact, now there are 3 CPUs in Pool-0, 
and all are kept always busy by the the 3 DomU vCPUs running endless
loops. So, when the DomU's SMC vCPU wakes up, has again to wait for the
rate limit to expire on one of them.

> I expected that it would be 20% to 50% slower, when there are more
> vCPUs than pCPUs. But it is 100 times slower and I can't explain
> this.
> Probably, something is very broken in my XEN. But I used 4.9 with
> some
> hacks to make minios work. I didn't touched scheduler at all.
> 
If you can, try with rate limiting off and let me know. :-D

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-07 16:41                     ` Dario Faggioli
@ 2017-07-07 17:03                       ` Volodymyr Babchuk
  2017-07-07 21:12                         ` Stefano Stabellini
  2017-07-08 14:26                         ` Dario Faggioli
  0 siblings, 2 replies; 49+ messages in thread
From: Volodymyr Babchuk @ 2017-07-07 17:03 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, George Dunlap,
	Julien Grall, Stefano Stabellini

Hi again,

On 7 July 2017 at 09:41, Dario Faggioli <dario.faggioli@citrix.com> wrote:
> On Fri, 2017-07-07 at 18:02 +0300, Volodymyr Babchuk wrote:
>> Hello Dario,
>>
> Hi!
>
>> On 20 June 2017 at 13:11, Dario Faggioli <dario.faggioli@citrix.com>
>> wrote:
>> > On Mon, 2017-06-19 at 11:36 -0700, Volodymyr Babchuk wrote:
>> > >
>> > > Thanks. Actually, we discussed this topic internally today. Main
>> > > concern today is not a SMCs and OP-TEE (I will be happy to do
>> > > this
>> > > right in XEN), but vcopros and GPU virtualization. Because of
>> > > legal
>> > > issues, we can't put this in XEN. And because of vcpu framework
>> > > nature
>> > > we will need multiple calls to vgpu driver per one vcpu context
>> > > switch.
>> > > I'm going to create worst case scenario, where multiple vcpu are
>> > > active and there are no free pcpu, to see how credit or credit2
>> > > scheduler will call my stubdom.
>> > >
>> >
>> > Well, that would be interesting and useful, thanks for offering
>> > doing
>> > that.
>>
>> Yeah, so I did that.
>>
> Ok, great! Thanks for doing and reporting about this. :-D
>
>> And I have get some puzzling results. I don't know why,
>> but when I have 4 (or less) active vcpus on 4 pcpus, my test  takes
>> about 1 second to execute.
>> But if there are 5 (or mode) active vcpus on 4 pcpus, it executes
>> from
>> 80 to 110 seconds.
>>
> I see. So, I've got just a handful of minutes right now, to only
> quickly look at the result and ask a couple of questions. Will think
> about this more in the coming days...
>
>> There will be the details, but first let me remind you my setup.
>>  I'm testing on ARM64 machine with 4 Cortex A57 cores. I wrote
>> special test driver for linux, that calls SMC instruction 100 000
>> times.
>> Also I hacked miniOS to act as monitor for DomU. This means that
>> XEN traps SMC invocation and asks MiniOS to handle this.
>>
> Ok.
>
>> So, every SMC is handled in this way:
>>
>> DomU->XEN->MiniOS->XEN->DomU.
>>
> Right. Nice work again.
>
>> Now, let's get back to results.
>>
>> ** Case 1:
>> - Dom0 has 4 vcpus and is idle
>> - DomU has 4 vcpus and is idle
>> - Minios has 1 vcpu and is not idle, because it's scheduler does
>> not calls WFI.
>> I run test in DomU:
>>
>> root@salvator-x-h3-xt:~# time -p cat /proc/smc_bench
>> Will call SMC 100000 time(s)
>>
> So, given what you said above, this means that the vCPU that is running
> this will frequently block (when calling SMC) and resume (when SMC is
> handled) quite frequently, right?
Yes, exactly. There is vm_event_vcpu_pause(v) call in monitor.c

>
> Also, are you sure (e.g., because of how the Linux driver is done) that
> this always happen on one vCPU?
No, I can't guarantee that. Linux driver is single threaded, but I did
nothing to pin in to a certain CPU.

>
>> Done!
>> real 1.10
>> user 0.00
>> sys 1.10
>
>> ** Case 2:
>> - Dom0 has 4 vcpus. They all are executing endless loop with sh
>> oneliner:
>> # while : ; do : ; done &
>> - DomU has 4 vcpus and is idle
>> - Minios has 1 vcpu and is not idle, because it's scheduler does not
>> calls WFI.
>>
> Ah, I see. This is unideal IMO. It's fine for this POC, of course, but
> I guess you've got plans to change this (if we decide to go the stubdom
> route)?
Sure. There are much to be done in MiniOS to make it production-grade.

>
>> - In total there are 6 vcpus active
>>
>> I run test in DomU:
>> real 113.08
>> user 0.00
>> sys 113.04
>>
> Ok, so there's contention for pCPUs. Dom0's vCPUs are CPU hogs, while,
> if my assumption above is correct, the "SMC vCPU" of the DomU is I/O
> bound, in the sense that it blocks on an operation --which turns out to
> be SMC call to MiniOS-- then resumes and block again almost
> immediately.
>
> Since you are using Credit, can you try to disable context switch rate
> limiting? Something like:
>
> # xl sched-credit -s -r 0
>
> should work.
Yep. You are right. In the environment described above (Case 2) I now
get much better results:

 real 1.85
user 0.00
sys 1.85


> This looks to me like one of those typical scenario where rate limiting
> is counterproductive. In fact, every time that your SMC vCPU is woken
> up, despite being boosted, it finds all the pCPUs busy, and it can't
> preempt any of the vCPUs that are running there, until rate limiting
> expires.
>
> That means it has to wait an interval of time that varies between 0 and
> 1ms. This happens 100000 times, and 1ms*100000 is 100 seconds... Which
> is roughly how the test takes, in the overcommitted case.
Yes, looks like that was the case. Does this means that ratelimiting
should be disabled for any domain that is backed up with device model?
AFAIK, device models are working in the exactly same way.

>> * Case 7
>> - Dom0 has 4 vcpus and is idle.
>> - DomU has 4 vcpus. Two of them are executing endless loop with sh
>> oneliner:
>> # while : ; do : ; done &
>> - Minios have 1 vcpu and is not idle, because it's scheduler does not
>> calls WFI.
>> - *Minios is running on separate cpu pool with 1 pcpu*:
>> Name               CPUs   Sched     Active   Domain count
>> Pool-0               3    credit       y          2
>> minios               1    credit       y          1
>>
>> I run test in DomU:
>> real 1.11
>> user 0.00
>> sys 1.10
>>
>> * Case 8
>> - Dom0 has 4 vcpus and is idle.
>> - DomU has 4 vcpus. Three of them are executing endless loop with sh
>> oneliner:
>> # while : ; do : ; done &
>> - Minios have 1 vcpu and is not idle, because it's scheduler does not
>> calls WFI.
>> - Minios is running on separate cpu pool with 1 pcpu:
>>
>> I run test in DomU:
>> real 100.12
>> user 0.00
>> sys 100.11
>>
>>
>> As you can see, I tried to move minios to separate cpu pool. But it
>> didn't helped a lot.
>>
> Yes, but it again makes sense. In fact, now there are 3 CPUs in Pool-0,
> and all are kept always busy by the the 3 DomU vCPUs running endless
> loops. So, when the DomU's SMC vCPU wakes up, has again to wait for the
> rate limit to expire on one of them.
Yes, as this was caused by ratelimit, this makes perfect sense. Thank you.

I tried number of different cases. Now execution time depends linearly
on number of over-committed vCPUs (about +200ms for every busy vCPU).
That is what I'm expected.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@gmail.com

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-07 17:03                       ` Volodymyr Babchuk
@ 2017-07-07 21:12                         ` Stefano Stabellini
  2017-07-12  6:14                           ` Dario Faggioli
  2017-07-08 14:26                         ` Dario Faggioli
  1 sibling, 1 reply; 49+ messages in thread
From: Stefano Stabellini @ 2017-07-07 21:12 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, Dario Faggioli,
	George Dunlap, Julien Grall, Stefano Stabellini

On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:
> >> I run test in DomU:
> >> real 113.08
> >> user 0.00
> >> sys 113.04
> >>
> > Ok, so there's contention for pCPUs. Dom0's vCPUs are CPU hogs, while,
> > if my assumption above is correct, the "SMC vCPU" of the DomU is I/O
> > bound, in the sense that it blocks on an operation --which turns out to
> > be SMC call to MiniOS-- then resumes and block again almost
> > immediately.
> >
> > Since you are using Credit, can you try to disable context switch rate
> > limiting? Something like:
> >
> > # xl sched-credit -s -r 0
> >
> > should work.
> Yep. You are right. In the environment described above (Case 2) I now
> get much better results:
> 
>  real 1.85
> user 0.00
> sys 1.85

From 113 to 1.85 -- WOW!

Obviously I am no scheduler expert, but shouldn't we advertise a bit
better a scheduler configuration option that makes things _one hundred
times faster_ ?! It's not even mentioned in
https://wiki.xen.org/wiki/Tuning_Xen_for_Performance!

Also, it is worrying to me that there are cases were, unless the user
tweaks the configuration, she is going to get 100x worse performance out
of her system.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-07 17:03                       ` Volodymyr Babchuk
  2017-07-07 21:12                         ` Stefano Stabellini
@ 2017-07-08 14:26                         ` Dario Faggioli
  1 sibling, 0 replies; 49+ messages in thread
From: Dario Faggioli @ 2017-07-08 14:26 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov, George Dunlap,
	Julien Grall, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 4005 bytes --]

On Fri, 2017-07-07 at 10:03 -0700, Volodymyr Babchuk wrote:
> On 7 July 2017 at 09:41, Dario Faggioli <dario.faggioli@citrix.com>
> wrote:
> > 
> > Also, are you sure (e.g., because of how the Linux driver is done)
> > that
> > this always happen on one vCPU?
> 
> No, I can't guarantee that. Linux driver is single threaded, but I
> did
> nothing to pin in to a certain CPU.
> 
Ok, it was just to understand.

> > 
> > > - In total there are 6 vcpus active
> > > 
> > > I run test in DomU:
> > > real 113.08
> > > user 0.00
> > > sys 113.04
> > > 
> > 
> > Ok, so there's contention for pCPUs. Dom0's vCPUs are CPU hogs,
> > while,
> > if my assumption above is correct, the "SMC vCPU" of the DomU is
> > I/O
> > bound, in the sense that it blocks on an operation --which turns
> > out to
> > be SMC call to MiniOS-- then resumes and block again almost
> > immediately.
> > 
> > Since you are using Credit, can you try to disable context switch
> > rate
> > limiting? Something like:
> > 
> > # xl sched-credit -s -r 0
> > 
> > should work.
> 
> Yep. You are right. In the environment described above (Case 2) I now
> get much better results:
> 
>  real 1.85
> user 0.00
> sys 1.85
> 
Ok, glad to hear it worked! :-)

> > This looks to me like one of those typical scenario where rate
> > limiting
> > is counterproductive. In fact, every time that your SMC vCPU is
> > woken
> > up, despite being boosted, it finds all the pCPUs busy, and it
> > can't
> > preempt any of the vCPUs that are running there, until rate
> > limiting
> > expires.
> > 
> > That means it has to wait an interval of time that varies between 0
> > and
> > 1ms. This happens 100000 times, and 1ms*100000 is 100 seconds...
> > Which
> > is roughly how the test takes, in the overcommitted case.
> 
> Yes, looks like that was the case. Does this means that ratelimiting
> should be disabled for any domain that is backed up with device
> model?
> AFAIK, device models are working in the exactly same way.
> 
Rate limiting is a scheduler-wide thing. If it's on, all the context
switching rate of all domains is limited. If it's off, none is.

We'll have to see when we will have something that is less of a proof-
of-concept, but it is very likely that, for your use case, rate-
limiting should just be kept disabled (you can do that with a Xen boot
time parameter, so that you don't have to issue the command all the
times).

> > Yes, but it again makes sense. In fact, now there are 3 CPUs in
> > Pool-0,
> > and all are kept always busy by the the 3 DomU vCPUs running
> > endless
> > loops. So, when the DomU's SMC vCPU wakes up, has again to wait for
> > the
> > rate limit to expire on one of them.
> 
> Yes, as this was caused by ratelimit, this makes perfect sense. Thank
> you.
> 
> I tried number of different cases. Now execution time depends
> linearly
> on number of over-committed vCPUs (about +200ms for every busy vCPU).
> That is what I'm expected.
>
Is this the case even when MiniOS is in its own cpupool? If yes, it
means that what is that the slowdown is caused by the contention
between the vCPU that is doing the SMC calls, and the other vCPUs (of
either the same or other domains).

Which should not really happen in this case (or, at least, not to grow
linearly), since you are on Credit1, and in there, the SMC vCPU should
pretty much be always boosted, and hence get to be scheduled almost
immediately, no matter how many CPU hogs there are around.

Depending on the specific details of your usecase/product, we can try
to assign to the various domains different weights... but I need to
think a bit more about this...

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-07 21:12                         ` Stefano Stabellini
@ 2017-07-12  6:14                           ` Dario Faggioli
  2017-07-17  9:25                             ` George Dunlap
  0 siblings, 1 reply; 49+ messages in thread
From: Dario Faggioli @ 2017-07-12  6:14 UTC (permalink / raw)
  To: Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov, George Dunlap


[-- Attachment #1.1: Type: text/plain, Size: 2207 bytes --]

On Fri, 2017-07-07 at 14:12 -0700, Stefano Stabellini wrote:
> On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:
> > > > 
> > > Since you are using Credit, can you try to disable context switch
> > > rate
> > > limiting?
> >
> > Yep. You are right. In the environment described above (Case 2) I
> > now
> > get much better results:
> > 
> >  real 1.85
> > user 0.00
> > sys 1.85
> 
> From 113 to 1.85 -- WOW!
> 
> Obviously I am no scheduler expert, but shouldn't we advertise a bit
> better a scheduler configuration option that makes things _one
> hundred
> times faster_ ?! 
>
So, to be fair, so far, we've bitten this hard by this only on
artificially constructed test cases, where either some extreme
assumption were made (e.g., that all the vCPUs except one always run at
100% load) or pinning was used in a weird and suboptimal way. And there
are workload where it has been verified that it helps making
performance better (poor SpecVIRT  results without it was the main
motivation having it upstream, and on by default).

That being said, I personally have never liked rate-limiting, it always
looked to me like the wrong solution.

> It's not even mentioned in
> https://wiki.xen.org/wiki/Tuning_Xen_for_Performance!
> 
Well, for sure it should be mentioned here, you're right!

> Also, it is worrying to me that there are cases were, unless the user
> tweaks the configuration, she is going to get 100x worse performance
> out
> of her system.
>
As I said, it's hard to tell in advance whether it will have a good,
bad, or really bad impact on a specific workload.

I'm starting to think, though, that it may be good to switch to having
it off by default, and then document that if the system is going into
trashing because of too frequent context switches, turning it on may
help.

I'll think about it, and see if I'll be able to run some benchmarks
with it on and off.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-12  6:14                           ` Dario Faggioli
@ 2017-07-17  9:25                             ` George Dunlap
  2017-07-17 10:04                               ` Julien Grall
  2017-07-20  8:49                               ` Dario Faggioli
  0 siblings, 2 replies; 49+ messages in thread
From: George Dunlap @ 2017-07-17  9:25 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov

On 07/12/2017 07:14 AM, Dario Faggioli wrote:
> On Fri, 2017-07-07 at 14:12 -0700, Stefano Stabellini wrote:
>> On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:
>>>>>
>>>> Since you are using Credit, can you try to disable context switch
>>>> rate
>>>> limiting?
>>>
>>> Yep. You are right. In the environment described above (Case 2) I
>>> now
>>> get much better results:
>>>
>>>  real 1.85
>>> user 0.00
>>> sys 1.85
>>
>> From 113 to 1.85 -- WOW!
>>
>> Obviously I am no scheduler expert, but shouldn't we advertise a bit
>> better a scheduler configuration option that makes things _one
>> hundred
>> times faster_ ?! 
>>
> So, to be fair, so far, we've bitten this hard by this only on
> artificially constructed test cases, where either some extreme
> assumption were made (e.g., that all the vCPUs except one always run at
> 100% load) or pinning was used in a weird and suboptimal way. And there
> are workload where it has been verified that it helps making
> performance better (poor SpecVIRT  results without it was the main
> motivation having it upstream, and on by default).
> 
> That being said, I personally have never liked rate-limiting, it always
> looked to me like the wrong solution.

In fact, I *think* the only reason it may have been introduced is that
there was a bug in the credit2 code at the time such that it always had
a single runqueue no matter what your actual pcpu topology was.

>> It's not even mentioned in
>> https://wiki.xen.org/wiki/Tuning_Xen_for_Performance!
>>
> Well, for sure it should be mentioned here, you're right!
> 
>> Also, it is worrying to me that there are cases were, unless the user
>> tweaks the configuration, she is going to get 100x worse performance
>> out
>> of her system.
>>
> As I said, it's hard to tell in advance whether it will have a good,
> bad, or really bad impact on a specific workload.
> 
> I'm starting to think, though, that it may be good to switch to having
> it off by default, and then document that if the system is going into
> trashing because of too frequent context switches, turning it on may
> help.
> 
> I'll think about it, and see if I'll be able to run some benchmarks
> with it on and off.

Thanks.  FYI the main benchmark that was used to justify its inclusion
(and on by default) was specvirt (I think).

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-17  9:25                             ` George Dunlap
@ 2017-07-17 10:04                               ` Julien Grall
  2017-07-17 11:28                                 ` George Dunlap
  2017-07-20  8:49                               ` Dario Faggioli
  1 sibling, 1 reply; 49+ messages in thread
From: Julien Grall @ 2017-07-17 10:04 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov

Hi,

On 17/07/17 10:25, George Dunlap wrote:
> On 07/12/2017 07:14 AM, Dario Faggioli wrote:
>> On Fri, 2017-07-07 at 14:12 -0700, Stefano Stabellini wrote:
>>> On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:
>>>>>>
>>>>> Since you are using Credit, can you try to disable context switch
>>>>> rate
>>>>> limiting?
>>>>
>>>> Yep. You are right. In the environment described above (Case 2) I
>>>> now
>>>> get much better results:
>>>>
>>>>  real 1.85
>>>> user 0.00
>>>> sys 1.85
>>>
>>> From 113 to 1.85 -- WOW!
>>>
>>> Obviously I am no scheduler expert, but shouldn't we advertise a bit
>>> better a scheduler configuration option that makes things _one
>>> hundred
>>> times faster_ ?!
>>>
>> So, to be fair, so far, we've bitten this hard by this only on
>> artificially constructed test cases, where either some extreme
>> assumption were made (e.g., that all the vCPUs except one always run at
>> 100% load) or pinning was used in a weird and suboptimal way. And there
>> are workload where it has been verified that it helps making
>> performance better (poor SpecVIRT  results without it was the main
>> motivation having it upstream, and on by default).
>>
>> That being said, I personally have never liked rate-limiting, it always
>> looked to me like the wrong solution.
>
> In fact, I *think* the only reason it may have been introduced is that
> there was a bug in the credit2 code at the time such that it always had
> a single runqueue no matter what your actual pcpu topology was.

FWIW, we don't yet parse the pCPU topology on ARM. AFAIU, we always tell 
Xen each CPU is in its own core. Will it have some implications in the 
scheduler?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-17 10:04                               ` Julien Grall
@ 2017-07-17 11:28                                 ` George Dunlap
  2017-07-19 11:21                                   ` Julien Grall
  2017-07-20  9:10                                   ` Dario Faggioli
  0 siblings, 2 replies; 49+ messages in thread
From: George Dunlap @ 2017-07-17 11:28 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov

On 07/17/2017 11:04 AM, Julien Grall wrote:
> Hi,
> 
> On 17/07/17 10:25, George Dunlap wrote:
>> On 07/12/2017 07:14 AM, Dario Faggioli wrote:
>>> On Fri, 2017-07-07 at 14:12 -0700, Stefano Stabellini wrote:
>>>> On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:
>>>>>>>
>>>>>> Since you are using Credit, can you try to disable context switch
>>>>>> rate
>>>>>> limiting?
>>>>>
>>>>> Yep. You are right. In the environment described above (Case 2) I
>>>>> now
>>>>> get much better results:
>>>>>
>>>>>  real 1.85
>>>>> user 0.00
>>>>> sys 1.85
>>>>
>>>> From 113 to 1.85 -- WOW!
>>>>
>>>> Obviously I am no scheduler expert, but shouldn't we advertise a bit
>>>> better a scheduler configuration option that makes things _one
>>>> hundred
>>>> times faster_ ?!
>>>>
>>> So, to be fair, so far, we've bitten this hard by this only on
>>> artificially constructed test cases, where either some extreme
>>> assumption were made (e.g., that all the vCPUs except one always run at
>>> 100% load) or pinning was used in a weird and suboptimal way. And there
>>> are workload where it has been verified that it helps making
>>> performance better (poor SpecVIRT  results without it was the main
>>> motivation having it upstream, and on by default).
>>>
>>> That being said, I personally have never liked rate-limiting, it always
>>> looked to me like the wrong solution.
>>
>> In fact, I *think* the only reason it may have been introduced is that
>> there was a bug in the credit2 code at the time such that it always had
>> a single runqueue no matter what your actual pcpu topology was.
> 
> FWIW, we don't yet parse the pCPU topology on ARM. AFAIU, we always tell
> Xen each CPU is in its own core. Will it have some implications in the
> scheduler?

Just checking -- you do mean its own core, as opposed to its own socket?
 (Or NUMA node?)

On any system without hyperthreading (or with HT disabled), that's what
an x86 system will see as well.

Most schedulers have one runqueue per logical cpu.  Credit2 has the
option of having one runqueue per logical cpu, one per core (i.e.,
hyperthreads share a runqueue), one runqueue per socket (i.e., all cores
on the same socket share a runqueue), or one socket across the whole
system.  I *think* we made one socket per core the default a while back
to deal with multithreading, but I may not be remembering correctly.

In any case, if you don't have threads, then reporting each logical cpu
as its own core is the right thing to do.

If you're mis-reporting sockets, then the scheduler will be unable to
take that into account.  But that's not usually going to be a major
issue, mainly because the scheduler is not actually in a position to
determine, most of the time, which is the optimal configuration.  If two
vcpus are communicating a lot, then the optimal configuration is to put
them on different cores of the same socket (so they can share an L3
cache); if two vcpus are computing independently, then the optimal
configuration is to put them on different sockets, so they can each have
their own L3 cache.  Xen isn't in a position to know which one is more
important, so it just assumes each vcpu is independent.

All that to say: It shouldn't be a major issue if you are mis-reporting
sockets. :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-17 11:28                                 ` George Dunlap
@ 2017-07-19 11:21                                   ` Julien Grall
  2017-07-20  9:25                                     ` Dario Faggioli
  2017-07-20  9:10                                   ` Dario Faggioli
  1 sibling, 1 reply; 49+ messages in thread
From: Julien Grall @ 2017-07-19 11:21 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov

Hi George,

On 17/07/17 12:28, George Dunlap wrote:
> On 07/17/2017 11:04 AM, Julien Grall wrote:
>> Hi,
>>
>> On 17/07/17 10:25, George Dunlap wrote:
>>> On 07/12/2017 07:14 AM, Dario Faggioli wrote:
>>>> On Fri, 2017-07-07 at 14:12 -0700, Stefano Stabellini wrote:
>>>>> On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:
>>>>>>>>
>>>>>>> Since you are using Credit, can you try to disable context switch
>>>>>>> rate
>>>>>>> limiting?
>>>>>>
>>>>>> Yep. You are right. In the environment described above (Case 2) I
>>>>>> now
>>>>>> get much better results:
>>>>>>
>>>>>>  real 1.85
>>>>>> user 0.00
>>>>>> sys 1.85
>>>>>
>>>>> From 113 to 1.85 -- WOW!
>>>>>
>>>>> Obviously I am no scheduler expert, but shouldn't we advertise a bit
>>>>> better a scheduler configuration option that makes things _one
>>>>> hundred
>>>>> times faster_ ?!
>>>>>
>>>> So, to be fair, so far, we've bitten this hard by this only on
>>>> artificially constructed test cases, where either some extreme
>>>> assumption were made (e.g., that all the vCPUs except one always run at
>>>> 100% load) or pinning was used in a weird and suboptimal way. And there
>>>> are workload where it has been verified that it helps making
>>>> performance better (poor SpecVIRT  results without it was the main
>>>> motivation having it upstream, and on by default).
>>>>
>>>> That being said, I personally have never liked rate-limiting, it always
>>>> looked to me like the wrong solution.
>>>
>>> In fact, I *think* the only reason it may have been introduced is that
>>> there was a bug in the credit2 code at the time such that it always had
>>> a single runqueue no matter what your actual pcpu topology was.
>>
>> FWIW, we don't yet parse the pCPU topology on ARM. AFAIU, we always tell
>> Xen each CPU is in its own core. Will it have some implications in the
>> scheduler?
>
> Just checking -- you do mean its own core, as opposed to its own socket?
>  (Or NUMA node?)

I don't know much about the scheduler, so I might say something stupid 
here :). Below the code we have for ARM

/* XXX these seem awfully x86ish... */
/* representing HT siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask);
/* representing HT and core siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_mask);

static void setup_cpu_sibling_map(int cpu)
{
     if ( !zalloc_cpumask_var(&per_cpu(cpu_sibling_mask, cpu)) ||
          !zalloc_cpumask_var(&per_cpu(cpu_core_mask, cpu)) )
         panic("No memory for CPU sibling/core maps");

     /* A CPU is a sibling with itself and is always on its own core. */
     cpumask_set_cpu(cpu, per_cpu(cpu_sibling_mask, cpu));
     cpumask_set_cpu(cpu, per_cpu(cpu_core_mask, cpu));
}

#define cpu_to_socket(_cpu) (0)

After calling setup_cpu_sibling_map, we never touch cpu_sibling_mask and 
cpu_core_mask for a given pCPU. So I would say that each logical CPU is 
in its own core, but they are all in the same socket at the moment.

>
> On any system without hyperthreading (or with HT disabled), that's what
> an x86 system will see as well.
>
> Most schedulers have one runqueue per logical cpu.  Credit2 has the
> option of having one runqueue per logical cpu, one per core (i.e.,
> hyperthreads share a runqueue), one runqueue per socket (i.e., all cores
> on the same socket share a runqueue), or one socket across the whole
> system.  I *think* we made one socket per core the default a while back
> to deal with multithreading, but I may not be remembering correctly.
>
> In any case, if you don't have threads, then reporting each logical cpu
> as its own core is the right thing to do.

The architecture doesn't disallow to do HT on ARM. Though, I am not 
aware of any cores using it today.

>
> If you're mis-reporting sockets, then the scheduler will be unable to
> take that into account.  But that's not usually going to be a major
> issue, mainly because the scheduler is not actually in a position to
> determine, most of the time, which is the optimal configuration.  If two
> vcpus are communicating a lot, then the optimal configuration is to put
> them on different cores of the same socket (so they can share an L3
> cache); if two vcpus are computing independently, then the optimal
> configuration is to put them on different sockets, so they can each have
> their own L3 cache.  Xen isn't in a position to know which one is more
> important, so it just assumes each vcpu is independent.
>
> All that to say: It shouldn't be a major issue if you are mis-reporting
> sockets. :-)

Good to know, thank you for the explanation! We might want to parse the 
bindings correctly to get a bit of improvement. I will add a task on jira.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-17  9:25                             ` George Dunlap
  2017-07-17 10:04                               ` Julien Grall
@ 2017-07-20  8:49                               ` Dario Faggioli
  1 sibling, 0 replies; 49+ messages in thread
From: Dario Faggioli @ 2017-07-20  8:49 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, Julien Grall, xen-devel, Andrii Anisov


[-- Attachment #1.1: Type: text/plain, Size: 3006 bytes --]

On Mon, 2017-07-17 at 10:25 +0100, George Dunlap wrote:
> On 07/12/2017 07:14 AM, Dario Faggioli wrote:
> > 
> > That being said, I personally have never liked rate-limiting, it
> > always
> > looked to me like the wrong solution.
> 
> In fact, I *think* the only reason it may have been introduced is
> that
> there was a bug in the credit2 code at the time such that it always
> had
> a single runqueue no matter what your actual pcpu topology was.
> 
It has been introduced because SpecVirt perf were bad because, during
interrupt storms, the context-switch rate was really really high.

It was all about Credit1... Work on Credit2 was stalled at the time,
and there has been, AFAICR, no evaluation of Credit2 was involved:

https://wiki.xen.org/wiki/Credit_Scheduler#Context-Switch_Rate_Limiting
https://lists.xenproject.org/archives/html/xen-devel/2011-12/msg00897.html%7C

(And in fact, it was not implemented in Credit2, until something like
last year, Anshul wrote the code for that.)

SpecVirt performance were judged to be important enough (e.g., because
we've been told people was using that for comparing us with other virt.
solutions), that this was set to on by default. 

I don't know if that is still the case, as I've run many benchmarks,
but never had the chance to try SpecVirt first hand myself. Fact is
that Credit1 does not have any measure in place for limit/control
context-switch rate, and it has boosting, which means that rate-
limiting (as much as I may hate it :-P) is actually useful.

Whether we should have it disabled by default, and tell people (in
documentation) to enable it if they think they're seeing the system
going into trashing because of context switching, or the vice-versa,
it's one of those things which is rather hard to tell. Let's see...

In Credit2, we do have CSCHED2_MIN_TIMER (which is not equivalent to
ratelimiting, of course, but it at least is something that goes in the
direction of trying to avoid too frequent interruptions), and (much
more important, IMO) we don't have boosting... So, I think it would be
interesting to try figuring out the role that rate-limiting plays, when
Credit2 is in use (and then, maybe, if we find that there are
differences, find a way to have, as default, it enabled on Credit1 and
disabled on Credit2).

> > I'll think about it, and see if I'll be able to run some benchmarks
> > with it on and off.
> 
> Thanks.  FYI the main benchmark that was used to justify its
> inclusion
> (and on by default) was specvirt (I think).
> 
Yeah, I know. I'm not sure I will have the chance to run that soon,
though. I'll try a bunch of other workloads, and we'll see what I will
find. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-17 11:28                                 ` George Dunlap
  2017-07-19 11:21                                   ` Julien Grall
@ 2017-07-20  9:10                                   ` Dario Faggioli
  1 sibling, 0 replies; 49+ messages in thread
From: Dario Faggioli @ 2017-07-20  9:10 UTC (permalink / raw)
  To: George Dunlap, Julien Grall, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov


[-- Attachment #1.1: Type: text/plain, Size: 3354 bytes --]

On Mon, 2017-07-17 at 12:28 +0100, George Dunlap wrote:
> Most schedulers have one runqueue per logical cpu.  Credit2 has the
> option of having one runqueue per logical cpu, one per core (i.e.,
> hyperthreads share a runqueue), one runqueue per socket (i.e., all
> cores
> on the same socket share a runqueue), or one socket across the whole
> system.  
>
You mean "or one runqueue across the whole system", I guess? :-)

> I *think* we made one socket per core the default a while back
> to deal with multithreading, but I may not be remembering correctly.
> 
We've have per-core runqueue as default, to deal with hyperthreading
for some time. Nowadays, handling hyperthreading is done independently
by runqueue arrangement, and so the current default is one runqueue
per-socket.

> In any case, if you don't have threads, then reporting each logical
> cpu as its own core is the right thing to do.
> 
Yep.

> If you're mis-reporting sockets, then the scheduler will be unable to
> take that into account.  
>
And if this means that each logical CPU is also reported as being its
own socket, then you have one runqueue per logical CPU.

> But that's not usually going to be a major
> issue, mainly because the scheduler is not actually in a position to
> determine, most of the time, which is the optimal configuration.  If
> two
> vcpus are communicating a lot, then the optimal configuration is to
> put
> them on different cores of the same socket (so they can share an L3
> cache); if two vcpus are computing independently, then the optimal
> configuration is to put them on different sockets, so they can each
> have
> their own L3 cache. 
>
This is all very true. However, if two CPUs share one runqueue, vCPUs
will seamlessly move between the two CPUs, without having to wait for
the load balancing logic to kick in. This is a rather cheap way of
achieving good fairness and load balancing, but is only effective if
this movement is also cheap, which, e.g., is probably the case if the
CPUs share some level of cache.

So, figuring out what the best runqueue arrangement is, is rather hard
to do automatically, as it depends both on the workload and on the
hardware characteristics of the platform, but having at last some
degree of runqueue sharing, among the CPUs that have some cache levels
in common, would be, IMO, our best bet.

And we do need topology information to try to do that. (We would also
need, in Credit2 code, to take more into account cache and memory
hierarchy information, rather than "just" CPU topology. We're already
working, for instance, of changing CSCHED2_MIGRATE_RESIST from being
constant, to vary depending on the amount of cache-sharing between two
CPUs.)

> All that to say: It shouldn't be a major issue if you are mis-
> reporting
> sockets. :-)
> 
Maybe yes, maybe not. It may actually be even better on some
combination of platforms and workloads, indeed... but it also means
that the Credit2 load balancer is being invoked a lot, which may be
unideal.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Notes on stubdoms and latency on ARM
  2017-07-19 11:21                                   ` Julien Grall
@ 2017-07-20  9:25                                     ` Dario Faggioli
  0 siblings, 0 replies; 49+ messages in thread
From: Dario Faggioli @ 2017-07-20  9:25 UTC (permalink / raw)
  To: Julien Grall, George Dunlap, Stefano Stabellini, Volodymyr Babchuk
  Cc: Artem_Mygaiev, xen-devel, Andrii Anisov


[-- Attachment #1.1: Type: text/plain, Size: 3026 bytes --]

On Wed, 2017-07-19 at 12:21 +0100, Julien Grall wrote:
> On 17/07/17 12:28, George Dunlap wrote:
> > Just checking -- you do mean its own core, as opposed to its own
> > socket?
> >  (Or NUMA node?)
> 
> I don't know much about the scheduler, so I might say something
> stupid 
> here :). Below the code we have for ARM
> 
> /* XXX these seem awfully x86ish... */
> /* representing HT siblings of each logical CPU */
> DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask);
> /* representing HT and core siblings of each logical CPU */
> DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_mask);
> 
> static void setup_cpu_sibling_map(int cpu)
> {
>      if ( !zalloc_cpumask_var(&per_cpu(cpu_sibling_mask, cpu)) ||
>           !zalloc_cpumask_var(&per_cpu(cpu_core_mask, cpu)) )
>          panic("No memory for CPU sibling/core maps");
> 
>      /* A CPU is a sibling with itself and is always on its own core.
> */
>      cpumask_set_cpu(cpu, per_cpu(cpu_sibling_mask, cpu));
>      cpumask_set_cpu(cpu, per_cpu(cpu_core_mask, cpu));
> }
> 
> #define cpu_to_socket(_cpu) (0)
> 
> After calling setup_cpu_sibling_map, we never touch cpu_sibling_mask
> and 
> cpu_core_mask for a given pCPU. So I would say that each logical CPU
> is 
> in its own core, but they are all in the same socket at the moment.
> 
Ah, fine... so you're in the exact opposite situation I was thinking
about and reasoning upon in the reply to George I've just sent! :-P

Ok, this basically means that, by default, in any ARM system, no matter
how big or small, Credit2 will always use just one runqueue, from which
_all_ the pCPUs will fish vCPUs, for running them.

As said already, it's impossible to tell whether this is either bad or
good, in the general case. It's good for fairness and load distribution
(load balancing happens automatically, without the actual load
balancing logic and code having to do anything at all!), but it's bad
for lock contention (every runq operation, e.g., wakeup, schedule,
etc., have to take the same lock).

I think this explains at least part of why Stefano's wakeup latency
numbers are rather bad with Credit2, on ARM, but that is not the case
for my tests on x86.

> > All that to say: It shouldn't be a major issue if you are mis-
> > reporting
> > sockets. :-)
> 
> Good to know, thank you for the explanation! We might want to parse
> the 
> bindings correctly to get a bit of improvement. I will add a task on
> jira.
> 
Yes, we should. Credit1 does not care about, but Credit2 is
specifically designed to take advantage of these (and possibly even
more!) information, so they need to be accurate. :-D

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-07-20  9:25 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-18 19:00 Notes on stubdoms and latency on ARM Stefano Stabellini
2017-05-19 19:45 ` Volodymyr Babchuk
2017-05-22 21:41   ` Stefano Stabellini
2017-05-26 19:28     ` Volodymyr Babchuk
2017-05-30 17:29       ` Stefano Stabellini
2017-05-30 17:33         ` Julien Grall
2017-06-01 10:28           ` Julien Grall
2017-06-17  0:17             ` Volodymyr Babchuk
2017-05-31  9:09         ` George Dunlap
2017-05-31 15:53           ` Dario Faggioli
2017-05-31 16:17             ` Volodymyr Babchuk
2017-05-31 17:45           ` Stefano Stabellini
2017-06-01 10:48             ` Julien Grall
2017-06-01 10:52             ` George Dunlap
2017-06-01 10:54               ` George Dunlap
2017-06-01 12:40               ` Dario Faggioli
2017-06-01 15:02                 ` George Dunlap
2017-06-01 18:27               ` Stefano Stabellini
2017-05-31 17:02       ` George Dunlap
2017-06-17  0:14         ` Volodymyr Babchuk
2017-06-19  9:37           ` George Dunlap
2017-06-19 17:54             ` Stefano Stabellini
2017-06-19 18:36               ` Volodymyr Babchuk
2017-06-20 10:11                 ` Dario Faggioli
2017-07-07 15:02                   ` Volodymyr Babchuk
2017-07-07 16:41                     ` Dario Faggioli
2017-07-07 17:03                       ` Volodymyr Babchuk
2017-07-07 21:12                         ` Stefano Stabellini
2017-07-12  6:14                           ` Dario Faggioli
2017-07-17  9:25                             ` George Dunlap
2017-07-17 10:04                               ` Julien Grall
2017-07-17 11:28                                 ` George Dunlap
2017-07-19 11:21                                   ` Julien Grall
2017-07-20  9:25                                     ` Dario Faggioli
2017-07-20  9:10                                   ` Dario Faggioli
2017-07-20  8:49                               ` Dario Faggioli
2017-07-08 14:26                         ` Dario Faggioli
2017-06-20 10:45                 ` Julien Grall
2017-06-20 16:23                   ` Volodymyr Babchuk
2017-06-21 10:38                     ` Julien Grall
2017-06-19 18:26             ` Volodymyr Babchuk
2017-06-20 10:00               ` Dario Faggioli
2017-06-20 10:30                 ` George Dunlap
2017-05-23  7:11   ` Dario Faggioli
2017-05-26 20:09     ` Volodymyr Babchuk
2017-05-27  2:10       ` Dario Faggioli
2017-05-23  9:08   ` George Dunlap
2017-05-26 19:43     ` Volodymyr Babchuk
2017-05-26 19:46       ` Volodymyr Babchuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.