All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] xen/arm: introduce vwfi parameter
       [not found] ` <a271394a-6c76-027c-fb08-b3fe775224ba@arm.com>
@ 2017-02-17 22:50   ` Stefano Stabellini
  2017-02-18  1:47     ` Dario Faggioli
  2017-02-19 21:34     ` Julien Grall
  0 siblings, 2 replies; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-17 22:50 UTC (permalink / raw)
  To: Julien Grall
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap,
	dario.faggioli, xen-devel, nd

CC'ing xen-devel, I forgot on the original patch

On Fri, 17 Feb 2017, Julien Grall wrote:
> Hi Stefano,
> 
> On 02/16/2017 11:04 PM, Stefano Stabellini wrote:
> > Introduce new Xen command line parameter called "vwfi", which stands for
> > virtual wfi. The default is "sleep": on guest wfi, Xen calls vcpu_block
> > on the guest vcpu. The behavior can be changed setting vwfi to "idle",
> > in that case Xen calls vcpu_yield.
> > 
> > The result is strong reduction in irq latency (8050ns -> 3500ns) at the
> > cost of idle_loop being called less often, leading to higher power
> > consumption.
> 
> Please explain in which context this will be beneficial. My gut feeling is
> only will make performance worst if a multiple vCPU of the same guest is
> running on vCPU

I am not a scheduler expert, but I don't think so. Let me explain the
difference:

- vcpu_block blocks a vcpu until an event occurs, for example until it
  receives an interrupt

- vcpu_yield stops the vcpu from running until the next scheduler slot

In both cases the vcpus is not run until the next slot, so I don't think
it should make the performance worse in multi-vcpus scenarios. But I can
do some tests to double check.


> > Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
> > CC: dario.faggioli@citrix.com
> > ---
> >  docs/misc/xen-command-line.markdown | 11 +++++++++++
> >  xen/arch/arm/traps.c                | 17 +++++++++++++++--
> >  2 files changed, 26 insertions(+), 2 deletions(-)
> > 
> > diff --git a/docs/misc/xen-command-line.markdown
> > b/docs/misc/xen-command-line.markdown
> > index a11fdf9..5d003e4 100644
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -1632,6 +1632,17 @@ Note that if **watchdog** option is also specified
> > vpmu will be turned off.
> >  As the virtualisation is not 100% safe, don't use the vpmu flag on
> >  production systems (see http://xenbits.xen.org/xsa/advisory-163.html)!
> > 
> > +### vwfi
> > +> `= sleep | idle
> > +
> > +> Default: `sleep`
> > +
> > +WFI is the ARM instruction to "wait for interrupt". This option, which
> > +is ARM specific, changes the way guest WFI is implemented in Xen. By
> > +default, Xen blocks the guest vcpu, putting it to sleep. When setting
> > +vwfi to `idle`, Xen idles the guest vcpu instead, resulting in lower
> > +interrupt latency, but higher power consumption.
> 
> The main point of using wfi is for power saving. With this change, you will
> end up in a busy loop and as you said consume more power.

That's not true: the vcpu is still descheduled until the next slot.
There is no busy loop (that would be indeed very bad).


> I don't think this is acceptable even to get a better interrupt latency. Some
> workload will care about interrupt latency and power.
> 
> I think a better approach would be to check whether the scheduler has another
> vCPU to run. If not wait for an interrupt in the trap.
> 
> This would save the context switch to the idle vCPU if we are still on the
> time slice of the vCPU.

From my limited understanding of how schedulers work, I think this
cannot work reliably. It is the scheduler that needs to tell the
arch-specific code to put a pcpu to sleep, not the other way around. I
would appreciate if Dario could confirm this though.


> Likely this may not fit everyone, so I would add some knowledge to change the
> behavior of WFI depending on how many vCPU are scheduled on the current pCPU.
> But this could be done as a second step.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-17 22:50   ` [PATCH] xen/arm: introduce vwfi parameter Stefano Stabellini
@ 2017-02-18  1:47     ` Dario Faggioli
  2017-02-19 21:27       ` Julien Grall
  2017-02-19 21:34     ` Julien Grall
  1 sibling, 1 reply; 39+ messages in thread
From: Dario Faggioli @ 2017-02-18  1:47 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: edgar.iglesias, george.dunlap, nd, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4669 bytes --]

On Fri, 2017-02-17 at 14:50 -0800, Stefano Stabellini wrote:
> On Fri, 17 Feb 2017, Julien Grall wrote:
> > Please explain in which context this will be beneficial. My gut
> > feeling is
> > only will make performance worst if a multiple vCPU of the same
> > guest is
> > running on vCPU
> 
> I am not a scheduler expert, but I don't think so. Let me explain the
> difference:
> 
> - vcpu_block blocks a vcpu until an event occurs, for example until
> it
>   receives an interrupt
> 
> - vcpu_yield stops the vcpu from running until the next scheduler
> slot
> 
So, what happens when you yield, depends on how yield is implemented in
the specific scheduler, and what other vcpus are runnable in the
system.

Currently, neither Credit1 nor Credit2 (and nor the Linux scheduler,
AFAICR) really stop the yielding vcpus. Broadly speaking, the following
two scenarios are possible:
 - vcpu A yields, and there is one or more runnable but not already 
   running other vcpus. In this case, A is indeed descheduled and put 
   back in a scheduler runqueue in such a way that one or more of the 
   runnable but not running other vcpus have a chance to execute, 
   before the scheduler would consider A again. This may be 
   implemented by putting A on the tail of the runqueue, so all the 
   other vcpus will get a chance to run (this is basically what 
   happens in Credit1, modulo periodic runq sorting). Or it may be
   implemented by ignoring A for the next <number> scheduling 
   decisions after it yielded (this is basically what happens in 
   Credit2). Both approaches have pros and cons, but the common botton 
   line is that others are given a chance to run.

 - vcpu A yields, and there are no runnable but not running vcpus
   around. In this case, A gets to run again. Full stop.

And when a vcpu that has yielded is picked up back for execution
--either immediately or after a few others-- it can run again. And if
it yields again (and again, and again), we just go back to option 1 or
2 above.

> In both cases the vcpus is not run until the next slot, so I don't
> think
> it should make the performance worse in multi-vcpus scenarios. But I
> can
> do some tests to double check.
> 
All the above being said, I also don't think it will affect much multi-
vcpus VM's performance. In fact, even if the yielding vcpu is never
really stopped, the other ones are indeed given a chance to execute if
they want and are capable of.

But sure it would not harm verifying with some tests.

> > The main point of using wfi is for power saving. With this change,
> > you will
> > end up in a busy loop and as you said consume more power.
> 
> That's not true: the vcpu is still descheduled until the next slot.
> There is no busy loop (that would be indeed very bad).
> 
Well, as a matter of fact there may be busy-looping involved... But
isn't it the main point of this all. AFAIR, idle=pool in Linux does
very much the same, and has the same risk of potentially letting tasks
busy loop.

What will never happen is that a yielding vcpu, by busy looping,
prevents other runnable (and non yielding) vcpus to run. And if it
does, it's a bug. :-)

> > I don't think this is acceptable even to get a better interrupt
> > latency. Some
> > workload will care about interrupt latency and power.
> > 
> > I think a better approach would be to check whether the scheduler
> > has another
> > vCPU to run. If not wait for an interrupt in the trap.
> > 
> > This would save the context switch to the idle vCPU if we are still
> > on the
> > time slice of the vCPU.
> 
> From my limited understanding of how schedulers work, I think this
> cannot work reliably. It is the scheduler that needs to tell the
> arch-specific code to put a pcpu to sleep, not the other way around. 
>
Yes, that is basically true.

Another way to explain it would be by saying that, if there were other
vCPUs to run, we wouldn't have gone idle (and entered the idle loop).

In fact, in work conserving schedulers, if pCPU x becomes idle, it
means there is _nothing_ that can execute on x itself around. And our
schedulers are (with the exception of ARRINC, and if not using caps in
Credit1) work conserving, or at least they want and try to be an as
much work conserving as possible.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-18  1:47     ` Dario Faggioli
@ 2017-02-19 21:27       ` Julien Grall
  2017-02-20 10:43         ` George Dunlap
  2017-02-20 11:15         ` Dario Faggioli
  0 siblings, 2 replies; 39+ messages in thread
From: Julien Grall @ 2017-02-19 21:27 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

Hi Dario,

On 02/18/2017 01:47 AM, Dario Faggioli wrote:
> On Fri, 2017-02-17 at 14:50 -0800, Stefano Stabellini wrote:
>> On Fri, 17 Feb 2017, Julien Grall wrote:
>>> Please explain in which context this will be beneficial. My gut
>>> feeling is
>>> only will make performance worst if a multiple vCPU of the same
>>> guest is
>>> running on vCPU
>>
>> I am not a scheduler expert, but I don't think so. Let me explain the
>> difference:
>>
>> - vcpu_block blocks a vcpu until an event occurs, for example until
>> it
>>   receives an interrupt
>>
>> - vcpu_yield stops the vcpu from running until the next scheduler
>> slot
>>
> So, what happens when you yield, depends on how yield is implemented in
> the specific scheduler, and what other vcpus are runnable in the
> system.
>
> Currently, neither Credit1 nor Credit2 (and nor the Linux scheduler,
> AFAICR) really stop the yielding vcpus. Broadly speaking, the following
> two scenarios are possible:
>  - vcpu A yields, and there is one or more runnable but not already
>    running other vcpus. In this case, A is indeed descheduled and put
>    back in a scheduler runqueue in such a way that one or more of the
>    runnable but not running other vcpus have a chance to execute,
>    before the scheduler would consider A again. This may be
>    implemented by putting A on the tail of the runqueue, so all the
>    other vcpus will get a chance to run (this is basically what
>    happens in Credit1, modulo periodic runq sorting). Or it may be
>    implemented by ignoring A for the next <number> scheduling
>    decisions after it yielded (this is basically what happens in
>    Credit2). Both approaches have pros and cons, but the common botton
>    line is that others are given a chance to run.
>
>  - vcpu A yields, and there are no runnable but not running vcpus
>    around. In this case, A gets to run again. Full stop.

Which turn to be the busy looping I was mentioning when one vCPU is 
assigned to a pCPU. This is not the goal of WFI and I would be really 
surprised that embedded folks will be happy with a solution using more 
power.

> And when a vcpu that has yielded is picked up back for execution
> --either immediately or after a few others-- it can run again. And if
> it yields again (and again, and again), we just go back to option 1 or
> 2 above.
>
>> In both cases the vcpus is not run until the next slot, so I don't
>> think
>> it should make the performance worse in multi-vcpus scenarios. But I
>> can
>> do some tests to double check.
>>
> All the above being said, I also don't think it will affect much multi-
> vcpus VM's performance. In fact, even if the yielding vcpu is never
> really stopped, the other ones are indeed given a chance to execute if
> they want and are capable of.
>
> But sure it would not harm verifying with some tests.
>
>>> The main point of using wfi is for power saving. With this change,
>>> you will
>>> end up in a busy loop and as you said consume more power.
>>
>> That's not true: the vcpu is still descheduled until the next slot.
>> There is no busy loop (that would be indeed very bad).
>>
> Well, as a matter of fact there may be busy-looping involved... But
> isn't it the main point of this all. AFAIR, idle=pool in Linux does
> very much the same, and has the same risk of potentially letting tasks
> busy loop.
>
> What will never happen is that a yielding vcpu, by busy looping,
> prevents other runnable (and non yielding) vcpus to run. And if it
> does, it's a bug. :-)

I didn't say it will prevent another vCPU to run. But it will at least 
use slot that could have been used for good purpose by another pCPU.

So in similar workload Xen will perform worst with vwfi=idle, not even 
mentioning the power consumption...

>
>>> I don't think this is acceptable even to get a better interrupt
>>> latency. Some
>>> workload will care about interrupt latency and power.
>>>
>>> I think a better approach would be to check whether the scheduler
>>> has another
>>> vCPU to run. If not wait for an interrupt in the trap.
>>>
>>> This would save the context switch to the idle vCPU if we are still
>>> on the
>>> time slice of the vCPU.
>>
>> From my limited understanding of how schedulers work, I think this
>> cannot work reliably. It is the scheduler that needs to tell the
>> arch-specific code to put a pcpu to sleep, not the other way around.
>>
> Yes, that is basically true.
>
> Another way to explain it would be by saying that, if there were other
> vCPUs to run, we wouldn't have gone idle (and entered the idle loop).
>
> In fact, in work conserving schedulers, if pCPU x becomes idle, it
> means there is _nothing_ that can execute on x itself around. And our
> schedulers are (with the exception of ARRINC, and if not using caps in
> Credit1) work conserving, or at least they want and try to be an as
> much work conserving as possible.

My knowledge of the scheduler is limited. Does the scheduler take into 
account the cost of context switch when scheduling? When do you decide 
when to run the idle vCPU? Is it only the no other vCPU are runnable or 
do you have an heuristic?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-17 22:50   ` [PATCH] xen/arm: introduce vwfi parameter Stefano Stabellini
  2017-02-18  1:47     ` Dario Faggioli
@ 2017-02-19 21:34     ` Julien Grall
  2017-02-20 11:35       ` Dario Faggioli
  2017-02-20 18:47       ` Stefano Stabellini
  1 sibling, 2 replies; 39+ messages in thread
From: Julien Grall @ 2017-02-19 21:34 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, dario.faggioli, Punit Agrawal,
	xen-devel, nd

Hi Stefano,

I have CCed another ARM person who has more knowledge than me on 
scheduling/power.

On 02/17/2017 10:50 PM, Stefano Stabellini wrote:
> CC'ing xen-devel, I forgot on the original patch
>
> On Fri, 17 Feb 2017, Julien Grall wrote:
>> Hi Stefano,
>>
>> On 02/16/2017 11:04 PM, Stefano Stabellini wrote:
>>> Introduce new Xen command line parameter called "vwfi", which stands for
>>> virtual wfi. The default is "sleep": on guest wfi, Xen calls vcpu_block
>>> on the guest vcpu. The behavior can be changed setting vwfi to "idle",
>>> in that case Xen calls vcpu_yield.
>>>
>>> The result is strong reduction in irq latency (8050ns -> 3500ns) at the
>>> cost of idle_loop being called less often, leading to higher power
>>> consumption.
>>
>> Please explain in which context this will be beneficial. My gut feeling is
>> only will make performance worst if a multiple vCPU of the same guest is
>> running on vCPU
>
> I am not a scheduler expert, but I don't think so. Let me explain the
> difference:
>
> - vcpu_block blocks a vcpu until an event occurs, for example until it
>   receives an interrupt
>
> - vcpu_yield stops the vcpu from running until the next scheduler slot
>
> In both cases the vcpus is not run until the next slot, so I don't think
> it should make the performance worse in multi-vcpus scenarios. But I can
> do some tests to double check.

You still haven't explained how you came up with those number? My guess 
is 1 vCPU per pCPU but it is not clear from the commit message.

Looking at your answer, I think it would be important that everyone in 
this thread understand the purpose of WFI and how it differs with WFE.

The two instructions provides a way to tell the processor to go in 
low-power state. It means the processor can turn off power on some parts 
(e.g unit, pipeline...) to save energy.

The instruction WFE (Wait For Event) will wait until the processor 
receives an event. The definition of even is quite wide, could be 
because of an SEV on another processor or an implementation defined 
mechanism (see D1.17.1 in DDI 0487A.k_iss10775). An example of use is 
when a lock is already taken, the processor would use WFE and wait for 
an even from the processor who will release the lock another processor 
would release the lock and send an event.

The instruction WFI (Wait for Interrupt) will wait until the processor 
receives an interrupt. An example of use is when a processor has nothing 
to run. It would be normal for the software to put the processor in low 
power mode until an interrupt is coming. The software may not receive 
interrupt for a while (see the recent example with the RCU bug in Xen 
where a processor had nothing to do and was staying in lower power mode).

For both instruction it is normal to have an higher latency when 
receiving an interrupt. When a software is using them, it knows that 
there will have an impact, but overall it will expect some power to be 
saved. Whether the current numbers are acceptable is another question.

Now, regarding what you said. Let's imagine the scheduler is 
descheduling the vCPU until the next slot, it will run the vCPU after 
even if no interrupt has been received. This is a real waste of power 
and become worst if an interrupt is not coming for multiple slot.

In the case of multi-vcpu, the guest using wfi will use more slot than 
it was doing before. This means less slot for vCPUs that actually have 
real work to do. So yes, this will have an impact for same workload 
between before your patch and after.

>
>
>>> Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
>>> CC: dario.faggioli@citrix.com
>>> ---
>>>  docs/misc/xen-command-line.markdown | 11 +++++++++++
>>>  xen/arch/arm/traps.c                | 17 +++++++++++++++--
>>>  2 files changed, 26 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/docs/misc/xen-command-line.markdown
>>> b/docs/misc/xen-command-line.markdown
>>> index a11fdf9..5d003e4 100644
>>> --- a/docs/misc/xen-command-line.markdown
>>> +++ b/docs/misc/xen-command-line.markdown
>>> @@ -1632,6 +1632,17 @@ Note that if **watchdog** option is also specified
>>> vpmu will be turned off.
>>>  As the virtualisation is not 100% safe, don't use the vpmu flag on
>>>  production systems (see http://xenbits.xen.org/xsa/advisory-163.html)!
>>>
>>> +### vwfi
>>> +> `= sleep | idle
>>> +
>>> +> Default: `sleep`
>>> +
>>> +WFI is the ARM instruction to "wait for interrupt". This option, which
>>> +is ARM specific, changes the way guest WFI is implemented in Xen. By
>>> +default, Xen blocks the guest vcpu, putting it to sleep. When setting
>>> +vwfi to `idle`, Xen idles the guest vcpu instead, resulting in lower
>>> +interrupt latency, but higher power consumption.
>>
>> The main point of using wfi is for power saving. With this change, you will
>> end up in a busy loop and as you said consume more power.
>
> That's not true: the vcpu is still descheduled until the next slot.
> There is no busy loop (that would be indeed very bad).

As Dario answered in a separate e-mail this will depend on the 
scheduler. Regardless that, for me you are still busy looping because 
you are going from one slot to another slot until maybe sometime 
interrupt is coming. So yes, the power consumption is much worse if you 
got a guest vCPU doing nothing.

>
>
>> I don't think this is acceptable even to get a better interrupt latency. Some
>> workload will care about interrupt latency and power.
>>
>> I think a better approach would be to check whether the scheduler has another
>> vCPU to run. If not wait for an interrupt in the trap.
>>
>> This would save the context switch to the idle vCPU if we are still on the
>> time slice of the vCPU.
>
> From my limited understanding of how schedulers work, I think this
> cannot work reliably. It is the scheduler that needs to tell the
> arch-specific code to put a pcpu to sleep, not the other way around. I
> would appreciate if Dario could confirm this though.

If my understanding is correct, your workload is 1 vCPU per pCPU. So why 
do you need to trap WFI/WFE in this case?

Would not it be easier to let the guest using them directly?

>
>
>> Likely this may not fit everyone, so I would add some knowledge to change the
>> behavior of WFI depending on how many vCPU are scheduled on the current pCPU.
>> But this could be done as a second step.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-19 21:27       ` Julien Grall
@ 2017-02-20 10:43         ` George Dunlap
  2017-02-20 11:15         ` Dario Faggioli
  1 sibling, 0 replies; 39+ messages in thread
From: George Dunlap @ 2017-02-20 10:43 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

On 19/02/17 21:27, Julien Grall wrote:
> Hi Dario,
> 
> On 02/18/2017 01:47 AM, Dario Faggioli wrote:
>> On Fri, 2017-02-17 at 14:50 -0800, Stefano Stabellini wrote:
>>> On Fri, 17 Feb 2017, Julien Grall wrote:
>>>> Please explain in which context this will be beneficial. My gut
>>>> feeling is
>>>> only will make performance worst if a multiple vCPU of the same
>>>> guest is
>>>> running on vCPU
>>>
>>> I am not a scheduler expert, but I don't think so. Let me explain the
>>> difference:
>>>
>>> - vcpu_block blocks a vcpu until an event occurs, for example until
>>> it
>>>   receives an interrupt
>>>
>>> - vcpu_yield stops the vcpu from running until the next scheduler
>>> slot
>>>
>> So, what happens when you yield, depends on how yield is implemented in
>> the specific scheduler, and what other vcpus are runnable in the
>> system.
>>
>> Currently, neither Credit1 nor Credit2 (and nor the Linux scheduler,
>> AFAICR) really stop the yielding vcpus. Broadly speaking, the following
>> two scenarios are possible:
>>  - vcpu A yields, and there is one or more runnable but not already
>>    running other vcpus. In this case, A is indeed descheduled and put
>>    back in a scheduler runqueue in such a way that one or more of the
>>    runnable but not running other vcpus have a chance to execute,
>>    before the scheduler would consider A again. This may be
>>    implemented by putting A on the tail of the runqueue, so all the
>>    other vcpus will get a chance to run (this is basically what
>>    happens in Credit1, modulo periodic runq sorting). Or it may be
>>    implemented by ignoring A for the next <number> scheduling
>>    decisions after it yielded (this is basically what happens in
>>    Credit2). Both approaches have pros and cons, but the common botton
>>    line is that others are given a chance to run.
>>
>>  - vcpu A yields, and there are no runnable but not running vcpus
>>    around. In this case, A gets to run again. Full stop.
> 
> Which turn to be the busy looping I was mentioning when one vCPU is
> assigned to a pCPU. This is not the goal of WFI and I would be really
> surprised that embedded folks will be happy with a solution using more
> power.

Yes, I'm afraid if there are no other vcpus to run on the pcpu that it
would indeed be busy-waiting.  So 'idle' would be a misdleading name for
that option; it should be 'poll'.

Yield is only advisory -- it says, "I'm doing something somewhat
low-priority; if there's something more useful you could be doing go
ahead and do it."  There is no promise about what, if anything, a yield
call will do.  For a long time credit1 didn't actually do anything on a
yield -- it simply went through the scheduler, which normally ended up
choosing the exact same vcpu again (even when there was other things to
do).  Doing something sensible with that information is not actually as
easy as one might think.

So we should avoid designing any mechanisms that rely on a particular
implementation of yield().

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-19 21:27       ` Julien Grall
  2017-02-20 10:43         ` George Dunlap
@ 2017-02-20 11:15         ` Dario Faggioli
  1 sibling, 0 replies; 39+ messages in thread
From: Dario Faggioli @ 2017-02-20 11:15 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 5378 bytes --]

On Sun, 2017-02-19 at 21:27 +0000, Julien Grall wrote:
> Hi Dario,
> 
Hi,

> On 02/18/2017 01:47 AM, Dario Faggioli wrote:
> >  - vcpu A yields, and there are no runnable but not running vcpus
> >    around. In this case, A gets to run again. Full stop.
> 
> Which turn to be the busy looping I was mentioning when one vCPU is 
> assigned to a pCPU. 
>
Absolutely. Actually, it would mean busy looping, no matter whether
vCPUs are assigned or not.

As I said already, it's not exactly identical, but it would have a very
similar behavior of the Linux's idle=poll option:

http://tomoyo.osdn.jp/cgi-bin/lxr/source/Documentation/kernel-parameters.txt?v=linux-4.9.9#L1576
1576         idle=           [X86]
1577                         Format: idle=poll, idle=halt, idle=nomwait
1578                         Poll forces a polling idle loop that can slightly
1579                         improve the performance of waking up a idle CPU, but
1580                         will use a lot of power and make the system run hot.
1581                         Not recommended.

And as I've also said, I don't see it as a solution to wakeup latency
problems, not one that I'd like to recommend using, outside of testing
and debugging. It perhaps may be a useful testing and debugging aid,
though.

> This is not the goal of WFI and I would be really 
> surprised that embedded folks will be happy with a solution using
> more 
> power.
> 
Me neither. It's a showstopper for anything that's battery power or may
incur in thermal/cooling issues. Outside

So, just to be clear, I'm happy to help and assist in understanding the
scheduling background and implications, but I am equally happy to leave
the decision of whether or not this is something nice or desirable to
have (as an option) on ARM. :-)

I've never been a fan of it, and never used it, on Linux on x86, not
even when actually working on real-time and low-latency stuff. That
being said, I also personally think that having the option would be no
harm, but I understand concerns that, when an option is there, people
will try to use it in the weirdest way, and then comply at your 'door'
if their CPU went on fire! :-O

> > What will never happen is that a yielding vcpu, by busy looping,
> > prevents other runnable (and non yielding) vcpus to run. And if it
> > does, it's a bug. :-)
> 
> I didn't say it will prevent another vCPU to run. But it will at
> least 
> use slot that could have been used for good purpose by another pCPU.
> 
Not really. Maybe I wasn't clear on explaining yielding, or maybe I'm
not getting what you're trying to say.

It indeed does depend a little bit on the implementation of yield, but
it won't (or at least must not) happen for busy looping issuing yield()
to be much different for the pCPU when that is happening to be sleeping
in deep C-state (or ARM equivalente). Performance aside, of course.

> So in similar workload Xen will perform worst with vwfi=idle, not
> even 
> mentioning the power consumption...
> 
It'd probably be a little bit more inefficient, even performance wise,
if, e.g., scheduler specific yielding code acquire locks, or means that
there is one more vCPU in the runqueues to be dealt with, but nothing
than that. And whether or not this would be significant or noticeable,
I don't know (should be measured, if interesting).

> > In fact, in work conserving schedulers, if pCPU x becomes idle, it
> > means there is _nothing_ that can execute on x itself around. And
> > our
> > schedulers are (with the exception of ARRINC, and if not using caps
> > in
> > Credit1) work conserving, or at least they want and try to be an as
> > much work conserving as possible.
> 
> My knowledge of the scheduler is limited. Does the scheduler take
> into 
> account the cost of context switch when scheduling? When do you
> decide 
> when to run the idle vCPU? Is it only the no other vCPU are runnable
> or 
> do you have an heuristic?
> 
Again, not sure I understand. Context switches, between running vCPUs,
must happen where the scheduling algorithm decides they must happen.
You can try to design an algorithm that requires not too many context
switches, or introduce countermeasures (we have something like that),
but apart from these, I don't know what (else?) you may refer to when
asking about "take into account the cost of context switch".

We do try to take into account the cose of migration, i.e., moving a
vCPU from a pCPU to another... but that's an entirely different thing.

About the idle vCPU... I think the answer to your question is yes.
Credit and Credit2 are work conserving schedulers, so they only let a
pCPU go idle, if there is no one wanting to run in the system (well, in
Credit2, this may not be 100% true, until the load balancer gets to
execute, but in practise, it happens very few and very infrequently).

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-19 21:34     ` Julien Grall
@ 2017-02-20 11:35       ` Dario Faggioli
  2017-02-20 18:43         ` Stefano Stabellini
  2017-02-20 18:47       ` Stefano Stabellini
  1 sibling, 1 reply; 39+ messages in thread
From: Dario Faggioli @ 2017-02-20 11:35 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3349 bytes --]

On Sun, 2017-02-19 at 21:34 +0000, Julien Grall wrote:
> Hi Stefano,
> 
> I have CCed another ARM person who has more knowledge than me on 
> scheduling/power.
>
Ah, when I saw this, I thought you were Cc-ing my friend Juri, which
also works there, and is doing that stuff. :-)

> > In both cases the vcpus is not run until the next slot, so I don't
> > think
> > it should make the performance worse in multi-vcpus scenarios. But
> > I can
> > do some tests to double check.
> 
> Looking at your answer, I think it would be important that everyone
> in 
> this thread understand the purpose of WFI and how it differs with
> WFE.
> 
> The two instructions provides a way to tell the processor to go in 
> low-power state. It means the processor can turn off power on some
> parts 
> (e.g unit, pipeline...) to save energy.
> 
[snip]
>
> For both instruction it is normal to have an higher latency when 
> receiving an interrupt. When a software is using them, it knows that 
> there will have an impact, but overall it will expect some power to
> be 
> saved. Whether the current numbers are acceptable is another
> question.
> 
Ok, thanks for these useful information. I think I understand the idea
behind these two instructions/mechanisms now.

What (I think) Stefano is proposing is providing the user (of Xen on
ARM) with a way of making them behave differently.

Whether good or bad, I've expressed my thoughts, and it's your call in
the end. :-)
George also has a fair point, though. Using yield is a quick and *most
likely* effective way of achieving Linux's "idle=poll", but at the same
time, a rather rather risky one, as it basically means the final
behavior would relay on how yield() behave on the specific scheduler
the user is using, which may vary.

> Now, regarding what you said. Let's imagine the scheduler is 
> descheduling the vCPU until the next slot, it will run the vCPU
> after 
> even if no interrupt has been received. 
>
There really are no slots. There sort of are in Credit1, but preemption
can happen inside a "slot", so I wouldn't call them such in there too.

> This is a real waste of power 
> and become worst if an interrupt is not coming for multiple slot.
> 
Undeniable. :-)

> In the case of multi-vcpu, the guest using wfi will use more slot
> than 
> it was doing before. This means less slot for vCPUs that actually
> have 
> real work to do. 
>
No, because it continuously yields. So, yes indeed there will be higher
scheduling overhead, but no stealing of otherwise useful computation
time. Not with the yield() implementations we have right now in the
code.

But I'm starting to think that we probably better make a step back from
withing deep inside the scheduler, and think, first, whether or not
having something similar to Linux's idle=poll is something we want, if
only for testing, debugging, or very specific use cases.

And only then, if the answer is yes, decide how to actually implement
it, whether or not to use yield, etc.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 11:35       ` Dario Faggioli
@ 2017-02-20 18:43         ` Stefano Stabellini
  2017-02-20 18:45           ` George Dunlap
  0 siblings, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-20 18:43 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap, Punit Agrawal,
	Julien Grall, xen-devel, nd

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4067 bytes --]

On Mon, 20 Feb 2017, Dario Faggioli wrote:
> On Sun, 2017-02-19 at 21:34 +0000, Julien Grall wrote:
> > Hi Stefano,
> > 
> > I have CCed another ARM person who has more knowledge than me on 
> > scheduling/power.
> >
> Ah, when I saw this, I thought you were Cc-ing my friend Juri, which
> also works there, and is doing that stuff. :-)
> 
> > > In both cases the vcpus is not run until the next slot, so I don't
> > > think
> > > it should make the performance worse in multi-vcpus scenarios. But
> > > I can
> > > do some tests to double check.
> > 
> > Looking at your answer, I think it would be important that everyone
> > in 
> > this thread understand the purpose of WFI and how it differs with
> > WFE.
> > 
> > The two instructions provides a way to tell the processor to go in 
> > low-power state. It means the processor can turn off power on some
> > parts 
> > (e.g unit, pipeline...) to save energy.
> > 
> [snip]
> >
> > For both instruction it is normal to have an higher latency when 
> > receiving an interrupt. When a software is using them, it knows that 
> > there will have an impact, but overall it will expect some power to
> > be 
> > saved. Whether the current numbers are acceptable is another
> > question.
> > 
> Ok, thanks for these useful information. I think I understand the idea
> behind these two instructions/mechanisms now.
> 
> What (I think) Stefano is proposing is providing the user (of Xen on
> ARM) with a way of making them behave differently.

That's right. It's not always feasible to change the code of the guest
the user is running. Maybe she cannot, or maybe she doesn't want to for
other reasons. Keep in mind that the developer of the operating system
in this example might have had very different expectations of irq
latency, given that, even with wfi, is much lower on native. 

When irq latency is way more important than power consumption to the
user (think of a train, or an industrial machine that needs to move
something in a given amount of time), this option provides value to her
at very little maintenance cost on our side.

Of course, even if we introduce this option, by no mean we should stop
improving the irq latency in the normal cases.


> Whether good or bad, I've expressed my thoughts, and it's your call in
> the end. :-)
> George also has a fair point, though. Using yield is a quick and *most
> likely* effective way of achieving Linux's "idle=poll", but at the same
> time, a rather rather risky one, as it basically means the final
> behavior would relay on how yield() behave on the specific scheduler
> the user is using, which may vary.
> 
> > Now, regarding what you said. Let's imagine the scheduler is 
> > descheduling the vCPU until the next slot, it will run the vCPU
> > after 
> > even if no interrupt has been received. 
> >
> There really are no slots. There sort of are in Credit1, but preemption
> can happen inside a "slot", so I wouldn't call them such in there too.
> 
> > This is a real waste of power 
> > and become worst if an interrupt is not coming for multiple slot.
> > 
> Undeniable. :-)

Of course. But if your app needs less than 3000ns of latency, then it's
the only choice.


> > In the case of multi-vcpu, the guest using wfi will use more slot
> > than 
> > it was doing before. This means less slot for vCPUs that actually
> > have 
> > real work to do. 
> >
> No, because it continuously yields. So, yes indeed there will be higher
> scheduling overhead, but no stealing of otherwise useful computation
> time. Not with the yield() implementations we have right now in the
> code.
> 
> But I'm starting to think that we probably better make a step back from
> withing deep inside the scheduler, and think, first, whether or not
> having something similar to Linux's idle=poll is something we want, if
> only for testing, debugging, or very specific use cases.
> 
> And only then, if the answer is yes, decide how to actually implement
> it, whether or not to use yield, etc.

I think we want it, if the implementation is small and unintrusive.

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 18:43         ` Stefano Stabellini
@ 2017-02-20 18:45           ` George Dunlap
  2017-02-20 18:49             ` Stefano Stabellini
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2017-02-20 18:45 UTC (permalink / raw)
  To: Stefano Stabellini, Dario Faggioli
  Cc: edgar.iglesias, george.dunlap, Punit Agrawal, Julien Grall,
	xen-devel, nd

On 20/02/17 18:43, Stefano Stabellini wrote:
> On Mon, 20 Feb 2017, Dario Faggioli wrote:
>> On Sun, 2017-02-19 at 21:34 +0000, Julien Grall wrote:
>>> Hi Stefano,
>>>
>>> I have CCed another ARM person who has more knowledge than me on 
>>> scheduling/power.
>>>
>> Ah, when I saw this, I thought you were Cc-ing my friend Juri, which
>> also works there, and is doing that stuff. :-)
>>
>>>> In both cases the vcpus is not run until the next slot, so I don't
>>>> think
>>>> it should make the performance worse in multi-vcpus scenarios. But
>>>> I can
>>>> do some tests to double check.
>>>
>>> Looking at your answer, I think it would be important that everyone
>>> in 
>>> this thread understand the purpose of WFI and how it differs with
>>> WFE.
>>>
>>> The two instructions provides a way to tell the processor to go in 
>>> low-power state. It means the processor can turn off power on some
>>> parts 
>>> (e.g unit, pipeline...) to save energy.
>>>
>> [snip]
>>>
>>> For both instruction it is normal to have an higher latency when 
>>> receiving an interrupt. When a software is using them, it knows that 
>>> there will have an impact, but overall it will expect some power to
>>> be 
>>> saved. Whether the current numbers are acceptable is another
>>> question.
>>>
>> Ok, thanks for these useful information. I think I understand the idea
>> behind these two instructions/mechanisms now.
>>
>> What (I think) Stefano is proposing is providing the user (of Xen on
>> ARM) with a way of making them behave differently.
> 
> That's right. It's not always feasible to change the code of the guest
> the user is running. Maybe she cannot, or maybe she doesn't want to for
> other reasons. Keep in mind that the developer of the operating system
> in this example might have had very different expectations of irq
> latency, given that, even with wfi, is much lower on native. 
> 
> When irq latency is way more important than power consumption to the
> user (think of a train, or an industrial machine that needs to move
> something in a given amount of time), this option provides value to her
> at very little maintenance cost on our side.
> 
> Of course, even if we introduce this option, by no mean we should stop
> improving the irq latency in the normal cases.
> 
> 
>> Whether good or bad, I've expressed my thoughts, and it's your call in
>> the end. :-)
>> George also has a fair point, though. Using yield is a quick and *most
>> likely* effective way of achieving Linux's "idle=poll", but at the same
>> time, a rather rather risky one, as it basically means the final
>> behavior would relay on how yield() behave on the specific scheduler
>> the user is using, which may vary.
>>
>>> Now, regarding what you said. Let's imagine the scheduler is 
>>> descheduling the vCPU until the next slot, it will run the vCPU
>>> after 
>>> even if no interrupt has been received. 
>>>
>> There really are no slots. There sort of are in Credit1, but preemption
>> can happen inside a "slot", so I wouldn't call them such in there too.
>>
>>> This is a real waste of power 
>>> and become worst if an interrupt is not coming for multiple slot.
>>>
>> Undeniable. :-)
> 
> Of course. But if your app needs less than 3000ns of latency, then it's
> the only choice.
> 
> 
>>> In the case of multi-vcpu, the guest using wfi will use more slot
>>> than 
>>> it was doing before. This means less slot for vCPUs that actually
>>> have 
>>> real work to do. 
>>>
>> No, because it continuously yields. So, yes indeed there will be higher
>> scheduling overhead, but no stealing of otherwise useful computation
>> time. Not with the yield() implementations we have right now in the
>> code.
>>
>> But I'm starting to think that we probably better make a step back from
>> withing deep inside the scheduler, and think, first, whether or not
>> having something similar to Linux's idle=poll is something we want, if
>> only for testing, debugging, or very specific use cases.
>>
>> And only then, if the answer is yes, decide how to actually implement
>> it, whether or not to use yield, etc.
> 
> I think we want it, if the implementation is small and unintrusive.

But surely we want it to be per-domain, not system-wide?

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-19 21:34     ` Julien Grall
  2017-02-20 11:35       ` Dario Faggioli
@ 2017-02-20 18:47       ` Stefano Stabellini
  2017-02-20 18:53         ` Julien Grall
  1 sibling, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-20 18:47 UTC (permalink / raw)
  To: Julien Grall
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap,
	dario.faggioli, Punit Agrawal, xen-devel, nd

On Sun, 19 Feb 2017, Julien Grall wrote:
> Hi Stefano,
> 
> I have CCed another ARM person who has more knowledge than me on
> scheduling/power.
> 
> On 02/17/2017 10:50 PM, Stefano Stabellini wrote:
> > CC'ing xen-devel, I forgot on the original patch
> > 
> > On Fri, 17 Feb 2017, Julien Grall wrote:
> > > Hi Stefano,
> > > 
> > > On 02/16/2017 11:04 PM, Stefano Stabellini wrote:
> > > > Introduce new Xen command line parameter called "vwfi", which stands for
> > > > virtual wfi. The default is "sleep": on guest wfi, Xen calls vcpu_block
> > > > on the guest vcpu. The behavior can be changed setting vwfi to "idle",
> > > > in that case Xen calls vcpu_yield.
> > > > 
> > > > The result is strong reduction in irq latency (8050ns -> 3500ns) at the
> > > > cost of idle_loop being called less often, leading to higher power
> > > > consumption.
> > > 
> > > Please explain in which context this will be beneficial. My gut feeling is
> > > only will make performance worst if a multiple vCPU of the same guest is
> > > running on vCPU
> > 
> > I am not a scheduler expert, but I don't think so. Let me explain the
> > difference:
> > 
> > - vcpu_block blocks a vcpu until an event occurs, for example until it
> >   receives an interrupt
> > 
> > - vcpu_yield stops the vcpu from running until the next scheduler slot
> > 
> > In both cases the vcpus is not run until the next slot, so I don't think
> > it should make the performance worse in multi-vcpus scenarios. But I can
> > do some tests to double check.
> 
> You still haven't explained how you came up with those number? My guess is 1
> vCPU per pCPU but it is not clear from the commit message.

It's the same setup I used for
alpine.DEB.2.10.1702091603240.20549@sstabellini-ThinkPad-X260, I'll add
more info to the commit message.


> > > I don't think this is acceptable even to get a better interrupt latency.
> > > Some
> > > workload will care about interrupt latency and power.
> > > 
> > > I think a better approach would be to check whether the scheduler has
> > > another
> > > vCPU to run. If not wait for an interrupt in the trap.
> > > 
> > > This would save the context switch to the idle vCPU if we are still on the
> > > time slice of the vCPU.
> > 
> > From my limited understanding of how schedulers work, I think this
> > cannot work reliably. It is the scheduler that needs to tell the
> > arch-specific code to put a pcpu to sleep, not the other way around. I
> > would appreciate if Dario could confirm this though.
> 
> If my understanding is correct, your workload is 1 vCPU per pCPU. So why do
> you need to trap WFI/WFE in this case?
> 
> Would not it be easier to let the guest using them directly?

This is a good question, I have already answered: I think it would break
the scheduler. Dario confirmed it in his reply
(1487382463.6732.146.camel@citrix.com).

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 18:45           ` George Dunlap
@ 2017-02-20 18:49             ` Stefano Stabellini
  0 siblings, 0 replies; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-20 18:49 UTC (permalink / raw)
  To: George Dunlap
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap,
	Dario Faggioli, Punit Agrawal, Julien Grall, xen-devel, nd

On Mon, 20 Feb 2017, George Dunlap wrote:
> On 20/02/17 18:43, Stefano Stabellini wrote:
> > On Mon, 20 Feb 2017, Dario Faggioli wrote:
> >> On Sun, 2017-02-19 at 21:34 +0000, Julien Grall wrote:
> >>> Hi Stefano,
> >>>
> >>> I have CCed another ARM person who has more knowledge than me on 
> >>> scheduling/power.
> >>>
> >> Ah, when I saw this, I thought you were Cc-ing my friend Juri, which
> >> also works there, and is doing that stuff. :-)
> >>
> >>>> In both cases the vcpus is not run until the next slot, so I don't
> >>>> think
> >>>> it should make the performance worse in multi-vcpus scenarios. But
> >>>> I can
> >>>> do some tests to double check.
> >>>
> >>> Looking at your answer, I think it would be important that everyone
> >>> in 
> >>> this thread understand the purpose of WFI and how it differs with
> >>> WFE.
> >>>
> >>> The two instructions provides a way to tell the processor to go in 
> >>> low-power state. It means the processor can turn off power on some
> >>> parts 
> >>> (e.g unit, pipeline...) to save energy.
> >>>
> >> [snip]
> >>>
> >>> For both instruction it is normal to have an higher latency when 
> >>> receiving an interrupt. When a software is using them, it knows that 
> >>> there will have an impact, but overall it will expect some power to
> >>> be 
> >>> saved. Whether the current numbers are acceptable is another
> >>> question.
> >>>
> >> Ok, thanks for these useful information. I think I understand the idea
> >> behind these two instructions/mechanisms now.
> >>
> >> What (I think) Stefano is proposing is providing the user (of Xen on
> >> ARM) with a way of making them behave differently.
> > 
> > That's right. It's not always feasible to change the code of the guest
> > the user is running. Maybe she cannot, or maybe she doesn't want to for
> > other reasons. Keep in mind that the developer of the operating system
> > in this example might have had very different expectations of irq
> > latency, given that, even with wfi, is much lower on native. 
> > 
> > When irq latency is way more important than power consumption to the
> > user (think of a train, or an industrial machine that needs to move
> > something in a given amount of time), this option provides value to her
> > at very little maintenance cost on our side.
> > 
> > Of course, even if we introduce this option, by no mean we should stop
> > improving the irq latency in the normal cases.
> > 
> > 
> >> Whether good or bad, I've expressed my thoughts, and it's your call in
> >> the end. :-)
> >> George also has a fair point, though. Using yield is a quick and *most
> >> likely* effective way of achieving Linux's "idle=poll", but at the same
> >> time, a rather rather risky one, as it basically means the final
> >> behavior would relay on how yield() behave on the specific scheduler
> >> the user is using, which may vary.
> >>
> >>> Now, regarding what you said. Let's imagine the scheduler is 
> >>> descheduling the vCPU until the next slot, it will run the vCPU
> >>> after 
> >>> even if no interrupt has been received. 
> >>>
> >> There really are no slots. There sort of are in Credit1, but preemption
> >> can happen inside a "slot", so I wouldn't call them such in there too.
> >>
> >>> This is a real waste of power 
> >>> and become worst if an interrupt is not coming for multiple slot.
> >>>
> >> Undeniable. :-)
> > 
> > Of course. But if your app needs less than 3000ns of latency, then it's
> > the only choice.
> > 
> > 
> >>> In the case of multi-vcpu, the guest using wfi will use more slot
> >>> than 
> >>> it was doing before. This means less slot for vCPUs that actually
> >>> have 
> >>> real work to do. 
> >>>
> >> No, because it continuously yields. So, yes indeed there will be higher
> >> scheduling overhead, but no stealing of otherwise useful computation
> >> time. Not with the yield() implementations we have right now in the
> >> code.
> >>
> >> But I'm starting to think that we probably better make a step back from
> >> withing deep inside the scheduler, and think, first, whether or not
> >> having something similar to Linux's idle=poll is something we want, if
> >> only for testing, debugging, or very specific use cases.
> >>
> >> And only then, if the answer is yes, decide how to actually implement
> >> it, whether or not to use yield, etc.
> > 
> > I think we want it, if the implementation is small and unintrusive.
> 
> But surely we want it to be per-domain, not system-wide?

Yes, per-domain would be ideal, but I thought that system-wide would be
good enough. I admit I got lazy :-)

I can plumb it through libxl/xl if that's the consensus.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 18:47       ` Stefano Stabellini
@ 2017-02-20 18:53         ` Julien Grall
  2017-02-20 19:20           ` Dario Faggioli
  0 siblings, 1 reply; 39+ messages in thread
From: Julien Grall @ 2017-02-20 18:53 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, dario.faggioli, Punit Agrawal,
	xen-devel, nd

Hi Stefano,

On 20/02/17 18:47, Stefano Stabellini wrote:
> On Sun, 19 Feb 2017, Julien Grall wrote:
>>>> I don't think this is acceptable even to get a better interrupt latency.
>>>> Some
>>>> workload will care about interrupt latency and power.
>>>>
>>>> I think a better approach would be to check whether the scheduler has
>>>> another
>>>> vCPU to run. If not wait for an interrupt in the trap.
>>>>
>>>> This would save the context switch to the idle vCPU if we are still on the
>>>> time slice of the vCPU.
>>>
>>> From my limited understanding of how schedulers work, I think this
>>> cannot work reliably. It is the scheduler that needs to tell the
>>> arch-specific code to put a pcpu to sleep, not the other way around. I
>>> would appreciate if Dario could confirm this though.
>>
>> If my understanding is correct, your workload is 1 vCPU per pCPU. So why do
>> you need to trap WFI/WFE in this case?
>>
>> Would not it be easier to let the guest using them directly?
>
> This is a good question, I have already answered: I think it would break
> the scheduler. Dario confirmed it in his reply
> (1487382463.6732.146.camel@citrix.com).

I don't think it will break the scheduler, we are trapping WFI/WFE to 
take advantage of the quiescence of the guest. If we don't do that, the 
guest will just waste his slot.

Even if the WFI is done by the guest, you will receive the scheduler 
interrupt in Xen and that's fine. Did I miss anything?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 18:53         ` Julien Grall
@ 2017-02-20 19:20           ` Dario Faggioli
  2017-02-20 19:38             ` Julien Grall
  0 siblings, 1 reply; 39+ messages in thread
From: Dario Faggioli @ 2017-02-20 19:20 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2067 bytes --]

On Mon, 2017-02-20 at 18:53 +0000, Julien Grall wrote:
> On 20/02/17 18:47, Stefano Stabellini wrote:
> > This is a good question, I have already answered: I think it would
> > break
> > the scheduler. Dario confirmed it in his reply
> > (1487382463.6732.146.camel@citrix.com).
> 
> I don't think it will break the scheduler, we are trapping WFI/WFE
> to 
> take advantage of the quiescence of the guest. If we don't do that,
> the 
> guest will just waste his slot.
> 
Err... Wait... WF* basically puts a _pCPU_ to sleep until something
happens (interrupt or event). This is not something we let a guest
_vCPU_ do!

E.g., if vCPU x of domain A wants to go idle with a WFI/WFE, but the
host is overbooked and currently really busy, Xen wants to run some
other vCPU (of either the same of another domain).

That's actually the whole point of virtualization, and the reason why
overbooking an host with more vCPUs (from multiple guests) than it has
pCPUs works at all. If we start letting guests put the host's pCPUs to
sleep, not only the scheduler, but many things would break, IMO!

> Even if the WFI is done by the guest, you will receive the scheduler 
> interrupt in Xen and that's fine. Did I miss anything?
> 
Maybe it's me that am missing what you actually mean here.

What do you mean with "you will receive the scheduler interrupt"? Right
now, HLT in guest means block in Xen, which makes the scheduler run and
decide whether the pCPU should stay idle, or run someone else. I'd say
that no good can come from having HTL in guest automatically meaning
also on the pCPU.

So, I'm not sure what we're talking about, but what I'm quite sure is
that we don't want a guest to be able to decide when and until what
time/event, a pCPU goes idle.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 19:20           ` Dario Faggioli
@ 2017-02-20 19:38             ` Julien Grall
  2017-02-20 22:53               ` Dario Faggioli
  0 siblings, 1 reply; 39+ messages in thread
From: Julien Grall @ 2017-02-20 19:38 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

Hi Dario,

On 20/02/17 19:20, Dario Faggioli wrote:
> On Mon, 2017-02-20 at 18:53 +0000, Julien Grall wrote:
>> On 20/02/17 18:47, Stefano Stabellini wrote:
>>> This is a good question, I have already answered: I think it would
>>> break
>>> the scheduler. Dario confirmed it in his reply
>>> (1487382463.6732.146.camel@citrix.com).
>>
>> I don't think it will break the scheduler, we are trapping WFI/WFE
>> to
>> take advantage of the quiescence of the guest. If we don't do that,
>> the
>> guest will just waste his slot.
>>
> Err... Wait... WF* basically puts a _pCPU_ to sleep until something
> happens (interrupt or event). This is not something we let a guest
> _vCPU_ do!
>
> E.g., if vCPU x of domain A wants to go idle with a WFI/WFE, but the
> host is overbooked and currently really busy, Xen wants to run some
> other vCPU (of either the same of another domain).
>
> That's actually the whole point of virtualization, and the reason why
> overbooking an host with more vCPUs (from multiple guests) than it has
> pCPUs works at all. If we start letting guests put the host's pCPUs to
> sleep, not only the scheduler, but many things would break, IMO!

I am not speaking about general case but when you get 1 vCPU pinned to 1 
pCPU (I think this is Stefano use case). No other vCPU will run on this 
pCPU. So it would be fine to let the guest do the WFI.

If you run multiple vCPU in the same pCPU you would have a bigger 
interrupt latency. And blocked the vCPU or yield will likely have the 
same number unless you know the interrupt will come right now. But in 
that case, using WFI in the guest may not have been the right things to do.

I have heard use case where people wants to disable the scheduler (e.g a 
nop scheduler) because they know only 1 vCPU will ever run on the pCPU. 
This is exactly the use case I am thinking about.

>
>> Even if the WFI is done by the guest, you will receive the scheduler
>> interrupt in Xen and that's fine. Did I miss anything?
>>
> Maybe it's me that am missing what you actually mean here.
>
> What do you mean with "you will receive the scheduler interrupt"? Right
> now, HLT in guest means block in Xen, which makes the scheduler run and
> decide whether the pCPU should stay idle, or run someone else. I'd say
> that no good can come from having HTL in guest automatically meaning
> also on the pCPU.
>
> So, I'm not sure what we're talking about, but what I'm quite sure is
> that we don't want a guest to be able to decide when and until what
> time/event, a pCPU goes idle.

Well, if the guest is not using the WFI/WFE at all you would need an 
interrupt from the scheduler to get it running. So here it is similar, 
the scheduler would have setup a timer and the processor will awake when 
receiving the timer interrupt to enter in the hypervisor.

So, yes in fine the guest will waste its slot. But this is what would 
have happen if the guest is not using any of the WFI/WFE instructions.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 19:38             ` Julien Grall
@ 2017-02-20 22:53               ` Dario Faggioli
  2017-02-21  0:38                 ` Stefano Stabellini
  2017-02-21  7:59                 ` Julien Grall
  0 siblings, 2 replies; 39+ messages in thread
From: Dario Faggioli @ 2017-02-20 22:53 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4767 bytes --]

On Mon, 2017-02-20 at 19:38 +0000, Julien Grall wrote:
> On 20/02/17 19:20, Dario Faggioli wrote:
> > E.g., if vCPU x of domain A wants to go idle with a WFI/WFE, but
> > the
> > host is overbooked and currently really busy, Xen wants to run some
> > other vCPU (of either the same of another domain).
> > 
> > That's actually the whole point of virtualization, and the reason
> > why
> > overbooking an host with more vCPUs (from multiple guests) than it
> > has
> > pCPUs works at all. If we start letting guests put the host's pCPUs
> > to
> > sleep, not only the scheduler, but many things would break, IMO!
> 
> I am not speaking about general case but when you get 1 vCPU pinned
> to 1 
> pCPU (I think this is Stefano use case). No other vCPU will run on
> this 
> pCPU. So it would be fine to let the guest do the WFI.
> 
Mmm... ok, yes, in that case, it may make sense and work, from a, let's
say, purely functional perspective. But still I struggle to place this
in a bigger picture.

For instance, as you say, executing a WFI from a guest directly on
hardware, only makes sense if we have 1:1 static pinning. Which means
it can't just be done by default, or with a boot parameter, because we
need to check and enforce that there's only 1:1 pinning around.

Is it possible to decide whether to trap and emulate WFI, or just
execute it, online, and change such decision dynamically? And even if
yes, how would the whole thing work? When the direct execution is
enabled for a domain we automatically enforce 1:1 pinning for that
domain, and kick all the other domain out of its pcpus? What if they
have their own pinning, what if they also have 'direct WFI' behavior
enabled?

If it is not possible to change all this online and on a per-domain
basis, what do we do? When dooted with the 'direct WFI' flag, we only
accept 1:1 pinning? Who should enforce that, the setvcpuaffinity
hypercall?

These are just examples, my point being that in theory, if we consider
a very specific usecase or set of usecase, there's a lot we can do. But
when you say "why don't you let the guest directly execute WFI", in
response to a patch and a discussion like this, people may think that
you are actually proposing doing it as a solution, which is not
possible without figuring out all the open questions above (actually,
probably, more) and without introducing a lot of cross-subsystem
policing inside Xen, which is often something we don't want.

But, if you let me say this again, it looks to me we are trying to
solve too many problem all at once in this thread, should we try
slowing down/refocusing? :-)

> If you run multiple vCPU in the same pCPU you would have a bigger 
> interrupt latency. And blocked the vCPU or yield will likely have
> the 
> same number unless you know the interrupt will come right now. 
>
Maybe. At least on x86, that would depend on the actual load. If all
your pCPUs are more than 100% loaded, yes. If the load is less than
that, you may still see improvements.

> But in 
> that case, using WFI in the guest may not have been the right things
> to do.
> 
But if the guest is, let's say, Linux, does it use WFI or not? And is
it the right thing or not?

Again, the fact you're saying this probably means there's something I
am either missing or ignoring about ARM.

> I have heard use case where people wants to disable the scheduler
> (e.g a 
> nop scheduler) because they know only 1 vCPU will ever run on the
> pCPU. 
> This is exactly the use case I am thinking about.
> 
Sure! Except that, in Xen, we don't know whether we have, and always
will, 1 vCPU ever run on each pCPU. Nor we have a way to enforce that,
neither in toolstack nor in the hypervisor. :-P

> > So, I'm not sure what we're talking about, but what I'm quite sure
> > is
> > that we don't want a guest to be able to decide when and until what
> > time/event, a pCPU goes idle.
> 
> Well, if the guest is not using the WFI/WFE at all you would need an 
> interrupt from the scheduler to get it running. 
>
If the guest is not using WFI, it's busy looping, isn't it?

> So here it is similar, 
> the scheduler would have setup a timer and the processor will awake
> when 
> receiving the timer interrupt to enter in the hypervisor.
> 
> So, yes in fine the guest will waste its slot. 
>
Did I say it already that this concept of "slots" does not apply here?
:-D

> Cheers,
> 
Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 22:53               ` Dario Faggioli
@ 2017-02-21  0:38                 ` Stefano Stabellini
  2017-02-21  8:10                   ` Julien Grall
  2017-02-21  7:59                 ` Julien Grall
  1 sibling, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-21  0:38 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap, Punit Agrawal,
	Julien Grall, xen-devel, nd

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4273 bytes --]

On Mon, 20 Feb 2017, Dario Faggioli wrote:
> On Mon, 2017-02-20 at 19:38 +0000, Julien Grall wrote:
> > On 20/02/17 19:20, Dario Faggioli wrote:
> > > E.g., if vCPU x of domain A wants to go idle with a WFI/WFE, but
> > > the
> > > host is overbooked and currently really busy, Xen wants to run some
> > > other vCPU (of either the same of another domain).
> > > 
> > > That's actually the whole point of virtualization, and the reason
> > > why
> > > overbooking an host with more vCPUs (from multiple guests) than it
> > > has
> > > pCPUs works at all. If we start letting guests put the host's pCPUs
> > > to
> > > sleep, not only the scheduler, but many things would break, IMO!
> > 
> > I am not speaking about general case but when you get 1 vCPU pinned
> > to 1 
> > pCPU (I think this is Stefano use case). No other vCPU will run on
> > this 
> > pCPU. So it would be fine to let the guest do the WFI.
> > 
> Mmm... ok, yes, in that case, it may make sense and work, from a, let's
> say, purely functional perspective. But still I struggle to place this
> in a bigger picture.

I feel the same way as you, Dario. That said, if we could make it work
without breaking too many assumptions in Xen, it would be a great
improvement for this use-case.


> For instance, as you say, executing a WFI from a guest directly on
> hardware, only makes sense if we have 1:1 static pinning. Which means
> it can't just be done by default, or with a boot parameter, because we
> need to check and enforce that there's only 1:1 pinning around.

That's right, but we don't have a way to recognize or enforce 1:1 static
pinning at the moment, do we? But maybe the nop scheduler we discussed
could be a step in the right direction.


> Is it possible to decide whether to trap and emulate WFI, or just
> execute it, online, and change such decision dynamically? And even if
> yes, how would the whole thing work? When the direct execution is
> enabled for a domain we automatically enforce 1:1 pinning for that
> domain, and kick all the other domain out of its pcpus? What if they
> have their own pinning, what if they also have 'direct WFI' behavior
> enabled?

Right, I asked myself those questions as well. That is why I wrote "it
breaks the scheduler" in the previous email. I don't think it can work
today, but it could work one day, building on top of the nop scheduler.


> If it is not possible to change all this online and on a per-domain
> basis, what do we do? When dooted with the 'direct WFI' flag, we only
> accept 1:1 pinning? Who should enforce that, the setvcpuaffinity
> hypercall?
> 
> These are just examples, my point being that in theory, if we consider
> a very specific usecase or set of usecase, there's a lot we can do. But
> when you say "why don't you let the guest directly execute WFI", in
> response to a patch and a discussion like this, people may think that
> you are actually proposing doing it as a solution, which is not
> possible without figuring out all the open questions above (actually,
> probably, more) and without introducing a lot of cross-subsystem
> policing inside Xen, which is often something we don't want.

+1


> But, if you let me say this again, it looks to me we are trying to
> solve too many problem all at once in this thread, should we try
> slowing down/refocusing? :-)

Indeed. I think this patch can improve some use-cases with little
maintenance cost. It's pretty straightforward to me.


> > But in 
> > that case, using WFI in the guest may not have been the right things
> > to do.
> > 
> But if the guest is, let's say, Linux, does it use WFI or not? And is
> it the right thing or not?
> 
> Again, the fact you're saying this probably means there's something I
> am either missing or ignoring about ARM.

Linux uses WFI


> > I have heard use case where people wants to disable the scheduler
> > (e.g a 
> > nop scheduler) because they know only 1 vCPU will ever run on the
> > pCPU. 
> > This is exactly the use case I am thinking about.
> > 
> Sure! Except that, in Xen, we don't know whether we have, and always
> will, 1 vCPU ever run on each pCPU. Nor we have a way to enforce that,
> neither in toolstack nor in the hypervisor. :-P

Exactly! I should have written it more clearly from the beginning.

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-20 22:53               ` Dario Faggioli
  2017-02-21  0:38                 ` Stefano Stabellini
@ 2017-02-21  7:59                 ` Julien Grall
  2017-02-21  9:09                   ` Dario Faggioli
  1 sibling, 1 reply; 39+ messages in thread
From: Julien Grall @ 2017-02-21  7:59 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

Hi Dario,

On 20/02/2017 22:53, Dario Faggioli wrote:
> On Mon, 2017-02-20 at 19:38 +0000, Julien Grall wrote:
>> On 20/02/17 19:20, Dario Faggioli wrote:
>>> E.g., if vCPU x of domain A wants to go idle with a WFI/WFE, but
>>> the
>>> host is overbooked and currently really busy, Xen wants to run some
>>> other vCPU (of either the same of another domain).
>>>
>>> That's actually the whole point of virtualization, and the reason
>>> why
>>> overbooking an host with more vCPUs (from multiple guests) than it
>>> has
>>> pCPUs works at all. If we start letting guests put the host's pCPUs
>>> to
>>> sleep, not only the scheduler, but many things would break, IMO!
>>
>> I am not speaking about general case but when you get 1 vCPU pinned
>> to 1
>> pCPU (I think this is Stefano use case). No other vCPU will run on
>> this
>> pCPU. So it would be fine to let the guest do the WFI.
>>
> Mmm... ok, yes, in that case, it may make sense and work, from a, let's
> say, purely functional perspective. But still I struggle to place this
> in a bigger picture.
>
> For instance, as you say, executing a WFI from a guest directly on
> hardware, only makes sense if we have 1:1 static pinning. Which means
> it can't just be done by default, or with a boot parameter, because we
> need to check and enforce that there's only 1:1 pinning around.

I agree it cannot be done by default. Similarly, the poll mode cannot be 
done by default in platform nor by domain because you need to know that 
all vCPUs will be in polling mode.

But as I said, if vCPUs are not pinned this patch as very little 
advantage because you may context switch between them when yielding.

>
> Is it possible to decide whether to trap and emulate WFI, or just
> execute it, online, and change such decision dynamically? And even if
> yes, how would the whole thing work? When the direct execution is
> enabled for a domain we automatically enforce 1:1 pinning for that
> domain, and kick all the other domain out of its pcpus? What if they
> have their own pinning, what if they also have 'direct WFI' behavior
> enabled?

It can be changed online, the WFI/WFE trapping is per pCPU (see 
HCR_EL2.{TWE,TWI}

>
> If it is not possible to change all this online and on a per-domain
> basis, what do we do? When dooted with the 'direct WFI' flag, we only
> accept 1:1 pinning? Who should enforce that, the setvcpuaffinity
> hypercall?
>
> These are just examples, my point being that in theory, if we consider
> a very specific usecase or set of usecase, there's a lot we can do. But
> when you say "why don't you let the guest directly execute WFI", in
> response to a patch and a discussion like this, people may think that
> you are actually proposing doing it as a solution, which is not
> possible without figuring out all the open questions above (actually,
> probably, more) and without introducing a lot of cross-subsystem
> policing inside Xen, which is often something we don't want.

I made this response because the patch sent by Stefano as a very 
specific use case that can be solved the same way. Everyone here is 
suggesting polling but it has it is own disadvantage: power consumption.

Anyway, I still think in both case we are solving a specific problem 
without looking at what matters. I.e Why the scheduler takes so much 
time to block/unblock.

>
> But, if you let me say this again, it looks to me we are trying to
> solve too many problem all at once in this thread, should we try
> slowing down/refocusing? :-)
>
>> If you run multiple vCPU in the same pCPU you would have a bigger
>> interrupt latency. And blocked the vCPU or yield will likely have
>> the
>> same number unless you know the interrupt will come right now.
>>
> Maybe. At least on x86, that would depend on the actual load. If all
> your pCPUs are more than 100% loaded, yes. If the load is less than
> that, you may still see improvements.
>
>> But in
>> that case, using WFI in the guest may not have been the right things
>> to do.
>>
> But if the guest is, let's say, Linux, does it use WFI or not? And is
> it the right thing or not?
>
> Again, the fact you're saying this probably means there's something I
> am either missing or ignoring about ARM.

WFI/WFE is a way to be nice and save power. It is not mandatory to use 
them, a guest OS can perfectly decide that it does not need it.

>> I have heard use case where people wants to disable the scheduler
>> (e.g a
>> nop scheduler) because they know only 1 vCPU will ever run on the
>> pCPU.
>> This is exactly the use case I am thinking about.
>>
> Sure! Except that, in Xen, we don't know whether we have, and always
> will, 1 vCPU ever run on each pCPU. Nor we have a way to enforce that,
> neither in toolstack nor in the hypervisor. :-P
>
>>> So, I'm not sure what we're talking about, but what I'm quite sure
>>> is
>>> that we don't want a guest to be able to decide when and until what
>>> time/event, a pCPU goes idle.
>>
>> Well, if the guest is not using the WFI/WFE at all you would need an
>> interrupt from the scheduler to get it running.
>>
> If the guest is not using WFI, it's busy looping, isn't it?

Yes, very similar to what we are implementing with the poll here.

>
>> So here it is similar,
>> the scheduler would have setup a timer and the processor will awake
>> when
>> receiving the timer interrupt to enter in the hypervisor.
>>
>> So, yes in fine the guest will waste its slot.
>>
> Did I say it already that this concept of "slots" does not apply here?
> :-D

Sorry forgot about this :/. I guess you use the term credit? If so, the 
guest will use its credit for nothing.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21  0:38                 ` Stefano Stabellini
@ 2017-02-21  8:10                   ` Julien Grall
  2017-02-21  9:24                     ` Dario Faggioli
  0 siblings, 1 reply; 39+ messages in thread
From: Julien Grall @ 2017-02-21  8:10 UTC (permalink / raw)
  To: Stefano Stabellini, Dario Faggioli
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

Hi Stefano,

On 21/02/2017 00:38, Stefano Stabellini wrote:
> On Mon, 20 Feb 2017, Dario Faggioli wrote:
>> On Mon, 2017-02-20 at 19:38 +0000, Julien Grall wrote:
>>> On 20/02/17 19:20, Dario Faggioli wrote:
>>>> E.g., if vCPU x of domain A wants to go idle with a WFI/WFE, but
>>>> the
>>>> host is overbooked and currently really busy, Xen wants to run some
>>>> other vCPU (of either the same of another domain).
>>>>
>>>> That's actually the whole point of virtualization, and the reason
>>>> why
>>>> overbooking an host with more vCPUs (from multiple guests) than it
>>>> has
>>>> pCPUs works at all. If we start letting guests put the host's pCPUs
>>>> to
>>>> sleep, not only the scheduler, but many things would break, IMO!
>>>
>>> I am not speaking about general case but when you get 1 vCPU pinned
>>> to 1
>>> pCPU (I think this is Stefano use case). No other vCPU will run on
>>> this
>>> pCPU. So it would be fine to let the guest do the WFI.
>>>
>> Mmm... ok, yes, in that case, it may make sense and work, from a, let's
>> say, purely functional perspective. But still I struggle to place this
>> in a bigger picture.
>
> I feel the same way as you, Dario. That said, if we could make it work
> without breaking too many assumptions in Xen, it would be a great
> improvement for this use-case.

Again, there is no assumption broken. Using WFI/WFE is just a nice way 
to say: "You can sleep and save power" so a scheduler can decide to 
schedule another vCPU. Obviously a guest is free to not use them and 
could do a busy loop instead.

But as far as I can tell, the use-case you are trying to solve is each 
vCPU pinned to a specific pCPU. If you don't pin them, your patch will 
not improve much because both yield and block may context switch you vCPU.

>
>> For instance, as you say, executing a WFI from a guest directly on
>> hardware, only makes sense if we have 1:1 static pinning. Which means
>> it can't just be done by default, or with a boot parameter, because we
>> need to check and enforce that there's only 1:1 pinning around.
>
> That's right, but we don't have a way to recognize or enforce 1:1 static
> pinning at the moment, do we? But maybe the nop scheduler we discussed
> could be a step in the right direction.
>
>
>> Is it possible to decide whether to trap and emulate WFI, or just
>> execute it, online, and change such decision dynamically? And even if
>> yes, how would the whole thing work? When the direct execution is
>> enabled for a domain we automatically enforce 1:1 pinning for that
>> domain, and kick all the other domain out of its pcpus? What if they
>> have their own pinning, what if they also have 'direct WFI' behavior
>> enabled?
>
> Right, I asked myself those questions as well. That is why I wrote "it
> breaks the scheduler" in the previous email. I don't think it can work
> today, but it could work one day, building on top of the nop scheduler.

WFE/WFI is only a hint for the scheduler to reschedule. If you don't 
trap them, the guest will still run until the end of its credit. It 
doesn't break anything.

>
>
>> If it is not possible to change all this online and on a per-domain
>> basis, what do we do? When dooted with the 'direct WFI' flag, we only
>> accept 1:1 pinning? Who should enforce that, the setvcpuaffinity
>> hypercall?
>>
>> These are just examples, my point being that in theory, if we consider
>> a very specific usecase or set of usecase, there's a lot we can do. But
>> when you say "why don't you let the guest directly execute WFI", in
>> response to a patch and a discussion like this, people may think that
>> you are actually proposing doing it as a solution, which is not
>> possible without figuring out all the open questions above (actually,
>> probably, more) and without introducing a lot of cross-subsystem
>> policing inside Xen, which is often something we don't want.
>
> +1
>
>
>> But, if you let me say this again, it looks to me we are trying to
>> solve too many problem all at once in this thread, should we try
>> slowing down/refocusing? :-)
>
> Indeed. I think this patch can improve some use-cases with little
> maintenance cost. It's pretty straightforward to me.

I am still missing the bit of your use-case and I can only speculate on 
it so far.

>>> But in
>>> that case, using WFI in the guest may not have been the right things
>>> to do.
>>>
>> But if the guest is, let's say, Linux, does it use WFI or not? And is
>> it the right thing or not?
>>
>> Again, the fact you're saying this probably means there's something I
>> am either missing or ignoring about ARM.
>
> Linux uses WFI

Linux may or may not use WFI. It is up to his scheduler to decide.

>
>>> I have heard use case where people wants to disable the scheduler
>>> (e.g a
>>> nop scheduler) because they know only 1 vCPU will ever run on the
>>> pCPU.
>>> This is exactly the use case I am thinking about.
>>>
>> Sure! Except that, in Xen, we don't know whether we have, and always
>> will, 1 vCPU ever run on each pCPU. Nor we have a way to enforce that,
>> neither in toolstack nor in the hypervisor. :-P
>
> Exactly! I should have written it more clearly from the beginning.
>

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21  7:59                 ` Julien Grall
@ 2017-02-21  9:09                   ` Dario Faggioli
  2017-02-21 12:30                     ` Julien Grall
  0 siblings, 1 reply; 39+ messages in thread
From: Dario Faggioli @ 2017-02-21  9:09 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 5693 bytes --]

On Tue, 2017-02-21 at 07:59 +0000, Julien Grall wrote:
> On 20/02/2017 22:53, Dario Faggioli wrote:
> > For instance, as you say, executing a WFI from a guest directly on
> > hardware, only makes sense if we have 1:1 static pinning. Which
> > means
> > it can't just be done by default, or with a boot parameter, because
> > we
> > need to check and enforce that there's only 1:1 pinning around.
> 
> I agree it cannot be done by default. Similarly, the poll mode cannot
> be 
> done by default in platform nor by domain because you need to know
> that 
> all vCPUs will be in polling mode.
> 
No, that's the big difference. Polling (which, as far as this patch
goes, is yielding, in this case) is generic in the sense that, no
matter the pinned or non-pinned state, things work. Power is wasted,
but nothing breaks.

Not trapping WF* is not generic in the sense that, if you do in the
pinned case, i (probably) works. If you lift the pinning, but leave the
direct WF* execution in place, everything breaks.

This is all I'm saying: that if you say, not trapping is an alternative
to this patch, well, it is not. Not trapping _plus_ measures for
preventing things to break, is an alternative.

Am I nitpicking? Perhaps... In which case, sorry. :-P

> But as I said, if vCPUs are not pinned this patch as very little 
> advantage because you may context switch between them when yielding.
> 
Smaller advantage, sure. How much smaller, hard to tell. That is the
reason why I see some potential value in this patch, especially if
converted to doing its thing per-domain, as George suggested. One can
try (and, when that happens, we'll show a big WARNING about wasting
power an heating up the CPUs!), and decide whether the result is good
or not for the specific use case.

> > Is it possible to decide whether to trap and emulate WFI, or just
> > execute it, online, and change such decision dynamically? And even
> > if
> > yes, how would the whole thing work? When the direct execution is
> > enabled for a domain we automatically enforce 1:1 pinning for that
> > domain, and kick all the other domain out of its pcpus? What if
> > they
> > have their own pinning, what if they also have 'direct WFI'
> > behavior
> > enabled?
> 
> It can be changed online, the WFI/WFE trapping is per pCPU (see 
> HCR_EL2.{TWE,TWI}
> 
Ok, thanks for the info. Not bad. With added logic (perhaps in the nop
scheduler), this looks like it could be useful.

> > These are just examples, my point being that in theory, if we
> > consider
> > a very specific usecase or set of usecase, there's a lot we can do.
> > But
> > when you say "why don't you let the guest directly execute WFI", in
> > response to a patch and a discussion like this, people may think
> > that
> > you are actually proposing doing it as a solution, which is not
> > possible without figuring out all the open questions above
> > (actually,
> > probably, more) and without introducing a lot of cross-subsystem
> > policing inside Xen, which is often something we don't want.
> 
> I made this response because the patch sent by Stefano as a very 
> specific use case that can be solved the same way. Everyone here is 
> suggesting polling but it has it is own disadvantage: power
> consumption.
> 
> Anyway, I still think in both case we are solving a specific problem 
> without looking at what matters. I.e Why the scheduler takes so much 
> time to block/unblock.
> 
Well, TBH, we still are not entirely sure who the culprit is for high
latency. There are spikes in Credit2, and I'm investigating that. But
apart from them? I think we need other numbers with which we can
compare the numbers that Stefano has collected.

I'll send code for the nop scheduler, and we will compare with what
we'll get with it. Another interesting data point would be knowing how
the numbers look like on baremetal, on the same platform and under
comparable conditions.

And I guess there are other components and layers, in the Xen
architecture, that may be causing increased latency, which we may have
not identified yet.

Anyway, nop scheduler is probably first thing we want to check. I'll
send the patches soon.

> > > So, yes in fine the guest will waste its slot.
> > > 
> > Did I say it already that this concept of "slots" does not apply
> > here?
> > :-D
> 
> Sorry forgot about this :/. I guess you use the term credit? If so,
> the 
> guest will use its credit for nothing.
> 
If the guest is alone, or in general the system is undersubscribed, it
would, by continuously yielding in a busy loop, but that doesn't
matter, because there are enough pCPUs to run even vCPUs that are out
of credits.

If the guest is not alone, and the system is oversubscribed, it would
use a very tiny amount of its credits, every now and then, i.e., the
ones that are necessary to execute a WFI, and, for Xen, to issue a call
to sched_yield(). But after that, we will run someone else. This to say
that the problem of this patch might be that, in the oversubscribed
case, it relies too much on the behavior of yield, but not that it does
nothing.

But maybe I'm nitpicking again. Sorry. I don't get to talk about these
inner (and very interesting, to me at least) scheduling details too
often, and when it happens, I tend to get excited and exaggerate! :-P

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21  8:10                   ` Julien Grall
@ 2017-02-21  9:24                     ` Dario Faggioli
  2017-02-21 13:04                       ` Julien Grall
  0 siblings, 1 reply; 39+ messages in thread
From: Dario Faggioli @ 2017-02-21  9:24 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2924 bytes --]

On Tue, 2017-02-21 at 08:10 +0000, Julien Grall wrote:
> On 21/02/2017 00:38, Stefano Stabellini wrote:
> > On Mon, 20 Feb 2017, Dario Faggioli wrote:
> > > Mmm... ok, yes, in that case, it may make sense and work, from a,
> > > let's
> > > say, purely functional perspective. But still I struggle to place
> > > this
> > > in a bigger picture.
> > 
> > I feel the same way as you, Dario. That said, if we could make it
> > work
> > without breaking too many assumptions in Xen, it would be a great
> > improvement for this use-case.
> 
> Again, there is no assumption broken. Using WFI/WFE is just a nice
> way 
> to say: "You can sleep and save power" so a scheduler can decide to 
> schedule another vCPU. Obviously a guest is free to not use them and 
> could do a busy loop instead.
> 
Again, other way round. It is a scheduler that, when a CPU would go
idle, decides whether to sleep or busy loop.

It's the Linux scheduler that, in Linux, on x86, decides whether to HLT
(or use MWAIT and that stuff) during the idle loop (which is what
happens by default) or not, and hence busy loop (which is what happens
if you pass 'idle=poll'). And, as far as Linux (and every OS running on
baremetal), that's it.

In Xen, it's the exact same thing. When the scheduler decides to run
the idle loop on a pCPU, it's Xen itself (it's, strictly speaking, not
really the scheduler, because the code is in arch/foo/domain.c, but,
whatever) that decides whether to sleep --with MWAIT, WFI, etc-- or to
stay awake. Stay awake would basically mean calling something like
cpu_relax() idle_loop() (basically, doing, on x86,
pm_idle=cpu_relax()). We currently don't have a way to tell Xen we want
that, but it may be added. That would be the _exact_ equivalent of
Linux's 'idle=poll'. And that is _not_ what this patch does.

In fact, still in Xen, we also have to decide what to do when one of
our guests' vCPUs goes idle. This is where, I feel, at least part of
the misunderstanding going on in this thread is actually happening...

> > Right, I asked myself those questions as well. That is why I wrote
> > "it
> > breaks the scheduler" in the previous email. I don't think it can
> > work
> > today, but it could work one day, building on top of the nop
> > scheduler.
> 
> WFE/WFI is only a hint for the scheduler to reschedule. If you don't 
> trap them, the guest will still run until the end of its credit. It 
> doesn't break anything.
> 
And in fact, I totally fail to understand what you mean here. By "don't
trap them" do you mean just ignore them? Or do you mean execute them on
hardware?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21  9:09                   ` Dario Faggioli
@ 2017-02-21 12:30                     ` Julien Grall
  2017-02-21 13:46                       ` George Dunlap
  2017-02-21 16:51                       ` Dario Faggioli
  0 siblings, 2 replies; 39+ messages in thread
From: Julien Grall @ 2017-02-21 12:30 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

Hi Dario,

On 21/02/2017 09:09, Dario Faggioli wrote:
> On Tue, 2017-02-21 at 07:59 +0000, Julien Grall wrote:
>> On 20/02/2017 22:53, Dario Faggioli wrote:
>>> For instance, as you say, executing a WFI from a guest directly on
>>> hardware, only makes sense if we have 1:1 static pinning. Which
>>> means
>>> it can't just be done by default, or with a boot parameter, because
>>> we
>>> need to check and enforce that there's only 1:1 pinning around.
>>
>> I agree it cannot be done by default. Similarly, the poll mode cannot
>> be
>> done by default in platform nor by domain because you need to know
>> that
>> all vCPUs will be in polling mode.
>>
> No, that's the big difference. Polling (which, as far as this patch
> goes, is yielding, in this case) is generic in the sense that, no
> matter the pinned or non-pinned state, things work. Power is wasted,
> but nothing breaks.
>
> Not trapping WF* is not generic in the sense that, if you do in the
> pinned case, i (probably) works. If you lift the pinning, but leave the
> direct WF* execution in place, everything breaks.
>
> This is all I'm saying: that if you say, not trapping is an alternative
> to this patch, well, it is not. Not trapping _plus_ measures for
> preventing things to break, is an alternative.
>
> Am I nitpicking? Perhaps... In which case, sorry. :-P

I am sorry but I still don't understand why you say things will break if 
you don't trap WFI/WFE. Can you detail it?

>
>> But as I said, if vCPUs are not pinned this patch as very little
>> advantage because you may context switch between them when yielding.
>>
> Smaller advantage, sure. How much smaller, hard to tell. That is the
> reason why I see some potential value in this patch, especially if
> converted to doing its thing per-domain, as George suggested. One can
> try (and, when that happens, we'll show a big WARNING about wasting
> power an heating up the CPUs!), and decide whether the result is good
> or not for the specific use case.

I even think there will be no advantage at all in multiple vCPUs case 
because I would not be surprised that the overhead of vCPU block is 
because we switch back and forth to the idle vCPU requiring to 
save/restore the context of the same vCPU.

Anyway, having number here would help to confirm.

My concern of per-domain solution or even system wide is you may have an 
idle vCPU where you don't expect interrupt to come. In this case, your 
vCPU will waste power and an unmodified app (e.g non-Xen aware) as there 
is no solution to suspend the vCPU today on Xen.

>
>>> Is it possible to decide whether to trap and emulate WFI, or just
>>> execute it, online, and change such decision dynamically? And even
>>> if
>>> yes, how would the whole thing work? When the direct execution is
>>> enabled for a domain we automatically enforce 1:1 pinning for that
>>> domain, and kick all the other domain out of its pcpus? What if
>>> they
>>> have their own pinning, what if they also have 'direct WFI'
>>> behavior
>>> enabled?
>>
>> It can be changed online, the WFI/WFE trapping is per pCPU (see
>> HCR_EL2.{TWE,TWI}
>>
> Ok, thanks for the info. Not bad. With added logic (perhaps in the nop
> scheduler), this looks like it could be useful.
>
>>> These are just examples, my point being that in theory, if we
>>> consider
>>> a very specific usecase or set of usecase, there's a lot we can do.
>>> But
>>> when you say "why don't you let the guest directly execute WFI", in
>>> response to a patch and a discussion like this, people may think
>>> that
>>> you are actually proposing doing it as a solution, which is not
>>> possible without figuring out all the open questions above
>>> (actually,
>>> probably, more) and without introducing a lot of cross-subsystem
>>> policing inside Xen, which is often something we don't want.
>>
>> I made this response because the patch sent by Stefano as a very
>> specific use case that can be solved the same way. Everyone here is
>> suggesting polling but it has it is own disadvantage: power
>> consumption.
>>
>> Anyway, I still think in both case we are solving a specific problem
>> without looking at what matters. I.e Why the scheduler takes so much
>> time to block/unblock.
>>
> Well, TBH, we still are not entirely sure who the culprit is for high
> latency. There are spikes in Credit2, and I'm investigating that. But
> apart from them? I think we need other numbers with which we can
> compare the numbers that Stefano has collected.

I think the problem is because we save/restore the vCPU state when 
switching to the idle vCPU.

Let say the only 1 vCPU can run on the pCPU, when the vCPU is issuing a 
WFI the following steps will happen:
      * WFI trapped and vcpu blocked
      * save vCPU state
      * run idle_loop
-> Interrupt incoming for the guest
      * restore vCPU state
      * back to the guest

Saving/restoring on ARM requires to context switch all the state of the 
VM (this is not saved in memory when entering in the hypervisor). This 
include things like system register, interrupt controller state, FPU...

Context switching the interrupt controller and the FPU can take some 
times as you got lots of register and some are only accessible through 
the memory interface (see GICv2 for instance).

So a context switch will likely hurt the performance of block vcpu in 
the context of 1 vCPU only running per pCPU.

>
> I'll send code for the nop scheduler, and we will compare with what
> we'll get with it. Another interesting data point would be knowing how
> the numbers look like on baremetal, on the same platform and under
> comparable conditions.
>
> And I guess there are other components and layers, in the Xen
> architecture, that may be causing increased latency, which we may have
> not identified yet.
>
> Anyway, nop scheduler is probably first thing we want to check. I'll
> send the patches soon.
>
>>>> So, yes in fine the guest will waste its slot.
>>>>
>>> Did I say it already that this concept of "slots" does not apply
>>> here?
>>> :-D
>>
>> Sorry forgot about this :/. I guess you use the term credit? If so,
>> the
>> guest will use its credit for nothing.
>>
> If the guest is alone, or in general the system is undersubscribed, it
> would, by continuously yielding in a busy loop, but that doesn't
> matter, because there are enough pCPUs to run even vCPUs that are out
> of credits.
>
> If the guest is not alone, and the system is oversubscribed, it would
> use a very tiny amount of its credits, every now and then, i.e., the
> ones that are necessary to execute a WFI, and, for Xen, to issue a call
> to sched_yield(). But after that, we will run someone else. This to say
> that the problem of this patch might be that, in the oversubscribed
> case, it relies too much on the behavior of yield, but not that it does
> nothing.
>
> But maybe I'm nitpicking again. Sorry. I don't get to talk about these
> inner (and very interesting, to me at least) scheduling details too
> often, and when it happens, I tend to get excited and exaggerate! :-P

Let's take a step aside. The ARM ARM describes WFI as "hint instruction 
that permits the processor to enter a low-power state until one of a 
number of asynchronous event occurs". Entering in lower-power state 
means it will have an impact (maybe small) interrupt latency because the 
CPU would have to leave the low-power state.

A baremetal application that use WFI is aware of the impact and wish to 
save power. If that application really care about interrupt latency it 
will use polling and not WFI. It depends on how much you could tolerate 
the interrupt latency.

Now, a same baremetal running as Xen guest will expect the same 
behavior. This is why WFI is implement with block but is has an high 
impact today (see above for a possible explanation). Moving to yield may 
have the same high impact because as you said the implementation will 
depend on the scheduler and when multiple vCPU are running on the same 
pCPU then you would have to context switch and it has a cost.

A user who want to move his baremetal app into a guest will have to pay 
the price of virtualization overhead + power if he wants to get good 
interrupt latency result even by using WFI. I would be surprise if it 
looks appealing to some people.

This is why for me implementing guest WFI as polling looks like an 
attempt to muddy the waters.

If you want a good interrupt latency with virtualization, you would pin 
your vCPU and ensure no other vCPU will run on this pCPU. And the you 
can play with the scheduler to optimize it (e.g avoiding pointless 
context switch...).

So for me implementing guest WFI as polling looks like an attempt to 
muddy the waters. It is not gonna solve the problem of the context 
switch takes time.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21  9:24                     ` Dario Faggioli
@ 2017-02-21 13:04                       ` Julien Grall
  0 siblings, 0 replies; 39+ messages in thread
From: Julien Grall @ 2017-02-21 13:04 UTC (permalink / raw)
  To: Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

Hi Dario,

On 21/02/17 09:24, Dario Faggioli wrote:
> On Tue, 2017-02-21 at 08:10 +0000, Julien Grall wrote:
>> On 21/02/2017 00:38, Stefano Stabellini wrote:
>>> On Mon, 20 Feb 2017, Dario Faggioli wrote:
>>>> Mmm... ok, yes, in that case, it may make sense and work, from a,
>>>> let's
>>>> say, purely functional perspective. But still I struggle to place
>>>> this
>>>> in a bigger picture.
>>>
>>> I feel the same way as you, Dario. That said, if we could make it
>>> work
>>> without breaking too many assumptions in Xen, it would be a great
>>> improvement for this use-case.
>>
>> Again, there is no assumption broken. Using WFI/WFE is just a nice
>> way
>> to say: "You can sleep and save power" so a scheduler can decide to
>> schedule another vCPU. Obviously a guest is free to not use them and
>> could do a busy loop instead.
>>
> Again, other way round. It is a scheduler that, when a CPU would go
> idle, decides whether to sleep or busy loop.
>
> It's the Linux scheduler that, in Linux, on x86, decides whether to HLT
> (or use MWAIT and that stuff) during the idle loop (which is what
> happens by default) or not, and hence busy loop (which is what happens
> if you pass 'idle=poll'). And, as far as Linux (and every OS running on
> baremetal), that's it.
>
> In Xen, it's the exact same thing. When the scheduler decides to run
> the idle loop on a pCPU, it's Xen itself (it's, strictly speaking, not
> really the scheduler, because the code is in arch/foo/domain.c, but,
> whatever) that decides whether to sleep --with MWAIT, WFI, etc-- or to
> stay awake. Stay awake would basically mean calling something like
> cpu_relax() idle_loop() (basically, doing, on x86,
> pm_idle=cpu_relax()). We currently don't have a way to tell Xen we want
> that, but it may be added. That would be the _exact_ equivalent of
> Linux's 'idle=poll'. And that is _not_ what this patch does.
>
> In fact, still in Xen, we also have to decide what to do when one of
> our guests' vCPUs goes idle. This is where, I feel, at least part of
> the misunderstanding going on in this thread is actually happening...

Likely :). I would be happy to have idle_loop to be a busy loop. But I 
don't think this will solve Stefano's problem (see my answer [1]) 
because the underlying problem is using the idle vCPU on ARM is really 
expensive.

>
>>> Right, I asked myself those questions as well. That is why I wrote
>>> "it
>>> breaks the scheduler" in the previous email. I don't think it can
>>> work
>>> today, but it could work one day, building on top of the nop
>>> scheduler.
>>
>> WFE/WFI is only a hint for the scheduler to reschedule. If you don't
>> trap them, the guest will still run until the end of its credit. It
>> doesn't break anything.
>>
> And in fact, I totally fail to understand what you mean here. By "don't
> trap them" do you mean just ignore them? Or do you mean execute them on
> hardware?

I mean executing on the hardware. My whole point was about WFI/WFE 
executed by the guest and not Xen.

Cheers,

[1] <14575011-0042-8940-c19f-2482136ff91c@foss.arm.com>

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 12:30                     ` Julien Grall
@ 2017-02-21 13:46                       ` George Dunlap
  2017-02-21 15:07                         ` Dario Faggioli
  2017-02-21 15:14                         ` Julien Grall
  2017-02-21 16:51                       ` Dario Faggioli
  1 sibling, 2 replies; 39+ messages in thread
From: George Dunlap @ 2017-02-21 13:46 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

On 21/02/17 12:30, Julien Grall wrote:
> Hi Dario,
> 
> On 21/02/2017 09:09, Dario Faggioli wrote:
>> On Tue, 2017-02-21 at 07:59 +0000, Julien Grall wrote:
>>> On 20/02/2017 22:53, Dario Faggioli wrote:
>>>> For instance, as you say, executing a WFI from a guest directly on
>>>> hardware, only makes sense if we have 1:1 static pinning. Which
>>>> means
>>>> it can't just be done by default, or with a boot parameter, because
>>>> we
>>>> need to check and enforce that there's only 1:1 pinning around.
>>>
>>> I agree it cannot be done by default. Similarly, the poll mode cannot
>>> be
>>> done by default in platform nor by domain because you need to know
>>> that
>>> all vCPUs will be in polling mode.
>>>
>> No, that's the big difference. Polling (which, as far as this patch
>> goes, is yielding, in this case) is generic in the sense that, no
>> matter the pinned or non-pinned state, things work. Power is wasted,
>> but nothing breaks.
>>
>> Not trapping WF* is not generic in the sense that, if you do in the
>> pinned case, i (probably) works. If you lift the pinning, but leave the
>> direct WF* execution in place, everything breaks.
>>
>> This is all I'm saying: that if you say, not trapping is an alternative
>> to this patch, well, it is not. Not trapping _plus_ measures for
>> preventing things to break, is an alternative.
>>
>> Am I nitpicking? Perhaps... In which case, sorry. :-P
> 
> I am sorry but I still don't understand why you say things will break if
> you don't trap WFI/WFE. Can you detail it?
> 
>>
>>> But as I said, if vCPUs are not pinned this patch as very little
>>> advantage because you may context switch between them when yielding.
>>>
>> Smaller advantage, sure. How much smaller, hard to tell. That is the
>> reason why I see some potential value in this patch, especially if
>> converted to doing its thing per-domain, as George suggested. One can
>> try (and, when that happens, we'll show a big WARNING about wasting
>> power an heating up the CPUs!), and decide whether the result is good
>> or not for the specific use case.
> 
> I even think there will be no advantage at all in multiple vCPUs case
> because I would not be surprised that the overhead of vCPU block is
> because we switch back and forth to the idle vCPU requiring to
> save/restore the context of the same vCPU.
> 
> Anyway, having number here would help to confirm.
> 
> My concern of per-domain solution or even system wide is you may have an
> idle vCPU where you don't expect interrupt to come. In this case, your
> vCPU will waste power and an unmodified app (e.g non-Xen aware) as there
> is no solution to suspend the vCPU today on Xen.
> 
>>
>>>> Is it possible to decide whether to trap and emulate WFI, or just
>>>> execute it, online, and change such decision dynamically? And even
>>>> if
>>>> yes, how would the whole thing work? When the direct execution is
>>>> enabled for a domain we automatically enforce 1:1 pinning for that
>>>> domain, and kick all the other domain out of its pcpus? What if
>>>> they
>>>> have their own pinning, what if they also have 'direct WFI'
>>>> behavior
>>>> enabled?
>>>
>>> It can be changed online, the WFI/WFE trapping is per pCPU (see
>>> HCR_EL2.{TWE,TWI}
>>>
>> Ok, thanks for the info. Not bad. With added logic (perhaps in the nop
>> scheduler), this looks like it could be useful.
>>
>>>> These are just examples, my point being that in theory, if we
>>>> consider
>>>> a very specific usecase or set of usecase, there's a lot we can do.
>>>> But
>>>> when you say "why don't you let the guest directly execute WFI", in
>>>> response to a patch and a discussion like this, people may think
>>>> that
>>>> you are actually proposing doing it as a solution, which is not
>>>> possible without figuring out all the open questions above
>>>> (actually,
>>>> probably, more) and without introducing a lot of cross-subsystem
>>>> policing inside Xen, which is often something we don't want.
>>>
>>> I made this response because the patch sent by Stefano as a very
>>> specific use case that can be solved the same way. Everyone here is
>>> suggesting polling but it has it is own disadvantage: power
>>> consumption.
>>>
>>> Anyway, I still think in both case we are solving a specific problem
>>> without looking at what matters. I.e Why the scheduler takes so much
>>> time to block/unblock.
>>>
>> Well, TBH, we still are not entirely sure who the culprit is for high
>> latency. There are spikes in Credit2, and I'm investigating that. But
>> apart from them? I think we need other numbers with which we can
>> compare the numbers that Stefano has collected.
> 
> I think the problem is because we save/restore the vCPU state when
> switching to the idle vCPU.
> 
> Let say the only 1 vCPU can run on the pCPU, when the vCPU is issuing a
> WFI the following steps will happen:
>      * WFI trapped and vcpu blocked
>      * save vCPU state
>      * run idle_loop
> -> Interrupt incoming for the guest
>      * restore vCPU state
>      * back to the guest
> 
> Saving/restoring on ARM requires to context switch all the state of the
> VM (this is not saved in memory when entering in the hypervisor). This
> include things like system register, interrupt controller state, FPU...
> 
> Context switching the interrupt controller and the FPU can take some
> times as you got lots of register and some are only accessible through
> the memory interface (see GICv2 for instance).
> 
> So a context switch will likely hurt the performance of block vcpu in
> the context of 1 vCPU only running per pCPU.
> 
>>
>> I'll send code for the nop scheduler, and we will compare with what
>> we'll get with it. Another interesting data point would be knowing how
>> the numbers look like on baremetal, on the same platform and under
>> comparable conditions.
>>
>> And I guess there are other components and layers, in the Xen
>> architecture, that may be causing increased latency, which we may have
>> not identified yet.
>>
>> Anyway, nop scheduler is probably first thing we want to check. I'll
>> send the patches soon.
>>
>>>>> So, yes in fine the guest will waste its slot.
>>>>>
>>>> Did I say it already that this concept of "slots" does not apply
>>>> here?
>>>> :-D
>>>
>>> Sorry forgot about this :/. I guess you use the term credit? If so,
>>> the
>>> guest will use its credit for nothing.
>>>
>> If the guest is alone, or in general the system is undersubscribed, it
>> would, by continuously yielding in a busy loop, but that doesn't
>> matter, because there are enough pCPUs to run even vCPUs that are out
>> of credits.
>>
>> If the guest is not alone, and the system is oversubscribed, it would
>> use a very tiny amount of its credits, every now and then, i.e., the
>> ones that are necessary to execute a WFI, and, for Xen, to issue a call
>> to sched_yield(). But after that, we will run someone else. This to say
>> that the problem of this patch might be that, in the oversubscribed
>> case, it relies too much on the behavior of yield, but not that it does
>> nothing.
>>
>> But maybe I'm nitpicking again. Sorry. I don't get to talk about these
>> inner (and very interesting, to me at least) scheduling details too
>> often, and when it happens, I tend to get excited and exaggerate! :-P
> 
> Let's take a step aside. The ARM ARM describes WFI as "hint instruction
> that permits the processor to enter a low-power state until one of a
> number of asynchronous event occurs". Entering in lower-power state
> means it will have an impact (maybe small) interrupt latency because the
> CPU would have to leave the low-power state.
> 
> A baremetal application that use WFI is aware of the impact and wish to
> save power. If that application really care about interrupt latency it
> will use polling and not WFI. It depends on how much you could tolerate
> the interrupt latency.
> 
> Now, a same baremetal running as Xen guest will expect the same
> behavior. This is why WFI is implement with block but is has an high
> impact today (see above for a possible explanation). Moving to yield may
> have the same high impact because as you said the implementation will
> depend on the scheduler and when multiple vCPU are running on the same
> pCPU then you would have to context switch and it has a cost.
> 
> A user who want to move his baremetal app into a guest will have to pay
> the price of virtualization overhead + power if he wants to get good
> interrupt latency result even by using WFI. I would be surprise if it
> looks appealing to some people.
> 
> This is why for me implementing guest WFI as polling looks like an
> attempt to muddy the waters.
> 
> If you want a good interrupt latency with virtualization, you would pin
> your vCPU and ensure no other vCPU will run on this pCPU. And the you
> can play with the scheduler to optimize it (e.g avoiding pointless
> context switch...).
> 
> So for me implementing guest WFI as polling looks like an attempt to
> muddy the waters. It is not gonna solve the problem of the context
> switch takes time.

I think our options look like:

A.  Don't trap guest WFI at all -- allow it to 'halt' in
moderate-power-but-ready-for-interrupt mode.

B. Trap guest WFI and block normally.

C. Trap guest WFI and poll instead of idling.

D. Trap guest WFI and take that as a hint to "idle" in a non-deep sleep
state (perhaps, for instance, using the WFI instruction).

A is safe because the scheduler should already have set a timer to break
out of it if necessary.  The only potential issue here is that the guest
is burning its credits, meaning that other vcpus with something
potentially useful to do aren't being allowed to run; and then later
when this vcpu has something useful to do it may be prevented from
running because of low credits.  (This may be what Dario means when he
says it "breaks scheduling").

B, C, and D have the advantage that the guest will not be charged for
credits, and other useful work can be done if it's possible.

B and C have the disadvantage that the effect will be significantly
different under Xen than on real hardware: B will mean it will go into a
deep sleep state (and thus not be as responsive as on real hardare); C
has the disadvantage that there won't be any significant power savings.

Having the ability for an administrator to choose between A and D seems
like it would be the most useful.

 -George



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 13:46                       ` George Dunlap
@ 2017-02-21 15:07                         ` Dario Faggioli
  2017-02-21 17:49                           ` Stefano Stabellini
  2017-02-21 15:14                         ` Julien Grall
  1 sibling, 1 reply; 39+ messages in thread
From: Dario Faggioli @ 2017-02-21 15:07 UTC (permalink / raw)
  To: George Dunlap, Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2423 bytes --]

On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
>
> A.  Don't trap guest WFI at all -- allow it to 'halt' in
> moderate-power-but-ready-for-interrupt mode.
> 
> [..]
>
> A is safe because the scheduler should already have set a timer to
> break
> out of it if necessary.  The only potential issue here is that the
> guest
> is burning its credits, meaning that other vcpus with something
> potentially useful to do aren't being allowed to run; and then later
> when this vcpu has something useful to do it may be prevented from
> running because of low credits.  (This may be what Dario means when
> he
> says it "breaks scheduling").
> 
Are you also referring to the case when there are less vCPUs around
than the host has pCPUs (and, ideally, all vCPUs are pinned 1:1 to a
pCPU)? If yes, I agree that we're probably fine, but we have to check
and enforce all this to be the case.

If no, think at a situation where there is 1 vCPU running on a pCPU and
3 vCPUs in the runqueue (it may be a per-CPU Credit1 runqueue or a
shared among some pCPUs Credit2 runqueue). If the running vCPU goes
idle, let's say with WFI, we _don't_ want the pCPU to enter neither
moderate nor deep sleep, we want to pick up the first of the 3 other
vCPUs that are waiting in the runqueue.

This is what I mean when I say "breaks scheduling". :-)

Oh, actually, if --which I only now realize may be what you are
referring to, since you're talking about "guest burning its credits"--
you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
the scheduler runs again for whatever reason), you charge to it for all
the time the the pCPU was actually idle/sleeping, well, that may
actually  not break scheduling, or cause disruption to the service of
other vCPUs.... But indeed I'd consider it rather counter intuitive a
behavior.

In fact, it'd mean that the guest has issued WFI because he wanted to
sleep and we do put it to sleep. But when it wakes up, we treat it like
it had busy waited.

What would be the benefit of this? That we don't context switch (either
to idle or to someone else)?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 13:46                       ` George Dunlap
  2017-02-21 15:07                         ` Dario Faggioli
@ 2017-02-21 15:14                         ` Julien Grall
  2017-02-21 16:59                           ` George Dunlap
  2017-02-21 18:03                           ` Stefano Stabellini
  1 sibling, 2 replies; 39+ messages in thread
From: Julien Grall @ 2017-02-21 15:14 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

Hi George,

On 21/02/17 13:46, George Dunlap wrote:
> I think our options look like:

Thank you for the summary of the options!

>
> A.  Don't trap guest WFI at all -- allow it to 'halt' in
> moderate-power-but-ready-for-interrupt mode.
>
> B. Trap guest WFI and block normally.
>
> C. Trap guest WFI and poll instead of idling.
>
> D. Trap guest WFI and take that as a hint to "idle" in a non-deep sleep
> state (perhaps, for instance, using the WFI instruction).
>
> A is safe because the scheduler should already have set a timer to break
> out of it if necessary.  The only potential issue here is that the guest
> is burning its credits, meaning that other vcpus with something
> potentially useful to do aren't being allowed to run; and then later
> when this vcpu has something useful to do it may be prevented from
> running because of low credits.  (This may be what Dario means when he
> says it "breaks scheduling").
>
> B, C, and D have the advantage that the guest will not be charged for
> credits, and other useful work can be done if it's possible.
>
> B and C have the disadvantage that the effect will be significantly
> different under Xen than on real hardware: B will mean it will go into a
> deep sleep state (and thus not be as responsive as on real hardare); C
> has the disadvantage that there won't be any significant power savings.

I'd like to correct one misunderstanding here. Today the idle_loop on 
ARM is doing a WFI. This is not a deep-sleep state, it is fairly quite 
quick to come back. What really cost if the context switch of the state 
of the vCPU during blocking.

So I think B and D are the same. Or did you expect D to not switch to 
the idle vCPU?

Note, it is possible to implement more deep-sleep state on ARM either 
via PSCI or platform specific code.

>
> Having the ability for an administrator to choose between A and D seems
> like it would be the most useful.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 12:30                     ` Julien Grall
  2017-02-21 13:46                       ` George Dunlap
@ 2017-02-21 16:51                       ` Dario Faggioli
  2017-02-21 17:39                         ` Stefano Stabellini
  1 sibling, 1 reply; 39+ messages in thread
From: Dario Faggioli @ 2017-02-21 16:51 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2141 bytes --]

On Tue, 2017-02-21 at 12:30 +0000, Julien Grall wrote:
> On 21/02/2017 09:09, Dario Faggioli wrote:
> > Well, TBH, we still are not entirely sure who the culprit is for
> > high
> > latency. There are spikes in Credit2, and I'm investigating that.
> > But
> > apart from them? I think we need other numbers with which we can
> > compare the numbers that Stefano has collected.
> 
> I think the problem is because we save/restore the vCPU state when 
> switching to the idle vCPU.
> 
That may well be. Or at least, that may well be part of the problem. I
don't know enough of ARM to know whether it's the predominant cause of
high latencies or not.

On x86, on Linux, polling is used to prevent the CPU to go in deep
C-states. I was assuming things to be similar on ARM, and that's the
reason why I thought introducing a polling mode could have been useful
(although wasteful).

But you guys are the ones that know whether or not that is the case
(and in another email, you seem to say it's not).

> Let say the only 1 vCPU can run on the pCPU, when the vCPU is issuing
> a 
> WFI the following steps will happen:
>       * WFI trapped and vcpu blocked
>       * save vCPU state
>       * run idle_loop
> -> Interrupt incoming for the guest
>       * restore vCPU state
>       * back to the guest
> 
> Saving/restoring on ARM requires to context switch all the state of
> the 
> VM (this is not saved in memory when entering in the hypervisor).
> This 
> include things like system register, interrupt controller state,
> FPU...
> 
Yes. In fact, on x86, we have what we call 'lazy context switch', which
deals specifically with some aspect of this situation.

Indeed it seems like implementing the same on ARM --if you don't have
it already, and if possible-- would be useful in this case too.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 15:14                         ` Julien Grall
@ 2017-02-21 16:59                           ` George Dunlap
  2017-02-21 18:03                           ` Stefano Stabellini
  1 sibling, 0 replies; 39+ messages in thread
From: George Dunlap @ 2017-02-21 16:59 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, nd, Punit Agrawal, xen-devel

On 21/02/17 15:14, Julien Grall wrote:
> Hi George,
> 
> On 21/02/17 13:46, George Dunlap wrote:
>> I think our options look like:
> 
> Thank you for the summary of the options!
> 
>>
>> A.  Don't trap guest WFI at all -- allow it to 'halt' in
>> moderate-power-but-ready-for-interrupt mode.
>>
>> B. Trap guest WFI and block normally.
>>
>> C. Trap guest WFI and poll instead of idling.
>>
>> D. Trap guest WFI and take that as a hint to "idle" in a non-deep sleep
>> state (perhaps, for instance, using the WFI instruction).
>>
>> A is safe because the scheduler should already have set a timer to break
>> out of it if necessary.  The only potential issue here is that the guest
>> is burning its credits, meaning that other vcpus with something
>> potentially useful to do aren't being allowed to run; and then later
>> when this vcpu has something useful to do it may be prevented from
>> running because of low credits.  (This may be what Dario means when he
>> says it "breaks scheduling").
>>
>> B, C, and D have the advantage that the guest will not be charged for
>> credits, and other useful work can be done if it's possible.
>>
>> B and C have the disadvantage that the effect will be significantly
>> different under Xen than on real hardware: B will mean it will go into a
>> deep sleep state (and thus not be as responsive as on real hardare); C
>> has the disadvantage that there won't be any significant power savings.
> 
> I'd like to correct one misunderstanding here. Today the idle_loop on
> ARM is doing a WFI. This is not a deep-sleep state, it is fairly quite
> quick to come back. What really cost if the context switch of the state
> of the vCPU during blocking.
> 
> So I think B and D are the same. Or did you expect D to not switch to
> the idle vCPU?
> 
> Note, it is possible to implement more deep-sleep state on ARM either
> via PSCI or platform specific code.

Oh, right; so it sounds a bit as if WFI is ARM's version of x86 HLT.  I
thought it was more special. :-)

Things get a bit tricky because one of the purposes of a hypervisor is
to deal with more advanced hardware so that the guest OS doesn't have
to.  For instance, it would make sense to have simple guest OSes that
only know how to do WFI, and then have Xen have the smarts to know if
and when to go into a deeper sleep state.  So you wouldn't normally want
to have WFI be a hint *not* to go into a deep sleep state, unless you
were sure that nearly all your guest operating systems would know how to
say "actually go ahead into a deep sleep state", and said that by default.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 16:51                       ` Dario Faggioli
@ 2017-02-21 17:39                         ` Stefano Stabellini
  0 siblings, 0 replies; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-21 17:39 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap, Punit Agrawal,
	Julien Grall, xen-devel, nd

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2383 bytes --]

On Tue, 21 Feb 2017, Dario Faggioli wrote:
> On Tue, 2017-02-21 at 12:30 +0000, Julien Grall wrote:
> > On 21/02/2017 09:09, Dario Faggioli wrote:
> > > Well, TBH, we still are not entirely sure who the culprit is for
> > > high
> > > latency. There are spikes in Credit2, and I'm investigating that.
> > > But
> > > apart from them? I think we need other numbers with which we can
> > > compare the numbers that Stefano has collected.
> > 
> > I think the problem is because we save/restore the vCPU state when 
> > switching to the idle vCPU.
> > 
> That may well be. Or at least, that may well be part of the problem. I
> don't know enough of ARM to know whether it's the predominant cause of
> high latencies or not.
> 
> On x86, on Linux, polling is used to prevent the CPU to go in deep
> C-states. I was assuming things to be similar on ARM, and that's the
> reason why I thought introducing a polling mode could have been useful
> (although wasteful).
> 
> But you guys are the ones that know whether or not that is the case
> (and in another email, you seem to say it's not).

The total cost of context switching is about 1100ns, which is
significant.  The remaining 1300ns difference between vwfi=sleep and
vwfi=idle is due to sched_op(wait) and sched_op(schedule).

As I mentioned, I have a patch to zero the context switch time when we
can back and force between a regular vcpu and the idle vcpu, but it's a
complex ugly patch I don't know if we want to maintain.


> > Let say the only 1 vCPU can run on the pCPU, when the vCPU is issuing
> > a 
> > WFI the following steps will happen:
> >       * WFI trapped and vcpu blocked
> >       * save vCPU state
> >       * run idle_loop
> > -> Interrupt incoming for the guest
> >       * restore vCPU state
> >       * back to the guest
> > 
> > Saving/restoring on ARM requires to context switch all the state of
> > the 
> > VM (this is not saved in memory when entering in the hypervisor).
> > This 
> > include things like system register, interrupt controller state,
> > FPU...
> > 
> Yes. In fact, on x86, we have what we call 'lazy context switch', which
> deals specifically with some aspect of this situation.
> 
> Indeed it seems like implementing the same on ARM --if you don't have
> it already, and if possible-- would be useful in this case too.

This sounds like what I have.

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 15:07                         ` Dario Faggioli
@ 2017-02-21 17:49                           ` Stefano Stabellini
  2017-02-21 17:56                             ` Julien Grall
  2017-02-21 18:17                             ` George Dunlap
  0 siblings, 2 replies; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-21 17:49 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap, Punit Agrawal,
	George Dunlap, Julien Grall, xen-devel, nd

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3134 bytes --]

On Tue, 21 Feb 2017, Dario Faggioli wrote:
> On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
> >
> > A.  Don't trap guest WFI at all -- allow it to 'halt' in
> > moderate-power-but-ready-for-interrupt mode.
> > 
> > [..]
> >
> > A is safe because the scheduler should already have set a timer to
> > break
> > out of it if necessary.  The only potential issue here is that the
> > guest
> > is burning its credits, meaning that other vcpus with something
> > potentially useful to do aren't being allowed to run; and then later
> > when this vcpu has something useful to do it may be prevented from
> > running because of low credits.  (This may be what Dario means when
> > he
> > says it "breaks scheduling").
> > 
> Are you also referring to the case when there are less vCPUs around
> than the host has pCPUs (and, ideally, all vCPUs are pinned 1:1 to a
> pCPU)? If yes, I agree that we're probably fine, but we have to check
> and enforce all this to be the case.
> 
> If no, think at a situation where there is 1 vCPU running on a pCPU and
> 3 vCPUs in the runqueue (it may be a per-CPU Credit1 runqueue or a
> shared among some pCPUs Credit2 runqueue). If the running vCPU goes
> idle, let's say with WFI, we _don't_ want the pCPU to enter neither
> moderate nor deep sleep, we want to pick up the first of the 3 other
> vCPUs that are waiting in the runqueue.
> 
> This is what I mean when I say "breaks scheduling". :-)

That's right, I cannot see how this can be made to work correctly.


> Oh, actually, if --which I only now realize may be what you are
> referring to, since you're talking about "guest burning its credits"--
> you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
> the scheduler runs again for whatever reason), you charge to it for all
> the time the the pCPU was actually idle/sleeping, well, that may
> actually  not break scheduling, or cause disruption to the service of
> other vCPUs.... But indeed I'd consider it rather counter intuitive a
> behavior.

How can this be safe? There could be no interrupts programmed to wake up
the pcpu at all. In fact, I don't think today there would be any, unless
we set one up in Xen for the specific purpose of interrupting the pcpu
sleep.

I don't know the inner working of the scheduler, but does it always send
an interrupt to other pcpu to schedule something?

What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.

Obviously this behavior cannot be the default. But even if we introduce
it as a command line option, when the user enables it can cause the
whole system to hang if she is not careful. Isn't that right? We have no
way to detect when we can do this safely.

I prefer an option such as vwfi=pool, where in the worst case the user
burns more power but there are no unexpected hangs.


> In fact, it'd mean that the guest has issued WFI because he wanted to
> sleep and we do put it to sleep. But when it wakes up, we treat it like
> it had busy waited.
>
> What would be the benefit of this? That we don't context switch (either
> to idle or to someone else)?

Yes, that would be the benefit.

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 17:49                           ` Stefano Stabellini
@ 2017-02-21 17:56                             ` Julien Grall
  2017-02-21 18:30                               ` Stefano Stabellini
  2017-02-21 18:17                             ` George Dunlap
  1 sibling, 1 reply; 39+ messages in thread
From: Julien Grall @ 2017-02-21 17:56 UTC (permalink / raw)
  To: Stefano Stabellini, Dario Faggioli
  Cc: edgar.iglesias, george.dunlap, Punit Agrawal, George Dunlap,
	xen-devel, nd

Hi Stefano,

On 21/02/17 17:49, Stefano Stabellini wrote:
> On Tue, 21 Feb 2017, Dario Faggioli wrote:
>> On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
>> Oh, actually, if --which I only now realize may be what you are
>> referring to, since you're talking about "guest burning its credits"--
>> you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
>> the scheduler runs again for whatever reason), you charge to it for all
>> the time the the pCPU was actually idle/sleeping, well, that may
>> actually  not break scheduling, or cause disruption to the service of
>> other vCPUs.... But indeed I'd consider it rather counter intuitive a
>> behavior.
>
> How can this be safe? There could be no interrupts programmed to wake up
> the pcpu at all. In fact, I don't think today there would be any, unless
> we set one up in Xen for the specific purpose of interrupting the pcpu
> sleep.
>
> I don't know the inner working of the scheduler, but does it always send
> an interrupt to other pcpu to schedule something?

You still seem to assume that WFI/WFE is the only way to get a vCPU 
unscheduled. If that was the case it would be utterly wrong because you 
cannot expect a guest to use them.

>
> What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.

Why wouldn't it be fair? This is the same situation as a guest vCPU not 
using WFI/WFE.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 15:14                         ` Julien Grall
  2017-02-21 16:59                           ` George Dunlap
@ 2017-02-21 18:03                           ` Stefano Stabellini
  2017-02-21 18:24                             ` Julien Grall
  1 sibling, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-21 18:03 UTC (permalink / raw)
  To: Julien Grall
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap,
	Dario Faggioli, Punit Agrawal, George Dunlap, xen-devel, nd

On Tue, 21 Feb 2017, Julien Grall wrote:
> Hi George,
> 
> On 21/02/17 13:46, George Dunlap wrote:
> > I think our options look like:
> 
> Thank you for the summary of the options!
> 
> > 
> > A.  Don't trap guest WFI at all -- allow it to 'halt' in
> > moderate-power-but-ready-for-interrupt mode.
> > 
> > B. Trap guest WFI and block normally.
> > 
> > C. Trap guest WFI and poll instead of idling.
> > 
> > D. Trap guest WFI and take that as a hint to "idle" in a non-deep sleep
> > state (perhaps, for instance, using the WFI instruction).
> > 
> > A is safe because the scheduler should already have set a timer to break
> > out of it if necessary.  The only potential issue here is that the guest
> > is burning its credits, meaning that other vcpus with something
> > potentially useful to do aren't being allowed to run; and then later
> > when this vcpu has something useful to do it may be prevented from
> > running because of low credits.  (This may be what Dario means when he
> > says it "breaks scheduling").
> > 
> > B, C, and D have the advantage that the guest will not be charged for
> > credits, and other useful work can be done if it's possible.
> > 
> > B and C have the disadvantage that the effect will be significantly
> > different under Xen than on real hardware: B will mean it will go into a
> > deep sleep state (and thus not be as responsive as on real hardare); C
> > has the disadvantage that there won't be any significant power savings.
> 
> I'd like to correct one misunderstanding here. Today the idle_loop on ARM is
> doing a WFI. This is not a deep-sleep state, it is fairly quite quick to come
> back. What really cost if the context switch of the state of the vCPU during
> blocking.
> 
> So I think B and D are the same. Or did you expect D to not switch to the idle
> vCPU?

I think that B and D are the same only in the scenario where each vcpu
is pinned to a different pcpu. However, we cannot automatically
configure this scenario and we cannot detect it when it happens.

Discussions on how to make this specific scenario better are fruitless
until we can detect and configure Xen for it. Your suggestion to do a
real wfi when the guest issues a virtual wfi is a good improvement at
that point in time. Now, we cannot do it, because we don't know when it
happens.

If Dario and George come up with a way to detect it or configure it, I
volunteer to write the wfi patch for Xen ARM. Until then, we can only
decide if it is worth having a vwfi option, either system wide or per
domain, to change vwfi behavior from sleep to poll. When we have a way
to configure 1vcpu=1pcpu, we'll be able to add one more option, for
example vwfi=passthrough, that will allow a vcpu to perform a physical
wfi, leading to optimal power saving and wake up times.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 17:49                           ` Stefano Stabellini
  2017-02-21 17:56                             ` Julien Grall
@ 2017-02-21 18:17                             ` George Dunlap
  2017-02-22 16:40                               ` Dario Faggioli
  1 sibling, 1 reply; 39+ messages in thread
From: George Dunlap @ 2017-02-21 18:17 UTC (permalink / raw)
  To: Stefano Stabellini, Dario Faggioli
  Cc: edgar.iglesias, george.dunlap, Punit Agrawal, Julien Grall,
	xen-devel, nd

On 21/02/17 17:49, Stefano Stabellini wrote:
> On Tue, 21 Feb 2017, Dario Faggioli wrote:
>> On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
>>>
>>> A.  Don't trap guest WFI at all -- allow it to 'halt' in
>>> moderate-power-but-ready-for-interrupt mode.
>>>
>>> [..]
>>>
>>> A is safe because the scheduler should already have set a timer to
>>> break
>>> out of it if necessary.  The only potential issue here is that the
>>> guest
>>> is burning its credits, meaning that other vcpus with something
>>> potentially useful to do aren't being allowed to run; and then later
>>> when this vcpu has something useful to do it may be prevented from
>>> running because of low credits.  (This may be what Dario means when
>>> he
>>> says it "breaks scheduling").
>>>
>> Are you also referring to the case when there are less vCPUs around
>> than the host has pCPUs (and, ideally, all vCPUs are pinned 1:1 to a
>> pCPU)? If yes, I agree that we're probably fine, but we have to check
>> and enforce all this to be the case.
>>
>> If no, think at a situation where there is 1 vCPU running on a pCPU and
>> 3 vCPUs in the runqueue (it may be a per-CPU Credit1 runqueue or a
>> shared among some pCPUs Credit2 runqueue). If the running vCPU goes
>> idle, let's say with WFI, we _don't_ want the pCPU to enter neither
>> moderate nor deep sleep, we want to pick up the first of the 3 other
>> vCPUs that are waiting in the runqueue.
>>
>> This is what I mean when I say "breaks scheduling". :-)
> 
> That's right, I cannot see how this can be made to work correctly.
> 
> 
>> Oh, actually, if --which I only now realize may be what you are
>> referring to, since you're talking about "guest burning its credits"--
>> you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
>> the scheduler runs again for whatever reason), you charge to it for all
>> the time the the pCPU was actually idle/sleeping, well, that may
>> actually  not break scheduling, or cause disruption to the service of
>> other vCPUs.... But indeed I'd consider it rather counter intuitive a
>> behavior.
> 
> How can this be safe? There could be no interrupts programmed to wake up
> the pcpu at all. In fact, I don't think today there would be any, unless
> we set one up in Xen for the specific purpose of interrupting the pcpu
> sleep.
> 
> I don't know the inner working of the scheduler, but does it always send
> an interrupt to other pcpu to schedule something?

Letting a guest call WFI is as safe as letting a guest `while(1);`.  Xen
*always* has to set a timer interrupt before switching to the guest
context to make sure that it can pre-empt the vcpu -- otherwise any vcpu
could perform a DoS by simply continually executing instructions.

As long as the guest can't disable host interrupts, and as long as the
scheduler interrupt will break out of a guest WFI into Xen, it should be
safe.

> What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.

From the scheduler's perspective, the WFI would be the same as the
`while(1)`.  If you had two vcpus doing while(1) on the same vcpu, the
credit2 scheduler would (approximately) let one run for 2ms, then the
other for 2ms, and so on, each getting 50%.  If instead you have one
doing a WFI, then the credit2 scheduler would do the same -- let the
first one do WFI for 2ms, then preempt it and let the other one spin for
2ms, then let the first one WFI for 2ms, &c.

(Obviously the exact behaviour would be significantly more random.)

>> What would be the benefit of this? That we don't context switch (either
>> to idle or to someone else)?
> 
> Yes, that would be the benefit.

If ARM had the equivalent of posted interrupts, then a pinned guest
could efficiently wait for interrupts with no additional latency from
virtualization without having to poll.

(Speaking of which -- that could be an interesting optimization even on
x86... if a pcpu has no vcpus waiting to run, then disable HLT exit.)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 18:03                           ` Stefano Stabellini
@ 2017-02-21 18:24                             ` Julien Grall
  0 siblings, 0 replies; 39+ messages in thread
From: Julien Grall @ 2017-02-21 18:24 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, Dario Faggioli, Punit Agrawal,
	George Dunlap, xen-devel, nd

Hi Stefano,

On 21/02/17 18:03, Stefano Stabellini wrote:
> On Tue, 21 Feb 2017, Julien Grall wrote:
>> Hi George,
>>
>> On 21/02/17 13:46, George Dunlap wrote:
>>> I think our options look like:
>>
>> Thank you for the summary of the options!
>>
>>>
>>> A.  Don't trap guest WFI at all -- allow it to 'halt' in
>>> moderate-power-but-ready-for-interrupt mode.
>>>
>>> B. Trap guest WFI and block normally.
>>>
>>> C. Trap guest WFI and poll instead of idling.
>>>
>>> D. Trap guest WFI and take that as a hint to "idle" in a non-deep sleep
>>> state (perhaps, for instance, using the WFI instruction).
>>>
>>> A is safe because the scheduler should already have set a timer to break
>>> out of it if necessary.  The only potential issue here is that the guest
>>> is burning its credits, meaning that other vcpus with something
>>> potentially useful to do aren't being allowed to run; and then later
>>> when this vcpu has something useful to do it may be prevented from
>>> running because of low credits.  (This may be what Dario means when he
>>> says it "breaks scheduling").
>>>
>>> B, C, and D have the advantage that the guest will not be charged for
>>> credits, and other useful work can be done if it's possible.
>>>
>>> B and C have the disadvantage that the effect will be significantly
>>> different under Xen than on real hardware: B will mean it will go into a
>>> deep sleep state (and thus not be as responsive as on real hardare); C
>>> has the disadvantage that there won't be any significant power savings.
>>
>> I'd like to correct one misunderstanding here. Today the idle_loop on ARM is
>> doing a WFI. This is not a deep-sleep state, it is fairly quite quick to come
>> back. What really cost if the context switch of the state of the vCPU during
>> blocking.
>>
>> So I think B and D are the same. Or did you expect D to not switch to the idle
>> vCPU?
>
> I think that B and D are the same only in the scenario where each vcpu
> is pinned to a different pcpu. However, we cannot automatically
> configure this scenario and we cannot detect it when it happens.

Again, there is no deep sleep state supported on Xen today. The idle 
loop on Xen is a simple WFI. So there is *no* difference between B and D 
no matter the configuration, both option trap WFI and block the vCPU.

>
> Discussions on how to make this specific scenario better are fruitless
> until we can detect and configure Xen for it. Your suggestion to do a
> real wfi when the guest issues a virtual wfi is a good improvement at
> that point in time. Now, we cannot do it, because we don't know when it
> happens.
>
> If Dario and George come up with a way to detect it or configure it, I
> volunteer to write the wfi patch for Xen ARM. Until then, we can only
> decide if it is worth having a vwfi option, either system wide or per
> domain, to change vwfi behavior from sleep to poll. When we have a way
> to configure 1vcpu=1pcpu, we'll be able to add one more option, for
> example vwfi=passthrough, that will allow a vcpu to perform a physical
> wfi, leading to optimal power saving and wake up times.

Please read my answer to Dario 
(<14575011-0042-8940-c19f-2482136ff91c@foss.arm.com>) regarding why an 
baremetal app will use WFI and what are the expectations of WFI.

I will summarize my e-mail with that:

My shiny baremetal app is running nicely without virtualization, good 
power usage, good interrupt latency. Now, I want to use virtualization 
to abstract the hardware, it will have a small overhead impact because 
of virtualization but it is ok. Now, you are telling me that if I want 
good interrupt latency, I would have to give up on power. I would likely 
give up on using virtualization in this case, better to adapt my app to 
the newer hardware. Do you really expect user to make another decision?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 17:56                             ` Julien Grall
@ 2017-02-21 18:30                               ` Stefano Stabellini
  2017-02-21 19:20                                 ` Julien Grall
  0 siblings, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-21 18:30 UTC (permalink / raw)
  To: Julien Grall
  Cc: edgar.iglesias, Stefano Stabellini, george.dunlap,
	Dario Faggioli, Punit Agrawal, George Dunlap, xen-devel, nd

On Tue, 21 Feb 2017, Julien Grall wrote:
> Hi Stefano,
> 
> On 21/02/17 17:49, Stefano Stabellini wrote:
> > On Tue, 21 Feb 2017, Dario Faggioli wrote:
> > > On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
> > > Oh, actually, if --which I only now realize may be what you are
> > > referring to, since you're talking about "guest burning its credits"--
> > > you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
> > > the scheduler runs again for whatever reason), you charge to it for all
> > > the time the the pCPU was actually idle/sleeping, well, that may
> > > actually  not break scheduling, or cause disruption to the service of
> > > other vCPUs.... But indeed I'd consider it rather counter intuitive a
> > > behavior.
> > 
> > How can this be safe? There could be no interrupts programmed to wake up
> > the pcpu at all. In fact, I don't think today there would be any, unless
> > we set one up in Xen for the specific purpose of interrupting the pcpu
> > sleep.
> > 
> > I don't know the inner working of the scheduler, but does it always send
> > an interrupt to other pcpu to schedule something?
> 
> You still seem to assume that WFI/WFE is the only way to get a vCPU
> unscheduled. If that was the case it would be utterly wrong because you cannot
> expect a guest to use them.
> 
> > 
> > What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.
> 
> Why wouldn't it be fair? This is the same situation as a guest vCPU not using
> WFI/WFE.

I read your suggestion as trapping WFI in Xen, then, depending on
settings, executing WFI in the Xen trap handler to idle the pcpu. That
doesn't work. But I take you suggested not trapping wfi (remove
HCR_TWI), executing the instruction in guest context. That is what we
used to do in the early days (before a780f750). It should be safe and
possibly even quick. I'll rerun the numbers and let you know.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 18:30                               ` Stefano Stabellini
@ 2017-02-21 19:20                                 ` Julien Grall
  2017-02-22  4:21                                   ` Edgar E. Iglesias
  0 siblings, 1 reply; 39+ messages in thread
From: Julien Grall @ 2017-02-21 19:20 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, Dario Faggioli, Punit Agrawal,
	George Dunlap, xen-devel, nd



On 21/02/2017 18:30, Stefano Stabellini wrote:
> On Tue, 21 Feb 2017, Julien Grall wrote:
>> Hi Stefano,
>>
>> On 21/02/17 17:49, Stefano Stabellini wrote:
>>> On Tue, 21 Feb 2017, Dario Faggioli wrote:
>>>> On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
>>>> Oh, actually, if --which I only now realize may be what you are
>>>> referring to, since you're talking about "guest burning its credits"--
>>>> you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
>>>> the scheduler runs again for whatever reason), you charge to it for all
>>>> the time the the pCPU was actually idle/sleeping, well, that may
>>>> actually  not break scheduling, or cause disruption to the service of
>>>> other vCPUs.... But indeed I'd consider it rather counter intuitive a
>>>> behavior.
>>>
>>> How can this be safe? There could be no interrupts programmed to wake up
>>> the pcpu at all. In fact, I don't think today there would be any, unless
>>> we set one up in Xen for the specific purpose of interrupting the pcpu
>>> sleep.
>>>
>>> I don't know the inner working of the scheduler, but does it always send
>>> an interrupt to other pcpu to schedule something?
>>
>> You still seem to assume that WFI/WFE is the only way to get a vCPU
>> unscheduled. If that was the case it would be utterly wrong because you cannot
>> expect a guest to use them.
>>
>>>
>>> What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.
>>
>> Why wouldn't it be fair? This is the same situation as a guest vCPU not using
>> WFI/WFE.
>
> I read your suggestion as trapping WFI in Xen, then, depending on
> settings, executing WFI in the Xen trap handler to idle the pcpu. That
> doesn't work. But I take you suggested not trapping wfi (remove
> HCR_TWI), executing the instruction in guest context. That is what we
> used to do in the early days (before a780f750). It should be safe and
> possibly even quick. I'll rerun the numbers and let you know.

My first suggestion was to emulate WFI in Xen, which I agree is not safe :).

I think not trapping WFI will have the best performance but may impact 
the credit of the vCPU as mentioned by Dario and George.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 19:20                                 ` Julien Grall
@ 2017-02-22  4:21                                   ` Edgar E. Iglesias
  2017-02-22 17:22                                     ` Stefano Stabellini
  0 siblings, 1 reply; 39+ messages in thread
From: Edgar E. Iglesias @ 2017-02-22  4:21 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, george.dunlap, Dario Faggioli, Punit Agrawal,
	George Dunlap, xen-devel, nd

On Tue, Feb 21, 2017 at 07:20:29PM +0000, Julien Grall wrote:
> 
> 
> On 21/02/2017 18:30, Stefano Stabellini wrote:
> >On Tue, 21 Feb 2017, Julien Grall wrote:
> >>Hi Stefano,
> >>
> >>On 21/02/17 17:49, Stefano Stabellini wrote:
> >>>On Tue, 21 Feb 2017, Dario Faggioli wrote:
> >>>>On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
> >>>>Oh, actually, if --which I only now realize may be what you are
> >>>>referring to, since you're talking about "guest burning its credits"--
> >>>>you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
> >>>>the scheduler runs again for whatever reason), you charge to it for all
> >>>>the time the the pCPU was actually idle/sleeping, well, that may
> >>>>actually  not break scheduling, or cause disruption to the service of
> >>>>other vCPUs.... But indeed I'd consider it rather counter intuitive a
> >>>>behavior.
> >>>
> >>>How can this be safe? There could be no interrupts programmed to wake up
> >>>the pcpu at all. In fact, I don't think today there would be any, unless
> >>>we set one up in Xen for the specific purpose of interrupting the pcpu
> >>>sleep.
> >>>
> >>>I don't know the inner working of the scheduler, but does it always send
> >>>an interrupt to other pcpu to schedule something?
> >>
> >>You still seem to assume that WFI/WFE is the only way to get a vCPU
> >>unscheduled. If that was the case it would be utterly wrong because you cannot
> >>expect a guest to use them.
> >>
> >>>
> >>>What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.
> >>
> >>Why wouldn't it be fair? This is the same situation as a guest vCPU not using
> >>WFI/WFE.
> >
> >I read your suggestion as trapping WFI in Xen, then, depending on
> >settings, executing WFI in the Xen trap handler to idle the pcpu. That
> >doesn't work. But I take you suggested not trapping wfi (remove
> >HCR_TWI), executing the instruction in guest context. That is what we
> >used to do in the early days (before a780f750). It should be safe and
> >possibly even quick. I'll rerun the numbers and let you know.
> 
> My first suggestion was to emulate WFI in Xen, which I agree is not safe :).
> 
> I think not trapping WFI will have the best performance but may impact the
> credit of the vCPU as mentioned by Dario and George.

I agree, wfi in guest context or at least with everything prepared to return to
the current guest would be great.

An option to enable this would work fine for our use-cases. Or if we could
at runtime detect that it's the best approach given scheduling (i.e
exclusive vCPU/pCPU pinning) even better.

Cheers,
Edgar


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-21 18:17                             ` George Dunlap
@ 2017-02-22 16:40                               ` Dario Faggioli
  0 siblings, 0 replies; 39+ messages in thread
From: Dario Faggioli @ 2017-02-22 16:40 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini
  Cc: edgar.iglesias, george.dunlap, Punit Agrawal, Julien Grall,
	xen-devel, nd


[-- Attachment #1.1: Type: text/plain, Size: 2161 bytes --]

On Tue, 2017-02-21 at 18:17 +0000, George Dunlap wrote:
> On 21/02/17 17:49, Stefano Stabellini wrote:
> > I don't know the inner working of the scheduler, but does it always
> > send
> > an interrupt to other pcpu to schedule something?
> 
> Letting a guest call WFI is as safe as letting a guest
> `while(1);`.  Xen
> *always* has to set a timer interrupt before switching to the guest
> context to make sure that it can pre-empt the vcpu -- otherwise any
> vcpu
> could perform a DoS by simply continually executing instructions.
> 
Yes, ensuring preemption happens is indeed Xen responsibility, and it
will indeed happen, as far as interrupts are enabled, as George says.

> > What if there are 2 vcpu pinned to the same pcpu? This cannot be
> > fair.
> 
> From the scheduler's perspective, the WFI would be the same as the
> `while(1)`.  
>
It is, provided you charge the vCPU for the time it spent in WFI, the
same as you'd charge it for time spend in a `while(1)`. This is
*probably* what happens already if we let WFI run on hardware, but I'd
double check and test.

> If you had two vcpus doing while(1) on the same vcpu, the
> credit2 scheduler would (approximately) let one run for 2ms, then the
> other for 2ms, and so on, each getting 50%.  
>
If you're talking about MAX_TIMER, it's 10ms these days. But yes, this
still means 50% each.

> If ARM had the equivalent of posted interrupts, then a pinned guest
> could efficiently wait for interrupts with no additional latency from
> virtualization without having to poll.
> 
> (Speaking of which -- that could be an interesting optimization even
> on
> x86... if a pcpu has no vcpus waiting to run, then disable HLT exit.)
> 
Sorry, this sounds interesting but I'm not sure I understand what you
mean (but let's not hijack this thread, maybe, and talk about it
somewhere else. :-)

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-22  4:21                                   ` Edgar E. Iglesias
@ 2017-02-22 17:22                                     ` Stefano Stabellini
  2017-02-23  9:19                                       ` Edgar E. Iglesias
  0 siblings, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2017-02-22 17:22 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Stefano Stabellini, george.dunlap, Dario Faggioli, Punit Agrawal,
	George Dunlap, Julien Grall, xen-devel, nd

On Wed, 22 Feb 2017, Edgar E. Iglesias wrote:
> On Tue, Feb 21, 2017 at 07:20:29PM +0000, Julien Grall wrote:
> > 
> > 
> > On 21/02/2017 18:30, Stefano Stabellini wrote:
> > >On Tue, 21 Feb 2017, Julien Grall wrote:
> > >>Hi Stefano,
> > >>
> > >>On 21/02/17 17:49, Stefano Stabellini wrote:
> > >>>On Tue, 21 Feb 2017, Dario Faggioli wrote:
> > >>>>On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
> > >>>>Oh, actually, if --which I only now realize may be what you are
> > >>>>referring to, since you're talking about "guest burning its credits"--
> > >>>>you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
> > >>>>the scheduler runs again for whatever reason), you charge to it for all
> > >>>>the time the the pCPU was actually idle/sleeping, well, that may
> > >>>>actually  not break scheduling, or cause disruption to the service of
> > >>>>other vCPUs.... But indeed I'd consider it rather counter intuitive a
> > >>>>behavior.
> > >>>
> > >>>How can this be safe? There could be no interrupts programmed to wake up
> > >>>the pcpu at all. In fact, I don't think today there would be any, unless
> > >>>we set one up in Xen for the specific purpose of interrupting the pcpu
> > >>>sleep.
> > >>>
> > >>>I don't know the inner working of the scheduler, but does it always send
> > >>>an interrupt to other pcpu to schedule something?
> > >>
> > >>You still seem to assume that WFI/WFE is the only way to get a vCPU
> > >>unscheduled. If that was the case it would be utterly wrong because you cannot
> > >>expect a guest to use them.
> > >>
> > >>>
> > >>>What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.
> > >>
> > >>Why wouldn't it be fair? This is the same situation as a guest vCPU not using
> > >>WFI/WFE.
> > >
> > >I read your suggestion as trapping WFI in Xen, then, depending on
> > >settings, executing WFI in the Xen trap handler to idle the pcpu. That
> > >doesn't work. But I take you suggested not trapping wfi (remove
> > >HCR_TWI), executing the instruction in guest context. That is what we
> > >used to do in the early days (before a780f750). It should be safe and
> > >possibly even quick. I'll rerun the numbers and let you know.
> > 
> > My first suggestion was to emulate WFI in Xen, which I agree is not safe :).
> > 
> > I think not trapping WFI will have the best performance but may impact the
> > credit of the vCPU as mentioned by Dario and George.
> 
> I agree, wfi in guest context or at least with everything prepared to return to
> the current guest would be great.

And the new numbers look good:

         AVG        MAX     WARM_MAX
credit1  1850		2650	1950
credit2  1850		2950	1840 

We are hitting the same levels as the non-WFI case. Nice!


> An option to enable this would work fine for our use-cases. Or if we could
> at runtime detect that it's the best approach given scheduling (i.e
> exclusive vCPU/pCPU pinning) even better.

The option is easy and we can do it today. We might be able to do
automatic enablement once we have a "nop" scheduler.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] xen/arm: introduce vwfi parameter
  2017-02-22 17:22                                     ` Stefano Stabellini
@ 2017-02-23  9:19                                       ` Edgar E. Iglesias
  0 siblings, 0 replies; 39+ messages in thread
From: Edgar E. Iglesias @ 2017-02-23  9:19 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: george.dunlap, Dario Faggioli, Punit Agrawal, George Dunlap,
	Julien Grall, xen-devel, nd

On Wed, Feb 22, 2017 at 09:22:25AM -0800, Stefano Stabellini wrote:
> On Wed, 22 Feb 2017, Edgar E. Iglesias wrote:
> > On Tue, Feb 21, 2017 at 07:20:29PM +0000, Julien Grall wrote:
> > > 
> > > 
> > > On 21/02/2017 18:30, Stefano Stabellini wrote:
> > > >On Tue, 21 Feb 2017, Julien Grall wrote:
> > > >>Hi Stefano,
> > > >>
> > > >>On 21/02/17 17:49, Stefano Stabellini wrote:
> > > >>>On Tue, 21 Feb 2017, Dario Faggioli wrote:
> > > >>>>On Tue, 2017-02-21 at 13:46 +0000, George Dunlap wrote:
> > > >>>>Oh, actually, if --which I only now realize may be what you are
> > > >>>>referring to, since you're talking about "guest burning its credits"--
> > > >>>>you let the vCPU put the pCPU to sleep *but*, when it wakes up (or when
> > > >>>>the scheduler runs again for whatever reason), you charge to it for all
> > > >>>>the time the the pCPU was actually idle/sleeping, well, that may
> > > >>>>actually  not break scheduling, or cause disruption to the service of
> > > >>>>other vCPUs.... But indeed I'd consider it rather counter intuitive a
> > > >>>>behavior.
> > > >>>
> > > >>>How can this be safe? There could be no interrupts programmed to wake up
> > > >>>the pcpu at all. In fact, I don't think today there would be any, unless
> > > >>>we set one up in Xen for the specific purpose of interrupting the pcpu
> > > >>>sleep.
> > > >>>
> > > >>>I don't know the inner working of the scheduler, but does it always send
> > > >>>an interrupt to other pcpu to schedule something?
> > > >>
> > > >>You still seem to assume that WFI/WFE is the only way to get a vCPU
> > > >>unscheduled. If that was the case it would be utterly wrong because you cannot
> > > >>expect a guest to use them.
> > > >>
> > > >>>
> > > >>>What if there are 2 vcpu pinned to the same pcpu? This cannot be fair.
> > > >>
> > > >>Why wouldn't it be fair? This is the same situation as a guest vCPU not using
> > > >>WFI/WFE.
> > > >
> > > >I read your suggestion as trapping WFI in Xen, then, depending on
> > > >settings, executing WFI in the Xen trap handler to idle the pcpu. That
> > > >doesn't work. But I take you suggested not trapping wfi (remove
> > > >HCR_TWI), executing the instruction in guest context. That is what we
> > > >used to do in the early days (before a780f750). It should be safe and
> > > >possibly even quick. I'll rerun the numbers and let you know.
> > > 
> > > My first suggestion was to emulate WFI in Xen, which I agree is not safe :).
> > > 
> > > I think not trapping WFI will have the best performance but may impact the
> > > credit of the vCPU as mentioned by Dario and George.
> > 
> > I agree, wfi in guest context or at least with everything prepared to return to
> > the current guest would be great.
> 
> And the new numbers look good:
> 
>          AVG        MAX     WARM_MAX
> credit1  1850		2650	1950
> credit2  1850		2950	1840 
> 
> We are hitting the same levels as the non-WFI case. Nice!

Yeah, very nice :-)

Thanks!
Edgar

> 
> 
> > An option to enable this would work fine for our use-cases. Or if we could
> > at runtime detect that it's the best approach given scheduling (i.e
> > exclusive vCPU/pCPU pinning) even better.
> 
> The option is easy and we can do it today. We might be able to do
> automatic enablement once we have a "nop" scheduler.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2017-02-23  9:20 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1487286292-29502-1-git-send-email-sstabellini@kernel.org>
     [not found] ` <a271394a-6c76-027c-fb08-b3fe775224ba@arm.com>
2017-02-17 22:50   ` [PATCH] xen/arm: introduce vwfi parameter Stefano Stabellini
2017-02-18  1:47     ` Dario Faggioli
2017-02-19 21:27       ` Julien Grall
2017-02-20 10:43         ` George Dunlap
2017-02-20 11:15         ` Dario Faggioli
2017-02-19 21:34     ` Julien Grall
2017-02-20 11:35       ` Dario Faggioli
2017-02-20 18:43         ` Stefano Stabellini
2017-02-20 18:45           ` George Dunlap
2017-02-20 18:49             ` Stefano Stabellini
2017-02-20 18:47       ` Stefano Stabellini
2017-02-20 18:53         ` Julien Grall
2017-02-20 19:20           ` Dario Faggioli
2017-02-20 19:38             ` Julien Grall
2017-02-20 22:53               ` Dario Faggioli
2017-02-21  0:38                 ` Stefano Stabellini
2017-02-21  8:10                   ` Julien Grall
2017-02-21  9:24                     ` Dario Faggioli
2017-02-21 13:04                       ` Julien Grall
2017-02-21  7:59                 ` Julien Grall
2017-02-21  9:09                   ` Dario Faggioli
2017-02-21 12:30                     ` Julien Grall
2017-02-21 13:46                       ` George Dunlap
2017-02-21 15:07                         ` Dario Faggioli
2017-02-21 17:49                           ` Stefano Stabellini
2017-02-21 17:56                             ` Julien Grall
2017-02-21 18:30                               ` Stefano Stabellini
2017-02-21 19:20                                 ` Julien Grall
2017-02-22  4:21                                   ` Edgar E. Iglesias
2017-02-22 17:22                                     ` Stefano Stabellini
2017-02-23  9:19                                       ` Edgar E. Iglesias
2017-02-21 18:17                             ` George Dunlap
2017-02-22 16:40                               ` Dario Faggioli
2017-02-21 15:14                         ` Julien Grall
2017-02-21 16:59                           ` George Dunlap
2017-02-21 18:03                           ` Stefano Stabellini
2017-02-21 18:24                             ` Julien Grall
2017-02-21 16:51                       ` Dario Faggioli
2017-02-21 17:39                         ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.