xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Null scheduler and vwfi native problem
@ 2021-01-21 10:54 Anders Törnqvist
  2021-01-21 18:32 ` Dario Faggioli
  2021-01-21 19:16 ` Julien Grall
  0 siblings, 2 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-01-21 10:54 UTC (permalink / raw)
  To: xen-devel

Hi,

I see a problem with destroy and restart of a domain. Interrupts are not 
available when trying to restart a domain.

The situation seems very similar to the thread "null scheduler bug" 
https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html.

The target system is a iMX8-based ARM board and Xen is a 4.13.0 version 
built from https://source.codeaurora.org/external/imx/imx-xen.git.

Xen is booted with sched=null vwfi=native.
One physical CPU core is pinned to the domu.
Some interrupts are passed through to the domu.

When destroying the domain with xl destroy etc it does not complain but 
then when trying to restart the domain
again with a "xl create <domain cfg>" I get:
(XEN) IRQ 210 is already used by domain 1

"xl list" does not contain the domain.

Repeating the "xl create" command 5-10 times eventually starts the 
domain without complaining about the IRQ.

Inspired from the discussion in the thread above I have put printks in 
the xen/common/domain.c file.
In the function domain_destroy I have a printk("End of domain_destroy 
function\n") in the end.
In the function complete_domain_destroy have a printk("Begin of 
complete_domain_destroy function\n") in the beginning.

With these printouts I get at "xl destroy":
(XEN) End of domain_destroy function

So it seems like the function complete_domain_destroy is not called.

"xl create" results in:
(XEN) IRQ 210 is already used by domain 1
(XEN) End of domain_destroy function

Then repeated "xl create" looks the same until after a few tries I also get:
(XEN) Begin of complete_domain_destroy function

After that the next "xl create" creates the domain.


I have also applied the patch from 
https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg02469.html. 
This does seem to change the results.

Starting the system without "sched=null vwfi=native" does not result in 
the problem.

BR
Anders




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-21 10:54 Null scheduler and vwfi native problem Anders Törnqvist
@ 2021-01-21 18:32 ` Dario Faggioli
  2021-01-21 19:40   ` Julien Grall
  2021-01-22  8:07   ` Anders Törnqvist
  2021-01-21 19:16 ` Julien Grall
  1 sibling, 2 replies; 31+ messages in thread
From: Dario Faggioli @ 2021-01-21 18:32 UTC (permalink / raw)
  To: Anders Törnqvist, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3167 bytes --]

On Thu, 2021-01-21 at 11:54 +0100, Anders Törnqvist wrote:
> Hi,
> 
Hello,

> I see a problem with destroy and restart of a domain. Interrupts are
> not 
> available when trying to restart a domain.
> 
> The situation seems very similar to the thread "null scheduler bug" 
>  
> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html
> .
> 
Right. Back then, PCI passthrough was involved, if I remember
correctly. Is it the case for you as well?

> The target system is a iMX8-based ARM board and Xen is a 4.13.0
> version 
> built from https://source.codeaurora.org/external/imx/imx-xen.git.
> 
Mmm, perhaps it's me, but neither going at that url with a browser not
trying to clone it, I do not see anything. What I'm doing wrong?

> Xen is booted with sched=null vwfi=native.
> One physical CPU core is pinned to the domu.
> Some interrupts are passed through to the domu.
> 
Ok, I guess it is involved, since you say "some interrupts are passed
through..."

> When destroying the domain with xl destroy etc it does not complain
> but 
> then when trying to restart the domain
> again with a "xl create <domain cfg>" I get:
> (XEN) IRQ 210 is already used by domain 1
> 
> "xl list" does not contain the domain.
> 
> Repeating the "xl create" command 5-10 times eventually starts the 
> domain without complaining about the IRQ.
> 
> Inspired from the discussion in the thread above I have put printks
> in 
> the xen/common/domain.c file.
> In the function domain_destroy I have a printk("End of domain_destroy
> function\n") in the end.
> In the function complete_domain_destroy have a printk("Begin of 
> complete_domain_destroy function\n") in the beginning.
> 
> With these printouts I get at "xl destroy":
> (XEN) End of domain_destroy function
> 
> So it seems like the function complete_domain_destroy is not called.
> 
Ok, thanks for making these tests. It's helpful to have this
information right away.

> "xl create" results in:
> (XEN) IRQ 210 is already used by domain 1
> (XEN) End of domain_destroy function
> 
> Then repeated "xl create" looks the same until after a few tries I
> also get:
> (XEN) Begin of complete_domain_destroy function
> 
> After that the next "xl create" creates the domain.
> 
> 
> I have also applied the patch from 
>     
> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg02469.html
> . 
> This does seem to change the results.
> 
Ah... Really? That's a bit unexpected, TBH.

Well, I'll think about it.

> Starting the system without "sched=null vwfi=native" does not result
> in 
> the problem.
>
Ok, how about, if you're up for some more testing:

 - booting with "sched=null" but not with "vwfi=native"
 - booting with "sched=null vwfi=native" but not doing the IRQ 
   passthrough that you mentioned above

?

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-21 10:54 Null scheduler and vwfi native problem Anders Törnqvist
  2021-01-21 18:32 ` Dario Faggioli
@ 2021-01-21 19:16 ` Julien Grall
  1 sibling, 0 replies; 31+ messages in thread
From: Julien Grall @ 2021-01-21 19:16 UTC (permalink / raw)
  To: Anders Törnqvist, xen-devel, Dario Faggioli, Stefano Stabellini



On 21/01/2021 10:54, Anders Törnqvist wrote:
> Hi,

Hi Anders,

Thank you for reporting the bug. I am adding Stefano and Dario as IIRC 
they were going to work on a solution.

Cheers,

> I see a problem with destroy and restart of a domain. Interrupts are not 
> available when trying to restart a domain.
> 
> The situation seems very similar to the thread "null scheduler bug" 
> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html.
> 
> The target system is a iMX8-based ARM board and Xen is a 4.13.0 version 
> built from https://source.codeaurora.org/external/imx/imx-xen.git.
> 
> Xen is booted with sched=null vwfi=native.
> One physical CPU core is pinned to the domu.
> Some interrupts are passed through to the domu.
> 
> When destroying the domain with xl destroy etc it does not complain but 
> then when trying to restart the domain
> again with a "xl create <domain cfg>" I get:
> (XEN) IRQ 210 is already used by domain 1
> 
> "xl list" does not contain the domain.
> 
> Repeating the "xl create" command 5-10 times eventually starts the 
> domain without complaining about the IRQ.
> 
> Inspired from the discussion in the thread above I have put printks in 
> the xen/common/domain.c file.
> In the function domain_destroy I have a printk("End of domain_destroy 
> function\n") in the end.
> In the function complete_domain_destroy have a printk("Begin of 
> complete_domain_destroy function\n") in the beginning.
> 
> With these printouts I get at "xl destroy":
> (XEN) End of domain_destroy function
> 
> So it seems like the function complete_domain_destroy is not called.
> 
> "xl create" results in:
> (XEN) IRQ 210 is already used by domain 1
> (XEN) End of domain_destroy function
> 
> Then repeated "xl create" looks the same until after a few tries I also 
> get:
> (XEN) Begin of complete_domain_destroy function
> 
> After that the next "xl create" creates the domain.
> 
> 
> I have also applied the patch from 
> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg02469.html. 
> This does seem to change the results.
> 
> Starting the system without "sched=null vwfi=native" does not result in 
> the problem.
> 
> BR
> Anders
> 
> 
> 

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-21 18:32 ` Dario Faggioli
@ 2021-01-21 19:40   ` Julien Grall
  2021-01-21 23:35     ` Dario Faggioli
  2021-01-22  8:07   ` Anders Törnqvist
  1 sibling, 1 reply; 31+ messages in thread
From: Julien Grall @ 2021-01-21 19:40 UTC (permalink / raw)
  To: Dario Faggioli, Anders Törnqvist, xen-devel, Stefano Stabellini

Hi Dario,

On 21/01/2021 18:32, Dario Faggioli wrote:
> On Thu, 2021-01-21 at 11:54 +0100, Anders Törnqvist wrote:
>> Hi,
>> I see a problem with destroy and restart of a domain. Interrupts are
>> not
>> available when trying to restart a domain.
>>
>> The situation seems very similar to the thread "null scheduler bug"
>>   
>> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html
>> .
>>
> Right. Back then, PCI passthrough was involved, if I remember
> correctly. Is it the case for you as well?

PCI passthrough is not yet supported on Arm :). However, the bug was 
reported with platform device passthrough.

[...]

>> "xl create" results in:
>> (XEN) IRQ 210 is already used by domain 1
>> (XEN) End of domain_destroy function
>>
>> Then repeated "xl create" looks the same until after a few tries I
>> also get:
>> (XEN) Begin of complete_domain_destroy function
>>
>> After that the next "xl create" creates the domain.
>>
>>
>> I have also applied the patch from
>>      
>> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg02469.html
>> .
>> This does seem to change the results.
>>
> Ah... Really? That's a bit unexpected, TBH.
> 
> Well, I'll think about it. >
>> Starting the system without "sched=null vwfi=native" does not result
>> in
>> the problem.
>>
> Ok, how about, if you're up for some more testing:
> 
>   - booting with "sched=null" but not with "vwfi=native"
>   - booting with "sched=null vwfi=native" but not doing the IRQ
>     passthrough that you mentioned above
> 
> ?

I think we can skip the testing as the bug was fully diagnostics back 
then. Unfortunately, I don't think a patch was ever posted. The 
interesting bits start at [1]. Let me try to summarize here.

This has nothing to do with device passthrough, but the bug is easier to 
spot as interrupts are only going to be released when then domain is 
fully destroyed (we should really release them during the relinquish 
period...).

The last step of the domain destruction (complete_domain_destroy()) will 
*only* happen when all the CPUs are considered quiescent from the RCU PoV.

As you pointed out on that thread, the RCU implementation in Xen 
requires the pCPU to enter in the hypervisor (via hypercalls, 
interrupts...) time to time.

This assumption doesn't hold anymore when using "sched=null vwfi=native" 
because a vCPU will not exit when it is idling (vwfi=native) and there 
may not be any other source of interrupt on that vCPU.

Therefore the quiescent state will never be reached on the pCPU running 
that vCPU.

 From Xen PoV, any pCPU executing guest context can be considered 
quiescent. So one way to solve the problem would be to mark the pCPU 
when entering to the guest.

Cheers,

[1] 
https://lore.kernel.org/xen-devel/acbeae1c-fda1-a079-322a-786d7528ecfc@arm.com/

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-21 19:40   ` Julien Grall
@ 2021-01-21 23:35     ` Dario Faggioli
  2021-01-22  8:06       ` Anders Törnqvist
  2021-01-22 14:02       ` Julien Grall
  0 siblings, 2 replies; 31+ messages in thread
From: Dario Faggioli @ 2021-01-21 23:35 UTC (permalink / raw)
  To: Julien Grall, Anders Törnqvist, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 2354 bytes --]

On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
> Hi Dario,
> 
Hi!

> On 21/01/2021 18:32, Dario Faggioli wrote:
> > On Thu, 2021-01-21 at 11:54 +0100, Anders Törnqvist wrote:
> > >   
> > > https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html
> > > .
> > > 
> > Right. Back then, PCI passthrough was involved, if I remember
> > correctly. Is it the case for you as well?
> 
> PCI passthrough is not yet supported on Arm :). However, the bug was 
> reported with platform device passthrough.
> 
Yeah, well... That! Which indeed is not PCI. Sorry for the terminology
mismatch. :-)

> > Well, I'll think about it. >
> > > Starting the system without "sched=null vwfi=native" does not
> > > result
> > > in
> > > the problem.
> > > 
> > Ok, how about, if you're up for some more testing:
> > 
> >   - booting with "sched=null" but not with "vwfi=native"
> >   - booting with "sched=null vwfi=native" but not doing the IRQ
> >     passthrough that you mentioned above
> > 
> > ?
> 
> I think we can skip the testing as the bug was fully diagnostics back
> then. Unfortunately, I don't think a patch was ever posted.
>
True. But an hackish debug patch was provided and, back then, it
worked.

OTOH, Anders seems to be reporting that such a patch did not work here.
I also continue to think that we're facing the same or a very similar
problem... But I'm curious why applying the patch did not help this
time. And that's why I asked for more testing.

Anyway, it's true that we left the issue pending, so something like
this:

>  From Xen PoV, any pCPU executing guest context can be considered 
> quiescent. So one way to solve the problem would be to mark the pCPU 
> when entering to the guest.
> 
Should be done anyway.

We'll then see if it actually solves this problem too, or if this is
really something else.

Thanks for the summary, BTW. :-)

I'll try to work on a patch.

Regards

> [1] 
>     
> https://lore.kernel.org/xen-devel/acbeae1c-fda1-a079-322a-786d7528ecfc@arm.com/
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-21 23:35     ` Dario Faggioli
@ 2021-01-22  8:06       ` Anders Törnqvist
  2021-01-22  9:05         ` Dario Faggioli
  2021-01-22 14:26         ` Julien Grall
  2021-01-22 14:02       ` Julien Grall
  1 sibling, 2 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-01-22  8:06 UTC (permalink / raw)
  To: Dario Faggioli, Julien Grall, xen-devel, Stefano Stabellini

Thanks for the responses.

On 1/22/21 12:35 AM, Dario Faggioli wrote:
> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
>> Hi Dario,
>>
> Hi!
>
>> On 21/01/2021 18:32, Dario Faggioli wrote:
>>> On Thu, 2021-01-21 at 11:54 +0100, Anders Törnqvist wrote:
>>>>    
>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html
>>>> .
>>>>
>>> Right. Back then, PCI passthrough was involved, if I remember
>>> correctly. Is it the case for you as well?
>> PCI passthrough is not yet supported on Arm :). However, the bug was
>> reported with platform device passthrough.
>>
> Yeah, well... That! Which indeed is not PCI. Sorry for the terminology
> mismatch. :-)
>
>>> Well, I'll think about it. >
>>>> Starting the system without "sched=null vwfi=native" does not
>>>> result
>>>> in
>>>> the problem.
>>>>
>>> Ok, how about, if you're up for some more testing:
>>>
>>>    - booting with "sched=null" but not with "vwfi=native"
>>>    - booting with "sched=null vwfi=native" but not doing the IRQ
>>>      passthrough that you mentioned above
>>>
>>> ?
>> I think we can skip the testing as the bug was fully diagnostics back
>> then. Unfortunately, I don't think a patch was ever posted.
>>
> True. But an hackish debug patch was provided and, back then, it
> worked.
>
> OTOH, Anders seems to be reporting that such a patch did not work here.
> I also continue to think that we're facing the same or a very similar
> problem... But I'm curious why applying the patch did not help this
> time. And that's why I asked for more testing.
I made the tests as suggested to shed some more light if needed.

- booting with "sched=null" but not with "vwfi=native"
Without "vwfi=native" it works fine to destroy and to re-create the domain.
Both printouts comes after a destroy:
(XEN) End of domain_destroy function
(XEN) End of complete_domain_destroy function


- booting with "sched=null vwfi=native" but not doing the IRQ 
passthrough that you mentioned above
"xl destroy" gives
(XEN) End of domain_destroy function

Then a "xl create" says nothing but the domain has not started correct. 
"xl list" look like this for the domain:
mydomu                                   2   512     1 ------       0.0

>
> Anyway, it's true that we left the issue pending, so something like
> this:
>
>>   From Xen PoV, any pCPU executing guest context can be considered
>> quiescent. So one way to solve the problem would be to mark the pCPU
>> when entering to the guest.
>>
> Should be done anyway.
>
> We'll then see if it actually solves this problem too, or if this is
> really something else.
>
> Thanks for the summary, BTW. :-)
>
> I'll try to work on a patch.
Thanks, just let me know if I can do some testing to assist.
>
> Regards
>
>> [1]
>>      
>> https://lore.kernel.org/xen-devel/acbeae1c-fda1-a079-322a-786d7528ecfc@arm.com/



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-21 18:32 ` Dario Faggioli
  2021-01-21 19:40   ` Julien Grall
@ 2021-01-22  8:07   ` Anders Törnqvist
  1 sibling, 0 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-01-22  8:07 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel

On 1/21/21 7:32 PM, Dario Faggioli wrote:
> On Thu, 2021-01-21 at 11:54 +0100, Anders Törnqvist wrote:
>> Hi,
>>
> Hello,
>
>> I see a problem with destroy and restart of a domain. Interrupts are
>> not
>> available when trying to restart a domain.
>>
>> The situation seems very similar to the thread "null scheduler bug"
>>   
>> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html
>> .
>>
> Right. Back then, PCI passthrough was involved, if I remember
> correctly. Is it the case for you as well?
>
>> The target system is a iMX8-based ARM board and Xen is a 4.13.0
>> version
>> built from https://source.codeaurora.org/external/imx/imx-xen.git.
>>
> Mmm, perhaps it's me, but neither going at that url with a browser not
> trying to clone it, I do not see anything. What I'm doing wrong?
Sorry. The link is https://source.codeaurora.org/external/imx/imx-xen.
>
>> Xen is booted with sched=null vwfi=native.
>> One physical CPU core is pinned to the domu.
>> Some interrupts are passed through to the domu.
>>
> Ok, I guess it is involved, since you say "some interrupts are passed
> through..."
>
>> When destroying the domain with xl destroy etc it does not complain
>> but
>> then when trying to restart the domain
>> again with a "xl create <domain cfg>" I get:
>> (XEN) IRQ 210 is already used by domain 1
>>
>> "xl list" does not contain the domain.
>>
>> Repeating the "xl create" command 5-10 times eventually starts the
>> domain without complaining about the IRQ.
>>
>> Inspired from the discussion in the thread above I have put printks
>> in
>> the xen/common/domain.c file.
>> In the function domain_destroy I have a printk("End of domain_destroy
>> function\n") in the end.
>> In the function complete_domain_destroy have a printk("Begin of
>> complete_domain_destroy function\n") in the beginning.
>>
>> With these printouts I get at "xl destroy":
>> (XEN) End of domain_destroy function
>>
>> So it seems like the function complete_domain_destroy is not called.
>>
> Ok, thanks for making these tests. It's helpful to have this
> information right away.
>
>> "xl create" results in:
>> (XEN) IRQ 210 is already used by domain 1
>> (XEN) End of domain_destroy function
>>
>> Then repeated "xl create" looks the same until after a few tries I
>> also get:
>> (XEN) Begin of complete_domain_destroy function
>>
>> After that the next "xl create" creates the domain.
>>
>>
>> I have also applied the patch from
>>      
>> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg02469.html
>> .
>> This does seem to change the results.
>>
> Ah... Really? That's a bit unexpected, TBH.
>
> Well, I'll think about it.
>
>> Starting the system without "sched=null vwfi=native" does not result
>> in
>> the problem.
>>
> Ok, how about, if you're up for some more testing:
>
>   - booting with "sched=null" but not with "vwfi=native"
>   - booting with "sched=null vwfi=native" but not doing the IRQ
>     passthrough that you mentioned above
>
> ?
>
> Regards




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-22  8:06       ` Anders Törnqvist
@ 2021-01-22  9:05         ` Dario Faggioli
  2021-01-22 14:26         ` Julien Grall
  1 sibling, 0 replies; 31+ messages in thread
From: Dario Faggioli @ 2021-01-22  9:05 UTC (permalink / raw)
  To: Anders Törnqvist, Julien Grall, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]

On Fri, 2021-01-22 at 09:06 +0100, Anders Törnqvist wrote:
> On 1/22/21 12:35 AM, Dario Faggioli wrote:
> 
> 
> - booting with "sched=null" but not with "vwfi=native"
> Without "vwfi=native" it works fine to destroy and to re-create the
> domain.
> Both printouts comes after a destroy:
> (XEN) End of domain_destroy function
> (XEN) End of complete_domain_destroy function
> 
Ok, thanks for doing these tests.

The fact that not using "vwfi=native" makes things work, seem to point
in the direction that myself and Julien (and you as well!) were
suspecting. I.e., it is the same issue than the one in the old xen-
devel thread.

I'm still a but puzzled why the debug patch posted back then does not
work for you... but that's not really super important. Let's try to
come up with a new debug patch and, this time, a proper fix. :-)

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-21 23:35     ` Dario Faggioli
  2021-01-22  8:06       ` Anders Törnqvist
@ 2021-01-22 14:02       ` Julien Grall
  2021-01-22 17:30         ` Anders Törnqvist
  1 sibling, 1 reply; 31+ messages in thread
From: Julien Grall @ 2021-01-22 14:02 UTC (permalink / raw)
  To: Dario Faggioli, Anders Törnqvist, xen-devel, Stefano Stabellini

Hi Dario,

On 21/01/2021 23:35, Dario Faggioli wrote:
> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
>> Hi Dario,
>>
> Hi!
> 
>> On 21/01/2021 18:32, Dario Faggioli wrote:
>>> On Thu, 2021-01-21 at 11:54 +0100, Anders Törnqvist wrote:
>>>>    
>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html
>>>> .
>>>>
>>> Right. Back then, PCI passthrough was involved, if I remember
>>> correctly. Is it the case for you as well?
>>
>> PCI passthrough is not yet supported on Arm :). However, the bug was
>> reported with platform device passthrough.
>>
> Yeah, well... That! Which indeed is not PCI. Sorry for the terminology
> mismatch. :-)
> 
>>> Well, I'll think about it. >
>>>> Starting the system without "sched=null vwfi=native" does not
>>>> result
>>>> in
>>>> the problem.
>>>>
>>> Ok, how about, if you're up for some more testing:
>>>
>>>    - booting with "sched=null" but not with "vwfi=native"
>>>    - booting with "sched=null vwfi=native" but not doing the IRQ
>>>      passthrough that you mentioned above
>>>
>>> ?
>>
>> I think we can skip the testing as the bug was fully diagnostics back
>> then. Unfortunately, I don't think a patch was ever posted.
>>
> True. But an hackish debug patch was provided and, back then, it
> worked.
> 
> OTOH, Anders seems to be reporting that such a patch did not work here.
> I also continue to think that we're facing the same or a very similar
> problem... But I'm curious why applying the patch did not help this
> time. And that's why I asked for more testing.

I wonder if this is because your patch doesn't modify rsinterval. So 
even if we call force_quiescent_state(), the softirq would only be 
raised for the current CPU.

I guess the following HACK could confirm the theory:

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index a5a27af3def0..50020bc34ddf 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -250,7 +250,7 @@ static void force_quiescent_state(struct rcu_data *rdp,
  {
      cpumask_t cpumask;
      raise_softirq(RCU_SOFTIRQ);
-    if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
+    if (1 || unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
          rdp->last_rs_qlen = rdp->qlen;
          /*
           * Don't send IPI to itself. With irqs disabled,

Cheers,

-- 
Julien Grall


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-22  8:06       ` Anders Törnqvist
  2021-01-22  9:05         ` Dario Faggioli
@ 2021-01-22 14:26         ` Julien Grall
  2021-01-22 17:44           ` Anders Törnqvist
  2021-01-25 16:11           ` Dario Faggioli
  1 sibling, 2 replies; 31+ messages in thread
From: Julien Grall @ 2021-01-22 14:26 UTC (permalink / raw)
  To: Anders Törnqvist, Dario Faggioli, xen-devel, Stefano Stabellini

Hi Anders,

On 22/01/2021 08:06, Anders Törnqvist wrote:
> On 1/22/21 12:35 AM, Dario Faggioli wrote:
>> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
> - booting with "sched=null vwfi=native" but not doing the IRQ 
> passthrough that you mentioned above
> "xl destroy" gives
> (XEN) End of domain_destroy function
> 
> Then a "xl create" says nothing but the domain has not started correct. 
> "xl list" look like this for the domain:
> mydomu                                   2   512     1 ------       0.0

This is odd. I would have expected ``xl create`` to fail if something 
went wrong with the domain creation.

The list of dash, suggests that the domain is:
    - Not running
    - Not blocked (i.e cannot run)
    - Not paused
    - Not shutdown

So this suggest the NULL scheduler didn't schedule the vCPU. Would it be 
possible to describe your setup:
   - How many pCPUs?
   - How many vCPUs did you give to dom0?
   - What was the number of the vCPUs given to the previous guest?

One possibility is the NULL scheduler doesn't release the pCPUs until 
the domain is fully destroyed. So if there is no pCPU free, it wouldn't 
be able to schedule the new domain.

However, I would have expected the NULL scheduler to refuse the domain 
to create if there is no pCPU available.

@Dario, @Stefano, do you know when the NULL scheduler decides to 
allocate the pCPU?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-22 14:02       ` Julien Grall
@ 2021-01-22 17:30         ` Anders Törnqvist
  0 siblings, 0 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-01-22 17:30 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, xen-devel, Stefano Stabellini

On 1/22/21 3:02 PM, Julien Grall wrote:
> Hi Dario,
>
> On 21/01/2021 23:35, Dario Faggioli wrote:
>> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
>>> Hi Dario,
>>>
>> Hi!
>>
>>> On 21/01/2021 18:32, Dario Faggioli wrote:
>>>> On Thu, 2021-01-21 at 11:54 +0100, Anders Törnqvist wrote:
>>>>> https://lists.xenproject.org/archives/html/xen-devel/2018-09/msg01213.html
>>>>> .
>>>>>
>>>> Right. Back then, PCI passthrough was involved, if I remember
>>>> correctly. Is it the case for you as well?
>>>
>>> PCI passthrough is not yet supported on Arm :). However, the bug was
>>> reported with platform device passthrough.
>>>
>> Yeah, well... That! Which indeed is not PCI. Sorry for the terminology
>> mismatch. :-)
>>
>>>> Well, I'll think about it. >
>>>>> Starting the system without "sched=null vwfi=native" does not
>>>>> result
>>>>> in
>>>>> the problem.
>>>>>
>>>> Ok, how about, if you're up for some more testing:
>>>>
>>>>    - booting with "sched=null" but not with "vwfi=native"
>>>>    - booting with "sched=null vwfi=native" but not doing the IRQ
>>>>      passthrough that you mentioned above
>>>>
>>>> ?
>>>
>>> I think we can skip the testing as the bug was fully diagnostics back
>>> then. Unfortunately, I don't think a patch was ever posted.
>>>
>> True. But an hackish debug patch was provided and, back then, it
>> worked.
>>
>> OTOH, Anders seems to be reporting that such a patch did not work here.
>> I also continue to think that we're facing the same or a very similar
>> problem... But I'm curious why applying the patch did not help this
>> time. And that's why I asked for more testing.
>
> I wonder if this is because your patch doesn't modify rsinterval. So 
> even if we call force_quiescent_state(), the softirq would only be 
> raised for the current CPU.
>
> I guess the following HACK could confirm the theory:
>
> diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
> index a5a27af3def0..50020bc34ddf 100644
> --- a/xen/common/rcupdate.c
> +++ b/xen/common/rcupdate.c
> @@ -250,7 +250,7 @@ static void force_quiescent_state(struct rcu_data 
> *rdp,
>  {
>      cpumask_t cpumask;
>      raise_softirq(RCU_SOFTIRQ);
> -    if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
> +    if (1 || unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
>          rdp->last_rs_qlen = rdp->qlen;
>          /*
>           * Don't send IPI to itself. With irqs disabled,
>
> Cheers,
>
I applied the patch above. No change. The function 
complete_domain_destroy function is not call when I destroy the domain.

/Anders



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-22 14:26         ` Julien Grall
@ 2021-01-22 17:44           ` Anders Törnqvist
  2021-01-25 15:45             ` Dario Faggioli
  2021-01-25 16:11           ` Dario Faggioli
  1 sibling, 1 reply; 31+ messages in thread
From: Anders Törnqvist @ 2021-01-22 17:44 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, xen-devel, Stefano Stabellini

On 1/22/21 3:26 PM, Julien Grall wrote:
> Hi Anders,
>
> On 22/01/2021 08:06, Anders Törnqvist wrote:
>> On 1/22/21 12:35 AM, Dario Faggioli wrote:
>>> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
>> - booting with "sched=null vwfi=native" but not doing the IRQ 
>> passthrough that you mentioned above
>> "xl destroy" gives
>> (XEN) End of domain_destroy function
>>
>> Then a "xl create" says nothing but the domain has not started 
>> correct. "xl list" look like this for the domain:
>> mydomu                                   2   512     1 ------       0.0
>
> This is odd. I would have expected ``xl create`` to fail if something 
> went wrong with the domain creation.
>
> The list of dash, suggests that the domain is:
>    - Not running
>    - Not blocked (i.e cannot run)
>    - Not paused
>    - Not shutdown
>
> So this suggest the NULL scheduler didn't schedule the vCPU. Would it 
> be possible to describe your setup:
>   - How many pCPUs?
There are 6 pCPUs
>   - How many vCPUs did you give to dom0?
I gave it 5
>   - What was the number of the vCPUs given to the previous guest?

Nr 0.

Listing vcpus looks like this when the domain is running:

xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) 
Affinity (Hard / Soft)
Domain-0                             0     0    0   r--     101.7 0 / all
Domain-0                             0     1    1   r--     101.0 1 / all
Domain-0                             0     2    2   r--     101.0 2 / all
Domain-0                             0     3    3   r--     100.9 3 / all
Domain-0                             0     4    4   r--     100.9 4 / all
mydomu                              1     0    5   r--      89.5 5 / all

vCPU nr 0 is also for dom0. Is that normal?

>
> One possibility is the NULL scheduler doesn't release the pCPUs until 
> the domain is fully destroyed. So if there is no pCPU free, it 
> wouldn't be able to schedule the new domain.
>
> However, I would have expected the NULL scheduler to refuse the domain 
> to create if there is no pCPU available.
>
> @Dario, @Stefano, do you know when the NULL scheduler decides to 
> allocate the pCPU?
>
> Cheers,
>




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-22 17:44           ` Anders Törnqvist
@ 2021-01-25 15:45             ` Dario Faggioli
  0 siblings, 0 replies; 31+ messages in thread
From: Dario Faggioli @ 2021-01-25 15:45 UTC (permalink / raw)
  To: Anders Törnqvist, Julien Grall, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 1564 bytes --]

On Fri, 2021-01-22 at 18:44 +0100, Anders Törnqvist wrote:
> Listing vcpus looks like this when the domain is running:
> 
> xl vcpu-list
> Name                                ID  VCPU   CPU State   Time(s) 
> Affinity (Hard / Soft)
> Domain-0                             0     0    0   r--     101.7 0 /
> all
> Domain-0                             0     1    1   r--     101.0 1 /
> all
> Domain-0                             0     2    2   r--     101.0 2 /
> all
> Domain-0                             0     3    3   r--     100.9 3 /
> all
> Domain-0                             0     4    4   r--     100.9 4 /
> all
> mydomu                              1     0    5   r--      89.5 5 /
> all
> 
> vCPU nr 0 is also for dom0. Is that normal?
> 
Yeah, that's the vCPU IDs numbering. Each VM/guest (including dom0) has
its vCPUs and they have ID starting from 0.

What counts here, to make sure that the NULL scheduler "configuration"
is correct, is that each VCPU is associated to one and only one PCPU.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-22 14:26         ` Julien Grall
  2021-01-22 17:44           ` Anders Törnqvist
@ 2021-01-25 16:11           ` Dario Faggioli
  2021-01-26 17:03             ` Anders Törnqvist
  1 sibling, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2021-01-25 16:11 UTC (permalink / raw)
  To: Julien Grall, Anders Törnqvist, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 2700 bytes --]

On Fri, 2021-01-22 at 14:26 +0000, Julien Grall wrote:
> Hi Anders,
> 
> On 22/01/2021 08:06, Anders Törnqvist wrote:
> > On 1/22/21 12:35 AM, Dario Faggioli wrote:
> > > On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
> > - booting with "sched=null vwfi=native" but not doing the IRQ 
> > passthrough that you mentioned above
> > "xl destroy" gives
> > (XEN) End of domain_destroy function
> > 
> > Then a "xl create" says nothing but the domain has not started
> > correct. 
> > "xl list" look like this for the domain:
> > mydomu                                   2   512     1 ------      
> > 0.0
> 
> This is odd. I would have expected ``xl create`` to fail if something
> went wrong with the domain creation.
>
So, Anders, would it be possible to issue a:

# xl debug-keys r
# xl dmesg

And send it to us ?

Ideally, you'd do it:
 - with Julien's patch (the one he sent the other day, and that you 
   have already given a try to) applied
 - while you are in the state above, i.e., after having tried to 
   destroy a domain and failing
 - and maybe again after having tried to start a new domain

> One possibility is the NULL scheduler doesn't release the pCPUs until
> the domain is fully destroyed. So if there is no pCPU free, it
> wouldn't 
> be able to schedule the new domain.
> 
> However, I would have expected the NULL scheduler to refuse the
> domain 
> to create if there is no pCPU available.
> 
Yeah but, unfortunately, the scheduler does not have it easy to fail
domain creation at this stage (i.e., when we realize there are no
available pCPUs). That's the reason why the NULL scheduler has a
waitqueue, where vCPUs that cannot be put on any pCPU are put.

Of course, this is a configuration error (or a bug, like maybe in this
case :-/), and we print warnings when it happens.

> @Dario, @Stefano, do you know when the NULL scheduler decides to 
> allocate the pCPU?
> 
On which pCPU to allocate a vCPU is decided in null_unit_insert(),
called from sched_alloc_unit() and sched_init_vcpu().

On the other hand, a vCPU is properly removed from its pCPU, hence
making the pCPU free for being assigned to some other vCPU, in
unit_deassign(), called from null_unit_remove(), which in turn is
called from sched_destroy_vcpu() Which is indeed called from
complete_domain_destroy().

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-25 16:11           ` Dario Faggioli
@ 2021-01-26 17:03             ` Anders Törnqvist
  2021-01-26 22:31               ` Dario Faggioli
  0 siblings, 1 reply; 31+ messages in thread
From: Anders Törnqvist @ 2021-01-26 17:03 UTC (permalink / raw)
  To: Dario Faggioli, Julien Grall, xen-devel, Stefano Stabellini

On 1/25/21 5:11 PM, Dario Faggioli wrote:
> On Fri, 2021-01-22 at 14:26 +0000, Julien Grall wrote:
>> Hi Anders,
>>
>> On 22/01/2021 08:06, Anders Törnqvist wrote:
>>> On 1/22/21 12:35 AM, Dario Faggioli wrote:
>>>> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
>>> - booting with "sched=null vwfi=native" but not doing the IRQ
>>> passthrough that you mentioned above
>>> "xl destroy" gives
>>> (XEN) End of domain_destroy function
>>>
>>> Then a "xl create" says nothing but the domain has not started
>>> correct.
>>> "xl list" look like this for the domain:
>>> mydomu                                   2   512     1 ------
>>> 0.0
>> This is odd. I would have expected ``xl create`` to fail if something
>> went wrong with the domain creation.
>>
> So, Anders, would it be possible to issue a:
>
> # xl debug-keys r
> # xl dmesg
>
> And send it to us ?
>
> Ideally, you'd do it:
>   - with Julien's patch (the one he sent the other day, and that you
>     have already given a try to) applied
>   - while you are in the state above, i.e., after having tried to
>     destroy a domain and failing
>   - and maybe again after having tried to start a new domain
Here are some logs.

The system is booted as before with the patch and the domu config does 
not have the IRQs.


# xl list
Name                                        ID   Mem VCPUs State    Time(s)
Domain-0                                     0  3000     5 r-----     820.1
mydomu                                       1   511     1 r-----     157.0

# xl debug-keys r
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=191793008000
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 1
(XEN)       6: [1.0] pcpu=5
(XEN) Waitqueue:
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d1v0
(XEN)     run: [1.0] pcpu=5

# xl dmesg
(XEN) Checking for initrd in /chosen
(XEN) RAM: 0000000080200000 - 00000000ffffffff
(XEN) RAM: 0000000880000000 - 00000008ffffffff
(XEN)
(XEN) MODULE[0]: 0000000080400000 - 000000008054d848 Xen
(XEN) MODULE[1]: 0000000083000000 - 0000000083018000 Device Tree
(XEN) MODULE[2]: 0000000088000000 - 0000000089701200 Kernel
(XEN)  RESVD[0]: 0000000088000000 - 0000000090000000
(XEN)  RESVD[1]: 0000000083000000 - 0000000083018000
(XEN)  RESVD[2]: 0000000084000000 - 0000000085ffffff
(XEN)  RESVD[3]: 0000000086000000 - 00000000863fffff
(XEN)  RESVD[4]: 0000000090000000 - 00000000903fffff
(XEN)  RESVD[5]: 0000000090400000 - 0000000091ffffff
(XEN)  RESVD[6]: 0000000092000000 - 00000000921fffff
(XEN)  RESVD[7]: 0000000092200000 - 00000000923fffff
(XEN)  RESVD[8]: 0000000092400000 - 00000000943fffff
(XEN)  RESVD[9]: 0000000094400000 - 0000000094bfffff
(XEN)
(XEN) CMDLINE[0000000088000000]:chosen console=hvc0 earlycon=xen 
root=/dev/mmcblk0p3 mem=3000M hostname=myhost 
video=HDMI-A-1:1920x1080@60 imxdrm.legacyfb_depth=32   quiet loglevel=3 
logo.nologo vt.global_cursor_default=0
(XEN)
(XEN) Command line: console=dtuart dtuart=/serial@5a060000 
dom0_mem=3000M dom0_max_vcpus=5 hmp-unsafe=true dom0_vcpus_pin 
sched=null vwfi=native
(XEN) Domain heap initialised
(XEN) Booting using Device Tree
(XEN) partition id 4
(XEN) Domain name mydomu
(XEN) *****Initialized MU
(XEN) Looking for dtuart at "/serial@5a060000", options ""
  Xen 4.13.1-pre
(XEN) Xen version 4.13.1-pre (anders@builder.local) 
(aarch64-poky-linux-gcc (GCC) 8.3.0) debug=n  Fri Jan 22 17:32:33 UTC 2021
(XEN) Latest ChangeSet: Wed Feb 27 17:56:28 2019 +0800 git:b64b8df-dirty
(XEN) Processor: 410fd034: "ARM Limited", variant: 0x0, part 0xd03, rev 0x4
(XEN) 64-bit Execution:
(XEN)   Processor Features: 0000000001002222 0000000000000000
(XEN)     Exception Levels: EL3:64+32 EL2:64+32 EL1:64+32 EL0:64+32
(XEN)     Extensions: FloatingPoint AdvancedSIMD GICv3-SysReg
(XEN)   Debug Features: 0000000010305106 0000000000000000
(XEN)   Auxiliary Features: 0000000000000000 0000000000000000
(XEN)   Memory Model Features: 0000000000001122 0000000000000000
(XEN)   ISA Features:  0000000000011120 0000000000000000
(XEN) 32-bit Execution:
(XEN)   Processor Features: 00000131:10011011
(XEN)     Instruction Sets: AArch32 A32 Thumb Thumb-2 Jazelle
(XEN)     Extensions: GenericTimer Security
(XEN)   Debug Features: 03010066
(XEN)   Auxiliary Features: 00000000
(XEN)   Memory Model Features: 10201105 40000000 01260000 02102211
(XEN)  ISA Features: 02101110 13112111 21232042 01112131 00011142 00011121
(XEN) Generic Timer IRQ: phys=30 hyp=26 virt=27 Freq: 8000 KHz
(XEN) GICv3 initialization:
(XEN)       gic_dist_addr=0x00000051a00000
(XEN)       gic_maintenance_irq=25
(XEN)       gic_rdist_stride=0
(XEN)       gic_rdist_regions=1
(XEN)       redistributor regions:
(XEN)         - region 0: 0x00000051b00000 - 0x00000051bc0000
(XEN) GICv3 compatible with GICv2 cbase 0x00000052000000 vbase 
0x00000052020000
(XEN) GICv3: 544 lines, (IID 0001143b).
(XEN) GICv3: CPU0: Found redistributor in region 0 @000000004002d000
(XEN) XSM Framework v1.0.0 initialized
(XEN) Initialising XSM SILO mode
(XEN) Using scheduler: null Scheduler (null)
(XEN) Initializing null scheduler
(XEN) WARNING: This is experimental software in development.
(XEN) Use at your own risk.
(XEN) Allocated console ring of 16 KiB.
(XEN) Bringing up CPU1
(XEN) GICv3: CPU1: Found redistributor in region 0 @00000000400ad000
(XEN) Bringing up CPU2
(XEN) GICv3: CPU2: Found redistributor in region 0 @00000000400cd000
(XEN) Bringing up CPU3
(XEN) GICv3: CPU3: Found redistributor in region 0 @000000004004d000
(XEN) Bringing up CPU4
(XEN) GICv3: CPU4: Found redistributor in region 0 @000000004006d000
(XEN) Bringing up CPU5
(XEN) GICv3: CPU5: Found redistributor in region 0 @000000004008d000
(XEN) Brought up 6 CPUs
(XEN) I/O virtualisation enabled
(XEN)  - Dom0 mode: Relaxed
(XEN) Interrupt remapping enabled
(XEN) P2M: 40-bit IPA with 40-bit PA and 8-bit VMID
(XEN) P2M: 3 levels with order-1 root, VTCR 0x80023558
(XEN) *** LOADING DOMAIN 0 ***
(XEN) Loading d0 kernel from boot module @ 0000000088000000
(XEN) Allocating 1:1 mappings totalling 3000MB for dom0:
(XEN) BANK[0] 0x00000098000000-0x00000100000000 (1664MB)
(XEN) BANK[1] 0x00000880000000-0x000008c0000000 (1024MB)
(XEN) BANK[2] 0x000008d0000000-0x000008e0000000 (256MB)
(XEN) BANK[3] 0x000008ec800000-0x000008f0000000 (56MB)
(XEN) Grant table range: 0x00000080400000-0x00000080440000
(XEN) HACK: skip /imx8_gpu_ss setup!
(XEN) Allocating PPI 16 for event channel interrupt
(XEN) Loading zImage from 0000000088000000 to 
0000000098080000-0000000099781200
(XEN) Loading d0 DTB to 0x00000000a0000000-0x00000000a001772e
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Scrubbing Free RAM in background
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) ***************************************************
(XEN) WARNING: HMP COMPUTING HAS BEEN ENABLED.
(XEN) It has implications on the security and stability of the system,
(XEN) unless the cpu affinity of all domains is specified.
(XEN) ***************************************************
(XEN) 3... 2... 1...
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) Freed 336kB init memory.
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER4
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER8
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER12
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER16
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER20
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER24
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER28
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER32
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER36
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER40
(XEN) Power on resource 215
(XEN) printk: 11 messages suppressed.
(XEN) d1v0: vGICR: SGI: unhandled word write 0x000000ffffffff to ICACTIVER0
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=191793008000
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 1
(XEN)       6: [1.0] pcpu=5
(XEN) Waitqueue:
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d1v0
(XEN)     run: [1.0] pcpu=5


# xl destroy mydomu
(XEN) End of domain_destroy function

# xl list
Name                                        ID   Mem VCPUs State    Time(s)
Domain-0                                     0  3000     5 r-----    1057.9

# xl debug-keys r
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=223871439875
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 1
(XEN)       6: [1.0] pcpu=5
(XEN) Waitqueue:
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d1v0


# xl create mydomu.cfg
Parsing config from mydomu.cfg
(XEN) Power on resource 215

# xl list
Name                                        ID   Mem VCPUs State    Time(s)
Domain-0                                     0  3000     5 r-----    1152.1
mydomu                                       2   512     1 ------       0.0

# xl debug-keys r
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=241210530250
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 1
(XEN)       6: [1.0] pcpu=5
(XEN)     Domain: 2
(XEN)       7: [2.0] pcpu=-1
(XEN) Waitqueue: d2v0
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d1v0

# xl dmesg
(XEN) Checking for initrd in /chosen
(XEN) RAM: 0000000080200000 - 00000000ffffffff
(XEN) RAM: 0000000880000000 - 00000008ffffffff
(XEN)
(XEN) MODULE[0]: 0000000080400000 - 000000008054d848 Xen
(XEN) MODULE[1]: 0000000083000000 - 0000000083018000 Device Tree
(XEN) MODULE[2]: 0000000088000000 - 0000000089701200 Kernel
(XEN)  RESVD[0]: 0000000088000000 - 0000000090000000
(XEN)  RESVD[1]: 0000000083000000 - 0000000083018000
(XEN)  RESVD[2]: 0000000084000000 - 0000000085ffffff
(XEN)  RESVD[3]: 0000000086000000 - 00000000863fffff
(XEN)  RESVD[4]: 0000000090000000 - 00000000903fffff
(XEN)  RESVD[5]: 0000000090400000 - 0000000091ffffff
(XEN)  RESVD[6]: 0000000092000000 - 00000000921fffff
(XEN)  RESVD[7]: 0000000092200000 - 00000000923fffff
(XEN)  RESVD[8]: 0000000092400000 - 00000000943fffff
(XEN)  RESVD[9]: 0000000094400000 - 0000000094bfffff
(XEN)
(XEN) CMDLINE[0000000088000000]:chosen console=hvc0 earlycon=xen 
root=/dev/mmcblk0p3 mem=3000M hostname=myhost 
video=HDMI-A-1:1920x1080@60 imxdrm.legacyfb_depth=32   quiet loglevel=3 
logo.nologo vt.global_cursor_default=0
(XEN)
(XEN) Command line: console=dtuart dtuart=/serial@5a060000 
dom0_mem=3000M dom0_max_vcpus=5 hmp-unsafe=true dom0_vcpus_pin 
sched=null vwfi=native
(XEN) Domain heap initialised
(XEN) Booting using Device Tree
(XEN) partition id 4
(XEN) Domain name mydomu
(XEN) *****Initialized MU
(XEN) Looking for dtuart at "/serial@5a060000", options ""
  Xen 4.13.1-pre
(XEN) Xen version 4.13.1-pre (anders@builder.local) 
(aarch64-poky-linux-gcc (GCC) 8.3.0) debug=n  Fri Jan 22 17:32:33 UTC 2021
(XEN) Latest ChangeSet: Wed Feb 27 17:56:28 2019 +0800 git:b64b8df-dirty
(XEN) Processor: 410fd034: "ARM Limited", variant: 0x0, part 0xd03, rev 0x4
(XEN) 64-bit Execution:
(XEN)   Processor Features: 0000000001002222 0000000000000000
(XEN)     Exception Levels: EL3:64+32 EL2:64+32 EL1:64+32 EL0:64+32
(XEN)     Extensions: FloatingPoint AdvancedSIMD GICv3-SysReg
(XEN)   Debug Features: 0000000010305106 0000000000000000
(XEN)   Auxiliary Features: 0000000000000000 0000000000000000
(XEN)   Memory Model Features: 0000000000001122 0000000000000000
(XEN)   ISA Features:  0000000000011120 0000000000000000
(XEN) 32-bit Execution:
(XEN)   Processor Features: 00000131:10011011
(XEN)     Instruction Sets: AArch32 A32 Thumb Thumb-2 Jazelle
(XEN)     Extensions: GenericTimer Security
(XEN)   Debug Features: 03010066
(XEN)   Auxiliary Features: 00000000
(XEN)   Memory Model Features: 10201105 40000000 01260000 02102211
(XEN)  ISA Features: 02101110 13112111 21232042 01112131 00011142 00011121
(XEN) Generic Timer IRQ: phys=30 hyp=26 virt=27 Freq: 8000 KHz
(XEN) GICv3 initialization:
(XEN)       gic_dist_addr=0x00000051a00000
(XEN)       gic_maintenance_irq=25
(XEN)       gic_rdist_stride=0
(XEN)       gic_rdist_regions=1
(XEN)       redistributor regions:
(XEN)         - region 0: 0x00000051b00000 - 0x00000051bc0000
(XEN) GICv3 compatible with GICv2 cbase 0x00000052000000 vbase 
0x00000052020000
(XEN) GICv3: 544 lines, (IID 0001143b).
(XEN) GICv3: CPU0: Found redistributor in region 0 @000000004002d000
(XEN) XSM Framework v1.0.0 initialized
(XEN) Initialising XSM SILO mode
(XEN) Using scheduler: null Scheduler (null)
(XEN) Initializing null scheduler
(XEN) WARNING: This is experimental software in development.
(XEN) Use at your own risk.
(XEN) Allocated console ring of 16 KiB.
(XEN) Bringing up CPU1
(XEN) GICv3: CPU1: Found redistributor in region 0 @00000000400ad000
(XEN) Bringing up CPU2
(XEN) GICv3: CPU2: Found redistributor in region 0 @00000000400cd000
(XEN) Bringing up CPU3
(XEN) GICv3: CPU3: Found redistributor in region 0 @000000004004d000
(XEN) Bringing up CPU4
(XEN) GICv3: CPU4: Found redistributor in region 0 @000000004006d000
(XEN) Bringing up CPU5
(XEN) GICv3: CPU5: Found redistributor in region 0 @000000004008d000
(XEN) Brought up 6 CPUs
(XEN) I/O virtualisation enabled
(XEN)  - Dom0 mode: Relaxed
(XEN) Interrupt remapping enabled
(XEN) P2M: 40-bit IPA with 40-bit PA and 8-bit VMID
(XEN) P2M: 3 levels with order-1 root, VTCR 0x80023558
(XEN) *** LOADING DOMAIN 0 ***
(XEN) Loading d0 kernel from boot module @ 0000000088000000
(XEN) Allocating 1:1 mappings totalling 3000MB for dom0:
(XEN) BANK[0] 0x00000098000000-0x00000100000000 (1664MB)
(XEN) BANK[1] 0x00000880000000-0x000008c0000000 (1024MB)
(XEN) BANK[2] 0x000008d0000000-0x000008e0000000 (256MB)
(XEN) BANK[3] 0x000008ec800000-0x000008f0000000 (56MB)
(XEN) Grant table range: 0x00000080400000-0x00000080440000
(XEN) HACK: skip /imx8_gpu_ss setup!
(XEN) Allocating PPI 16 for event channel interrupt
(XEN) Loading zImage from 0000000088000000 to 
0000000098080000-0000000099781200
(XEN) Loading d0 DTB to 0x00000000a0000000-0x00000000a001772e
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Scrubbing Free RAM in background
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) ***************************************************
(XEN) WARNING: HMP COMPUTING HAS BEEN ENABLED.
(XEN) It has implications on the security and stability of the system,
(XEN) unless the cpu affinity of all domains is specified.
(XEN) ***************************************************
(XEN) 3... 2... 1...
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) Freed 336kB init memory.
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER4
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER8
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER12
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER16
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER20
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER24
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER28
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER32
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER36
(XEN) d0v0: vGICD: unhandled word write 0x000000ffffffff to ICACTIVER40
(XEN) Power on resource 215
(XEN) printk: 11 messages suppressed.
(XEN) d1v0: vGICR: SGI: unhandled word write 0x000000ffffffff to ICACTIVER0
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=191793008000
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 1
(XEN)       6: [1.0] pcpu=5
(XEN) Waitqueue:
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d1v0
(XEN)     run: [1.0] pcpu=5
(XEN) End of domain_destroy function
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=223871439875
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 1
(XEN)       6: [1.0] pcpu=5
(XEN) Waitqueue:
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d1v0
(XEN) Power on resource 215
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=241210530250
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 1
(XEN)       6: [1.0] pcpu=5
(XEN)     Domain: 2
(XEN)       7: [2.0] pcpu=-1
(XEN) Waitqueue: d2v0
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d1v0


I then repeated the "xl create" some times until it caused the 
complete_domain_destroy function to be called.
Then the information looked like this:

# xl debug-keys r
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=850134473000
(XEN) Online Cpus: 0-5
(XEN) Cpupool 0:
(XEN) Cpus: 0-5
(XEN) Scheduler: null Scheduler (null)
(XEN)     cpus_free =
(XEN) Domain info:
(XEN)     Domain: 0
(XEN)       1: [0.0] pcpu=0
(XEN)       2: [0.1] pcpu=1
(XEN)       3: [0.2] pcpu=2
(XEN)       4: [0.3] pcpu=3
(XEN)       5: [0.4] pcpu=4
(XEN)     Domain: 2
(XEN)       6: [2.0] pcpu=5
(XEN)     Domain: 3
(XEN)     Domain: 4
(XEN) Waitqueue:
(XEN) CPUs info:
(XEN) CPU[00] sibling={0}, core={0}, unit=d0v0
(XEN)     run: [0.0] pcpu=0
(XEN) CPU[01] sibling={1}, core={1}, unit=d0v1
(XEN)     run: [0.1] pcpu=1
(XEN) CPU[02] sibling={2}, core={2}, unit=d0v2
(XEN)     run: [0.2] pcpu=2
(XEN) CPU[03] sibling={3}, core={3}, unit=d0v3
(XEN)     run: [0.3] pcpu=3
(XEN) CPU[04] sibling={4}, core={4}, unit=d0v4
(XEN)     run: [0.4] pcpu=4
(XEN) CPU[05] sibling={5}, core={5}, unit=d2v0
(XEN)     run: [2.0] pcpu=5

# xl list
Name                                        ID   Mem VCPUs State    Time(s)
Domain-0                                     0  3000     5 r-----    4277.7
mydomu                                       2   511     1 r-----      15.6


>
>> One possibility is the NULL scheduler doesn't release the pCPUs until
>> the domain is fully destroyed. So if there is no pCPU free, it
>> wouldn't
>> be able to schedule the new domain.
>>
>> However, I would have expected the NULL scheduler to refuse the
>> domain
>> to create if there is no pCPU available.
>>
> Yeah but, unfortunately, the scheduler does not have it easy to fail
> domain creation at this stage (i.e., when we realize there are no
> available pCPUs). That's the reason why the NULL scheduler has a
> waitqueue, where vCPUs that cannot be put on any pCPU are put.
>
> Of course, this is a configuration error (or a bug, like maybe in this
> case :-/), and we print warnings when it happens.
>
>> @Dario, @Stefano, do you know when the NULL scheduler decides to
>> allocate the pCPU?
>>
> On which pCPU to allocate a vCPU is decided in null_unit_insert(),
> called from sched_alloc_unit() and sched_init_vcpu().
>
> On the other hand, a vCPU is properly removed from its pCPU, hence
> making the pCPU free for being assigned to some other vCPU, in
> unit_deassign(), called from null_unit_remove(), which in turn is
> called from sched_destroy_vcpu() Which is indeed called from
> complete_domain_destroy().
>
> Regards




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-26 17:03             ` Anders Törnqvist
@ 2021-01-26 22:31               ` Dario Faggioli
  2021-01-29  8:08                 ` Anders Törnqvist
  0 siblings, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2021-01-26 22:31 UTC (permalink / raw)
  To: Anders Törnqvist, Julien Grall, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 6402 bytes --]

On Tue, 2021-01-26 at 18:03 +0100, Anders Törnqvist wrote:
> On 1/25/21 5:11 PM, Dario Faggioli wrote:
> > On Fri, 2021-01-22 at 14:26 +0000, Julien Grall wrote:
> > > Hi Anders,
> > > 
> > > On 22/01/2021 08:06, Anders Törnqvist wrote:
> > > > On 1/22/21 12:35 AM, Dario Faggioli wrote:
> > > > > On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
> > > > - booting with "sched=null vwfi=native" but not doing the IRQ
> > > > passthrough that you mentioned above
> > > > "xl destroy" gives
> > > > (XEN) End of domain_destroy function
> > > > 
> > > > Then a "xl create" says nothing but the domain has not started
> > > > correct.
> > > > "xl list" look like this for the domain:
> > > > mydomu                                   2   512     1 ------
> > > > 0.0
> > > This is odd. I would have expected ``xl create`` to fail if
> > > something
> > > went wrong with the domain creation.
> > > 
> > So, Anders, would it be possible to issue a:
> > 
> > # xl debug-keys r
> > # xl dmesg
> > 
> > And send it to us ?
> > 
> > Ideally, you'd do it:
> >   - with Julien's patch (the one he sent the other day, and that
> > you
> >     have already given a try to) applied
> >   - while you are in the state above, i.e., after having tried to
> >     destroy a domain and failing
> >   - and maybe again after having tried to start a new domain
> Here are some logs.
> 
Great, thanks a lot!

> The system is booted as before with the patch and the domu config
> does 
> not have the IRQs.
> 
Ok.

> # xl list
> Name                                        ID   Mem VCPUs State   
> Time(s)
> Domain-0                                     0  3000     5 r-----    
> 820.1
> mydomu                                       1   511     1 r-----    
> 157.0
> 
> # xl debug-keys r
> (XEN) sched_smt_power_savings: disabled
> (XEN) NOW=191793008000
> (XEN) Online Cpus: 0-5
> (XEN) Cpupool 0:
> (XEN) Cpus: 0-5
> (XEN) Scheduler: null Scheduler (null)
> (XEN)     cpus_free =
> (XEN) Domain info:
> (XEN)     Domain: 0
> (XEN)       1: [0.0] pcpu=0
> (XEN)       2: [0.1] pcpu=1
> (XEN)       3: [0.2] pcpu=2
> (XEN)       4: [0.3] pcpu=3
> (XEN)       5: [0.4] pcpu=4
> (XEN)     Domain: 1
> (XEN)       6: [1.0] pcpu=5
> (XEN) Waitqueue:
>
So far, so good. All vCPUs are running on their assigned pCPU, and
there is no vCPU wanting to run but not having a vCPU where to do so.

> (XEN) Command line: console=dtuart dtuart=/serial@5a060000 
> dom0_mem=3000M dom0_max_vcpus=5 hmp-unsafe=true dom0_vcpus_pin 
> sched=null vwfi=native
>
Oh, just as a side note (and most likely unrelated to the problem we're
discussing), you should be able to get rid of dom0_vcpus_pin.

The NULL scheduler will do something similar to what that option itself
does anyway. And with the benefit that, if you want, you can actually
change to what pCPUs the dom0's vCPU are pinned. While, if you use
dom0_vcpus_pin, you can't.

So it using it has only downsides (and that's true in general, if you
ask me, but particularly so if using NULL).

> # xl destroy mydomu
> (XEN) End of domain_destroy function
> 
> # xl list
> Name                                        ID   Mem VCPUs State   
> Time(s)
> Domain-0                                     0  3000     5 r-----   
> 1057.9
> 
> # xl debug-keys r
> (XEN) sched_smt_power_savings: disabled
> (XEN) NOW=223871439875
> (XEN) Online Cpus: 0-5
> (XEN) Cpupool 0:
> (XEN) Cpus: 0-5
> (XEN) Scheduler: null Scheduler (null)
> (XEN)     cpus_free =
> (XEN) Domain info:
> (XEN)     Domain: 0
> (XEN)       1: [0.0] pcpu=0
> (XEN)       2: [0.1] pcpu=1
> (XEN)       3: [0.2] pcpu=2
> (XEN)       4: [0.3] pcpu=3
> (XEN)       5: [0.4] pcpu=4
> (XEN)     Domain: 1
> (XEN)       6: [1.0] pcpu=5
>
Right. And from the fact that: 1) we only see the "End of
domain_destroy function" line in the logs, and 2) we see that the vCPU
is still listed here, we have our confirmation (like there wase the
need for it :-/) that domain destruction is done only partially.

> # xl create mydomu.cfg
> Parsing config from mydomu.cfg
> (XEN) Power on resource 215
> 
> # xl list
> Name                                        ID   Mem VCPUs State   
> Time(s)
> Domain-0                                     0  3000     5 r-----   
> 1152.1
> mydomu                                       2   512     1 ------
>        0.0
> 
> # xl debug-keys r
> (XEN) sched_smt_power_savings: disabled
> (XEN) NOW=241210530250
> (XEN) Online Cpus: 0-5
> (XEN) Cpupool 0:
> (XEN) Cpus: 0-5
> (XEN) Scheduler: null Scheduler (null)
> (XEN)     cpus_free =
> (XEN) Domain info:
> (XEN)     Domain: 0
> (XEN)       1: [0.0] pcpu=0
> (XEN)       2: [0.1] pcpu=1
> (XEN)       3: [0.2] pcpu=2
> (XEN)       4: [0.3] pcpu=3
> (XEN)       5: [0.4] pcpu=4
> (XEN)     Domain: 1
> (XEN)       6: [1.0] pcpu=5
> (XEN)     Domain: 2
> (XEN)       7: [2.0] pcpu=-1
> (XEN) Waitqueue: d2v0
>
Yep, so, as we were suspecting, domain 1 was not destroyed properly.
Specifically, we did not get to the point where the vCPU is deallocated
and the pCPU to which such vCPU has been assigned to by the NULL
scheduler is released.

This means that the new vCPU (i.e., d2v0) has, from the point of view
of the NULL scheduler, no pCPU where to run. And it's therefore parked
in the waitqueue.

There should be a warning about that, which I don't see... but perhaps
I'm just misremembering.

Anyway, cool, this makes things even more clear.

Thanks again for letting us see these logs.
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-26 22:31               ` Dario Faggioli
@ 2021-01-29  8:08                 ` Anders Törnqvist
  2021-01-29  8:18                   ` Jürgen Groß
  2021-01-30 17:59                   ` Dario Faggioli
  0 siblings, 2 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-01-29  8:08 UTC (permalink / raw)
  To: Dario Faggioli, Julien Grall, xen-devel, Stefano Stabellini

On 1/26/21 11:31 PM, Dario Faggioli wrote:
> On Tue, 2021-01-26 at 18:03 +0100, Anders Törnqvist wrote:
>> On 1/25/21 5:11 PM, Dario Faggioli wrote:
>>> On Fri, 2021-01-22 at 14:26 +0000, Julien Grall wrote:
>>>> Hi Anders,
>>>>
>>>> On 22/01/2021 08:06, Anders Törnqvist wrote:
>>>>> On 1/22/21 12:35 AM, Dario Faggioli wrote:
>>>>>> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
>>>>> - booting with "sched=null vwfi=native" but not doing the IRQ
>>>>> passthrough that you mentioned above
>>>>> "xl destroy" gives
>>>>> (XEN) End of domain_destroy function
>>>>>
>>>>> Then a "xl create" says nothing but the domain has not started
>>>>> correct.
>>>>> "xl list" look like this for the domain:
>>>>> mydomu                                   2   512     1 ------
>>>>> 0.0
>>>> This is odd. I would have expected ``xl create`` to fail if
>>>> something
>>>> went wrong with the domain creation.
>>>>
>>> So, Anders, would it be possible to issue a:
>>>
>>> # xl debug-keys r
>>> # xl dmesg
>>>
>>> And send it to us ?
>>>
>>> Ideally, you'd do it:
>>>    - with Julien's patch (the one he sent the other day, and that
>>> you
>>>      have already given a try to) applied
>>>    - while you are in the state above, i.e., after having tried to
>>>      destroy a domain and failing
>>>    - and maybe again after having tried to start a new domain
>> Here are some logs.
>>
> Great, thanks a lot!
>
>> The system is booted as before with the patch and the domu config
>> does
>> not have the IRQs.
>>
> Ok.
>
>> # xl list
>> Name                                        ID   Mem VCPUs State
>> Time(s)
>> Domain-0                                     0  3000     5 r-----
>> 820.1
>> mydomu                                       1   511     1 r-----
>> 157.0
>>
>> # xl debug-keys r
>> (XEN) sched_smt_power_savings: disabled
>> (XEN) NOW=191793008000
>> (XEN) Online Cpus: 0-5
>> (XEN) Cpupool 0:
>> (XEN) Cpus: 0-5
>> (XEN) Scheduler: null Scheduler (null)
>> (XEN)     cpus_free =
>> (XEN) Domain info:
>> (XEN)     Domain: 0
>> (XEN)       1: [0.0] pcpu=0
>> (XEN)       2: [0.1] pcpu=1
>> (XEN)       3: [0.2] pcpu=2
>> (XEN)       4: [0.3] pcpu=3
>> (XEN)       5: [0.4] pcpu=4
>> (XEN)     Domain: 1
>> (XEN)       6: [1.0] pcpu=5
>> (XEN) Waitqueue:
>>
> So far, so good. All vCPUs are running on their assigned pCPU, and
> there is no vCPU wanting to run but not having a vCPU where to do so.
>
>> (XEN) Command line: console=dtuart dtuart=/serial@5a060000
>> dom0_mem=3000M dom0_max_vcpus=5 hmp-unsafe=true dom0_vcpus_pin
>> sched=null vwfi=native
>>
> Oh, just as a side note (and most likely unrelated to the problem we're
> discussing), you should be able to get rid of dom0_vcpus_pin.
>
> The NULL scheduler will do something similar to what that option itself
> does anyway. And with the benefit that, if you want, you can actually
> change to what pCPUs the dom0's vCPU are pinned. While, if you use
> dom0_vcpus_pin, you can't.
>
> So it using it has only downsides (and that's true in general, if you
> ask me, but particularly so if using NULL).
Thanks for the feedback.
I removed dom0_vcpus_pin. And, as you said, it seems to be unrelated to 
the problem we're discussing. The system still behaves the same.

When the dom0_vcpus_pin is removed. xl vcpu-list looks like this:

Name                                ID  VCPU   CPU State Time(s) 
Affinity (Hard / Soft)
Domain-0                             0     0    0   r--      29.4 all / all
Domain-0                             0     1    1   r--      28.7 all / all
Domain-0                             0     2    2   r--      28.7 all / all
Domain-0                             0     3    3   r--      28.6 all / all
Domain-0                             0     4    4   r--      28.6 all / all
mydomu                              1     0    5   r--      21.6 5 / all

 From this listing (with "all" as hard affinity for dom0) one might read 
it like dom0 is not pinned with hard affinity to any specific pCPUs at 
all but mudomu is pinned to pCPU 5.
Will the dom0_max_vcpus=5 in this case guarantee that dom0 only will run 
on pCPU 0-4 so that mydomu always will have pCPU 5 for itself only?

What if I would like mydomu to be th only domain that uses pCPU 2?

>
>> # xl destroy mydomu
>> (XEN) End of domain_destroy function
>>
>> # xl list
>> Name                                        ID   Mem VCPUs State
>> Time(s)
>> Domain-0                                     0  3000     5 r-----
>> 1057.9
>>
>> # xl debug-keys r
>> (XEN) sched_smt_power_savings: disabled
>> (XEN) NOW=223871439875
>> (XEN) Online Cpus: 0-5
>> (XEN) Cpupool 0:
>> (XEN) Cpus: 0-5
>> (XEN) Scheduler: null Scheduler (null)
>> (XEN)     cpus_free =
>> (XEN) Domain info:
>> (XEN)     Domain: 0
>> (XEN)       1: [0.0] pcpu=0
>> (XEN)       2: [0.1] pcpu=1
>> (XEN)       3: [0.2] pcpu=2
>> (XEN)       4: [0.3] pcpu=3
>> (XEN)       5: [0.4] pcpu=4
>> (XEN)     Domain: 1
>> (XEN)       6: [1.0] pcpu=5
>>
> Right. And from the fact that: 1) we only see the "End of
> domain_destroy function" line in the logs, and 2) we see that the vCPU
> is still listed here, we have our confirmation (like there wase the
> need for it :-/) that domain destruction is done only partially.
Yes it looks like that.
>
>> # xl create mydomu.cfg
>> Parsing config from mydomu.cfg
>> (XEN) Power on resource 215
>>
>> # xl list
>> Name                                        ID   Mem VCPUs State
>> Time(s)
>> Domain-0                                     0  3000     5 r-----
>> 1152.1
>> mydomu                                       2   512     1 ------
>>         0.0
>>
>> # xl debug-keys r
>> (XEN) sched_smt_power_savings: disabled
>> (XEN) NOW=241210530250
>> (XEN) Online Cpus: 0-5
>> (XEN) Cpupool 0:
>> (XEN) Cpus: 0-5
>> (XEN) Scheduler: null Scheduler (null)
>> (XEN)     cpus_free =
>> (XEN) Domain info:
>> (XEN)     Domain: 0
>> (XEN)       1: [0.0] pcpu=0
>> (XEN)       2: [0.1] pcpu=1
>> (XEN)       3: [0.2] pcpu=2
>> (XEN)       4: [0.3] pcpu=3
>> (XEN)       5: [0.4] pcpu=4
>> (XEN)     Domain: 1
>> (XEN)       6: [1.0] pcpu=5
>> (XEN)     Domain: 2
>> (XEN)       7: [2.0] pcpu=-1
>> (XEN) Waitqueue: d2v0
>>
> Yep, so, as we were suspecting, domain 1 was not destroyed properly.
> Specifically, we did not get to the point where the vCPU is deallocated
> and the pCPU to which such vCPU has been assigned to by the NULL
> scheduler is released.
>
> This means that the new vCPU (i.e., d2v0) has, from the point of view
> of the NULL scheduler, no pCPU where to run. And it's therefore parked
> in the waitqueue.
>
> There should be a warning about that, which I don't see... but perhaps
> I'm just misremembering.
>
> Anyway, cool, this makes things even more clear.
>
> Thanks again for letting us see these logs.

Thanks for the attention to this :-)

Any ideas for how to solve it?




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-29  8:08                 ` Anders Törnqvist
@ 2021-01-29  8:18                   ` Jürgen Groß
  2021-01-29 10:16                     ` Dario Faggioli
  2021-01-30 17:59                   ` Dario Faggioli
  1 sibling, 1 reply; 31+ messages in thread
From: Jürgen Groß @ 2021-01-29  8:18 UTC (permalink / raw)
  To: Anders Törnqvist, Dario Faggioli, Julien Grall, xen-devel,
	Stefano Stabellini


[-- Attachment #1.1.1: Type: text/plain, Size: 5145 bytes --]

On 29.01.21 09:08, Anders Törnqvist wrote:
> On 1/26/21 11:31 PM, Dario Faggioli wrote:
>> On Tue, 2021-01-26 at 18:03 +0100, Anders Törnqvist wrote:
>>> On 1/25/21 5:11 PM, Dario Faggioli wrote:
>>>> On Fri, 2021-01-22 at 14:26 +0000, Julien Grall wrote:
>>>>> Hi Anders,
>>>>>
>>>>> On 22/01/2021 08:06, Anders Törnqvist wrote:
>>>>>> On 1/22/21 12:35 AM, Dario Faggioli wrote:
>>>>>>> On Thu, 2021-01-21 at 19:40 +0000, Julien Grall wrote:
>>>>>> - booting with "sched=null vwfi=native" but not doing the IRQ
>>>>>> passthrough that you mentioned above
>>>>>> "xl destroy" gives
>>>>>> (XEN) End of domain_destroy function
>>>>>>
>>>>>> Then a "xl create" says nothing but the domain has not started
>>>>>> correct.
>>>>>> "xl list" look like this for the domain:
>>>>>> mydomu                                   2   512     1 ------
>>>>>> 0.0
>>>>> This is odd. I would have expected ``xl create`` to fail if
>>>>> something
>>>>> went wrong with the domain creation.
>>>>>
>>>> So, Anders, would it be possible to issue a:
>>>>
>>>> # xl debug-keys r
>>>> # xl dmesg
>>>>
>>>> And send it to us ?
>>>>
>>>> Ideally, you'd do it:
>>>>    - with Julien's patch (the one he sent the other day, and that
>>>> you
>>>>      have already given a try to) applied
>>>>    - while you are in the state above, i.e., after having tried to
>>>>      destroy a domain and failing
>>>>    - and maybe again after having tried to start a new domain
>>> Here are some logs.
>>>
>> Great, thanks a lot!
>>
>>> The system is booted as before with the patch and the domu config
>>> does
>>> not have the IRQs.
>>>
>> Ok.
>>
>>> # xl list
>>> Name                                        ID   Mem VCPUs State
>>> Time(s)
>>> Domain-0                                     0  3000     5 r-----
>>> 820.1
>>> mydomu                                       1   511     1 r-----
>>> 157.0
>>>
>>> # xl debug-keys r
>>> (XEN) sched_smt_power_savings: disabled
>>> (XEN) NOW=191793008000
>>> (XEN) Online Cpus: 0-5
>>> (XEN) Cpupool 0:
>>> (XEN) Cpus: 0-5
>>> (XEN) Scheduler: null Scheduler (null)
>>> (XEN)     cpus_free =
>>> (XEN) Domain info:
>>> (XEN)     Domain: 0
>>> (XEN)       1: [0.0] pcpu=0
>>> (XEN)       2: [0.1] pcpu=1
>>> (XEN)       3: [0.2] pcpu=2
>>> (XEN)       4: [0.3] pcpu=3
>>> (XEN)       5: [0.4] pcpu=4
>>> (XEN)     Domain: 1
>>> (XEN)       6: [1.0] pcpu=5
>>> (XEN) Waitqueue:
>>>
>> So far, so good. All vCPUs are running on their assigned pCPU, and
>> there is no vCPU wanting to run but not having a vCPU where to do so.
>>
>>> (XEN) Command line: console=dtuart dtuart=/serial@5a060000
>>> dom0_mem=3000M dom0_max_vcpus=5 hmp-unsafe=true dom0_vcpus_pin
>>> sched=null vwfi=native
>>>
>> Oh, just as a side note (and most likely unrelated to the problem we're
>> discussing), you should be able to get rid of dom0_vcpus_pin.
>>
>> The NULL scheduler will do something similar to what that option itself
>> does anyway. And with the benefit that, if you want, you can actually
>> change to what pCPUs the dom0's vCPU are pinned. While, if you use
>> dom0_vcpus_pin, you can't.
>>
>> So it using it has only downsides (and that's true in general, if you
>> ask me, but particularly so if using NULL).
> Thanks for the feedback.
> I removed dom0_vcpus_pin. And, as you said, it seems to be unrelated to 
> the problem we're discussing. The system still behaves the same.
> 
> When the dom0_vcpus_pin is removed. xl vcpu-list looks like this:
> 
> Name                                ID  VCPU   CPU State Time(s) 
> Affinity (Hard / Soft)
> Domain-0                             0     0    0   r--      29.4 all / all
> Domain-0                             0     1    1   r--      28.7 all / all
> Domain-0                             0     2    2   r--      28.7 all / all
> Domain-0                             0     3    3   r--      28.6 all / all
> Domain-0                             0     4    4   r--      28.6 all / all
> mydomu                              1     0    5   r--      21.6 5 / all
> 
>  From this listing (with "all" as hard affinity for dom0) one might read 
> it like dom0 is not pinned with hard affinity to any specific pCPUs at 
> all but mudomu is pinned to pCPU 5.
> Will the dom0_max_vcpus=5 in this case guarantee that dom0 only will run 
> on pCPU 0-4 so that mydomu always will have pCPU 5 for itself only?

No.

> 
> What if I would like mydomu to be th only domain that uses pCPU 2?

Setup a cpupool with that pcpu assigned to it and put your domain into
that cpupool.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-29  8:18                   ` Jürgen Groß
@ 2021-01-29 10:16                     ` Dario Faggioli
  2021-02-01  6:53                       ` Anders Törnqvist
  0 siblings, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2021-01-29 10:16 UTC (permalink / raw)
  To: Jürgen Groß,
	Anders Törnqvist, Julien Grall, xen-devel,
	Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 6388 bytes --]

On Fri, 2021-01-29 at 09:18 +0100, Jürgen Groß wrote:
> On 29.01.21 09:08, Anders Törnqvist wrote:
> > > 
> > > So it using it has only downsides (and that's true in general, if
> > > you
> > > ask me, but particularly so if using NULL).
> > Thanks for the feedback.
> > I removed dom0_vcpus_pin. And, as you said, it seems to be
> > unrelated to 
> > the problem we're discussing. 
>
Right. Don't put it back, and stay away from it, if you accept an
advice. :-)

> > The system still behaves the same.
> > 
Yeah, that was expected.

> > When the dom0_vcpus_pin is removed. xl vcpu-list looks like this:
> > 
> > Name                                ID  VCPU   CPU State Time(s) 
> > Affinity (Hard / Soft)
> > Domain-0                             0     0    0   r--      29.4
> > all / all
> > Domain-0                             0     1    1   r--      28.7
> > all / all
> > Domain-0                             0     2    2   r--      28.7
> > all / all
> > Domain-0                             0     3    3   r--      28.6
> > all / all
> > Domain-0                             0     4    4   r--      28.6
> > all / all
> > mydomu                              1     0    5   r--      21.6 5
> > / all
> > 
Right, and it makes sense for it to look like this.

> >  From this listing (with "all" as hard affinity for dom0) one might
> > read 
> > it like dom0 is not pinned with hard affinity to any specific pCPUs
> > at 
> > all but mudomu is pinned to pCPU 5.
> > Will the dom0_max_vcpus=5 in this case guarantee that dom0 only
> > will run 
> > on pCPU 0-4 so that mydomu always will have pCPU 5 for itself only?
> 
> No.
>
Well, yes... if you use the NULL scheduler. Which is in use here. :-)

Basically, the NULL scheduler _always_ assign one and only one vCPU to
each pCPU. This happens at domain (well, at the vCPU) creation time.
And it _never_ move a vCPU away from the pCPU to which it has assigned
it.

And it also _never_ change this vCPU-->pCPU assignment/relationship,
unless some special event happens (such as, either the vCPU and/or the
pCPU goes offline, is removed from the cpupool, you change the affinity
[as I'll explain below], etc).

This is the NULL scheduler's mission and only job, so it does that by
default, _without_ any need for an affinity to be specified.

So, how can affinity be useful in the NULL scheduler? Well, it's useful
if you want to control and decide to what pCPU a certain vCPU should
go.

So, let's make an example. Let's say you are in this situation:

Name                                ID  VCPU   CPU State Time(s) Affinity (Hard / Soft)
Domain-0                             0     0    0   r--     29.4   all / all
Domain-0                             0     1    1   r--     28.7   all / all
Domain-0                             0     2    2   r--     28.7   all / all
Domain-0                             0     3    3   r--     28.6   all / all
Domain-0                             0     4    4   r--     28.6   all / all

I.e., you have 6 CPUs, you have only dom0, dom0 has 5 vCPUs and you are
not using dom0_vcpus_pin.

The NULL scheduler has put d0v0 on pCPU 0. And d0v0 is the only vCPU
that can run on pCPU 0, despite its affinities being "all"... because
it's what the NULL scheduler does for you and it's the reason why one
uses it! :-)

Similarly, it has put d0v1 on pCPU 1, d0v2 on pCPU 2, d0v3 on pCPU 3
and d0v4 on pCPU 4. And the "exclusivity guarantee" exaplained above
for d0v0 and pCPU 0, applies to all these other vCPUs and pCPUs as
well.

With no affinity being specified, which vCPU is assigned to which pCPU
is entirely under the NULL scheduler control. It has its heuristics
inside, to try to do that in a smart way, but that's an
internal/implementation detail and is not relevant here.

If you now create a domU with 1 vCPU, that vCPU will be assigned to
pCPU 5.

Now, let's say that, for whatever reason, you absolutely want that d0v2
to run on pCPU 5, instead of being assigned and run on pCPU 2 (which is
what the NULL scheduler decided to pick for it). Well, what you do is
use xl, set the affinity of d0v2 to pCPU 5, and you will get something
like this as a result:

Name                                ID  VCPU   CPU State Time(s) Affinity (Hard / Soft)
Domain-0                             0     0    0   r--     29.4   all / all
Domain-0                             0     1    1   r--     28.7   all / all
Domain-0                             0     2    5   r--     28.7     5 / all
Domain-0                             0     3    3   r--     28.6   all / all
Domain-0                             0     4    4   r--     28.6   all / all

So, affinity is indeed useful, even when using NULL, if you want to
diverge from the default behavior and enact a certain policy, maybe due
to the nature of your workload, the characteristics of your hardware,
or whatever.

It is not, however, necessary to set the affinity to:
 - have a vCPU to always stay on one --and always the same one too-- 
   pCPU;
 - avoid that any other vCPU would ever run on that pCPU.

That is guaranteed by the NULL scheduler itself. It just can't happen
that it behaves otherwise, because the whole point of doing it was to
make it simple (and fast :-)) *exactly* by avoiding to teach it how to
do such things. It can't do it, because the code for doing it is not
there... by design! :-D

And, BTW, if you now create a domU with 1 vCPU, that vCPU will be
assigned to pCPU  2.

> 
> What if I would like mydomu to be th only domain that uses pCPU 2?

Setup a cpupool with that pcpu assigned to it and put your domain into
that cpupool.

Yes, with any other scheduler that is not NULL, that's the proper way
of doing it.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-29  8:08                 ` Anders Törnqvist
  2021-01-29  8:18                   ` Jürgen Groß
@ 2021-01-30 17:59                   ` Dario Faggioli
  2021-02-01  6:55                     ` Anders Törnqvist
                                       ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Dario Faggioli @ 2021-01-30 17:59 UTC (permalink / raw)
  To: Anders Törnqvist, Julien Grall, xen-devel, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1772 bytes --]

On Fri, 2021-01-29 at 09:08 +0100, Anders Törnqvist wrote:
> On 1/26/21 11:31 PM, Dario Faggioli wrote:
> > Thanks again for letting us see these logs.
> 
> Thanks for the attention to this :-)
> 
> Any ideas for how to solve it?
> 
So, you're up for testing patches, right?

How about applying these two, and letting me know what happens? :-D

They are on top of current staging. I can try to rebase on something
else, if it's easier for you to test.

Besides being attached, they're also available here:

https://gitlab.com/xen-project/people/dfaggioli/xen/-/tree/rcu-quiet-fix

I could not test them properly on ARM, as I don't have an ARM system
handy, so everything is possible really... just let me know.

It should at least build fine, AFAICT from here:

https://gitlab.com/xen-project/people/dfaggioli/xen/-/pipelines/249101213

Julien, back in:

 https://lore.kernel.org/xen-devel/315740e1-3591-0e11-923a-718e06c36445@arm.com/


you said I should hook in enter_hypervisor_head(),
leave_hypervisor_tail(). Those functions are gone now and looking at
how the code changed, this is where I figured I should put the calls
(see the second patch). But feel free to educate me otherwise.

For x86 people that are listening... Do we have, in our beloved arch,
equally handy places (i.e., right before leaving Xen for a guest and
right after entering Xen from one), preferrably in a C file, and for
all guests... like it seems to be the case on ARM?

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #1.2: 1-xen-rcu-rename-idle-ignore.patch --]
[-- Type: text/x-patch, Size: 10925 bytes --]

commit 2c38754fa733333a81e8dfab8abdfb18b9896e00
Author: Dario Faggioli <dfaggioli@suse.com>
Date:   Sat Jan 30 07:50:22 2021 +0000

    xen: rename RCU idle timer and cpumask
    
    Both the cpumask and the timer will be used in more generic
    circumnstances, not only for CPUs that go idle. Change their names to
    reflect that.
    
    No functional change.
    
    Signed-off-by: Dario Faggioli <dfaggioli@suse.com>

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index a5a27af3de..e0bf842f13 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -55,8 +55,8 @@ static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
-    cpumask_t   idle_cpumask; /* ... unless they are already idle */
+    cpumask_t   cpumask; /* CPUs that need to switch in order ...   */
+    cpumask_t   ignore_cpumask; /* ... unless they are already idle */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -88,8 +88,8 @@ struct rcu_data {
     long            last_rs_qlen;     /* qlen during the last resched */
 
     /* 3) idle CPUs handling */
-    struct timer idle_timer;
-    bool idle_timer_active;
+    struct timer cb_timer;
+    bool cb_timer_active;
 
     bool            process_callbacks;
     bool            barrier_active;
@@ -121,22 +121,22 @@ struct rcu_data {
  * CPU that is going idle. The user can change this, via a boot time
  * parameter, but only up to 100ms.
  */
-#define IDLE_TIMER_PERIOD_MAX     MILLISECS(100)
-#define IDLE_TIMER_PERIOD_DEFAULT MILLISECS(10)
-#define IDLE_TIMER_PERIOD_MIN     MICROSECS(100)
+#define CB_TIMER_PERIOD_MAX     MILLISECS(100)
+#define CB_TIMER_PERIOD_DEFAULT MILLISECS(10)
+#define CB_TIMER_PERIOD_MIN     MICROSECS(100)
 
-static s_time_t __read_mostly idle_timer_period;
+static s_time_t __read_mostly cb_timer_period;
 
 /*
- * Increment and decrement values for the idle timer handler. The algorithm
+ * Increment and decrement values for the callback timer handler. The algorithm
  * works as follows:
  * - if the timer actually fires, and it finds out that the grace period isn't
- *   over yet, we add IDLE_TIMER_PERIOD_INCR to the timer's period;
+ *   over yet, we add CB_TIMER_PERIOD_INCR to the timer's period;
  * - if the timer actually fires and it finds the grace period over, we
  *   subtract IDLE_TIMER_PERIOD_DECR from the timer's period.
  */
-#define IDLE_TIMER_PERIOD_INCR    MILLISECS(10)
-#define IDLE_TIMER_PERIOD_DECR    MICROSECS(100)
+#define CB_TIMER_PERIOD_INCR    MILLISECS(10)
+#define CB_TIMER_PERIOD_DECR    MICROSECS(100)
 
 static DEFINE_PER_CPU(struct rcu_data, rcu_data);
 
@@ -364,7 +364,7 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         * This barrier is paired with the one in rcu_idle_enter().
         */
         smp_mb();
-        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->ignore_cpumask);
     }
 }
 
@@ -523,7 +523,7 @@ int rcu_needs_cpu(int cpu)
 {
     struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
-    return (rdp->curlist && !rdp->idle_timer_active) || rcu_pending(cpu);
+    return (rdp->curlist && !rdp->cb_timer_active) || rcu_pending(cpu);
 }
 
 /*
@@ -531,7 +531,7 @@ int rcu_needs_cpu(int cpu)
  * periodically poke rcu_pedning(), so that it will invoke the callback
  * not too late after the end of the grace period.
  */
-static void rcu_idle_timer_start(void)
+static void cb_timer_start(void)
 {
     struct rcu_data *rdp = &this_cpu(rcu_data);
 
@@ -543,48 +543,48 @@ static void rcu_idle_timer_start(void)
     if (likely(!rdp->curlist))
         return;
 
-    set_timer(&rdp->idle_timer, NOW() + idle_timer_period);
-    rdp->idle_timer_active = true;
+    set_timer(&rdp->cb_timer, NOW() + cb_timer_period);
+    rdp->cb_timer_active = true;
 }
 
-static void rcu_idle_timer_stop(void)
+static void cb_timer_stop(void)
 {
     struct rcu_data *rdp = &this_cpu(rcu_data);
 
-    if (likely(!rdp->idle_timer_active))
+    if (likely(!rdp->cb_timer_active))
         return;
 
-    rdp->idle_timer_active = false;
+    rdp->cb_timer_active = false;
 
     /*
      * In general, as the CPU is becoming active again, we don't need the
-     * idle timer, and so we want to stop it.
+     * callback timer, and so we want to stop it.
      *
-     * However, in case we are here because idle_timer has (just) fired and
+     * However, in case we are here because cb_timer has (just) fired and
      * has woken up the CPU, we skip stop_timer() now. In fact, when a CPU
      * wakes up from idle, this code always runs before do_softirq() has the
      * chance to check and deal with TIMER_SOFTIRQ. And if we stop the timer
      * now, the TIMER_SOFTIRQ handler will see it as inactive, and will not
-     * call rcu_idle_timer_handler().
+     * call cb_timer_handler().
      *
      * Therefore, if we see that the timer is expired already, we leave it
      * alone. The TIMER_SOFTIRQ handler will then run the timer routine, and
      * deactivate it.
      */
-    if ( !timer_is_expired(&rdp->idle_timer) )
-        stop_timer(&rdp->idle_timer);
+    if ( !timer_is_expired(&rdp->cb_timer) )
+        stop_timer(&rdp->cb_timer);
 }
 
-static void rcu_idle_timer_handler(void* data)
+static void cb_timer_handler(void* data)
 {
-    perfc_incr(rcu_idle_timer);
+    perfc_incr(rcu_callback_timer);
 
     if ( !cpumask_empty(&rcu_ctrlblk.cpumask) )
-        idle_timer_period = min(idle_timer_period + IDLE_TIMER_PERIOD_INCR,
-                                IDLE_TIMER_PERIOD_MAX);
+        cb_timer_period = min(cb_timer_period + CB_TIMER_PERIOD_INCR,
+                                CB_TIMER_PERIOD_MAX);
     else
-        idle_timer_period = max(idle_timer_period - IDLE_TIMER_PERIOD_DECR,
-                                IDLE_TIMER_PERIOD_MIN);
+        cb_timer_period = max(cb_timer_period - CB_TIMER_PERIOD_DECR,
+                                CB_TIMER_PERIOD_MIN);
 }
 
 void rcu_check_callbacks(int cpu)
@@ -608,7 +608,7 @@ static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
 static void rcu_offline_cpu(struct rcu_data *this_rdp,
                             struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
 {
-    kill_timer(&rdp->idle_timer);
+    kill_timer(&rdp->cb_timer);
 
     /* If the cpu going offline owns the grace period we can block
      * indefinitely waiting for it, so flush it here.
@@ -638,7 +638,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
     rdp->qs_pending = 0;
     rdp->cpu = cpu;
     rdp->blimit = blimit;
-    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, rdp, cpu);
+    init_timer(&rdp->cb_timer, cb_timer_handler, rdp, cpu);
 }
 
 static int cpu_callback(
@@ -667,25 +667,39 @@ static struct notifier_block cpu_nfb = {
     .notifier_call = cpu_callback
 };
 
+/*
+ * We're changing the name of the parameter, to better reflect the fact that
+ * the timer is used for callbacks in general, when the CPU is either idle
+ * or executing guest code. We still accept the old parameter but, if both
+ * are specified, the new one ("rcu-callback-timer-period-ms") has priority.
+ */
+#define CB_TIMER_PERIOD_DEFAULT_MS ( CB_TIMER_PERIOD_DEFAULT / MILLISECS(1) )
+static unsigned int __initdata cb_timer_period_ms = CB_TIMER_PERIOD_DEFAULT_MS;
+integer_param("rcu-callback-timer-period-ms", cb_timer_period_ms);
+
+static unsigned int __initdata idle_timer_period_ms = CB_TIMER_PERIOD_DEFAULT_MS;
+integer_param("rcu-idle-timer-period-ms", idle_timer_period_ms);
+
 void __init rcu_init(void)
 {
     void *cpu = (void *)(long)smp_processor_id();
-    static unsigned int __initdata idle_timer_period_ms =
-                                    IDLE_TIMER_PERIOD_DEFAULT / MILLISECS(1);
-    integer_param("rcu-idle-timer-period-ms", idle_timer_period_ms);
+
+    if (idle_timer_period_ms != CB_TIMER_PERIOD_DEFAULT_MS &&
+        cb_timer_period_ms == CB_TIMER_PERIOD_DEFAULT_MS)
+        cb_timer_period_ms = idle_timer_period_ms;
 
     /* We don't allow 0, or anything higher than IDLE_TIMER_PERIOD_MAX */
-    if ( idle_timer_period_ms == 0 ||
-         idle_timer_period_ms > IDLE_TIMER_PERIOD_MAX / MILLISECS(1) )
+    if ( cb_timer_period_ms == 0 ||
+         cb_timer_period_ms > CB_TIMER_PERIOD_MAX / MILLISECS(1) )
     {
-        idle_timer_period_ms = IDLE_TIMER_PERIOD_DEFAULT / MILLISECS(1);
-        printk("WARNING: rcu-idle-timer-period-ms outside of "
+        cb_timer_period_ms = CB_TIMER_PERIOD_DEFAULT / MILLISECS(1);
+        printk("WARNING: rcu-callback-timer-period-ms outside of "
                "(0,%"PRI_stime"]. Resetting it to %u.\n",
-               IDLE_TIMER_PERIOD_MAX / MILLISECS(1), idle_timer_period_ms);
+               CB_TIMER_PERIOD_MAX / MILLISECS(1), cb_timer_period_ms);
     }
-    idle_timer_period = MILLISECS(idle_timer_period_ms);
+    cb_timer_period = MILLISECS(cb_timer_period_ms);
 
-    cpumask_clear(&rcu_ctrlblk.idle_cpumask);
+    cpumask_clear(&rcu_ctrlblk.ignore_cpumask);
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
@@ -697,8 +711,8 @@ void __init rcu_init(void)
  */
 void rcu_idle_enter(unsigned int cpu)
 {
-    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
-    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
+    cpumask_set_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
     /*
      * If some other CPU is starting a new grace period, we'll notice that
      * by seeing a new value in rcp->cur (different than our quiescbatch).
@@ -709,12 +723,12 @@ void rcu_idle_enter(unsigned int cpu)
      */
     smp_mb();
 
-    rcu_idle_timer_start();
+    cb_timer_start();
 }
 
 void rcu_idle_exit(unsigned int cpu)
 {
-    rcu_idle_timer_stop();
-    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
-    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    cb_timer_stop();
+    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
+    cpumask_clear_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
 }
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
index 08b182ccd9..d142534383 100644
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -12,7 +12,7 @@ PERFCOUNTER(calls_from_multicall,       "calls from multicall")
 PERFCOUNTER(irqs,                   "#interrupts")
 PERFCOUNTER(ipis,                   "#IPIs")
 
-PERFCOUNTER(rcu_idle_timer,         "RCU: idle_timer")
+PERFCOUNTER(rcu_callback_timer,     "RCU: callback_timer")
 
 /* Generic scheduler counters (applicable to all schedulers) */
 PERFCOUNTER(sched_irq,              "sched: timer")

[-- Attachment #1.3: 2-rcu-idle-guest.patch --]
[-- Type: text/x-patch, Size: 10488 bytes --]

commit 770539f499e02010ffa26b99973a4120eba5827c
Author: Dario Faggioli <dfaggioli@suse.com>
Date:   Sat Jan 30 08:16:59 2021 +0000

    xen: deal with vCPUs that do not yield when idle
    
    Our RCU implementation needs that a CPU goes through Xen, from time to
    time, e.g., for a context switch, to properly mark the end of grace
    period. This usually happen often enough, and CPUs that go idle and stay
    like that for a while are handled specially (so that they are recorded
    as quiescent and "leave" the grace period before starting idling).
    
    In principle, even a CPU that starts executing guest code may/should be
    marked as quiescent (it certainly can't be in the middle of a read side
    RCU critical section if it's leaving Xen and entering the guest!). This
    isn't done and in general does not cause problems. However, if the NULL
    scheduler is used and the guest is configured to not go back in Xen when
    its vCPUs become idle (e.g., with "vwfi=native" on ARM) grace periods
    may extend for very long time and RCU callback delayed to a point that,
    for instance, a domain is not properly destroyed.
    
    To fix that, we must start marking a CPU as quiescent as soon as it
    enter the guest (and, vice versa, register it back to the current grace
    period when it enters Xen). In order to do that, some changes to the API
    of rcu_idle_enter/exit were necessary (and the functions were renamed
    too).
    
    Note that, exactly like in the case where the CPU goes idle, we need the
    arm the callback timer when we enter guest context. In fact, if a CPU
    enter a guest with an RCU callback queued and then stays in that context
    for long enough, we still risk to not execute the callback itself for
    long enough to have problems.
    
    XXX ARM only for now.
    
    Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
    ---
    - Implemented for ARM only for now. Julien, is where I put the calls to
      rcu_quiet_enter/exit ?
    - x86 people, do we have an equally handny place where to do the same on
      our lovely arch? :-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index bdd3d3e5b5..90726afe15 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -47,7 +47,7 @@ static void do_idle(void)
 {
     unsigned int cpu = smp_processor_id();
 
-    rcu_idle_enter(cpu);
+    rcu_quiet_enter();
     /* rcu_idle_enter() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
 
@@ -59,7 +59,7 @@ static void do_idle(void)
     }
     local_irq_enable();
 
-    rcu_idle_exit(cpu);
+    rcu_quiet_exit();
 }
 
 void idle_loop(void)
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 6fa135050b..806870a38f 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2048,6 +2048,8 @@ void enter_hypervisor_from_guest(void)
 {
     struct vcpu *v = current;
 
+    rcu_quiet_exit();
+
     /*
      * If we pended a virtual abort, preserve it until it gets cleared.
      * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details,
@@ -2326,6 +2328,8 @@ static bool check_for_vcpu_work(void)
  */
 void leave_hypervisor_to_guest(void)
 {
+    rcu_quiet_enter();
+
     local_irq_disable();
 
     /*
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c092086b33..473a89e212 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -714,8 +714,8 @@ static void acpi_processor_idle(void)
 
     cpufreq_dbs_timer_suspend();
 
-    rcu_idle_enter(cpu);
-    /* rcu_idle_enter() can raise TIMER_SOFTIRQ. Process it now. */
+    rcu_quiet_enter();
+    /* rcu_quiet_enter() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
 
     /*
@@ -727,7 +727,7 @@ static void acpi_processor_idle(void)
     if ( !cpu_is_haltable(cpu) )
     {
         local_irq_enable();
-        rcu_idle_exit(cpu);
+        rcu_quiet_exit();
         cpufreq_dbs_timer_resume();
         return;
     }
@@ -852,7 +852,7 @@ static void acpi_processor_idle(void)
         /* Now in C0 */
         power->last_state = &power->states[0];
         local_irq_enable();
-        rcu_idle_exit(cpu);
+        rcu_quiet_exit();
         cpufreq_dbs_timer_resume();
         return;
     }
@@ -860,7 +860,7 @@ static void acpi_processor_idle(void)
     /* Now in C0 */
     power->last_state = &power->states[0];
 
-    rcu_idle_exit(cpu);
+    rcu_quiet_exit();
     cpufreq_dbs_timer_resume();
 
     if ( cpuidle_current_governor->reflect )
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index f0c6ff9d52..dc12559396 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -778,8 +778,8 @@ static void mwait_idle(void)
 
 	cpufreq_dbs_timer_suspend();
 
-	rcu_idle_enter(cpu);
-	/* rcu_idle_enter() can raise TIMER_SOFTIRQ. Process it now. */
+	rcu_quiet_enter();
+	/* rcu_quiet_enter() can raise TIMER_SOFTIRQ. Process it now. */
 	process_pending_softirqs();
 
 	/* Interrupts must be disabled for C2 and higher transitions. */
@@ -787,7 +787,7 @@ static void mwait_idle(void)
 
 	if (!cpu_is_haltable(cpu)) {
 		local_irq_enable();
-		rcu_idle_exit(cpu);
+		rcu_quiet_exit();
 		cpufreq_dbs_timer_resume();
 		return;
 	}
@@ -829,7 +829,7 @@ static void mwait_idle(void)
 	if (!(lapic_timer_reliable_states & (1 << cx->type)))
 		lapic_timer_on();
 
-	rcu_idle_exit(cpu);
+	rcu_quiet_exit();
 	cpufreq_dbs_timer_resume();
 
 	if ( cpuidle_current_governor->reflect )
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index e0bf842f13..54a292ae75 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -55,8 +55,8 @@ static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order ...   */
-    cpumask_t   ignore_cpumask; /* ... unless they are already idle */
+    cpumask_t   cpumask; /* CPUs that need to switch in order...      */
+    cpumask_t   ignore_cpumask; /* ...unless already idle or in guest */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -87,7 +87,7 @@ struct rcu_data {
     int cpu;
     long            last_rs_qlen;     /* qlen during the last resched */
 
-    /* 3) idle CPUs handling */
+    /* 3) idle (or in guest mode) CPUs handling */
     struct timer cb_timer;
     bool cb_timer_active;
 
@@ -112,6 +112,12 @@ struct rcu_data {
  * 3) it is stopped immediately, if the CPU wakes up from idle and
  *    resumes 'normal' execution.
  *
+ * Note also that the same happens if a CPU starts executing a guest that
+ * (almost) never comes back into the hypervisor. This may be the case if
+ * the guest uses "idle=poll" / "vwfi=native". Therefore, we need to handle
+ * guest entry events in the same way as the CPU going idle, i.e., consider
+ * it quiesced and arm the timer.
+ *
  * About how far in the future the timer should be programmed each time,
  * it's hard to tell (guess!!). Since this mimics Linux's periodic timer
  * tick, take values used there as an indication. In Linux 2.6.21, tick
@@ -359,9 +365,10 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         * Make sure the increment of rcp->cur is visible so, even if a
         * CPU that is about to go idle, is captured inside rcp->cpumask,
         * rcu_pending() will return false, which then means cpu_quiet()
-        * will be invoked, before the CPU would actually enter idle.
+        * will be invoked, before the CPU would actually go idle (or
+	* enter a guest).
         *
-        * This barrier is paired with the one in rcu_idle_enter().
+        * This barrier is paired with the one in rcu_quiet_enter().
         */
         smp_mb();
         cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->ignore_cpumask);
@@ -531,14 +538,15 @@ int rcu_needs_cpu(int cpu)
  * periodically poke rcu_pedning(), so that it will invoke the callback
  * not too late after the end of the grace period.
  */
-static void cb_timer_start(void)
+static void cb_timer_start(unsigned int cpu)
 {
-    struct rcu_data *rdp = &this_cpu(rcu_data);
+    struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
     /*
      * Note that we don't check rcu_pending() here. In fact, we don't want
      * the timer armed on CPUs that are in the process of quiescing while
-     * going idle, unless they really are the ones with a queued callback.
+     * going idle or entering guest mode, unless they really have queued
+     * callbacks.
      */
     if (likely(!rdp->curlist))
         return;
@@ -547,9 +555,9 @@ static void cb_timer_start(void)
     rdp->cb_timer_active = true;
 }
 
-static void cb_timer_stop(void)
+static void cb_timer_stop(unsigned int cpu)
 {
-    struct rcu_data *rdp = &this_cpu(rcu_data);
+    struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
     if (likely(!rdp->cb_timer_active))
         return;
@@ -706,11 +714,14 @@ void __init rcu_init(void)
 }
 
 /*
- * The CPU is becoming idle, so no more read side critical
- * sections, and one more step toward grace period.
+ * The CPU is becoming about to either idle or enter the guest. In any of
+ * these cases, it can't have any outstanding read side critical sections
+ * so this is one step toward the end of the grace period.
  */
-void rcu_idle_enter(unsigned int cpu)
+void rcu_quiet_enter()
 {
+    unsigned int cpu = smp_processor_id();
+
     ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
     cpumask_set_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
     /*
@@ -723,12 +734,14 @@ void rcu_idle_enter(unsigned int cpu)
      */
     smp_mb();
 
-    cb_timer_start();
+    cb_timer_start(cpu);
 }
 
-void rcu_idle_exit(unsigned int cpu)
+void rcu_quiet_exit()
 {
-    cb_timer_stop();
+    unsigned int cpu = smp_processor_id();
+
+    cb_timer_stop(cpu);
     ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
     cpumask_clear_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
 }
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 6f2587058e..f378cc2aa2 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -177,7 +177,7 @@ void call_rcu(struct rcu_head *head,
 
 void rcu_barrier(void);
 
-void rcu_idle_enter(unsigned int cpu);
-void rcu_idle_exit(unsigned int cpu);
+void rcu_quiet_enter(void);
+void rcu_quiet_exit(void);
 
 #endif /* __XEN_RCUPDATE_H */

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-29 10:16                     ` Dario Faggioli
@ 2021-02-01  6:53                       ` Anders Törnqvist
  0 siblings, 0 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-02-01  6:53 UTC (permalink / raw)
  To: Dario Faggioli, Jürgen Groß,
	Julien Grall, xen-devel, Stefano Stabellini

On 1/29/21 11:16 AM, Dario Faggioli wrote:
> On Fri, 2021-01-29 at 09:18 +0100, Jürgen Groß wrote:
>> On 29.01.21 09:08, Anders Törnqvist wrote:
>>>> So it using it has only downsides (and that's true in general, if
>>>> you
>>>> ask me, but particularly so if using NULL).
>>> Thanks for the feedback.
>>> I removed dom0_vcpus_pin. And, as you said, it seems to be
>>> unrelated to
>>> the problem we're discussing.
> Right. Don't put it back, and stay away from it, if you accept an
> advice. :-)
>
>>> The system still behaves the same.
>>>
> Yeah, that was expected.
>
>>> When the dom0_vcpus_pin is removed. xl vcpu-list looks like this:
>>>
>>> Name                                ID  VCPU   CPU State Time(s)
>>> Affinity (Hard / Soft)
>>> Domain-0                             0     0    0   r--      29.4
>>> all / all
>>> Domain-0                             0     1    1   r--      28.7
>>> all / all
>>> Domain-0                             0     2    2   r--      28.7
>>> all / all
>>> Domain-0                             0     3    3   r--      28.6
>>> all / all
>>> Domain-0                             0     4    4   r--      28.6
>>> all / all
>>> mydomu                              1     0    5   r--      21.6 5
>>> / all
>>>
> Right, and it makes sense for it to look like this.
>
>>>   From this listing (with "all" as hard affinity for dom0) one might
>>> read
>>> it like dom0 is not pinned with hard affinity to any specific pCPUs
>>> at
>>> all but mudomu is pinned to pCPU 5.
>>> Will the dom0_max_vcpus=5 in this case guarantee that dom0 only
>>> will run
>>> on pCPU 0-4 so that mydomu always will have pCPU 5 for itself only?
>> No.
>>
> Well, yes... if you use the NULL scheduler. Which is in use here. :-)
>
> Basically, the NULL scheduler _always_ assign one and only one vCPU to
> each pCPU. This happens at domain (well, at the vCPU) creation time.
> And it _never_ move a vCPU away from the pCPU to which it has assigned
> it.
>
> And it also _never_ change this vCPU-->pCPU assignment/relationship,
> unless some special event happens (such as, either the vCPU and/or the
> pCPU goes offline, is removed from the cpupool, you change the affinity
> [as I'll explain below], etc).
>
> This is the NULL scheduler's mission and only job, so it does that by
> default, _without_ any need for an affinity to be specified.
>
> So, how can affinity be useful in the NULL scheduler? Well, it's useful
> if you want to control and decide to what pCPU a certain vCPU should
> go.
>
> So, let's make an example. Let's say you are in this situation:
>
> Name                                ID  VCPU   CPU State Time(s) Affinity (Hard / Soft)
> Domain-0                             0     0    0   r--     29.4   all / all
> Domain-0                             0     1    1   r--     28.7   all / all
> Domain-0                             0     2    2   r--     28.7   all / all
> Domain-0                             0     3    3   r--     28.6   all / all
> Domain-0                             0     4    4   r--     28.6   all / all
>
> I.e., you have 6 CPUs, you have only dom0, dom0 has 5 vCPUs and you are
> not using dom0_vcpus_pin.
>
> The NULL scheduler has put d0v0 on pCPU 0. And d0v0 is the only vCPU
> that can run on pCPU 0, despite its affinities being "all"... because
> it's what the NULL scheduler does for you and it's the reason why one
> uses it! :-)
>
> Similarly, it has put d0v1 on pCPU 1, d0v2 on pCPU 2, d0v3 on pCPU 3
> and d0v4 on pCPU 4. And the "exclusivity guarantee" exaplained above
> for d0v0 and pCPU 0, applies to all these other vCPUs and pCPUs as
> well.
>
> With no affinity being specified, which vCPU is assigned to which pCPU
> is entirely under the NULL scheduler control. It has its heuristics
> inside, to try to do that in a smart way, but that's an
> internal/implementation detail and is not relevant here.
>
> If you now create a domU with 1 vCPU, that vCPU will be assigned to
> pCPU 5.
>
> Now, let's say that, for whatever reason, you absolutely want that d0v2
> to run on pCPU 5, instead of being assigned and run on pCPU 2 (which is
> what the NULL scheduler decided to pick for it). Well, what you do is
> use xl, set the affinity of d0v2 to pCPU 5, and you will get something
> like this as a result:
>
> Name                                ID  VCPU   CPU State Time(s) Affinity (Hard / Soft)
> Domain-0                             0     0    0   r--     29.4   all / all
> Domain-0                             0     1    1   r--     28.7   all / all
> Domain-0                             0     2    5   r--     28.7     5 / all
> Domain-0                             0     3    3   r--     28.6   all / all
> Domain-0                             0     4    4   r--     28.6   all / all
>
> So, affinity is indeed useful, even when using NULL, if you want to
> diverge from the default behavior and enact a certain policy, maybe due
> to the nature of your workload, the characteristics of your hardware,
> or whatever.
>
> It is not, however, necessary to set the affinity to:
>   - have a vCPU to always stay on one --and always the same one too--
>     pCPU;
>   - avoid that any other vCPU would ever run on that pCPU.
>
> That is guaranteed by the NULL scheduler itself. It just can't happen
> that it behaves otherwise, because the whole point of doing it was to
> make it simple (and fast :-)) *exactly* by avoiding to teach it how to
> do such things. It can't do it, because the code for doing it is not
> there... by design! :-D
>
> And, BTW, if you now create a domU with 1 vCPU, that vCPU will be
> assigned to pCPU  2.
Wow, what a great explanation. Thank you very much!
>> What if I would like mydomu to be th only domain that uses pCPU 2?
> Setup a cpupool with that pcpu assigned to it and put your domain into
> that cpupool.
>
> Yes, with any other scheduler that is not NULL, that's the proper way
> of doing it.
>
> Regards




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-30 17:59                   ` Dario Faggioli
@ 2021-02-01  6:55                     ` Anders Törnqvist
  2021-02-02  7:59                     ` Julien Grall
  2021-02-15  7:15                     ` Anders Törnqvist
  2 siblings, 0 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-02-01  6:55 UTC (permalink / raw)
  To: Dario Faggioli, Julien Grall, xen-devel, Stefano Stabellini

On 1/30/21 6:59 PM, Dario Faggioli wrote:
> On Fri, 2021-01-29 at 09:08 +0100, Anders Törnqvist wrote:
>> On 1/26/21 11:31 PM, Dario Faggioli wrote:
>>> Thanks again for letting us see these logs.
>> Thanks for the attention to this :-)
>>
>> Any ideas for how to solve it?
>>
> So, you're up for testing patches, right?
Absolutely. I will apply them and be back with the results. :-)

>
> How about applying these two, and letting me know what happens? :-D
>
> They are on top of current staging. I can try to rebase on something
> else, if it's easier for you to test.
>
> Besides being attached, they're also available here:
>
> https://gitlab.com/xen-project/people/dfaggioli/xen/-/tree/rcu-quiet-fix
>
> I could not test them properly on ARM, as I don't have an ARM system
> handy, so everything is possible really... just let me know.
>
> It should at least build fine, AFAICT from here:
>
> https://gitlab.com/xen-project/people/dfaggioli/xen/-/pipelines/249101213
>
> Julien, back in:
>
>   https://lore.kernel.org/xen-devel/315740e1-3591-0e11-923a-718e06c36445@arm.com/
>
>
> you said I should hook in enter_hypervisor_head(),
> leave_hypervisor_tail(). Those functions are gone now and looking at
> how the code changed, this is where I figured I should put the calls
> (see the second patch). But feel free to educate me otherwise.
>
> For x86 people that are listening... Do we have, in our beloved arch,
> equally handy places (i.e., right before leaving Xen for a guest and
> right after entering Xen from one), preferrably in a C file, and for
> all guests... like it seems to be the case on ARM?
>
> Regards




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-30 17:59                   ` Dario Faggioli
  2021-02-01  6:55                     ` Anders Törnqvist
@ 2021-02-02  7:59                     ` Julien Grall
  2021-02-02 15:03                       ` Dario Faggioli
  2021-02-15  7:15                     ` Anders Törnqvist
  2 siblings, 1 reply; 31+ messages in thread
From: Julien Grall @ 2021-02-02  7:59 UTC (permalink / raw)
  To: Dario Faggioli, Anders Törnqvist, xen-devel, Stefano Stabellini

Hi Dario,

On 30/01/2021 17:59, Dario Faggioli wrote:
> On Fri, 2021-01-29 at 09:08 +0100, Anders Törnqvist wrote:
>> On 1/26/21 11:31 PM, Dario Faggioli wrote:
>>> Thanks again for letting us see these logs.
>>
>> Thanks for the attention to this :-)
>>
>> Any ideas for how to solve it?
>>
> So, you're up for testing patches, right?
> 
> How about applying these two, and letting me know what happens? :-D
> 
> They are on top of current staging. I can try to rebase on something
> else, if it's easier for you to test.
> 
> Besides being attached, they're also available here:
> 
> https://gitlab.com/xen-project/people/dfaggioli/xen/-/tree/rcu-quiet-fix
> 
> I could not test them properly on ARM, as I don't have an ARM system
> handy, so everything is possible really... just let me know.
> 
> It should at least build fine, AFAICT from here:
> 
> https://gitlab.com/xen-project/people/dfaggioli/xen/-/pipelines/249101213
> 
> Julien, back in:
> 
>   https://lore.kernel.org/xen-devel/315740e1-3591-0e11-923a-718e06c36445@arm.com/
> 
> 
> you said I should hook in enter_hypervisor_head(),
> leave_hypervisor_tail(). Those functions are gone now and looking at
> how the code changed, this is where I figured I should put the calls
> (see the second patch). But feel free to educate me otherwise.

enter_hypervisor_from_guest() and leave_hypervisor_to_guest() are the 
new functions.

I have had a quick look at your place. The RCU call in 
leave_hypervisor_to_guest() needs to be placed just after the last call 
to check_for_pcpu_work().

Otherwise, you may be preempted and keep the RCU quiet.

The placement in enter_hypervisor_from_guest() doesn't matter too much, 
although I would consider to call it as a late as possible.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-02-02  7:59                     ` Julien Grall
@ 2021-02-02 15:03                       ` Dario Faggioli
  2021-02-02 15:23                         ` Julien Grall
  0 siblings, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2021-02-02 15:03 UTC (permalink / raw)
  To: Julien Grall, Anders Törnqvist, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On Tue, 2021-02-02 at 07:59 +0000, Julien Grall wrote:
> Hi Dario,
> 
Hi!

> I have had a quick look at your place. The RCU call in 
> leave_hypervisor_to_guest() needs to be placed just after the last
> call 
> to check_for_pcpu_work().
> 
> Otherwise, you may be preempted and keep the RCU quiet.
> 
Ok, makes sense. I'll move it.

> The placement in enter_hypervisor_from_guest() doesn't matter too
> much, 
> although I would consider to call it as a late as possible.
> 
Mmmm... Can I ask why? In fact, I would have said "as soon as
possible".

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-02-02 15:03                       ` Dario Faggioli
@ 2021-02-02 15:23                         ` Julien Grall
  2021-02-03  7:31                           ` Dario Faggioli
  0 siblings, 1 reply; 31+ messages in thread
From: Julien Grall @ 2021-02-02 15:23 UTC (permalink / raw)
  To: Dario Faggioli, Anders Törnqvist, xen-devel, Stefano Stabellini
  Cc: Jan Beulich, Andrew Cooper, Juergen Gross

(Adding Andrew, Jan, Juergen for visibility)

Hi Dario,

On 02/02/2021 15:03, Dario Faggioli wrote:
> On Tue, 2021-02-02 at 07:59 +0000, Julien Grall wrote:
>> Hi Dario,
>>
>> I have had a quick look at your place. The RCU call in
>> leave_hypervisor_to_guest() needs to be placed just after the last
>> call
>> to check_for_pcpu_work().
>>
>> Otherwise, you may be preempted and keep the RCU quiet.
>>
> Ok, makes sense. I'll move it.
> 
>> The placement in enter_hypervisor_from_guest() doesn't matter too
>> much,
>> although I would consider to call it as a late as possible.
>>
> Mmmm... Can I ask why? In fact, I would have said "as soon as
> possible".

Because those functions only access data for the current vCPU/domain. 
This is already protected by the fact that the domain is running.

By leaving the "quiesce" mode later, you give an opportunity to the RCU 
to release memory earlier.

In reality, it is probably still too early as a pCPU can be considered 
quiesced until a call to rcu_lock*() (such rcu_lock_domain()).

But this would require some investigation to check if we effectively 
protect all the region with the RCU helpers. This is likely too 
complicated for 4.15.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-02-02 15:23                         ` Julien Grall
@ 2021-02-03  7:31                           ` Dario Faggioli
  2021-02-03  9:19                             ` Julien Grall
  0 siblings, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2021-02-03  7:31 UTC (permalink / raw)
  To: Julien Grall, Anders Törnqvist, xen-devel, Stefano Stabellini
  Cc: Jan Beulich, Andrew Cooper, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

Hi again, 

On Tue, 2021-02-02 at 15:23 +0000, Julien Grall wrote:
> (Adding Andrew, Jan, Juergen for visibility)
> 
Thanks! :-)

> On 02/02/2021 15:03, Dario Faggioli wrote:
> > On Tue, 2021-02-02 at 07:59 +0000, Julien Grall wrote:
> > > The placement in enter_hypervisor_from_guest() doesn't matter too
> > > much,
> > > although I would consider to call it as a late as possible.
> > > 
> > Mmmm... Can I ask why? In fact, I would have said "as soon as
> > possible".
> 
> Because those functions only access data for the current vCPU/domain.
> This is already protected by the fact that the domain is running.
> 
Mmm.. ok, yes, I think it makes sense.

> By leaving the "quiesce" mode later, you give an opportunity to the
> RCU 
> to release memory earlier.
> 
Yeah. What I wanted to be sure is that we put the CPU "back in the
race" :-) before any current or future use of RCUs.

> In reality, it is probably still too early as a pCPU can be
> considered 
> quiesced until a call to rcu_lock*() (such rcu_lock_domain()).
> 
Well, yes, in theory, we could track down which is the first RCU read
side crit. section on this path, and put the call right before that (if
I understood what you mean).

To me, however, this looks indeed too complex and difficult to
maintain, not only for 4.15 but in general. E.g., suppose we find such
a use of RCUs in function foo() called by bar() called by
hypervisor_enter_from_guest().

If someone at some points wants to use RCUs in bar(), how does she know
that she should also move the call to rcu_quiet_enter() from foo() to
there?

So, yes, I'll move it a little down, but still within
hypervisor_enter_from_guest().

In the meanwhile, I had a quick chat with Jourgen about x86. In fact, I
had a look and was not finding a place where to put the
rcu_quiet_{exit,enter}() calls as convenient as you have here on ARM.
I.e., two nice C functions that we traverse for all kind of guests, for
HVM and SVM, etc.

Actually, I was quite skeptical about it but, you know, one can hope!
Juergen confirmed that there's no such things, so I'll look at the
various entry.S files for the proper spots.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-02-03  7:31                           ` Dario Faggioli
@ 2021-02-03  9:19                             ` Julien Grall
  2021-02-03 11:00                               ` Jürgen Groß
  0 siblings, 1 reply; 31+ messages in thread
From: Julien Grall @ 2021-02-03  9:19 UTC (permalink / raw)
  To: Dario Faggioli, Anders Törnqvist, xen-devel, Stefano Stabellini
  Cc: Jan Beulich, Andrew Cooper, Juergen Gross

Hi,

On 03/02/2021 07:31, Dario Faggioli wrote:
> On Tue, 2021-02-02 at 15:23 +0000, Julien Grall wrote:
>> In reality, it is probably still too early as a pCPU can be
>> considered
>> quiesced until a call to rcu_lock*() (such rcu_lock_domain()).
>>
> Well, yes, in theory, we could track down which is the first RCU read
> side crit. section on this path, and put the call right before that (if
> I understood what you mean).

Oh, that's not what I meant. This will indeed be far more complex than I 
originally had in mind.

AFAIU, the RCU uses critical section to protect data. So the "entering" 
could be used as "the pCPU is not quiesced" and "exiting" could be used 
as "the pCPU is quiesced".

The concern with my approach is we would need to make sure that Xen 
correctly uses the rcu helpers. I know Juergen worked on that recently, 
but I don't know whether this is fully complete.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-02-03  9:19                             ` Julien Grall
@ 2021-02-03 11:00                               ` Jürgen Groß
  2021-02-03 11:20                                 ` Julien Grall
  0 siblings, 1 reply; 31+ messages in thread
From: Jürgen Groß @ 2021-02-03 11:00 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, Anders Törnqvist, xen-devel,
	Stefano Stabellini
  Cc: Jan Beulich, Andrew Cooper


[-- Attachment #1.1.1: Type: text/plain, Size: 1312 bytes --]

On 03.02.21 10:19, Julien Grall wrote:
> Hi,
> 
> On 03/02/2021 07:31, Dario Faggioli wrote:
>> On Tue, 2021-02-02 at 15:23 +0000, Julien Grall wrote:
>>> In reality, it is probably still too early as a pCPU can be
>>> considered
>>> quiesced until a call to rcu_lock*() (such rcu_lock_domain()).
>>>
>> Well, yes, in theory, we could track down which is the first RCU read
>> side crit. section on this path, and put the call right before that (if
>> I understood what you mean).
> 
> Oh, that's not what I meant. This will indeed be far more complex than I 
> originally had in mind.
> 
> AFAIU, the RCU uses critical section to protect data. So the "entering" 
> could be used as "the pCPU is not quiesced" and "exiting" could be used 
> as "the pCPU is quiesced".
> 
> The concern with my approach is we would need to make sure that Xen 
> correctly uses the rcu helpers. I know Juergen worked on that recently, 
> but I don't know whether this is fully complete.

I think it is complete, but I can't be sure, of course.

One bit missing (for catching some wrong uses of the helpers) is this
patch:

https://lists.xen.org/archives/html/xen-devel/2020-03/msg01759.html

I don't remember why it hasn't been taken, but I think there was a
specific reason for that.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-02-03 11:00                               ` Jürgen Groß
@ 2021-02-03 11:20                                 ` Julien Grall
  2021-02-03 12:02                                   ` Jürgen Groß
  0 siblings, 1 reply; 31+ messages in thread
From: Julien Grall @ 2021-02-03 11:20 UTC (permalink / raw)
  To: Jürgen Groß,
	Dario Faggioli, Anders Törnqvist, xen-devel,
	Stefano Stabellini
  Cc: Jan Beulich, Andrew Cooper

Hi Juergen,

On 03/02/2021 11:00, Jürgen Groß wrote:
> On 03.02.21 10:19, Julien Grall wrote:
>> Hi,
>>
>> On 03/02/2021 07:31, Dario Faggioli wrote:
>>> On Tue, 2021-02-02 at 15:23 +0000, Julien Grall wrote:
>>>> In reality, it is probably still too early as a pCPU can be
>>>> considered
>>>> quiesced until a call to rcu_lock*() (such rcu_lock_domain()).
>>>>
>>> Well, yes, in theory, we could track down which is the first RCU read
>>> side crit. section on this path, and put the call right before that (if
>>> I understood what you mean).
>>
>> Oh, that's not what I meant. This will indeed be far more complex than 
>> I originally had in mind.
>>
>> AFAIU, the RCU uses critical section to protect data. So the 
>> "entering" could be used as "the pCPU is not quiesced" and "exiting" 
>> could be used as "the pCPU is quiesced".
>>
>> The concern with my approach is we would need to make sure that Xen 
>> correctly uses the rcu helpers. I know Juergen worked on that 
>> recently, but I don't know whether this is fully complete.
> 
> I think it is complete, but I can't be sure, of course.
> 
> One bit missing (for catching some wrong uses of the helpers) is this
> patch:
> 
> https://lists.xen.org/archives/html/xen-devel/2020-03/msg01759.html
> 
> I don't remember why it hasn't been taken, but I think there was a
> specific reason for that.

Looking at v8 and the patch is suitably reviewed by Jan. So I am a bit 
puzzled to why this wasn't committed... I had to go to v6 and notice the 
following message:

"albeit to be honest I'm not fully convinced we need to go this far."

Was the implication that his reviewed-by was conditional to someone else 
answering to the e-mail?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-02-03 11:20                                 ` Julien Grall
@ 2021-02-03 12:02                                   ` Jürgen Groß
  0 siblings, 0 replies; 31+ messages in thread
From: Jürgen Groß @ 2021-02-03 12:02 UTC (permalink / raw)
  To: Julien Grall, Dario Faggioli, Anders Törnqvist, xen-devel,
	Stefano Stabellini
  Cc: Jan Beulich, Andrew Cooper


[-- Attachment #1.1.1: Type: text/plain, Size: 2611 bytes --]

On 03.02.21 12:20, Julien Grall wrote:
> Hi Juergen,
> 
> On 03/02/2021 11:00, Jürgen Groß wrote:
>> On 03.02.21 10:19, Julien Grall wrote:
>>> Hi,
>>>
>>> On 03/02/2021 07:31, Dario Faggioli wrote:
>>>> On Tue, 2021-02-02 at 15:23 +0000, Julien Grall wrote:
>>>>> In reality, it is probably still too early as a pCPU can be
>>>>> considered
>>>>> quiesced until a call to rcu_lock*() (such rcu_lock_domain()).
>>>>>
>>>> Well, yes, in theory, we could track down which is the first RCU read
>>>> side crit. section on this path, and put the call right before that (if
>>>> I understood what you mean).
>>>
>>> Oh, that's not what I meant. This will indeed be far more complex 
>>> than I originally had in mind.
>>>
>>> AFAIU, the RCU uses critical section to protect data. So the 
>>> "entering" could be used as "the pCPU is not quiesced" and "exiting" 
>>> could be used as "the pCPU is quiesced".
>>>
>>> The concern with my approach is we would need to make sure that Xen 
>>> correctly uses the rcu helpers. I know Juergen worked on that 
>>> recently, but I don't know whether this is fully complete.
>>
>> I think it is complete, but I can't be sure, of course.
>>
>> One bit missing (for catching some wrong uses of the helpers) is this
>> patch:
>>
>> https://lists.xen.org/archives/html/xen-devel/2020-03/msg01759.html
>>
>> I don't remember why it hasn't been taken, but I think there was a
>> specific reason for that.
> 
> Looking at v8 and the patch is suitably reviewed by Jan. So I am a bit 
> puzzled to why this wasn't committed... I had to go to v6 and notice the 
> following message:
> 
> "albeit to be honest I'm not fully convinced we need to go this far."
> 
> Was the implication that his reviewed-by was conditional to someone else 
> answering to the e-mail?

I have no record for that being the case.

Patches 1-3 of that series were needed for getting rid of
stop_machine_run() in rcu handling and to fix other problems. Patch 4
was adding some additional ASSERT()s for making sure no potential
deadlocks due to wrong rcu usage could creep in again.

Patch 5 was more a "nice to have" addition in order to avoid any
wrong usage of rcu which should have no real negative impact on the
system stability.

So I believe Jan as the committer didn't want to commit it himself, but
was fine with the overall idea and implementation.

I still think for code sanity it would be nice, but I was rather busy
with Xenstore and event channel security work at that time, so I didn't
urge anyone to take this patch.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Null scheduler and vwfi native problem
  2021-01-30 17:59                   ` Dario Faggioli
  2021-02-01  6:55                     ` Anders Törnqvist
  2021-02-02  7:59                     ` Julien Grall
@ 2021-02-15  7:15                     ` Anders Törnqvist
  2 siblings, 0 replies; 31+ messages in thread
From: Anders Törnqvist @ 2021-02-15  7:15 UTC (permalink / raw)
  To: Dario Faggioli, Julien Grall, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 2408 bytes --]

On 1/30/21 6:59 PM, Dario Faggioli wrote:
> On Fri, 2021-01-29 at 09:08 +0100, Anders Törnqvist wrote:
>> On 1/26/21 11:31 PM, Dario Faggioli wrote:
>>> Thanks again for letting us see these logs.
>> Thanks for the attention to this :-)
>>
>> Any ideas for how to solve it?
>>
> So, you're up for testing patches, right?
>
> How about applying these two, and letting me know what happens? :-D

Great work guys!

Hi. Now I got the time to test the patches.

They was not possible to apply without fail on the code version I am 
using which is commit b64b8df622963accf85b227e468fe12b2d56c128 from 
https://source.codeaurora.org/external/imx/imx-xen.

I did some editing to get them into my code. I think I should have 
removed some sched_tick_suspend/sched_tick_resume calls also.
See the attached patches for what I have applied on the code.

Anyway, after applying the patches including the original 
rcu-quiesc-patch.patch the destroy of the domu seems to work.
I have rebooted, only destroyed-created and used Xen watchdog to reboot 
the domu in total about 20 times and so far it has nicely destroyed and 
the been able to start a new instance of the domu.

So it looks promising although my edited patches probably need some fixing.


>
> They are on top of current staging. I can try to rebase on something
> else, if it's easier for you to test.
>
> Besides being attached, they're also available here:
>
> https://gitlab.com/xen-project/people/dfaggioli/xen/-/tree/rcu-quiet-fix
>
> I could not test them properly on ARM, as I don't have an ARM system
> handy, so everything is possible really... just let me know.
>
> It should at least build fine, AFAICT from here:
>
> https://gitlab.com/xen-project/people/dfaggioli/xen/-/pipelines/249101213
>
> Julien, back in:
>
>   https://lore.kernel.org/xen-devel/315740e1-3591-0e11-923a-718e06c36445@arm.com/
>
>
> you said I should hook in enter_hypervisor_head(),
> leave_hypervisor_tail(). Those functions are gone now and looking at
> how the code changed, this is where I figured I should put the calls
> (see the second patch). But feel free to educate me otherwise.
>
> For x86 people that are listening... Do we have, in our beloved arch,
> equally handy places (i.e., right before leaving Xen for a guest and
> right after entering Xen from one), preferrably in a C file, and for
> all guests... like it seems to be the case on ARM?
>
> Regards



[-- Attachment #2: 1_1-xen-rcu-rename-idle-ignore.patch --]
[-- Type: text/x-patch, Size: 10406 bytes --]

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index d6dc4b48db..42ab9dbbd6 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -52,8 +52,8 @@ static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
-    cpumask_t   idle_cpumask; /* ... unless they are already idle */
+    cpumask_t   cpumask; /* CPUs that need to switch in order ...   */
+    cpumask_t   ignore_cpumask; /* ... unless they are already idle */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -86,8 +86,8 @@ struct rcu_data {
     long            last_rs_qlen;     /* qlen during the last resched */
 
     /* 3) idle CPUs handling */
-    struct timer idle_timer;
-    bool idle_timer_active;
+    struct timer cb_timer;
+    bool cb_timer_active;
 };
 
 /*
@@ -116,22 +116,22 @@ struct rcu_data {
  * CPU that is going idle. The user can change this, via a boot time
  * parameter, but only up to 100ms.
  */
-#define IDLE_TIMER_PERIOD_MAX     MILLISECS(100)
-#define IDLE_TIMER_PERIOD_DEFAULT MILLISECS(10)
-#define IDLE_TIMER_PERIOD_MIN     MICROSECS(100)
+#define CB_TIMER_PERIOD_MAX     MILLISECS(100)
+#define CB_TIMER_PERIOD_DEFAULT MILLISECS(10)
+#define CB_TIMER_PERIOD_MIN     MICROSECS(100)
 
-static s_time_t __read_mostly idle_timer_period;
+static s_time_t __read_mostly cb_timer_period;
 
 /*
- * Increment and decrement values for the idle timer handler. The algorithm
+ * Increment and decrement values for the callback timer handler. The algorithm
  * works as follows:
  * - if the timer actually fires, and it finds out that the grace period isn't
- *   over yet, we add IDLE_TIMER_PERIOD_INCR to the timer's period;
+ *   over yet, we add CB_TIMER_PERIOD_INCR to the timer's period;
  * - if the timer actually fires and it finds the grace period over, we
  *   subtract IDLE_TIMER_PERIOD_DECR from the timer's period.
  */
-#define IDLE_TIMER_PERIOD_INCR    MILLISECS(10)
-#define IDLE_TIMER_PERIOD_DECR    MICROSECS(100)
+#define CB_TIMER_PERIOD_INCR    MILLISECS(10)
+#define CB_TIMER_PERIOD_DECR    MICROSECS(100)
 
 static DEFINE_PER_CPU(struct rcu_data, rcu_data);
 
@@ -309,7 +309,7 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         * This barrier is paired with the one in rcu_idle_enter().
         */
         smp_mb();
-        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->ignore_cpumask);
     }
 }
 
@@ -455,7 +455,7 @@ int rcu_needs_cpu(int cpu)
 {
     struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
-    return (rdp->curlist && !rdp->idle_timer_active) || rcu_pending(cpu);
+    return (rdp->curlist && !rdp->cb_timer_active) || rcu_pending(cpu);
 }
 
 /*
@@ -463,7 +463,7 @@ int rcu_needs_cpu(int cpu)
  * periodically poke rcu_pedning(), so that it will invoke the callback
  * not too late after the end of the grace period.
  */
-void rcu_idle_timer_start()
+static void cb_timer_start(void)
 {
     struct rcu_data *rdp = &this_cpu(rcu_data);
 
@@ -475,48 +475,48 @@ void rcu_idle_timer_start()
     if (likely(!rdp->curlist))
         return;
 
-    set_timer(&rdp->idle_timer, NOW() + idle_timer_period);
-    rdp->idle_timer_active = true;
+    set_timer(&rdp->cb_timer, NOW() + cb_timer_period);
+    rdp->cb_timer_active = true;
 }
 
-void rcu_idle_timer_stop()
+static void cb_timer_stop(void)
 {
     struct rcu_data *rdp = &this_cpu(rcu_data);
 
-    if (likely(!rdp->idle_timer_active))
+    if (likely(!rdp->cb_timer_active))
         return;
 
-    rdp->idle_timer_active = false;
+    rdp->cb_timer_active = false;
 
     /*
      * In general, as the CPU is becoming active again, we don't need the
-     * idle timer, and so we want to stop it.
+     * callback timer, and so we want to stop it.
      *
-     * However, in case we are here because idle_timer has (just) fired and
+     * However, in case we are here because cb_timer has (just) fired and
      * has woken up the CPU, we skip stop_timer() now. In fact, when a CPU
      * wakes up from idle, this code always runs before do_softirq() has the
      * chance to check and deal with TIMER_SOFTIRQ. And if we stop the timer
      * now, the TIMER_SOFTIRQ handler will see it as inactive, and will not
-     * call rcu_idle_timer_handler().
+     * call cb_timer_handler().
      *
      * Therefore, if we see that the timer is expired already, we leave it
      * alone. The TIMER_SOFTIRQ handler will then run the timer routine, and
      * deactivate it.
      */
-    if ( !timer_is_expired(&rdp->idle_timer) )
-        stop_timer(&rdp->idle_timer);
+    if ( !timer_is_expired(&rdp->cb_timer) )
+        stop_timer(&rdp->cb_timer);
 }
 
-static void rcu_idle_timer_handler(void* data)
+static void cb_timer_handler(void* data)
 {
-    perfc_incr(rcu_idle_timer);
+    perfc_incr(rcu_callback_timer);
 
     if ( !cpumask_empty(&rcu_ctrlblk.cpumask) )
-        idle_timer_period = min(idle_timer_period + IDLE_TIMER_PERIOD_INCR,
-                                IDLE_TIMER_PERIOD_MAX);
+        cb_timer_period = min(cb_timer_period + CB_TIMER_PERIOD_INCR,
+                                CB_TIMER_PERIOD_MAX);
     else
-        idle_timer_period = max(idle_timer_period - IDLE_TIMER_PERIOD_DECR,
-                                IDLE_TIMER_PERIOD_MIN);
+        cb_timer_period = max(cb_timer_period - CB_TIMER_PERIOD_DECR,
+                                CB_TIMER_PERIOD_MIN);
 }
 
 void rcu_check_callbacks(int cpu)
@@ -537,7 +537,7 @@ static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
 static void rcu_offline_cpu(struct rcu_data *this_rdp,
                             struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
 {
-    kill_timer(&rdp->idle_timer);
+    kill_timer(&rdp->cb_timer);
 
     /* If the cpu going offline owns the grace period we can block
      * indefinitely waiting for it, so flush it here.
@@ -567,7 +567,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
     rdp->qs_pending = 0;
     rdp->cpu = cpu;
     rdp->blimit = blimit;
-    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, rdp, cpu);
+    init_timer(&rdp->cb_timer, cb_timer_handler, rdp, cpu);
 }
 
 static int cpu_callback(
@@ -596,25 +596,39 @@ static struct notifier_block cpu_nfb = {
     .notifier_call = cpu_callback
 };
 
+/*
+ * We're changing the name of the parameter, to better reflect the fact that
+ * the timer is used for callbacks in general, when the CPU is either idle
+ * or executing guest code. We still accept the old parameter but, if both
+ * are specified, the new one ("rcu-callback-timer-period-ms") has priority.
+ */
+#define CB_TIMER_PERIOD_DEFAULT_MS ( CB_TIMER_PERIOD_DEFAULT / MILLISECS(1) )
+static unsigned int __initdata cb_timer_period_ms = CB_TIMER_PERIOD_DEFAULT_MS;
+integer_param("rcu-callback-timer-period-ms", cb_timer_period_ms);
+
+static unsigned int __initdata idle_timer_period_ms = CB_TIMER_PERIOD_DEFAULT_MS;
+integer_param("rcu-idle-timer-period-ms", idle_timer_period_ms);
+
 void __init rcu_init(void)
 {
     void *cpu = (void *)(long)smp_processor_id();
-    static unsigned int __initdata idle_timer_period_ms =
-                                    IDLE_TIMER_PERIOD_DEFAULT / MILLISECS(1);
-    integer_param("rcu-idle-timer-period-ms", idle_timer_period_ms);
+
+    if (idle_timer_period_ms != CB_TIMER_PERIOD_DEFAULT_MS &&
+        cb_timer_period_ms == CB_TIMER_PERIOD_DEFAULT_MS)
+        cb_timer_period_ms = idle_timer_period_ms;
 
     /* We don't allow 0, or anything higher than IDLE_TIMER_PERIOD_MAX */
-    if ( idle_timer_period_ms == 0 ||
-         idle_timer_period_ms > IDLE_TIMER_PERIOD_MAX / MILLISECS(1) )
+    if ( cb_timer_period_ms == 0 ||
+         cb_timer_period_ms > CB_TIMER_PERIOD_MAX / MILLISECS(1) )
     {
-        idle_timer_period_ms = IDLE_TIMER_PERIOD_DEFAULT / MILLISECS(1);
-        printk("WARNING: rcu-idle-timer-period-ms outside of "
+        cb_timer_period_ms = CB_TIMER_PERIOD_DEFAULT / MILLISECS(1);
+        printk("WARNING: rcu-callback-timer-period-ms outside of "
                "(0,%"PRI_stime"]. Resetting it to %u.\n",
-               IDLE_TIMER_PERIOD_MAX / MILLISECS(1), idle_timer_period_ms);
+               CB_TIMER_PERIOD_MAX / MILLISECS(1), cb_timer_period_ms);
     }
-    idle_timer_period = MILLISECS(idle_timer_period_ms);
+    cb_timer_period = MILLISECS(cb_timer_period_ms);
 
-    cpumask_clear(&rcu_ctrlblk.idle_cpumask);
+    cpumask_clear(&rcu_ctrlblk.ignore_cpumask);
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
@@ -626,8 +640,8 @@ void __init rcu_init(void)
  */
 void rcu_idle_enter(unsigned int cpu)
 {
-    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
-    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
+    cpumask_set_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
     /*
      * If some other CPU is starting a new grace period, we'll notice that
      * by seeing a new value in rcp->cur (different than our quiescbatch).
@@ -637,10 +651,12 @@ void rcu_idle_enter(unsigned int cpu)
      * Se the comment before cpumask_andnot() in  rcu_start_batch().
      */
     smp_mb();
+    cb_timer_start();
 }
 
 void rcu_idle_exit(unsigned int cpu)
 {
-    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
-    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    cb_timer_stop();
+    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
+    cpumask_clear_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
 }
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
index 08b182ccd9..d142534383 100644
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -12,7 +12,7 @@ PERFCOUNTER(calls_from_multicall,       "calls from multicall")
 PERFCOUNTER(irqs,                   "#interrupts")
 PERFCOUNTER(ipis,                   "#IPIs")
 
-PERFCOUNTER(rcu_idle_timer,         "RCU: idle_timer")
+PERFCOUNTER(rcu_callback_timer,     "RCU: callback_timer")
 
 /* Generic scheduler counters (applicable to all schedulers) */
 PERFCOUNTER(sched_irq,              "sched: timer")

[-- Attachment #3: 2_1-rcu-idle-guest.patch --]
[-- Type: text/x-patch, Size: 8321 bytes --]

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index a9ca09acb2..e4439b2397 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -46,6 +46,8 @@ static void do_idle(void)
 {
     unsigned int cpu = smp_processor_id();
 
+    rcu_quiet_enter();
+
     sched_tick_suspend();
     /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
@@ -59,6 +61,8 @@ static void do_idle(void)
     local_irq_enable();
 
     sched_tick_resume();
+
+    rcu_quiet_exit();
 }
 
 void idle_loop(void)
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 1d2b762e22..5158a03746 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -2007,6 +2007,8 @@ void enter_hypervisor_from_guest(void)
 {
     struct vcpu *v = current;
 
+    rcu_quiet_exit();
+
     /*
      * If we pended a virtual abort, preserve it until it gets cleared.
      * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details,
@@ -2264,6 +2266,8 @@ static void check_for_vcpu_work(void)
  */
 void leave_hypervisor_to_guest(void)
 {
+    rcu_quiet_enter();
+
     local_irq_disable();
 
     check_for_vcpu_work();
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 836f524ef4..3d8dcec143 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -647,7 +647,8 @@ static void acpi_processor_idle(void)
     cpufreq_dbs_timer_suspend();
 
     sched_tick_suspend();
-    /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
+    rcu_quiet_enter();
+    /* rcu_quiet_enter() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
 
     /*
@@ -660,6 +661,7 @@ static void acpi_processor_idle(void)
     {
         local_irq_enable();
         sched_tick_resume();
+        rcu_quiet_exit();
         cpufreq_dbs_timer_resume();
         return;
     }
@@ -785,6 +787,7 @@ static void acpi_processor_idle(void)
         power->last_state = &power->states[0];
         local_irq_enable();
         sched_tick_resume();
+        rcu_quiet_exit();
         cpufreq_dbs_timer_resume();
         return;
     }
@@ -793,6 +796,7 @@ static void acpi_processor_idle(void)
     power->last_state = &power->states[0];
 
     sched_tick_resume();
+    rcu_quiet_exit();
     cpufreq_dbs_timer_resume();
 
     if ( cpuidle_current_governor->reflect )
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 52413e6da1..2657ec76f4 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -756,7 +756,8 @@ static void mwait_idle(void)
 	cpufreq_dbs_timer_suspend();
 
 	sched_tick_suspend();
-	/* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
+	rcu_quiet_enter();
+	/* rcu_quiet_enter() can raise TIMER_SOFTIRQ. Process it now. */
 	process_pending_softirqs();
 
 	/* Interrupts must be disabled for C2 and higher transitions. */
@@ -765,6 +766,7 @@ static void mwait_idle(void)
 	if (!cpu_is_haltable(cpu)) {
 		local_irq_enable();
 		sched_tick_resume();
+		rcu_quiet_exit();
 		cpufreq_dbs_timer_resume();
 		return;
 	}
@@ -807,6 +809,7 @@ static void mwait_idle(void)
 		lapic_timer_on();
 
 	sched_tick_resume();
+	rcu_quiet_exit();
 	cpufreq_dbs_timer_resume();
 
 	if ( cpuidle_current_governor->reflect )
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 42ab9dbbd6..a9c24b5889 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -52,8 +52,8 @@ static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order ...   */
-    cpumask_t   ignore_cpumask; /* ... unless they are already idle */
+    cpumask_t   cpumask; /* CPUs that need to switch in order...      */
+    cpumask_t   ignore_cpumask; /* ...unless already idle or in guest */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -85,7 +85,7 @@ struct rcu_data {
     struct rcu_head barrier;
     long            last_rs_qlen;     /* qlen during the last resched */
 
-    /* 3) idle CPUs handling */
+    /* 3) idle (or in guest mode) CPUs handling */
     struct timer cb_timer;
     bool cb_timer_active;
 };
@@ -107,6 +107,12 @@ struct rcu_data {
  * 3) it is stopped immediately, if the CPU wakes up from idle and
  *    resumes 'normal' execution.
  *
+ * Note also that the same happens if a CPU starts executing a guest that
+ * (almost) never comes back into the hypervisor. This may be the case if
+ * the guest uses "idle=poll" / "vwfi=native". Therefore, we need to handle
+ * guest entry events in the same way as the CPU going idle, i.e., consider
+ * it quiesced and arm the timer.
+ *
  * About how far in the future the timer should be programmed each time,
  * it's hard to tell (guess!!). Since this mimics Linux's periodic timer
  * tick, take values used there as an indication. In Linux 2.6.21, tick
@@ -304,9 +310,10 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         * Make sure the increment of rcp->cur is visible so, even if a
         * CPU that is about to go idle, is captured inside rcp->cpumask,
         * rcu_pending() will return false, which then means cpu_quiet()
-        * will be invoked, before the CPU would actually enter idle.
+        * will be invoked, before the CPU would actually go idle (or
+	* enter a guest).
         *
-        * This barrier is paired with the one in rcu_idle_enter().
+        * This barrier is paired with the one in rcu_quiet_enter().
         */
         smp_mb();
         cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->ignore_cpumask);
@@ -463,14 +470,15 @@ int rcu_needs_cpu(int cpu)
  * periodically poke rcu_pedning(), so that it will invoke the callback
  * not too late after the end of the grace period.
  */
-static void cb_timer_start(void)
+static void cb_timer_start(unsigned int cpu)
 {
-    struct rcu_data *rdp = &this_cpu(rcu_data);
+    struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
     /*
      * Note that we don't check rcu_pending() here. In fact, we don't want
      * the timer armed on CPUs that are in the process of quiescing while
-     * going idle, unless they really are the ones with a queued callback.
+     * going idle or entering guest mode, unless they really have queued
+     * callbacks.
      */
     if (likely(!rdp->curlist))
         return;
@@ -479,9 +487,9 @@ static void cb_timer_start(void)
     rdp->cb_timer_active = true;
 }
 
-static void cb_timer_stop(void)
+static void cb_timer_stop(unsigned int cpu)
 {
-    struct rcu_data *rdp = &this_cpu(rcu_data);
+    struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
 
     if (likely(!rdp->cb_timer_active))
         return;
@@ -635,11 +643,14 @@ void __init rcu_init(void)
 }
 
 /*
- * The CPU is becoming idle, so no more read side critical
- * sections, and one more step toward grace period.
+ * The CPU is becoming about to either idle or enter the guest. In any of
+ * these cases, it can't have any outstanding read side critical sections
+ * so this is one step toward the end of the grace period.
  */
-void rcu_idle_enter(unsigned int cpu)
+void rcu_quiet_enter()
 {
+    unsigned int cpu = smp_processor_id();
+
     ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
     cpumask_set_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
     /*
@@ -652,11 +663,15 @@ void rcu_idle_enter(unsigned int cpu)
      */
     smp_mb();
     cb_timer_start();
+    cb_timer_start(cpu);
 }
 
-void rcu_idle_exit(unsigned int cpu)
+
+void rcu_quiet_exit()
 {
-    cb_timer_stop();
+    unsigned int cpu = smp_processor_id();
+
+    cb_timer_stop(cpu);
     ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.ignore_cpumask));
     cpumask_clear_cpu(cpu, &rcu_ctrlblk.ignore_cpumask);
 }
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 13850865ed..63db0f9887 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -145,8 +145,8 @@ void call_rcu(struct rcu_head *head,
 
 int rcu_barrier(void);
 
-void rcu_idle_enter(unsigned int cpu);
-void rcu_idle_exit(unsigned int cpu);
+void rcu_quiet_enter(void);
+void rcu_quiet_exit(void);
 
 void rcu_idle_timer_start(void);
 void rcu_idle_timer_stop(void);

[-- Attachment #4: 3_1-rcu-adaptations.patch --]
[-- Type: text/x-patch, Size: 1610 bytes --]

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 0902a15e8d..a8e203a1c1 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -935,7 +935,7 @@ static void complete_domain_destroy(struct rcu_head *head)
     struct domain *d = container_of(head, struct domain, rcu);
     struct vcpu *v;
     int i;
-
+    printk("complete_domain_destroy BEGIN\n");
     /*
      * Flush all state for the vCPU previously having run on the current CPU.
      * This is in particular relevant for x86 HVM ones on VMX, so that this
@@ -991,6 +991,7 @@ static void complete_domain_destroy(struct rcu_head *head)
     _domain_destroy(d);
 
     send_global_virq(VIRQ_DOM_EXC);
+    printk("complete_domain_destroy END\n");
 }
 
 /* Release resources belonging to task @p. */
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index a9c24b5889..1bdf4ecc53 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -662,7 +662,6 @@ void rcu_quiet_enter()
      * Se the comment before cpumask_andnot() in  rcu_start_batch().
      */
     smp_mb();
-    cb_timer_start();
     cb_timer_start(cpu);
 }
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 6d24a3a135..4a63c11ed1 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -3111,14 +3111,12 @@ void schedule_dump(struct cpupool *c)
 
 void sched_tick_suspend(void)
 {
-    rcu_idle_enter(smp_processor_id());
-    rcu_idle_timer_start();
+    rcu_quiet_enter();
 }
 
 void sched_tick_resume(void)
 {
-    rcu_idle_timer_stop();
-    rcu_idle_exit(smp_processor_id());
+    rcu_quiet_exit();
 }
 
 void wait(void)

[-- Attachment #5: rcu-quiesc-patch.patch --]
[-- Type: text/x-patch, Size: 1755 bytes --]

commit 0d2beb3d4125d65c415860d66974db9a5532ac84
Author: Dario Faggioli <dfaggioli@suse.com>
Date:   Wed Sep 26 11:47:06 2018 +0200

    xen: RCU: bootparam to force quiescence at every call.
    
    Signed-off-by: Dario Faggioli <dfaggioli@suse.com>

diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 0f4b1f2a5d..536eb17017 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -110,7 +110,10 @@ static enum {
 static int __init parse_vwfi(const char *s)
 {
 	if ( !strcmp(s, "native") )
+	{
+		rcu_always_quiesc = true;
 		vwfi = NATIVE;
+	}
 	else
 		vwfi = TRAP;
 
diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 3517790913..219dd2884f 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -140,6 +140,9 @@ static int qhimark = 10000;
 static int qlowmark = 100;
 static int rsinterval = 1000;
 
+bool rcu_always_quiesc = false;
+boolean_param("rcu_force_quiesc", rcu_always_quiesc);
+
 struct rcu_barrier_data {
     struct rcu_head head;
     atomic_t *cpu_count;
@@ -562,6 +565,13 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
     rdp->quiescbatch = rcp->completed;
     rdp->qs_pending = 0;
     rdp->cpu = cpu;
+    if ( rcu_always_quiesc )
+    {
+        blimit = INT_MAX;
+        qhimark = 0;
+        qlowmark = 0;
+        //rsinterval = 0;
+    }
     rdp->blimit = blimit;
     init_timer(&rdp->idle_timer, rcu_idle_timer_handler, rdp, cpu);
 }
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 3402eb5caf..274a01acf6 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -56,6 +56,8 @@ struct rcu_head {
 } while (0)
 
 
+extern bool rcu_always_quiesc;
+
 int rcu_pending(int cpu);
 int rcu_needs_cpu(int cpu);
 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-02-15  7:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 10:54 Null scheduler and vwfi native problem Anders Törnqvist
2021-01-21 18:32 ` Dario Faggioli
2021-01-21 19:40   ` Julien Grall
2021-01-21 23:35     ` Dario Faggioli
2021-01-22  8:06       ` Anders Törnqvist
2021-01-22  9:05         ` Dario Faggioli
2021-01-22 14:26         ` Julien Grall
2021-01-22 17:44           ` Anders Törnqvist
2021-01-25 15:45             ` Dario Faggioli
2021-01-25 16:11           ` Dario Faggioli
2021-01-26 17:03             ` Anders Törnqvist
2021-01-26 22:31               ` Dario Faggioli
2021-01-29  8:08                 ` Anders Törnqvist
2021-01-29  8:18                   ` Jürgen Groß
2021-01-29 10:16                     ` Dario Faggioli
2021-02-01  6:53                       ` Anders Törnqvist
2021-01-30 17:59                   ` Dario Faggioli
2021-02-01  6:55                     ` Anders Törnqvist
2021-02-02  7:59                     ` Julien Grall
2021-02-02 15:03                       ` Dario Faggioli
2021-02-02 15:23                         ` Julien Grall
2021-02-03  7:31                           ` Dario Faggioli
2021-02-03  9:19                             ` Julien Grall
2021-02-03 11:00                               ` Jürgen Groß
2021-02-03 11:20                                 ` Julien Grall
2021-02-03 12:02                                   ` Jürgen Groß
2021-02-15  7:15                     ` Anders Törnqvist
2021-01-22 14:02       ` Julien Grall
2021-01-22 17:30         ` Anders Törnqvist
2021-01-22  8:07   ` Anders Törnqvist
2021-01-21 19:16 ` Julien Grall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).