All of lore.kernel.org
 help / color / mirror / Atom feed
* xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
       [not found] <VI1PR08MB3629824170C1707255465D8BE46A9@VI1PR08MB3629.eurprd08.prod.outlook.com>
@ 2021-03-17 14:32 ` Luca Fancellu
  2021-03-18  7:54   ` Jürgen Groß
  0 siblings, 1 reply; 12+ messages in thread
From: Luca Fancellu @ 2021-03-17 14:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, jgrall

[-- Attachment #1: Type: text/plain, Size: 3209 bytes --]


Hi all,

we've been encountering an issue when using the kernel 5.10 with preempt_rt support for Dom0, the problem is that during the boot of Dom0, it hits a BUG_ON(!irqs_disabled()) from the function evtchn_fifo_unmask defined in events_fifo.c.

This is the call stack:

[   17.817018] ------------[ cut here ]------------
[   17.817021] kernel BUG at drivers/xen/events/events_fifo.c:258!
[   18.817079] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
[   18.817081] Modules linked in: bridge stp llc ipv6
[   18.817086] CPU: 3 PID: 558 Comm: xenstored Not tainted 5.10.16-rt25-yocto-preempt-rt #1
[   18.817089] Hardware name: Arm Neoverse N1 System Development Platform (DT)
[   18.817090] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[   18.817092] pc : evtchn_fifo_unmask+0xd4/0xe0
[   18.817099] lr : xen_irq_lateeoi_locked+0xec/0x200
[   18.817102] sp : ffff8000123f3cc0
[   18.817102] x29: ffff8000123f3cc0 x28: ffff0000427b1d80
[   18.817104] x27: 0000000000000000 x26: 0000000000000000
[   18.817106] x25: 0000000000000001 x24: 0000000000000001
[   18.817107] x23: ffff0000412fc900 x22: 0000000000000004
[   18.817109] x21: 0000000000000000 x20: ffff000042e06990
[   18.817110] x19: ffff0000427b1d80 x18: 0000000000000010
[   18.817112] x17: 0000000000000000 x16: 0000000000000000
[   18.817113] x15: 0000000000000002 x14: 0000000000000001
[   18.817114] x13: 000000000001a7e8 x12: 0000000000000040
[   18.817116] x11: ffff000040400248 x10: ffff00004040024a
[   18.817117] x9 : ffff800011be5200 x8 : ffff000040400270
[   18.817119] x7 : 0000000000000000 x6 : 0000000000000003
[   18.817120] x5 : 0000000000000000 x4 : ffff000040400308
[   18.817121] x3 : ffff0000408a400c x2 : 0000000000000000
[   18.817122] x1 : 0000000000000000 x0 : ffff0000408a4000
[   18.817124] Call trace:
[   18.817125]  evtchn_fifo_unmask+0xd4/0xe0
[   18.817127]  xen_irq_lateeoi_locked+0xec/0x200
[   18.817129]  xen_irq_lateeoi+0x48/0x64
[   18.817131]  evtchn_write+0x124/0x15c
[   18.817134]  vfs_write+0xf0/0x2cc
[   18.817137]  ksys_write+0xe0/0x100
[   18.817139]  __arm64_sys_write+0x20/0x30
[   18.817142]  el0_svc_common.constprop.0+0x78/0x1a0
[   18.817145]  do_el0_svc+0x24/0x90
[   18.817147]  el0_svc+0x14/0x20
[   18.817151]  el0_sync_handler+0x1a4/0x1b0
[   18.817153]  el0_sync+0x174/0x180
[   18.817156] Code: 52800120 b90023e6 97e6d104 17fffff0 (d4210000)
[   18.817158] ---[ end trace 0000000000000002 ]---

Our last tested kernel was the 5.4 and our analysis pointed out that the introduction of the lateeoi framework (xen/events: add a new "late EOI" evtchn framework) in conjunction with the preempt_rt patches (irqs kept enabled between spinlock_t/rwlock_t _irqsave/​_irqrestore operations) is the root cause.

Given that many modifications were made to the mask/unmask operations, a big one from Juergen Gross (xen/events: don't unmask an event channel when an eoi is pending), is the BUG_ON(...) still needed?

With the mentioned commit every call to a mask/unmask operation is protected by a spinlock, so I would like to have some feedbacks from who has more experience than me on this part of the code.

Thank you,

Luca


[-- Attachment #2: Type: text/html, Size: 6232 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-17 14:32 ` xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10 Luca Fancellu
@ 2021-03-18  7:54   ` Jürgen Groß
  2021-03-18  8:47     ` Luca Fancellu
  0 siblings, 1 reply; 12+ messages in thread
From: Jürgen Groß @ 2021-03-18  7:54 UTC (permalink / raw)
  To: Luca Fancellu, xen-devel; +Cc: jgrall


[-- Attachment #1.1.1: Type: text/plain, Size: 3531 bytes --]

On 17.03.21 15:32, Luca Fancellu wrote:
> 
> Hi all,
> 
> we've been encountering an issue when using the kernel 5.10 with 
> preempt_rt support for Dom0, the problem is that during the boot of 
> Dom0, it hits a BUG_ON(!irqs_disabled()) from the 
> function evtchn_fifo_unmask defined in events_fifo.c.
> 
> This is the call stack:
> 
> [   17.817018] ------------[ cut here ]------------
> [   17.817021] kernel BUG at drivers/xen/events/events_fifo.c:258!
> [   18.817079] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
> [   18.817081] Modules linked in: bridge stp llc ipv6
> [   18.817086] CPU: 3 PID: 558 Comm: xenstored Not tainted 
> 5.10.16-rt25-yocto-preempt-rt #1
> [   18.817089] Hardware name: Arm Neoverse N1 System Development 
> Platform (DT)
> [   18.817090] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [   18.817092] pc : evtchn_fifo_unmask+0xd4/0xe0
> [   18.817099] lr : xen_irq_lateeoi_locked+0xec/0x200
> [   18.817102] sp : ffff8000123f3cc0
> [   18.817102] x29: ffff8000123f3cc0 x28: ffff0000427b1d80
> [   18.817104] x27: 0000000000000000 x26: 0000000000000000
> [   18.817106] x25: 0000000000000001 x24: 0000000000000001
> [   18.817107] x23: ffff0000412fc900 x22: 0000000000000004
> [   18.817109] x21: 0000000000000000 x20: ffff000042e06990
> [   18.817110] x19: ffff0000427b1d80 x18: 0000000000000010
> [   18.817112] x17: 0000000000000000 x16: 0000000000000000
> [   18.817113] x15: 0000000000000002 x14: 0000000000000001
> [   18.817114] x13: 000000000001a7e8 x12: 0000000000000040
> [   18.817116] x11: ffff000040400248 x10: ffff00004040024a
> [   18.817117] x9 : ffff800011be5200 x8 : ffff000040400270
> [   18.817119] x7 : 0000000000000000 x6 : 0000000000000003
> [   18.817120] x5 : 0000000000000000 x4 : ffff000040400308
> [   18.817121] x3 : ffff0000408a400c x2 : 0000000000000000
> [   18.817122] x1 : 0000000000000000 x0 : ffff0000408a4000
> [   18.817124] Call trace:
> [   18.817125]  evtchn_fifo_unmask+0xd4/0xe0
> [   18.817127]  xen_irq_lateeoi_locked+0xec/0x200
> [   18.817129]  xen_irq_lateeoi+0x48/0x64
> [   18.817131]  evtchn_write+0x124/0x15c
> [   18.817134]  vfs_write+0xf0/0x2cc
> [   18.817137]  ksys_write+0xe0/0x100
> [   18.817139]  __arm64_sys_write+0x20/0x30
> [   18.817142]  el0_svc_common.constprop.0+0x78/0x1a0
> [   18.817145]  do_el0_svc+0x24/0x90
> [   18.817147]  el0_svc+0x14/0x20
> [   18.817151]  el0_sync_handler+0x1a4/0x1b0
> [   18.817153]  el0_sync+0x174/0x180
> [   18.817156] Code: 52800120 b90023e6 97e6d104 17fffff0 (d4210000)
> [   18.817158] ---[ end trace 0000000000000002 ]---
> 
> Our last tested kernel was the 5.4 and our analysis pointed out that the 
> introduction of the lateeoi framework (xen/events: add a new "late EOI" 
> evtchn framework) in conjunction with the preempt_rt patches (irqs kept 
> enabled between spinlock_t/rwlock_t _irqsave///​_irqrestore operations) 
> is the root cause.
> 
> Given that many modifications were made to the mask/unmask operations, a 
> big one from Juergen Gross (xen/events: don't unmask an event channel 
> when an eoi is pending), is the BUG_ON(...) still needed?
> 
> With the mentioned commit every call to a mask/unmask operation is 
> protected by a spinlock, so I would like to have some feedbacks from who 
> has more experience than me on this part of the code.

I think this BUG_ON() can be removed.

Are you planning to send a patch?


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-18  7:54   ` Jürgen Groß
@ 2021-03-18  8:47     ` Luca Fancellu
  2021-03-19 11:50       ` Luca Fancellu
  0 siblings, 1 reply; 12+ messages in thread
From: Luca Fancellu @ 2021-03-18  8:47 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: xen-devel, jgrall

Hi Juergen,

If you are willing to do the patch I think it will be faster to being accepted, what about the BUG_ON(…) in evtchn_2l_unmask from events_2l.c file?

Cheers,

Luca

> On 18 Mar 2021, at 07:54, Jürgen Groß <jgross@suse.com> wrote:
> 
> On 17.03.21 15:32, Luca Fancellu wrote:
>> Hi all,
>> we've been encountering an issue when using the kernel 5.10 with preempt_rt support for Dom0, the problem is that during the boot of Dom0, it hits a BUG_ON(!irqs_disabled()) from the function evtchn_fifo_unmask defined in events_fifo.c.
>> This is the call stack:
>> [   17.817018] ------------[ cut here ]------------
>> [   17.817021] kernel BUG at drivers/xen/events/events_fifo.c:258!
>> [   18.817079] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
>> [   18.817081] Modules linked in: bridge stp llc ipv6
>> [   18.817086] CPU: 3 PID: 558 Comm: xenstored Not tainted 5.10.16-rt25-yocto-preempt-rt #1
>> [   18.817089] Hardware name: Arm Neoverse N1 System Development Platform (DT)
>> [   18.817090] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>> [   18.817092] pc : evtchn_fifo_unmask+0xd4/0xe0
>> [   18.817099] lr : xen_irq_lateeoi_locked+0xec/0x200
>> [   18.817102] sp : ffff8000123f3cc0
>> [   18.817102] x29: ffff8000123f3cc0 x28: ffff0000427b1d80
>> [   18.817104] x27: 0000000000000000 x26: 0000000000000000
>> [   18.817106] x25: 0000000000000001 x24: 0000000000000001
>> [   18.817107] x23: ffff0000412fc900 x22: 0000000000000004
>> [   18.817109] x21: 0000000000000000 x20: ffff000042e06990
>> [   18.817110] x19: ffff0000427b1d80 x18: 0000000000000010
>> [   18.817112] x17: 0000000000000000 x16: 0000000000000000
>> [   18.817113] x15: 0000000000000002 x14: 0000000000000001
>> [   18.817114] x13: 000000000001a7e8 x12: 0000000000000040
>> [   18.817116] x11: ffff000040400248 x10: ffff00004040024a
>> [   18.817117] x9 : ffff800011be5200 x8 : ffff000040400270
>> [   18.817119] x7 : 0000000000000000 x6 : 0000000000000003
>> [   18.817120] x5 : 0000000000000000 x4 : ffff000040400308
>> [   18.817121] x3 : ffff0000408a400c x2 : 0000000000000000
>> [   18.817122] x1 : 0000000000000000 x0 : ffff0000408a4000
>> [   18.817124] Call trace:
>> [   18.817125]  evtchn_fifo_unmask+0xd4/0xe0
>> [   18.817127]  xen_irq_lateeoi_locked+0xec/0x200
>> [   18.817129]  xen_irq_lateeoi+0x48/0x64
>> [   18.817131]  evtchn_write+0x124/0x15c
>> [   18.817134]  vfs_write+0xf0/0x2cc
>> [   18.817137]  ksys_write+0xe0/0x100
>> [   18.817139]  __arm64_sys_write+0x20/0x30
>> [   18.817142]  el0_svc_common.constprop.0+0x78/0x1a0
>> [   18.817145]  do_el0_svc+0x24/0x90
>> [   18.817147]  el0_svc+0x14/0x20
>> [   18.817151]  el0_sync_handler+0x1a4/0x1b0
>> [   18.817153]  el0_sync+0x174/0x180
>> [   18.817156] Code: 52800120 b90023e6 97e6d104 17fffff0 (d4210000)
>> [   18.817158] ---[ end trace 0000000000000002 ]---
>> Our last tested kernel was the 5.4 and our analysis pointed out that the introduction of the lateeoi framework (xen/events: add a new "late EOI" evtchn framework) in conjunction with the preempt_rt patches (irqs kept enabled between spinlock_t/rwlock_t _irqsave///​_irqrestore operations) is the root cause.
>> Given that many modifications were made to the mask/unmask operations, a big one from Juergen Gross (xen/events: don't unmask an event channel when an eoi is pending), is the BUG_ON(...) still needed?
>> With the mentioned commit every call to a mask/unmask operation is protected by a spinlock, so I would like to have some feedbacks from who has more experience than me on this part of the code.
> 
> I think this BUG_ON() can be removed.
> 
> Are you planning to send a patch?
> 
> 
> Juergen
> <OpenPGP_0xB0DE9DD628BF132F.asc>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-18  8:47     ` Luca Fancellu
@ 2021-03-19 11:50       ` Luca Fancellu
  2021-03-19 11:54         ` Jürgen Groß
  0 siblings, 1 reply; 12+ messages in thread
From: Luca Fancellu @ 2021-03-19 11:50 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: xen-devel, jgrall

Hi Juergen,

Could you confirm that back porting this two serie to the linux kernel 5.10:

https://patchwork.kernel.org/project/xen-devel/cover/20201210192536.118432146@linutronix.de/
https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/

Is needed to remove the BUG_ON(…)?

Thank you for your time.

Cheers,

Luca

> On 18 Mar 2021, at 08:47, Luca Fancellu <Luca.Fancellu@arm.com> wrote:
> 
> Hi Juergen,
> 
> If you are willing to do the patch I think it will be faster to being accepted, what about the BUG_ON(…) in evtchn_2l_unmask from events_2l.c file?
> 
> Cheers,
> 
> Luca
> 
>> On 18 Mar 2021, at 07:54, Jürgen Groß <jgross@suse.com> wrote:
>> 
>> On 17.03.21 15:32, Luca Fancellu wrote:
>>> Hi all,
>>> we've been encountering an issue when using the kernel 5.10 with preempt_rt support for Dom0, the problem is that during the boot of Dom0, it hits a BUG_ON(!irqs_disabled()) from the function evtchn_fifo_unmask defined in events_fifo.c.
>>> This is the call stack:
>>> [   17.817018] ------------[ cut here ]------------
>>> [   17.817021] kernel BUG at drivers/xen/events/events_fifo.c:258!
>>> [   18.817079] Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP
>>> [   18.817081] Modules linked in: bridge stp llc ipv6
>>> [   18.817086] CPU: 3 PID: 558 Comm: xenstored Not tainted 5.10.16-rt25-yocto-preempt-rt #1
>>> [   18.817089] Hardware name: Arm Neoverse N1 System Development Platform (DT)
>>> [   18.817090] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>>> [   18.817092] pc : evtchn_fifo_unmask+0xd4/0xe0
>>> [   18.817099] lr : xen_irq_lateeoi_locked+0xec/0x200
>>> [   18.817102] sp : ffff8000123f3cc0
>>> [   18.817102] x29: ffff8000123f3cc0 x28: ffff0000427b1d80
>>> [   18.817104] x27: 0000000000000000 x26: 0000000000000000
>>> [   18.817106] x25: 0000000000000001 x24: 0000000000000001
>>> [   18.817107] x23: ffff0000412fc900 x22: 0000000000000004
>>> [   18.817109] x21: 0000000000000000 x20: ffff000042e06990
>>> [   18.817110] x19: ffff0000427b1d80 x18: 0000000000000010
>>> [   18.817112] x17: 0000000000000000 x16: 0000000000000000
>>> [   18.817113] x15: 0000000000000002 x14: 0000000000000001
>>> [   18.817114] x13: 000000000001a7e8 x12: 0000000000000040
>>> [   18.817116] x11: ffff000040400248 x10: ffff00004040024a
>>> [   18.817117] x9 : ffff800011be5200 x8 : ffff000040400270
>>> [   18.817119] x7 : 0000000000000000 x6 : 0000000000000003
>>> [   18.817120] x5 : 0000000000000000 x4 : ffff000040400308
>>> [   18.817121] x3 : ffff0000408a400c x2 : 0000000000000000
>>> [   18.817122] x1 : 0000000000000000 x0 : ffff0000408a4000
>>> [   18.817124] Call trace:
>>> [   18.817125]  evtchn_fifo_unmask+0xd4/0xe0
>>> [   18.817127]  xen_irq_lateeoi_locked+0xec/0x200
>>> [   18.817129]  xen_irq_lateeoi+0x48/0x64
>>> [   18.817131]  evtchn_write+0x124/0x15c
>>> [   18.817134]  vfs_write+0xf0/0x2cc
>>> [   18.817137]  ksys_write+0xe0/0x100
>>> [   18.817139]  __arm64_sys_write+0x20/0x30
>>> [   18.817142]  el0_svc_common.constprop.0+0x78/0x1a0
>>> [   18.817145]  do_el0_svc+0x24/0x90
>>> [   18.817147]  el0_svc+0x14/0x20
>>> [   18.817151]  el0_sync_handler+0x1a4/0x1b0
>>> [   18.817153]  el0_sync+0x174/0x180
>>> [   18.817156] Code: 52800120 b90023e6 97e6d104 17fffff0 (d4210000)
>>> [   18.817158] ---[ end trace 0000000000000002 ]---
>>> Our last tested kernel was the 5.4 and our analysis pointed out that the introduction of the lateeoi framework (xen/events: add a new "late EOI" evtchn framework) in conjunction with the preempt_rt patches (irqs kept enabled between spinlock_t/rwlock_t _irqsave///​_irqrestore operations) is the root cause.
>>> Given that many modifications were made to the mask/unmask operations, a big one from Juergen Gross (xen/events: don't unmask an event channel when an eoi is pending), is the BUG_ON(...) still needed?
>>> With the mentioned commit every call to a mask/unmask operation is protected by a spinlock, so I would like to have some feedbacks from who has more experience than me on this part of the code.
>> 
>> I think this BUG_ON() can be removed.
>> 
>> Are you planning to send a patch?
>> 
>> 
>> Juergen
>> <OpenPGP_0xB0DE9DD628BF132F.asc>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-19 11:50       ` Luca Fancellu
@ 2021-03-19 11:54         ` Jürgen Groß
  2021-03-22 19:08           ` Luca Fancellu
  0 siblings, 1 reply; 12+ messages in thread
From: Jürgen Groß @ 2021-03-19 11:54 UTC (permalink / raw)
  To: Luca Fancellu; +Cc: xen-devel, jgrall


[-- Attachment #1.1.1: Type: text/plain, Size: 431 bytes --]

On 19.03.21 12:50, Luca Fancellu wrote:
> Hi Juergen,
> 
> Could you confirm that back porting this two serie to the linux kernel 5.10:
> 
> https://patchwork.kernel.org/project/xen-devel/cover/20201210192536.118432146@linutronix.de/

I don't see why this one would be needed?

> https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/

This is already part of 5.10.24.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-19 11:54         ` Jürgen Groß
@ 2021-03-22 19:08           ` Luca Fancellu
  2021-03-23 12:36             ` Jason Andryuk
  0 siblings, 1 reply; 12+ messages in thread
From: Luca Fancellu @ 2021-03-22 19:08 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: xen-devel

Hi Juergen,

Yes you are right it was my mistake, as you said to remove the BUG_ON(…) this serie (https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/) is needed, since I’m using yocto I’m able to build a preempt_rt kernel up to the 5.10.23 and for this reason I’m applying that serie on top of this version, then I’m removing the BUG_ON(…).

A thing that was not expected is that now the Dom0 kernel is stuck on “Setting domain 0 name, domid and JSON config…” step and the system seems unresponsive. Seems like a deadlock issue but looking into the serie we can’t spot anything and that serie was also tested by others from the community.
Do you have any idea on that?

Thank you for your time.

> On 19 Mar 2021, at 11:54, Jürgen Groß <jgross@suse.com> wrote:
> 
> On 19.03.21 12:50, Luca Fancellu wrote:
>> Hi Juergen,
>> Could you confirm that back porting this two serie to the linux kernel 5.10:
>> https://patchwork.kernel.org/project/xen-devel/cover/20201210192536.118432146@linutronix.de/
> 
> I don't see why this one would be needed?
> 
>> https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/
> 
> This is already part of 5.10.24.
> 
> 
> Juergen
> <OpenPGP_0xB0DE9DD628BF132F.asc>

Cheers,
Luca

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-22 19:08           ` Luca Fancellu
@ 2021-03-23 12:36             ` Jason Andryuk
  2021-03-23 15:39               ` Luca Fancellu
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Andryuk @ 2021-03-23 12:36 UTC (permalink / raw)
  To: Luca Fancellu; +Cc: Jürgen Groß, xen-devel

On Mon, Mar 22, 2021 at 3:09 PM Luca Fancellu <luca.fancellu@arm.com> wrote:
>
> Hi Juergen,
>
> Yes you are right it was my mistake, as you said to remove the BUG_ON(…) this serie (https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/) is needed, since I’m using yocto I’m able to build a preempt_rt kernel up to the 5.10.23 and for this reason I’m applying that serie on top of this version, then I’m removing the BUG_ON(…).
>
> A thing that was not expected is that now the Dom0 kernel is stuck on “Setting domain 0 name, domid and JSON config…” step and the system seems unresponsive. Seems like a deadlock issue but looking into the serie we can’t spot anything and that serie was also tested by others from the community.
> Do you have any idea on that?

I think this can happen if xenstored isn't running.  Client access to
xenstore blocks until xenstored becomes available.  So check your init
scripts ordering.

Regards,
Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-23 12:36             ` Jason Andryuk
@ 2021-03-23 15:39               ` Luca Fancellu
  2021-03-23 17:06                 ` Luca Fancellu
  0 siblings, 1 reply; 12+ messages in thread
From: Luca Fancellu @ 2021-03-23 15:39 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel

Hi Jason,

Thanks for your hints, unfortunately seems not an init problem because in the same init configuration I tried the 5.10.23 (preempt_rt) without the Juergen patch but with the BUG_ON removed and it boots without problem. So seems that applying the serie does something (on a preempt_rt kernel) and we are trying to figure out what.


> On 23 Mar 2021, at 12:36, Jason Andryuk <jandryuk@gmail.com> wrote:
> 
> On Mon, Mar 22, 2021 at 3:09 PM Luca Fancellu <luca.fancellu@arm.com> wrote:
>> 
>> Hi Juergen,
>> 
>> Yes you are right it was my mistake, as you said to remove the BUG_ON(…) this serie (https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/) is needed, since I’m using yocto I’m able to build a preempt_rt kernel up to the 5.10.23 and for this reason I’m applying that serie on top of this version, then I’m removing the BUG_ON(…).
>> 
>> A thing that was not expected is that now the Dom0 kernel is stuck on “Setting domain 0 name, domid and JSON config…” step and the system seems unresponsive. Seems like a deadlock issue but looking into the serie we can’t spot anything and that serie was also tested by others from the community.
>> Do you have any idea on that?
> 
> I think this can happen if xenstored isn't running.  Client access to
> xenstore blocks until xenstored becomes available.  So check your init
> scripts ordering.
> 
> Regards,
> Jason

Cheers,
Luca

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-23 15:39               ` Luca Fancellu
@ 2021-03-23 17:06                 ` Luca Fancellu
  2021-03-23 19:26                   ` Julien Grall
  0 siblings, 1 reply; 12+ messages in thread
From: Luca Fancellu @ 2021-03-23 17:06 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: xen-devel, Jürgen Groß

Hi all,

I have an update, changing the lock introduced by the serie from spinlock_t to raw_spinlock_t, changing the lock/unlock function to use the raw_* version and keeping the BUG_ON(…) (now we can because raw_* implementation disable interrupts on preempt_rt) the kernel is booting correctly.
So seems that the BUG_ON(…) is needed and the unmask function should run with interrupt disabled, anyone knows why this change worked? 

> On 23 Mar 2021, at 15:39, Luca Fancellu <luca.fancellu@arm.com> wrote:
> 
> Hi Jason,
> 
> Thanks for your hints, unfortunately seems not an init problem because in the same init configuration I tried the 5.10.23 (preempt_rt) without the Juergen patch but with the BUG_ON removed and it boots without problem. So seems that applying the serie does something (on a preempt_rt kernel) and we are trying to figure out what.
> 
> 
>> On 23 Mar 2021, at 12:36, Jason Andryuk <jandryuk@gmail.com> wrote:
>> 
>> On Mon, Mar 22, 2021 at 3:09 PM Luca Fancellu <luca.fancellu@arm.com> wrote:
>>> 
>>> Hi Juergen,
>>> 
>>> Yes you are right it was my mistake, as you said to remove the BUG_ON(…) this serie (https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/) is needed, since I’m using yocto I’m able to build a preempt_rt kernel up to the 5.10.23 and for this reason I’m applying that serie on top of this version, then I’m removing the BUG_ON(…).
>>> 
>>> A thing that was not expected is that now the Dom0 kernel is stuck on “Setting domain 0 name, domid and JSON config…” step and the system seems unresponsive. Seems like a deadlock issue but looking into the serie we can’t spot anything and that serie was also tested by others from the community.
>>> Do you have any idea on that?
>> 
>> I think this can happen if xenstored isn't running.  Client access to
>> xenstore blocks until xenstored becomes available.  So check your init
>> scripts ordering.
>> 
>> Regards,
>> Jason
> 
> Cheers,
> Luca

Cheers,
Luca

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-23 17:06                 ` Luca Fancellu
@ 2021-03-23 19:26                   ` Julien Grall
  2021-03-23 22:59                     ` Julien Grall
  2021-03-24 10:37                     ` Luca Fancellu
  0 siblings, 2 replies; 12+ messages in thread
From: Julien Grall @ 2021-03-23 19:26 UTC (permalink / raw)
  To: Luca Fancellu, Jason Andryuk; +Cc: xen-devel, Jürgen Groß



On 23/03/2021 17:06, Luca Fancellu wrote:
> Hi all,

Hi,

Please avoid top posting when answering to a comment. This makes more 
difficult to follow.

> I have an update, changing the lock introduced by the serie from spinlock_t to raw_spinlock_t, changing the lock/unlock function to use the raw_* version and keeping the BUG_ON(…) (now we can because raw_* implementation disable interrupts on preempt_rt) the kernel is booting correctly.
> So seems that the BUG_ON(…) is needed and the unmask function should run with interrupt disabled, anyone knows why this change worked?

Do you mean why no-one spotted the issue before? If so, AFAIK, on 
vanilla Linux, spin_lock is still just a wrapper to raw_spinlock. IOW 
there is no option to replace it with a RT spinlock.

So if you don't apply the RT patches, you would not be able to trigger 
the issue.

As to the fix itself, I think using raw_spinlock_t is the correct thing 
to do because the lock is also used in interrupt context (even with RT 
enabled).

Would you be able to send a patch?

> 
>> On 23 Mar 2021, at 15:39, Luca Fancellu <luca.fancellu@arm.com> wrote:
>>
>> Hi Jason,
>>
>> Thanks for your hints, unfortunately seems not an init problem because in the same init configuration I tried the 5.10.23 (preempt_rt) without the Juergen patch but with the BUG_ON removed and it boots without problem. So seems that applying the serie does something (on a preempt_rt kernel) and we are trying to figure out what.
>>
>>
>>> On 23 Mar 2021, at 12:36, Jason Andryuk <jandryuk@gmail.com> wrote:
>>>
>>> On Mon, Mar 22, 2021 at 3:09 PM Luca Fancellu <luca.fancellu@arm.com> wrote:
>>>>
>>>> Hi Juergen,
>>>>
>>>> Yes you are right it was my mistake, as you said to remove the BUG_ON(…) this serie (https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/) is needed, since I’m using yocto I’m able to build a preempt_rt kernel up to the 5.10.23 and for this reason I’m applying that serie on top of this version, then I’m removing the BUG_ON(…).
>>>>
>>>> A thing that was not expected is that now the Dom0 kernel is stuck on “Setting domain 0 name, domid and JSON config…” step and the system seems unresponsive. Seems like a deadlock issue but looking into the serie we can’t spot anything and that serie was also tested by others from the community.

The deadlock is expected. When you enable RT spinlock, the interrupts 
will not disabled even when you call spin_lock_irqsave().

As the lock is also used in interrupt context (e.g. with interrupt 
masked), this will lead to a deadlock because the lock can be held with 
interrupt unmasked.

This is quite a common error as developpers are not yet used to test RT. 
I remember finding a few other instances like that when I worked on RT a 
couple of years ago.

For future reference, I think CONFIG_PROVE_LOCKING=y could help you to 
detect (potential) deadlock.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-23 19:26                   ` Julien Grall
@ 2021-03-23 22:59                     ` Julien Grall
  2021-03-24 10:37                     ` Luca Fancellu
  1 sibling, 0 replies; 12+ messages in thread
From: Julien Grall @ 2021-03-23 22:59 UTC (permalink / raw)
  To: Luca Fancellu, Jason Andryuk; +Cc: xen-devel, Jürgen Groß



On 23/03/2021 19:26, Julien Grall wrote:
> 
> 
> On 23/03/2021 17:06, Luca Fancellu wrote:
>> Hi all,
> 
> Hi,
> 
> Please avoid top posting when answering to a comment. This makes more 
> difficult to follow.
> 
>> I have an update, changing the lock introduced by the serie from 
>> spinlock_t to raw_spinlock_t, changing the lock/unlock function to use 
>> the raw_* version and keeping the BUG_ON(…) (now we can because raw_* 
>> implementation disable interrupts on preempt_rt) the kernel is booting 
>> correctly.
>> So seems that the BUG_ON(…) is needed and the unmask function should 
>> run with interrupt disabled, anyone knows why this change worked?
> 
> Do you mean why no-one spotted the issue before? If so, AFAIK, on 
> vanilla Linux, spin_lock is still just a wrapper to raw_spinlock. IOW 
> there is no option to replace it with a RT spinlock.
> 
> So if you don't apply the RT patches, you would not be able to trigger 
> the issue.
> 
> As to the fix itself, I think using raw_spinlock_t is the correct thing 
> to do because the lock is also used in interrupt context (even with RT 
> enabled).
> 
> Would you be able to send a patch?
> 
>>
>>> On 23 Mar 2021, at 15:39, Luca Fancellu <luca.fancellu@arm.com> wrote:
>>>
>>> Hi Jason,
>>>
>>> Thanks for your hints, unfortunately seems not an init problem 
>>> because in the same init configuration I tried the 5.10.23 
>>> (preempt_rt) without the Juergen patch but with the BUG_ON removed 
>>> and it boots without problem. So seems that applying the serie does 
>>> something (on a preempt_rt kernel) and we are trying to figure out what.
>>>
>>>
>>>> On 23 Mar 2021, at 12:36, Jason Andryuk <jandryuk@gmail.com> wrote:
>>>>
>>>> On Mon, Mar 22, 2021 at 3:09 PM Luca Fancellu 
>>>> <luca.fancellu@arm.com> wrote:
>>>>>
>>>>> Hi Juergen,
>>>>>
>>>>> Yes you are right it was my mistake, as you said to remove the 
>>>>> BUG_ON(…) this serie 
>>>>> (https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/) 
>>>>> is needed, since I’m using yocto I’m able to build a preempt_rt 
>>>>> kernel up to the 5.10.23 and for this reason I’m applying that 
>>>>> serie on top of this version, then I’m removing the BUG_ON(…).
>>>>>
>>>>> A thing that was not expected is that now the Dom0 kernel is stuck 
>>>>> on “Setting domain 0 name, domid and JSON config…” step and the 
>>>>> system seems unresponsive. Seems like a deadlock issue but looking 
>>>>> into the serie we can’t spot anything and that serie was also 
>>>>> tested by others from the community.
> 
> The deadlock is expected. When you enable RT spinlock, the interrupts 
> will not disabled even when you call spin_lock_irqsave().
> 
> As the lock is also used in interrupt context (e.g. with interrupt 
> masked), this will lead to a deadlock because the lock can be held with 
> interrupt unmasked.
> 
> This is quite a common error as developpers are not yet used to test RT. 
> I remember finding a few other instances like that when I worked on RT a 
> couple of years ago.
> 
> For future reference, I think CONFIG_PROVE_LOCKING=y could help you to 
> detect (potential) deadlock.

Actually, Linux has a config that detect misuse of spinlock. If I enable 
CONFIG_PROVE_RAW_LOCK_NESTING, I will get the following:

[    1.286028] =============================
[    1.286781] [ BUG: Invalid wait context ]
[    1.287539] 5.12.0-rc4+ #222 Not tainted
[    1.288277] -----------------------------
[    1.289033] swapper/0/0 is trying to lock:
[    1.289799] ffff8881001b8358 (&info->lock){....}-{3:3}, at: 
do_unmask+0x1c/0x50
[    1.291132] other info that might help us debug this:
[    1.292066] context-{5:5}
[    1.292594] 2 locks held by swapper/0/0:
[    1.293339]  #0: ffff8881001bfae8 (&desc->request_mutex){+.+.}-{4:4}, 
at: __setup_irq+0xb9/0x8c0
[    1.294909]  #1: ffff8881001bf930 (&irq_desc_lock_class){....}-{2:2}, 
at: __setup_irq+0xea/0x8c0
[    1.296478] stack backtrace:
[    1.297062] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc4+ #222
[    1.298304] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[    1.300326] Call Trace:
[    1.300837]  dump_stack+0x7f/0xa1
[    1.301514]  __lock_acquire+0x86e/0x1e20
[    1.302281]  ? add_lock_to_list.constprop.58+0xac/0x1d0
[    1.303270]  lock_acquire+0x26d/0x400
[    1.303987]  ? do_unmask+0x1c/0x50
[    1.304670]  _raw_spin_lock_irqsave+0x48/0x60
[    1.305525]  ? do_unmask+0x1c/0x50
[    1.306208]  do_unmask+0x1c/0x50
[    1.306863]  unmask_irq.part.41+0x25/0x40
[    1.307655]  __irq_startup+0x68/0x70
[    1.308361]  irq_startup+0x59/0xf0
[    1.309046]  __setup_irq+0x7ea/0x8c0
[    1.309750]  request_threaded_irq+0xfb/0x160
[    1.310581]  ? xen_timerop_shutdown+0x10/0x10
[    1.311434]  bind_virq_to_irqhandler+0x47/0x80
[    1.312293]  xen_setup_timer+0x7f/0x190
[    1.313045]  xen_time_init+0x1bf/0x214
[    1.313780]  x86_late_time_init+0x17/0x30
[    1.314553]  start_kernel+0x851/0x90a
[    1.315276]  ? default_get_nmi_reason+0x10/0x10
[    1.316149]  xen_start_kernel+0x575/0x57f
[    1.316932]  startup_xen+0x3e/0x3e

This is on an x86 (either PV or HVM). Interestingly on Arm, I get a 
different one:

[    0.112598] =============================
[    0.112745] [ BUG: Invalid wait context ]
[    0.112953] 5.12.0-rc4+ #98 Not tainted
[    0.113156] -----------------------------
[    0.113301] swapper/0/1 is trying to lock:
[    0.113461] ffffa10084d2f788 (evtchn_rwlock){....}-{3:3}, at: 
__xen_evtchn_do_upcall+0x64/0xf0
[    0.114041] other info that might help us debug this:
[    0.114229] context-{2:2}
[    0.114351] 2 locks held by swapper/0/1:
[    0.114608]  #0: ffffa10084ab1e78 (cpu_hotplug_lock){.+.+}-{0:0}, at: 
xen_guest_init+0x33c/0x380
[    0.115076]  #1: ffffa10084ab1c70 (cpuhp_state_mutex){+.+.}-{4:4}, 
at: __cpuhp_setup_state_cpuslocked+0x64/0x308
[    0.115530] stack backtrace:
[    0.115880] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc4+ #98
[    0.116371] Call trace:
[    0.116488]  dump_backtrace+0x0/0x1e0
[    0.116650]  show_stack+0x2c/0x48
[    0.116791]  dump_stack+0xf0/0x138
[    0.116933]  __lock_acquire+0x5ac/0x1a38
[    0.117085]  lock_acquire+0x3cc/0x4e0
[    0.117225]  _raw_read_lock+0x80/0x188
[    0.117370]  __xen_evtchn_do_upcall+0x64/0xf0
[    0.117536]  xen_hvm_evtchn_do_upcall+0x14/0x20
[    0.117706]  xen_arm_callback+0x14/0x20
[    0.117855]  handle_percpu_devid_irq+0xc8/0x428
[    0.118027]  generic_handle_irq+0x3c/0x58
[    0.118174]  __handle_domain_irq+0x68/0xc0
[    0.118323]  gic_handle_irq+0x64/0x160
[    0.118463]  el1_irq+0xb0/0x140
[    0.118593]  _raw_spin_unlock_irqrestore+0x6c/0xe8
[    0.118769]  __irq_put_desc_unlock+0x30/0x60
[    0.118935]  enable_percpu_irq+0x94/0xf0
[    0.119079]  xen_starting_cpu+0x70/0x1b8
[    0.119228]  cpuhp_invoke_callback+0xfc/0xf20
[    0.119395]  cpuhp_issue_call+0x220/0x250
[    0.119542]  __cpuhp_setup_state_cpuslocked+0x150/0x308
[    0.119725]  __cpuhp_setup_state+0x138/0x3b0
[    0.119891]  xen_guest_init+0x33c/0x380
[    0.120033]  do_one_initcall+0x90/0x470
[    0.120176]  kernel_init_freeable+0x250/0x440
[    0.120345]  kernel_init+0x1c/0x138
[    0.120480]  ret_from_fork+0x10/0x18

I am not sure why I don't get a splat for info->lock on Arm.

Anyway, I am quite confused at the second splat. It seems to suggest it 
would not be valid to call read_lock() while a cpu_hotplug_lock() is 
held. Does it ring any bell?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10
  2021-03-23 19:26                   ` Julien Grall
  2021-03-23 22:59                     ` Julien Grall
@ 2021-03-24 10:37                     ` Luca Fancellu
  1 sibling, 0 replies; 12+ messages in thread
From: Luca Fancellu @ 2021-03-24 10:37 UTC (permalink / raw)
  To: Julien Grall; +Cc: Jason Andryuk, xen-devel, Jürgen Groß



> On 23 Mar 2021, at 19:26, Julien Grall <julien@xen.org> wrote:
> 
> 
> 
> On 23/03/2021 17:06, Luca Fancellu wrote:
>> Hi all,
> 
> Hi,
> 
> Please avoid top posting when answering to a comment. This makes more difficult to follow.
> 
>> I have an update, changing the lock introduced by the serie from spinlock_t to raw_spinlock_t, changing the lock/unlock function to use the raw_* version and keeping the BUG_ON(…) (now we can because raw_* implementation disable interrupts on preempt_rt) the kernel is booting correctly.
>> So seems that the BUG_ON(…) is needed and the unmask function should run with interrupt disabled, anyone knows why this change worked?
> 
> Do you mean why no-one spotted the issue before? If so, AFAIK, on vanilla Linux, spin_lock is still just a wrapper to raw_spinlock. IOW there is no option to replace it with a RT spinlock.
> 
> So if you don't apply the RT patches, you would not be able to trigger the issue.
> 
> As to the fix itself, I think using raw_spinlock_t is the correct thing to do because the lock is also used in interrupt context (even with RT enabled).
> 
> Would you be able to send a patch?

Yes I’ll send a patch soon

> 
>>> On 23 Mar 2021, at 15:39, Luca Fancellu <luca.fancellu@arm.com> wrote:
>>> 
>>> Hi Jason,
>>> 
>>> Thanks for your hints, unfortunately seems not an init problem because in the same init configuration I tried the 5.10.23 (preempt_rt) without the Juergen patch but with the BUG_ON removed and it boots without problem. So seems that applying the serie does something (on a preempt_rt kernel) and we are trying to figure out what.
>>> 
>>> 
>>>> On 23 Mar 2021, at 12:36, Jason Andryuk <jandryuk@gmail.com> wrote:
>>>> 
>>>> On Mon, Mar 22, 2021 at 3:09 PM Luca Fancellu <luca.fancellu@arm.com> wrote:
>>>>> 
>>>>> Hi Juergen,
>>>>> 
>>>>> Yes you are right it was my mistake, as you said to remove the BUG_ON(…) this serie (https://patchwork.kernel.org/project/xen-devel/cover/20210306161833.4552-1-jgross@suse.com/) is needed, since I’m using yocto I’m able to build a preempt_rt kernel up to the 5.10.23 and for this reason I’m applying that serie on top of this version, then I’m removing the BUG_ON(…).
>>>>> 
>>>>> A thing that was not expected is that now the Dom0 kernel is stuck on “Setting domain 0 name, domid and JSON config…” step and the system seems unresponsive. Seems like a deadlock issue but looking into the serie we can’t spot anything and that serie was also tested by others from the community.
> 
> The deadlock is expected. When you enable RT spinlock, the interrupts will not disabled even when you call spin_lock_irqsave().
> 
> As the lock is also used in interrupt context (e.g. with interrupt masked), this will lead to a deadlock because the lock can be held with interrupt unmasked.
> 
> This is quite a common error as developpers are not yet used to test RT. I remember finding a few other instances like that when I worked on RT a couple of years ago.
> 
> For future reference, I think CONFIG_PROVE_LOCKING=y could help you to detect (potential) deadlock.
> 
> Cheers,
> 
> -- 
> Julien Grall


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-24 10:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <VI1PR08MB3629824170C1707255465D8BE46A9@VI1PR08MB3629.eurprd08.prod.outlook.com>
2021-03-17 14:32 ` xen/evtchn: Dom0 boot hangs using preempt_rt kernel 5.10 Luca Fancellu
2021-03-18  7:54   ` Jürgen Groß
2021-03-18  8:47     ` Luca Fancellu
2021-03-19 11:50       ` Luca Fancellu
2021-03-19 11:54         ` Jürgen Groß
2021-03-22 19:08           ` Luca Fancellu
2021-03-23 12:36             ` Jason Andryuk
2021-03-23 15:39               ` Luca Fancellu
2021-03-23 17:06                 ` Luca Fancellu
2021-03-23 19:26                   ` Julien Grall
2021-03-23 22:59                     ` Julien Grall
2021-03-24 10:37                     ` Luca Fancellu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.