All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Advice sought on RCU stalls on ARM64 WSL2
       [not found] <6a07-65e37700-1-4052b980@161649365>
@ 2024-03-02 19:43 ` Paul E. McKenney
       [not found]   ` <3718bb-65e38400-d-5cd53080@68830111>
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2024-03-02 19:43 UTC (permalink / raw)
  To: Max Boone; +Cc: boqun.feng, rcu

[ Adding Boqun and the rcu list on CC. ]

On Sat, Mar 02, 2024 at 07:59:08PM +0100, Max Boone wrote:
> 
> Dear Dr. McKenney,
> 
> For a couple of years now I've been the sometimes frustrated owner of a Microsoft Surface Pro X ARM64 device, which has been getting progressively better as more vendors start targeting their builds at ARM64 architectures but since the introduction of the device there have been issues with the Windows Subsystem for Linux (not more than an opinionated Hyper-V VM with extensive tooling) locking up and hanging. 
> 
> When this happens, traces like the following are dumped in the kernel messages:
> https://github.com/microsoft/WSL/issues/9454#issuecomment-1942222109
> 
> When watching your talk "Decoding Those Inscrutable RCU CPU Stall Warnings" you mentioned one can feel free reaching out when bumping into such issues. Building other kernel releases, switching off-and-on modules and playing with the RCU grace period times so far don't seem to work for me (or others in that thread).
> 
> Anyways, I don't really know where to start looking and the call stacks aren't very informative (to my eye) either. I'm hoping you might help me find the direction to look for the root of this problem.

I am assuming that you have filed a bug with the Debian folks, and before
doing that, searched for similar bug reports.

At first glance, this is because things were stuck here:

[  967.115632]  clear_rseq_cs.isra.0+0x4c/0x60
[  967.116433]  do_notify_resume+0xf8/0xeb0
[  967.116960]  el0_svc+0x3c/0x50
[  967.117537]  el0t_64_sync_handler+0x9c/0x120
[  967.118323]  el0t_64_sync+0x158/0x15c

So including these function names (clear_rseq_cs() and so on) in your
search for similar bug reports would be a good idea.

I am unfamiliar with that code.

So I added Boqun because he works with Linux on HyperV as part of his
day job and has a great deal of experience with RCU.  He will likely
have quite a number of questions for you including exact versions,
Debian bug number, the results of your web search, and so on.  He might
also know an ARM person to get involved in this.

Or maybe he knows the solution off the top of his head!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
       [not found]   ` <3718bb-65e38400-d-5cd53080@68830111>
@ 2024-03-03  0:19     ` Paul E. McKenney
  2024-03-04 16:33       ` Boqun Feng
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2024-03-03  0:19 UTC (permalink / raw)
  To: Max Boone; +Cc: boqun.feng, rcu

On Sat, Mar 02, 2024 at 08:53:54PM +0100, Max Boone wrote:
> 
> Thank you so much for the quick reply!
> 
> ​​​​​​I haven't filed a bug with Debian specifically as I'm running the linux kernel built and provided by Microsoft and Ubuntu as OS on top. If it helps with the search I'd gladly run Debian and file a bug there, but will still need to build my own kernel as WSL requires some modules (such as HyperV storage and sockets) to be built into the kernel instead (meaning =y) of as modules (meaning =m).

Ah, if you built your own kernel, then you are your own distro as far
as kernel issues are concerned.  ;-)

							Thanx, Paul

> I'll stick to using the rcu list from here on to avoid spam, thanks again!
> ​​​​​
> On Saturday, March 02, 2024 20:43 CET, "Paul E. McKenney" <paulmck@kernel.org> wrote:
>  [ Adding Boqun and the rcu list on CC. ]
> 
> On Sat, Mar 02, 2024 at 07:59:08PM +0100, Max Boone wrote:
> >
> > Dear Dr. McKenney,
> >
> > For a couple of years now I've been the sometimes frustrated owner of a Microsoft Surface Pro X ARM64 device, which has been getting progressively better as more vendors start targeting their builds at ARM64 architectures but since the introduction of the device there have been issues with the Windows Subsystem for Linux (not more than an opinionated Hyper-V VM with extensive tooling) locking up and hanging. 
> >
> > When this happens, traces like the following are dumped in the kernel messages:
> > https://github.com/microsoft/WSL/issues/9454#issuecomment-1942222109
> >
> > When watching your talk "Decoding Those Inscrutable RCU CPU Stall Warnings" you mentioned one can feel free reaching out when bumping into such issues. Building other kernel releases, switching off-and-on modules and playing with the RCU grace period times so far don't seem to work for me (or others in that thread).
> >
> > Anyways, I don't really know where to start looking and the call stacks aren't very informative (to my eye) either. I'm hoping you might help me find the direction to look for the root of this problem.
> 
> I am assuming that you have filed a bug with the Debian folks, and before
> doing that, searched for similar bug reports.
> 
> At first glance, this is because things were stuck here:
> 
> [ 967.115632] clear_rseq_cs.isra.0+0x4c/0x60
> [ 967.116433] do_notify_resume+0xf8/0xeb0
> [ 967.116960] el0_svc+0x3c/0x50
> [ 967.117537] el0t_64_sync_handler+0x9c/0x120
> [ 967.118323] el0t_64_sync+0x158/0x15c
> 
> So including these function names (clear_rseq_cs() and so on) in your
> search for similar bug reports would be a good idea.
> 
> I am unfamiliar with that code.
> 
> So I added Boqun because he works with Linux on HyperV as part of his
> day job and has a great deal of experience with RCU. He will likely
> have quite a number of questions for you including exact versions,
> Debian bug number, the results of your web search, and so on. He might
> also know an ARM person to get involved in this.
> 
> Or maybe he knows the solution off the top of his head!
> 
> Thanx, Paul
> 
> 
>  

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
  2024-03-03  0:19     ` Paul E. McKenney
@ 2024-03-04 16:33       ` Boqun Feng
  2024-03-04 21:54         ` Max Boone
  0 siblings, 1 reply; 9+ messages in thread
From: Boqun Feng @ 2024-03-04 16:33 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Max Boone, rcu

On Sat, Mar 02, 2024 at 04:19:17PM -0800, Paul E. McKenney wrote:
> On Sat, Mar 02, 2024 at 08:53:54PM +0100, Max Boone wrote:
> > 
> > Thank you so much for the quick reply!
> > 
> > ​​​​​​I haven't filed a bug with Debian specifically as I'm running the linux kernel built and provided by Microsoft and Ubuntu as OS on top. If it helps with the search I'd gladly run Debian and file a bug there, but will still need to build my own kernel as WSL requires some modules (such as HyperV storage and sockets) to be built into the kernel instead (meaning =y) of as modules (meaning =m).
> 
> Ah, if you built your own kernel, then you are your own distro as far
> as kernel issues are concerned.  ;-)
> 
> 							Thanx, Paul
> 
> > I'll stick to using the rcu list from here on to avoid spam, thanks again!
> > ​​​​​
> > On Saturday, March 02, 2024 20:43 CET, "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >  [ Adding Boqun and the rcu list on CC. ]
> > 

Thanks, Paul.

> > On Sat, Mar 02, 2024 at 07:59:08PM +0100, Max Boone wrote:
> > >
> > > Dear Dr. McKenney,
> > >
> > > For a couple of years now I've been the sometimes frustrated owner of a Microsoft Surface Pro X ARM64 device, which has been getting progressively better as more vendors start targeting their builds at ARM64 architectures but since the introduction of the device there have been issues with the Windows Subsystem for Linux (not more than an opinionated Hyper-V VM with extensive tooling) locking up and hanging. 
> > >
> > > When this happens, traces like the following are dumped in the kernel messages:
> > > https://github.com/microsoft/WSL/issues/9454#issuecomment-1942222109
> > >
> > > When watching your talk "Decoding Those Inscrutable RCU CPU Stall Warnings" you mentioned one can feel free reaching out when bumping into such issues. Building other kernel releases, switching off-and-on modules and playing with the RCU grace period times so far don't seem to work for me (or others in that thread).
> > >
> > > Anyways, I don't really know where to start looking and the call stacks aren't very informative (to my eye) either. I'm hoping you might help me find the direction to look for the root of this problem.
> > 
> > I am assuming that you have filed a bug with the Debian folks, and before
> > doing that, searched for similar bug reports.
> > 
> > At first glance, this is because things were stuck here:
> > 
> > [ 967.115632] clear_rseq_cs.isra.0+0x4c/0x60
> > [ 967.116433] do_notify_resume+0xf8/0xeb0
> > [ 967.116960] el0_svc+0x3c/0x50
> > [ 967.117537] el0t_64_sync_handler+0x9c/0x120
> > [ 967.118323] el0t_64_sync+0x158/0x15c
> > 
> > So including these function names (clear_rseq_cs() and so on) in your
> > search for similar bug reports would be a good idea.
> > 
> > I am unfamiliar with that code.
> > 
> > So I added Boqun because he works with Linux on HyperV as part of his
> > day job and has a great deal of experience with RCU. He will likely
> > have quite a number of questions for you including exact versions,
> > Debian bug number, the results of your web search, and so on. He might
> > also know an ARM person to get involved in this.
> > 
> > Or maybe he knows the solution off the top of his head!
> > 

I haven't seen this issue before, looks to me the stall is caused by
clear_rseq_cs(), which is basically a put_user(), and I don't have an
immediate theory, could you share the kernel repo and configuration you
used, so that I can see if I can reproduce this? (Note I don't have the
exact device as you do nor an ARM64 Windows system with the exact
Windows build you are using).

Regards,
Boqun

> > Thanx, Paul
> > 
> > 
> >  
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
  2024-03-04 16:33       ` Boqun Feng
@ 2024-03-04 21:54         ` Max Boone
  2024-03-05  0:32           ` Joel Fernandes
  0 siblings, 1 reply; 9+ messages in thread
From: Max Boone @ 2024-03-04 21:54 UTC (permalink / raw)
  To: Boqun Feng, Paul E. McKenney; +Cc: rcu

On Mon Mar 4, 2024 at 4:33 PM UTC, Boqun Feng wrote:
> I haven't seen this issue before, looks to me the stall is caused by
> clear_rseq_cs(), which is basically a put_user(), and I don't have an
> immediate theory, could you share the kernel repo and configuration you
> used, so that I can see if I can reproduce this? (Note I don't have the
> exact device as you do nor an ARM64 Windows system with the exact
> Windows build you are using).
>
> Regards,
> Boqun

This happens on the default Windows System for Linux 2 setup, thus the
kernel built with the following sources:
- https://github.com/microsoft/WSL2-Linux-Kernel/tree/linux-msft-wsl-5.15.y

It also happens when I build the kernel myself from a more recent
release:
- https://github.com/maxboone/SQ2-Linux-Kernel-Builds

Microsoft should have a Development Kit (Volterra) with identical hardware 
to mine (and other Surface Pro X, Surface Pro 9 users) that run into the 
same issue with WSL2.

Moreover, the problem does not seem to occur on a regular Hyper-V VM 
(with a different / standard kernel though) at all, so reproduction 
might be difficult. The CPU that's shown by Hyper-V also differs from
the one visible in WSL2, maybe colleagues over at Redmond know what
CPU virtualization is used there?
- https://github.com/microsoft/WSL/issues/9454#issuecomment-1976411786

Thanks,
Max

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
  2024-03-04 21:54         ` Max Boone
@ 2024-03-05  0:32           ` Joel Fernandes
  2024-03-05  5:57             ` Max Boone
  0 siblings, 1 reply; 9+ messages in thread
From: Joel Fernandes @ 2024-03-05  0:32 UTC (permalink / raw)
  To: Max Boone, Boqun Feng, Paul E. McKenney; +Cc: rcu

Hello, Max, Boqun and Paul,

On 3/4/2024 4:54 PM, Max Boone wrote:
> On Mon Mar 4, 2024 at 4:33 PM UTC, Boqun Feng wrote:
>> I haven't seen this issue before, looks to me the stall is caused by
>> clear_rseq_cs(), which is basically a put_user(), and I don't have an
>> immediate theory, could you share the kernel repo and configuration you
>> used, so that I can see if I can reproduce this? (Note I don't have the
>> exact device as you do nor an ARM64 Windows system with the exact
>> Windows build you are using).
> 
> This happens on the default Windows System for Linux 2 setup, thus the
> kernel built with the following sources:
> - https://github.com/microsoft/WSL2-Linux-Kernel/tree/linux-msft-wsl-5.15.y

FWIW, I use a Windows machine that has WSL2 (kernel version
5.15.133.1-microsoft-standard-WSL2) and I have never experienced any kind of
hang. Though, this is a desktop and not a laptop or battery powered device.
> 
> It also happens when I build the kernel myself from a more recent
> release:
> - https://github.com/maxboone/SQ2-Linux-Kernel-Builds
> 
> Microsoft should have a Development Kit (Volterra) with identical hardware 
> to mine (and other Surface Pro X, Surface Pro 9 users) that run into the 
> same issue with WSL2.

Right, so at least that's a data point, that its Surface-specific (?). Have you
tried to disable power management and see if it occurs? Like disable suspend,
disable cpuidle, etc.

Have you tried to reproduce the issue with CONFIG_RSEQ=n and see if it happens?

Also this github thread looks awfully similar to the github thread you pointed
and has the same clear_rseq signature leading to the RCU stall. Over there also
it is a hang, but they say the CPU usage is at 100%:
https://github.com/microsoft/WSL/issues/8529

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
  2024-03-05  0:32           ` Joel Fernandes
@ 2024-03-05  5:57             ` Max Boone
  2024-03-05 14:32               ` Max Boone
  2024-03-05 14:50               ` Joel Fernandes
  0 siblings, 2 replies; 9+ messages in thread
From: Max Boone @ 2024-03-05  5:57 UTC (permalink / raw)
  To: Joel Fernandes, Boqun Feng, Paul E. McKenney; +Cc: rcu

On Tue Mar 5, 2024 at 12:32 AM UTC, Joel Fernandes wrote:
> FWIW, I use a Windows machine that has WSL2 (kernel version
> 5.15.133.1-microsoft-standard-WSL2) and I have never experienced any kind of
> hang. Though, this is a desktop and not a laptop or battery powered device.

Is that also an ARM64 machine, because I have never seen this happen on
a x86_64 machine, there it runs like a charm. Out of curiousity, if you are 
running an ARM64 Desktop. If I may as, which one, as the Volterra Development 
Kit is not available in the Netherlands.

> > 
> > It also happens when I build the kernel myself from a more recent
> > release:
> > - https://github.com/maxboone/SQ2-Linux-Kernel-Builds
> > 
> > Microsoft should have a Development Kit (Volterra) with identical hardware 
> > to mine (and other Surface Pro X, Surface Pro 9 users) that run into the 
> > same issue with WSL2.
>
> Right, so at least that's a data point, that its Surface-specific (?). Have you
> tried to disable power management and see if it occurs? Like disable suspend,
> disable cpuidle, etc.

It also happens on non-Surface (but indeed mobile) devices, such as
Lenovo ThinkPads. However, the common denominator might be the Qualcomm
8cx chip (that Microsoft uses as SQ{1,2,3} -> 8cx Gen{1,2,3} with a 
beefier GPU).

Changes to power management settings in Windows don't seem to have
effect other than stalls taking longer to occur when the device never
sleeps. But the stalls also happen (often) when it doesn't sleep.

Power management in WSL2 seems to be all but available:

```
root@ProX2024:~# uname -r
6.7.7-WSL2-STABLE+
root@ProX2024:~# echo freeze > /sys/power/state
-bash: echo: write error: Function not implemented
root@ProX2024:~# ls /sys/devices/system/cpu/
cpu0  cpu2  cpu4  cpu6  cpufreq   kernel_max  offline  possible  present  vulnerabilities
cpu1  cpu3  cpu5  cpu7  isolated  modalias    online   power     uevent
```

However available in Hyper-V:

```
root@ubuntu0:~# uname -r
6.5.0-21-generic
root@ubuntu0:~# echo freeze > /sys/power/state
root@ubuntu0:~# ls /sys/devices/system/cpu
cpu0  cpu2  cpufreq  hotplug   kernel_max  offline  possible  present  uevent
cpu1  cpu3  cpuidle  isolated  modalias    online   power     smt      vulnerabilities
```

> Have you tried to reproduce the issue with CONFIG_RSEQ=n and see if it happens?

Will build a new kernel today with that flag, and report back.

> Also this github thread looks awfully similar to the github thread you pointed
> and has the same clear_rseq signature leading to the RCU stall. Over there also
> it is a hang, but they say the CPU usage is at 100%:
> https://github.com/microsoft/WSL/issues/8529

Indeed, when the RCU stalls occur, the CPU of the core that is stalling
ramps up to 100%. I had thought that was an effect of the stall, but
will check if the 100% usage is caused by the process that is stalling.

Cheers,
Max.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
  2024-03-05  5:57             ` Max Boone
@ 2024-03-05 14:32               ` Max Boone
  2024-03-05 15:00                 ` Joel Fernandes
  2024-03-05 14:50               ` Joel Fernandes
  1 sibling, 1 reply; 9+ messages in thread
From: Max Boone @ 2024-03-05 14:32 UTC (permalink / raw)
  To: Max Boone, Joel Fernandes, Boqun Feng, Paul E. McKenney; +Cc: rcu

On Tue Mar 5, 2024 at 5:57 AM UTC, Max Boone wrote:
> On Tue Mar 5, 2024 at 12:32 AM UTC, Joel Fernandes wrote:
> > Have you tried to reproduce the issue with CONFIG_RSEQ=n and see if it happens?
>
> Will build a new kernel today with that flag, and report back.

With CONFIG_RSEQ=n the stalls happen a lot less often, the system is
way more workable and when it does freeze up it recovered on its own
once, and I was able to get the full kernel messages for this:

```
[  675.812339] rcu: INFO: rcu_sched self-detected stall on CPU
[  675.814587] rcu:     3-....: (14893 ticks this GP) idle=762c/1/0x4000000000000000 softirq=6920/6920 fqs=6610
[  675.815606] rcu:     (t=15001 jiffies g=50497 q=1304 ncpus=8)
[  675.816520] CPU: 3 PID: 232 Comm: snapfuse Not tainted 6.7.7-WSL2-STABLE+ #2
[  675.816550] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  675.816553] pc : __arch_copy_to_user+0x1a0/0x240
[  675.817689] lr : _copy_to_iter+0xf0/0x560
[  675.818069] sp : ffff800082ceba80
[  675.818070] x29: ffff800082cebac0 x28: 0000000001b2c000 x27: 0000000000000005
[  675.818074] x26: 0000000000000000 x25: ffff00004c491000 x24: 0000000000000000
[  675.818076] x23: 0000000000001000 x22: 0000040000000000 x21: ffff800082cebd30
[  675.818079] x20: ffff800082cebd30 x19: 0000000000001000 x18: 0000000000000000
[  675.818081] x17: 0000000000000000 x16: 0000000000000000 x15: ffff00004c491000
[  675.818083] x14: 9887db4ae914c054 x13: 6bcd444ce14effe5 x12: 0b22b481c6001041
[  675.818086] x11: 7513c0250d7df247 x10: b85affa4063b12c7 x9 : 368beb85bc648557
[  675.818088] x8 : 217c88df9795370e x7 : a16d77942052b4ab x6 : 0000aaf844516fff
[  675.818090] x5 : 0000aaf844517e2f x4 : 0000000000000000 x3 : 0000000000003daf
[  675.818092] x2 : 0000000000000dc0 x1 : ffff00004c491210 x0 : 0000aaf844516e2f
[  675.818096] Call trace:
[  675.818143]  __arch_copy_to_user+0x1a0/0x240
[  675.818147]  copy_page_to_iter+0xbc/0x140
[  675.818150]  filemap_read+0x1b0/0x398
[  675.818427]  generic_file_read_iter+0x48/0x168
[  675.818429]  ext4_file_read_iter+0x58/0x288
[  675.818681]  vfs_read+0x1e8/0x280
[  675.818804]  ksys_pread64+0x90/0xf0
[  675.818806]  __arm64_sys_pread64+0x24/0x48
[  675.818807]  invoke_syscall.constprop.0+0x54/0x128
[  675.818912]  do_el0_svc+0x44/0xf0
[  675.818914]  el0_svc+0x24/0xb0
[  675.819041]  el0t_64_sync_handler+0x138/0x148
[  675.819043]  el0t_64_sync+0x14c/0x150
[  681.501178] block sda: the capability attribute has been deprecated.
[  741.700330] rcu: INFO: rcu_sched self-detected stall on CPU
[  741.701707] rcu:     4-....: (14940 ticks this GP) idle=8074/1/0x4000000000000000 softirq=13021/13037 fqs=6400
[  741.703152] rcu:     (t=15001 jiffies g=50713 q=5093 ncpus=8)
[  741.704017] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2
[  741.704047] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  741.704050] pc : __arch_copy_to_user+0x190/0x240
[  741.704424] lr : _copy_to_iter+0xf0/0x560
[  741.704565] sp : ffff800082cfb870
[  741.704566] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085
[  741.704569] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000
[  741.704571] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90
[  741.704574] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000
[  741.704608] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00
[  741.704610] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070
[  741.704612] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937
[  741.704614] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0
[  741.704617] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000
[  741.704619] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0
[  741.704621] Call trace:
[  741.704647]  __arch_copy_to_user+0x190/0x240
[  741.704651]  simple_copy_to_iter+0x48/0x98
[  741.704939]  __skb_datagram_iter+0x7c/0x280
[  741.704941]  skb_copy_datagram_iter+0x48/0xc8
[  741.704943]  unix_stream_read_actor+0x30/0x68
[  741.705137]  unix_stream_read_generic+0x304/0xb70
[  741.705139]  unix_stream_recvmsg+0xc0/0xd0
[  741.705140]  sock_recvmsg+0x88/0x108
[  741.705170]  ____sys_recvmsg+0x78/0x198
[  741.705171]  ___sys_recvmsg+0x80/0xf0
[  741.705173]  __sys_recvmsg+0x5c/0xd0
[  741.705175]  __arm64_sys_recvmsg+0x28/0x50
[  741.705177]  invoke_syscall.constprop.0+0x54/0x128
[  741.705316]  do_el0_svc+0xcc/0xf0
[  741.705317]  el0_svc+0x24/0xb0
[  741.705369]  el0t_64_sync_handler+0x138/0x148
[  741.705371]  el0t_64_sync+0x14c/0x150
[  743.232431] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 4-.... } 15347 jiffies s: 721 root: 0x10/.
[  743.234297] rcu: blocking rcu_node structures (internal RCU debug):
[  743.235477] Sending NMI from CPU 1 to CPUs 4:
[  743.235491] NMI backtrace for cpu 4
[  743.235531] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2
[  743.235535] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  743.235537] pc : __arch_copy_to_user+0x190/0x240
[  743.235598] lr : _copy_to_iter+0xf0/0x560
[  743.235603] sp : ffff800082cfb870
[  743.235604] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085
[  743.235607] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000
[  743.235610] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90
[  743.235612] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000
[  743.235614] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00
[  743.235617] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070
[  743.235619] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937
[  743.235621] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0
[  743.235623] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000
[  743.235626] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0
[  743.235628] Call trace:
[  743.235630]  __arch_copy_to_user+0x190/0x240
[  743.235632]  simple_copy_to_iter+0x48/0x98
[  743.235636]  __skb_datagram_iter+0x7c/0x280
[  743.235639]  skb_copy_datagram_iter+0x48/0xc8
[  743.235641]  unix_stream_read_actor+0x30/0x68
[  743.235644]  unix_stream_read_generic+0x304/0xb70
[  743.235646]  unix_stream_recvmsg+0xc0/0xd0
[  743.235647]  sock_recvmsg+0x88/0x108
[  743.235650]  ____sys_recvmsg+0x78/0x198
[  743.235651]  ___sys_recvmsg+0x80/0xf0
[  743.235653]  __sys_recvmsg+0x5c/0xd0
[  743.235655]  __arm64_sys_recvmsg+0x28/0x50
[  743.235657]  invoke_syscall.constprop.0+0x54/0x128
[  743.235661]  do_el0_svc+0xcc/0xf0
[  743.235663]  el0_svc+0x24/0xb0
[  743.235667]  el0t_64_sync_handler+0x138/0x148
[  743.235668]  el0t_64_sync+0x14c/0x150
```

Another time it didn't recover, and I was only able to get whatever was
printed to console:

```
[ 1559.425979] rcu:     7-....: (14977 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1559.431367] rcu:     (t=15002 jiffies g=67965 q=36939 ncpus=8)
[ 1559.432083] rcu: rcu_sched kthread starved for 2866 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1559.433645] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1559.434604] rcu: RCU grace-period kthread stack dump:
[ 1559.435549] rcu: Stack dump where RCU GP kthread last ran:
[ 1739.451891] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1739.452511] rcu:     7-....: (59616 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1739.453498] rcu:     (t=60008 jiffies g=67965 q=36939 ncpus=8)
[ 1739.454053] rcu: rcu_sched kthread starved for 47871 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1739.455135] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1739.456110] rcu: RCU grace-period kthread stack dump:
[ 1739.456687] rcu: Stack dump where RCU GP kthread last ran:
[ 1919.467822] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1919.468776] rcu:     7-....: (104010 ticks this GP) idle=d4ec/1/0x4000000000000000 softirq=18636/18636 fqs=5263
[ 1919.470405] rcu:     (t=105012 jiffies g=67965 q=36941 ncpus=8)
[ 1919.472599] rcu: rcu_sched kthread starved for 92875 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 1919.474013] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1919.474977] rcu: RCU grace-period kthread stack dump:
[ 1919.475641] rcu: Stack dump where RCU GP kthread last ran:
```

Unfortunately the traces are not provided there.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
  2024-03-05  5:57             ` Max Boone
  2024-03-05 14:32               ` Max Boone
@ 2024-03-05 14:50               ` Joel Fernandes
  1 sibling, 0 replies; 9+ messages in thread
From: Joel Fernandes @ 2024-03-05 14:50 UTC (permalink / raw)
  To: Max Boone, Boqun Feng, Paul E. McKenney; +Cc: rcu

On 3/5/2024 12:57 AM, Max Boone wrote:
> On Tue Mar 5, 2024 at 12:32 AM UTC, Joel Fernandes wrote:
>> FWIW, I use a Windows machine that has WSL2 (kernel version
>> 5.15.133.1-microsoft-standard-WSL2) and I have never experienced any kind of
>> hang. Though, this is a desktop and not a laptop or battery powered device.
> 
> Is that also an ARM64 machine, because I have never seen this happen on
> a x86_64 machine, there it runs like a charm. Out of curiousity, if you are 
> running an ARM64 Desktop. If I may as, which one, as the Volterra Development 
> Kit is not available in the Netherlands.

It is x86 for me.

>> Also this github thread looks awfully similar to the github thread you pointed
>> and has the same clear_rseq signature leading to the RCU stall. Over there also
>> it is a hang, but they say the CPU usage is at 100%:
>> https://github.com/microsoft/WSL/issues/8529
> 
> Indeed, when the RCU stalls occur, the CPU of the core that is stalling
> ramps up to 100%. I had thought that was an effect of the stall, but
> will check if the 100% usage is caused by the process that is stalling.

Right. And, RCU needs CPU available for its smooth operation (for RCU kthreads,
interrupt handling, softirq etc.). So a CPU pegged at 100% for long period of
time is not ideal and in your case likely causes the stall.

 - Joel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Advice sought on RCU stalls on ARM64 WSL2
  2024-03-05 14:32               ` Max Boone
@ 2024-03-05 15:00                 ` Joel Fernandes
  0 siblings, 0 replies; 9+ messages in thread
From: Joel Fernandes @ 2024-03-05 15:00 UTC (permalink / raw)
  To: Max Boone, Boqun Feng, Paul E. McKenney; +Cc: rcu



On 3/5/2024 9:32 AM, Max Boone wrote:
> On Tue Mar 5, 2024 at 5:57 AM UTC, Max Boone wrote:
>> On Tue Mar 5, 2024 at 12:32 AM UTC, Joel Fernandes wrote:
>>> Have you tried to reproduce the issue with CONFIG_RSEQ=n and see if it happens?
>>
>> Will build a new kernel today with that flag, and report back.
> 
> With CONFIG_RSEQ=n the stalls happen a lot less often, the system is
> way more workable and when it does freeze up it recovered on its own
> once, and I was able to get the full kernel messages for this:
> 
> ```
> [  675.812339] rcu: INFO: rcu_sched self-detected stall on CPU
> [  675.814587] rcu:     3-....: (14893 ticks this GP) idle=762c/1/0x4000000000000000 softirq=6920/6920 fqs=6610
> [  675.815606] rcu:     (t=15001 jiffies g=50497 q=1304 ncpus=8)
> [  675.816520] CPU: 3 PID: 232 Comm: snapfuse Not tainted 6.7.7-WSL2-STABLE+ #2
> [  675.816550] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  675.816553] pc : __arch_copy_to_user+0x1a0/0x240
> [  675.817689] lr : _copy_to_iter+0xf0/0x560
> [  675.818069] sp : ffff800082ceba80
> [  675.818070] x29: ffff800082cebac0 x28: 0000000001b2c000 x27: 0000000000000005
> [  675.818074] x26: 0000000000000000 x25: ffff00004c491000 x24: 0000000000000000
> [  675.818076] x23: 0000000000001000 x22: 0000040000000000 x21: ffff800082cebd30
> [  675.818079] x20: ffff800082cebd30 x19: 0000000000001000 x18: 0000000000000000
> [  675.818081] x17: 0000000000000000 x16: 0000000000000000 x15: ffff00004c491000
> [  675.818083] x14: 9887db4ae914c054 x13: 6bcd444ce14effe5 x12: 0b22b481c6001041
> [  675.818086] x11: 7513c0250d7df247 x10: b85affa4063b12c7 x9 : 368beb85bc648557
> [  675.818088] x8 : 217c88df9795370e x7 : a16d77942052b4ab x6 : 0000aaf844516fff
> [  675.818090] x5 : 0000aaf844517e2f x4 : 0000000000000000 x3 : 0000000000003daf
> [  675.818092] x2 : 0000000000000dc0 x1 : ffff00004c491210 x0 : 0000aaf844516e2f
> [  675.818096] Call trace:
> [  675.818143]  __arch_copy_to_user+0x1a0/0x240
> [  675.818147]  copy_page_to_iter+0xbc/0x140
> [  675.818150]  filemap_read+0x1b0/0x398

This seems similar to the RSEQ one in the sense that RSEQ tries to poke into
userspace memory as well. May be it is related fault handling? The RSEQ case has
code comments saying faults are possible. But I'm speculating. It might be worth
providing all these details (both RSEQ stack and the copy_to_user one) to the
arm64 folks to see if they know of a hang in such case. Maybe some patch
backports are missing.

> [  675.818427]  generic_file_read_iter+0x48/0x168
> [  675.818429]  ext4_file_read_iter+0x58/0x288
> [  675.818681]  vfs_read+0x1e8/0x280
> [  675.818804]  ksys_pread64+0x90/0xf0
> [  675.818806]  __arm64_sys_pread64+0x24/0x48
> [  675.818807]  invoke_syscall.constprop.0+0x54/0x128
> [  675.818912]  do_el0_svc+0x44/0xf0
> [  675.818914]  el0_svc+0x24/0xb0
> [  675.819041]  el0t_64_sync_handler+0x138/0x148
> [  675.819043]  el0t_64_sync+0x14c/0x150
> [  681.501178] block sda: the capability attribute has been deprecated.
> [  741.700330] rcu: INFO: rcu_sched self-detected stall on CPU
> [  741.701707] rcu:     4-....: (14940 ticks this GP) idle=8074/1/0x4000000000000000 softirq=13021/13037 fqs=6400
> [  741.703152] rcu:     (t=15001 jiffies g=50713 q=5093 ncpus=8)
> [  741.704017] CPU: 4 PID: 194 Comm: systemd-journal Not tainted 6.7.7-WSL2-STABLE+ #2
> [  741.704047] pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  741.704050] pc : __arch_copy_to_user+0x190/0x240
> [  741.704424] lr : _copy_to_iter+0xf0/0x560
> [  741.704565] sp : ffff800082cfb870
> [  741.704566] x29: ffff800082cfb8b0 x28: ffff00000fe96b18 x27: 0000000000000085
> [  741.704569] x26: 0000000000000000 x25: ffff0000061e1c00 x24: 0000000000000000
> [  741.704571] x23: 0000000000000085 x22: ffff000117d2a600 x21: ffff800082cfbd90
> [  741.704574] x20: ffff0000061e1c00 x19: 0000000000000085 x18: 0000000000000000
> [  741.704608] x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000061e1c00
> [  741.704610] x14: 62616c6961766120 x13: 7365746164707520 x12: 6f6e207361682070
> [  741.704612] x11: 616e73203a687365 x10: 7266657220746f6e x9 : 6e6163203a313937
> [  741.704614] x8 : 3a6f672e73726570 x7 : 6c656865726f7473 x6 : 0000ab58c96fd6f0
> [  741.704617] x5 : 0000ab58c96fd775 x4 : 0000000000000000 x3 : 0000000000000000
> [  741.704619] x2 : 0000000000000005 x1 : ffff0000061e1c40 x0 : 0000ab58c96fd6f0
> [  741.704621] Call trace:
> [  741.704647]  __arch_copy_to_user+0x190/0x240
> [  741.704651]  simple_copy_to_iter+0x48/0x98
> [  741.704939]  __skb_datagram_iter+0x7c/0x280
> [  741.704941]  skb_copy_datagram_iter+0x48/0xc8

And this is another user of the copy_to_user, so its probably that.

> [ 1559.432083] rcu: rcu_sched kthread starved for 2866 jiffies! g67965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> [ 1559.433645] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.

This is an example of what I was saying in the other thread, that RCU kthreads
need to get CPU other RCU stalls can happen.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-03-05 15:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <6a07-65e37700-1-4052b980@161649365>
2024-03-02 19:43 ` Advice sought on RCU stalls on ARM64 WSL2 Paul E. McKenney
     [not found]   ` <3718bb-65e38400-d-5cd53080@68830111>
2024-03-03  0:19     ` Paul E. McKenney
2024-03-04 16:33       ` Boqun Feng
2024-03-04 21:54         ` Max Boone
2024-03-05  0:32           ` Joel Fernandes
2024-03-05  5:57             ` Max Boone
2024-03-05 14:32               ` Max Boone
2024-03-05 15:00                 ` Joel Fernandes
2024-03-05 14:50               ` Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.