Re: Advice sought on RCU stalls on ARM64 WSL2

* Re: Advice sought on RCU stalls on ARM64 WSL2
       [not found] <6a07-65e37700-1-4052b980@161649365>
@ 2024-03-02 19:43 ` Paul E. McKenney
       [not found]   ` <3718bb-65e38400-d-5cd53080@68830111>
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2024-03-02 19:43 UTC (permalink / raw)
  To: Max Boone; +Cc: boqun.feng, rcu

[ Adding Boqun and the rcu list on CC. ]

On Sat, Mar 02, 2024 at 07:59:08PM +0100, Max Boone wrote:
> 
> Dear Dr. McKenney,
> 
> For a couple of years now I've been the sometimes frustrated owner of a Microsoft Surface Pro X ARM64 device, which has been getting progressively better as more vendors start targeting their builds at ARM64 architectures but since the introduction of the device there have been issues with the Windows Subsystem for Linux (not more than an opinionated Hyper-V VM with extensive tooling) locking up and hanging. 
> 
> When this happens, traces like the following are dumped in the kernel messages:
> https://github.com/microsoft/WSL/issues/9454#issuecomment-1942222109
> 
> When watching your talk "Decoding Those Inscrutable RCU CPU Stall Warnings" you mentioned one can feel free reaching out when bumping into such issues. Building other kernel releases, switching off-and-on modules and playing with the RCU grace period times so far don't seem to work for me (or others in that thread).
> 
> Anyways, I don't really know where to start looking and the call stacks aren't very informative (to my eye) either. I'm hoping you might help me find the direction to look for the root of this problem.

I am assuming that you have filed a bug with the Debian folks, and before
doing that, searched for similar bug reports.

At first glance, this is because things were stuck here:

[  967.115632]  clear_rseq_cs.isra.0+0x4c/0x60
[  967.116433]  do_notify_resume+0xf8/0xeb0
[  967.116960]  el0_svc+0x3c/0x50
[  967.117537]  el0t_64_sync_handler+0x9c/0x120
[  967.118323]  el0t_64_sync+0x158/0x15c

So including these function names (clear_rseq_cs() and so on) in your
search for similar bug reports would be a good idea.

I am unfamiliar with that code.

So I added Boqun because he works with Linux on HyperV as part of his
day job and has a great deal of experience with RCU.  He will likely
have quite a number of questions for you including exact versions,
Debian bug number, the results of your web search, and so on.  He might
also know an ARM person to get involved in this.

Or maybe he knows the solution off the top of his head!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread