On Mon, Nov 23, 2020 at 07:32PM +0000, Mark Rutland wrote: > On Fri, Nov 20, 2020 at 03:03:32PM +0100, Marco Elver wrote: > > On Fri, Nov 20, 2020 at 10:30AM +0000, Mark Rutland wrote: > > > On Thu, Nov 19, 2020 at 10:53:53PM +0000, Will Deacon wrote: > > > > FWIW, arm64 is known broken wrt lockdep and irq tracing atm. Mark has been > > > > looking at that and I think he is close to having something workable. > > > > > > > > Mark -- is there anything Marco and Paul can try out? > > > > > > I initially traced some issues back to commit: > > > > > > 044d0d6de9f50192 ("lockdep: Only trace IRQ edges") > > > > > > ... and that change of semantic could cause us to miss edges in some > > > cases, but IIUC mostly where we haven't done the right thing in > > > exception entry/return. > > > > > > I don't think my patches address this case yet, but my WIP (currently > > > just fixing user<->kernel transitions) is at: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/irq-fixes > > > > > > I'm looking into the kernel<->kernel transitions now, and I know that we > > > mess up RCU management for a small window around arch_cpu_idle, but it's > > > not immediately clear to me if either of those cases could cause this > > > report. > > > > Thank you -- I tried your irq-fixes, however that didn't seem to fix the > > problem (still get warnings and then a panic). :-/ > > I've just updated that branch with a new version which I hope covers > kernel<->kernel transitions too. If you get a chance, would you mind > giving that a spin? > > The HEAD commit should be: > > a51334f033f8ee88 ("HACK: check IRQ tracing has RCU watching") Thank you! Your series appears to work and fixes the stalls and deadlocks (3 trials)! I noticed there are a bunch of warnings in the log that might be relevant (see attached). Note, I also reverted   sched/core: Allow try_invoke_on_locked_down_task() with irqs disabled and that still works. Thanks, -- Marco > Otherwise, I intend to clean that up and post it tomorrow (without the > additional debug hacks). I've thrown my local Syzkaller instance at it > in the mean time (and if I get the chance tomrrow I'll try to get > rcutorture setup), and the only report I'm seeing so far looks genuine: > > | BUG: sleeping function called from invalid context in sta_info_move_state > > ... as that was reported on x86 too, per: > > https://syzkaller.appspot.com/bug?id=6c7899acf008be2ddcddb46a2567c2153193632a > > Thanks, > Mark.