Hi Steve, >> After some days of training: >> https://training.linuxfoundation.org/training/linux-kernel-debugging-and-security/ >> I was able to get much closer to the problem :-) > > Well, knowing what to look for was not going to be easy. > > And I'm sure you were shocked to see what I posted as a fix ;-) Not really, such problems typically just have one line fixes :-) > Assuming this does fix your issue, I sent out a real patch with the > explanation of what happened in the change log, so that you can see why > that change was your issue. Yes, it does the trick, thanks very much! >> In order to reproduce it and get reliable kexec crash dumps, >> I needed to give the VM at least 3 cores. > > Yes, it did require having this run on multiple CPUs to have a race > condition trigger, and two cores would have been hard to hit it. I ran > it on 8 cores and it triggered rather easily. I mean I was able to trigger the problem with 2 cores, but then the machine was completely stuck, without triggering the kexec reboot to get the crash dump. I don't understand why as the additional cpu is not really involved at all.. Any idea why kexec could be so unreliable? (even with 3 or 6 cpus it sometimes just didn't trigger) >> While running './io-uring_cp-forever link-cp.c file' (from: >> https://github.com/metze-samba/liburing/commits/io_uring-cp-forever ) >> in one window, the following simple sequence triggered the problem in most cases: >> >> echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable >> echo 1 > /sys/kernel/tracing/set_event_pid > > All it took was something busy that did a lot of wakeups while setting > the set_event_pid, to be able to hit the race easily. As I stated, I > triggered it with running hackbench instead of the io-uring code. In > fact, this bug had nothing to do with io-uring, and you were only > triggering it because you were making enough of a load on the system to > make the race happen often. Yes, it was just the example that made it most reliable to trigger for me. >> pid_list = rcu_dereference_raw(tr->filtered_pids); > > That "tr" comes from the trace_event_file that is passed in by the > "data" field of the callback. Hence, this callback got the data field > of the event_filter_pid_sched_wakeup_probe_pre() callback that is > called before all events when the set_event_pid file is set. That > means, the "tr" being dereferened was not the "tr" you were looking for. Yes, that's what I assumed, but didn't find the reason for. >> >> It seems it got inlined within trace_event_buffer_reserve() >> >> There strangest things I found so far is this: >> >> crash> sym global_trace >> ffffffffbcdb7e80 (d) global_trace >> crash> list ftrace_trace_arrays >> ffffffffbcdb7e70 >> ffffffffbcdb7e80 >> >> trace_array has size 7672, but ffffffffbcdb7e70 is only 16 bytes away from >> ffffffffbcdb7e80. > > ftrace_trace_arrays is a list_head, and I doubt you created any > instances, thus the list head has only one instance, and that is > global_trace. Hence, it points to global_trace and itself. It just so > happens that a list_head is 16 bytes. Ah, that explains a lot: crash> struct list_head -o ftrace_trace_arrays struct list_head { [ffffffffbcdb7e70] struct list_head *next; [ffffffffbcdb7e78] struct list_head *prev; } SIZE: 16 crash> list ftrace_trace_arrays ffffffffbcdb7e70 ffffffffbcdb7e80 I guess this is what I want: crash> list -H ftrace_trace_arrays ffffffffbcdb7e80 >> Also this: >> >> crash> struct trace_array.events -o global_trace >> struct trace_array { >> [ffffffffbcdb9be0] struct list_head events; >> } >> crash> list -s trace_event_file.tr -o trace_event_file.list ffffffffbcdb9be0 >> ffffffffbcdb9be0 >> tr = 0xffffffffbcdb7b20 >> ffff8ccb45cdfb00 >> tr = 0xffffffffbcdb7e80 >> ffff8ccb45cdf580 >> tr = 0xffffffffbcdb7e80 >> ffff8ccb45cdfe18 >> tr = 0xffffffffbcdb7e80 >> ... >> tr = 0xffffffffbcdb7e80 >> >> The first one 0xffffffffbcdb7b20 is only 864 bytes away from 0xffffffffbcdb7e80 > > I'm thinking it is confused by hitting the ftrace_trace_arrays > list_head itself. Yes, I needed the -H here too: list -s trace_event_file.tr -o trace_event_file.list -H ffffffffbcdb9be0 >> It would be really great if you (or anyone else on the lists) could have a closer look >> in order to get the problem fixed :-) > > Once I triggered it and started looking at what was happening, it > didn't take me to long to figure out where the culprit was. Great! >> >> I've learned a lot this week and tried hard to find the problem myself, >> but I have to move back to other work for now. >> > > I'm glad you learned a lot. There's a lot of complex code in there, and > its getting more complex, as you can see with the static_calls. > > This is because tracing tries hard to avoid the heisenbug effect. You > know, you see a bug, turn on tracing, and then that bug goes away! > > Thus it pulls out all the tricks it can to be as least intrusive on the > system as it can be. And that causes things to get complex really quick. I'll have a look, I may need something similar for my smbdirect socket driver in future. > Cheers, and thanks for keeping up on this bug! Thanks for fixing the bug! Now I can finally use: trace-cmd record -e all -P $(pidof io_uring-cp-forever) But that doesn't include the iou-wrk-* threads and the '-c' option seems to only work with forking. Is there a way to specify "trace *all* threads of the given pid"? (Note the threads are comming and going, so it's not possible to specifiy -P more than once) Thanks! metze