Hi Steve,
>> After some days of training:
>> https://training.linuxfoundation.org/training/linux-kernel-debugging-and-security/
>> I was able to get much closer to the problem :-)
> 
> Well, knowing what to look for was not going to be easy.
>
> And I'm sure you were shocked to see what I posted as a fix ;-)

Not really, such problems typically just have one line fixes :-)

> Assuming this does fix your issue, I sent out a real patch with the
> explanation of what happened in the change log, so that you can see why
> that change was your issue.

Yes, it does the trick, thanks very much!

>> In order to reproduce it and get reliable kexec crash dumps,
>> I needed to give the VM at least 3 cores.
>
> Yes, it did require having this run on multiple CPUs to have a race
> condition trigger, and two cores would have been hard to hit it. I ran
> it on 8 cores and it triggered rather easily.

I mean I was able to trigger the problem with 2 cores, but then
the machine was completely stuck, without triggering the kexec reboot
to get the crash dump. I don't understand why as the additional cpu
is not really involved at all..

Any idea why kexec could be so unreliable?
(even with 3 or 6 cpus it sometimes just didn't trigger)

>> While running './io-uring_cp-forever link-cp.c file' (from:
>> https://github.com/metze-samba/liburing/commits/io_uring-cp-forever )
>> in one window, the following simple sequence triggered the problem in most cases:
>>
>> echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
>> echo 1 > /sys/kernel/tracing/set_event_pid
> 
> All it took was something busy that did a lot of wakeups while setting
> the set_event_pid, to be able to hit the race easily. As I stated, I
> triggered it with running hackbench instead of the io-uring code. In
> fact, this bug had nothing to do with io-uring, and you were only
> triggering it because you were making enough of a load on the system to
> make the race happen often.

Yes, it was just the example that made it most reliable to trigger for me.

>> pid_list = rcu_dereference_raw(tr->filtered_pids);
> 
> That "tr" comes from the trace_event_file that is passed in by the
> "data" field of the callback. Hence, this callback got the data field
> of the event_filter_pid_sched_wakeup_probe_pre() callback that is
> called before all events when the set_event_pid file is set. That
> means, the "tr" being dereferened was not the "tr" you were looking for.

Yes, that's what I assumed, but didn't find the reason for.

>>
>> It seems it got inlined within trace_event_buffer_reserve()
>>
>> There strangest things I found so far is this:
>>
>> crash> sym global_trace  
>> ffffffffbcdb7e80 (d) global_trace
>> crash> list ftrace_trace_arrays  
>> ffffffffbcdb7e70
>> ffffffffbcdb7e80
>>
>> trace_array has size 7672, but ffffffffbcdb7e70 is only 16 bytes away from
>> ffffffffbcdb7e80.
> 
> ftrace_trace_arrays is a list_head, and I doubt you created any
> instances, thus the list head has only one instance, and that is
> global_trace. Hence, it points to global_trace and itself. It just so
> happens that a list_head is 16 bytes.

Ah, that explains a lot:

crash> struct list_head -o ftrace_trace_arrays
struct list_head {
  [ffffffffbcdb7e70] struct list_head *next;
  [ffffffffbcdb7e78] struct list_head *prev;
}
SIZE: 16
crash> list  ftrace_trace_arrays
ffffffffbcdb7e70
ffffffffbcdb7e80

I guess this is what I want:

crash> list -H ftrace_trace_arrays
ffffffffbcdb7e80

>> Also this:
>>
>> crash> struct trace_array.events -o global_trace  
>> struct trace_array {
>>   [ffffffffbcdb9be0] struct list_head events;
>> }
>> crash> list -s trace_event_file.tr -o trace_event_file.list ffffffffbcdb9be0  
>> ffffffffbcdb9be0
>>   tr = 0xffffffffbcdb7b20
>> ffff8ccb45cdfb00
>>   tr = 0xffffffffbcdb7e80
>> ffff8ccb45cdf580
>>   tr = 0xffffffffbcdb7e80
>> ffff8ccb45cdfe18
>>   tr = 0xffffffffbcdb7e80
>> ...
>>   tr = 0xffffffffbcdb7e80
>>
>> The first one 0xffffffffbcdb7b20 is only 864 bytes away from 0xffffffffbcdb7e80
> 
> I'm thinking it is confused by hitting the ftrace_trace_arrays
> list_head itself.

Yes, I needed the -H here too:

list -s trace_event_file.tr -o trace_event_file.list -H ffffffffbcdb9be0


>> It would be really great if you (or anyone else on the lists) could have a closer look
>> in order to get the problem fixed :-)
> 
> Once I triggered it and started looking at what was happening, it
> didn't take me to long to figure out where the culprit was.

Great!

>>
>> I've learned a lot this week and tried hard to find the problem myself,
>> but I have to move back to other work for now.
>>
> 
> I'm glad you learned a lot. There's a lot of complex code in there, and
> its getting more complex, as you can see with the static_calls.
> 
> This is because tracing tries hard to avoid the heisenbug effect. You
> know, you see a bug, turn on tracing, and then that bug goes away!
> 
> Thus it pulls out all the tricks it can to be as least intrusive on the
> system as it can be. And that causes things to get complex really quick.

I'll have a look, I may need something similar for my smbdirect socket driver
in future.

> Cheers, and thanks for keeping up on this bug!

Thanks for fixing the bug!

Now I can finally use:

trace-cmd record -e all -P $(pidof io_uring-cp-forever)

But that doesn't include the iou-wrk-* threads
and the '-c' option seems to only work with forking.

Is there a way to specify "trace *all* threads of the given pid"?
(Note the threads are comming and going, so it's not possible to
specifiy -P more than once)

Thanks!
metze