* Suggestions on how to debug kernel crashes where printk and gdb both does not work @ 2021-06-14 13:22 Dongliang Mu 2021-06-14 13:34 ` Pavel Skripkin 0 siblings, 1 reply; 12+ messages in thread From: Dongliang Mu @ 2021-06-14 13:22 UTC (permalink / raw) To: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH Dear kernel developers, I was trying to debug the crash - memory leak in hwsim_add_one [1] recently. However, I encountered a disgusting issue: my breakpoint and printk/pr_alert in the functions that will be surely executed do not work. The stack trace is in the following. I wrote this email to ask for some suggestions on how to debug such cases? Thanks very much. Looking forward to your reply. [<ffffffff82b09f12>] kmalloc include/linux/slab.h:556 [inline] [<ffffffff82b09f12>] kzalloc include/linux/slab.h:686 [inline] [<ffffffff82b09f12>] hwsim_alloc_edge.constprop.0+0x22/0x80 drivers/net/ieee802154/mac802154_hwsim.c:385 [<ffffffff82b0b0f3>] hwsim_subscribe_all_others drivers/net/ieee802154/mac802154_hwsim.c:709 [inline] [<ffffffff82b0b0f3>] hwsim_add_one+0x3b3/0x640 drivers/net/ieee802154/mac802154_hwsim.c:802 [<ffffffff82b0b3c4>] hwsim_probe+0x44/0xd0 drivers/net/ieee802154/mac802154_hwsim.c:848 [<ffffffff82628bf1>] platform_probe+0x81/0x120 drivers/base/platform.c:1447 [<ffffffff82625679>] really_probe+0x159/0x500 drivers/base/dd.c:576 [<ffffffff82625aab>] driver_probe_device+0x8b/0x100 drivers/base/dd.c:763 [<ffffffff82626325>] device_driver_attach+0x105/0x110 drivers/base/dd.c:1039 [<ffffffff826263d1>] __driver_attach drivers/base/dd.c:1117 [inline] [<ffffffff826263d1>] __driver_attach+0xa1/0x140 drivers/base/dd.c:1070 [<ffffffff82622459>] bus_for_each_dev+0xa9/0x100 drivers/base/bus.c:305 [<ffffffff826244e0>] bus_add_driver+0x160/0x280 drivers/base/bus.c:622 [<ffffffff82627233>] driver_register+0xc3/0x150 drivers/base/driver.c:171 [<ffffffff874fa3dc>] hwsim_init_module+0xae/0x107 drivers/net/ieee802154/mac802154_hwsim.c:899 [<ffffffff81001083>] do_one_initcall+0x63/0x2e0 init/main.c:1249 [<ffffffff87489873>] do_initcall_level init/main.c:1322 [inline] [<ffffffff87489873>] do_initcalls init/main.c:1338 [inline] [<ffffffff87489873>] do_basic_setup init/main.c:1358 [inline] [<ffffffff87489873>] kernel_init_freeable+0x1f4/0x26e init/main.c:1560 [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447 [<ffffffff810022ef>] ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 [1] https://groups.google.com/g/syzkaller-bugs/c/iEEZ_UOvoEY/m/6sm4bt6EAgAJ -- My best regards to you. No System Is Safe! Dongliang Mu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 13:22 Suggestions on how to debug kernel crashes where printk and gdb both does not work Dongliang Mu @ 2021-06-14 13:34 ` Pavel Skripkin 2021-06-14 14:19 ` Dongliang Mu 0 siblings, 1 reply; 12+ messages in thread From: Pavel Skripkin @ 2021-06-14 13:34 UTC (permalink / raw) To: Dongliang Mu Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Mon, 14 Jun 2021 21:22:43 +0800 Dongliang Mu <mudongliangabcd@gmail.com> wrote: > Dear kernel developers, > > I was trying to debug the crash - memory leak in hwsim_add_one [1] > recently. However, I encountered a disgusting issue: my breakpoint and > printk/pr_alert in the functions that will be surely executed do not > work. The stack trace is in the following. I wrote this email to ask > for some suggestions on how to debug such cases? > > Thanks very much. Looking forward to your reply. > Hi, Dongliang! This bug is not similar to others on the dashboard. I spent some time debugging it a week ago. The main problem here, that memory allocation happens in the boot time: > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447 and reproducer simply tries to free this data. You can use ftrace to look at it. Smth like this: $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter would work. With regards, Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 13:34 ` Pavel Skripkin @ 2021-06-14 14:19 ` Dongliang Mu 2021-06-14 14:25 ` Pavel Skripkin 0 siblings, 1 reply; 12+ messages in thread From: Dongliang Mu @ 2021-06-14 14:19 UTC (permalink / raw) To: Pavel Skripkin Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin <paskripkin@gmail.com> wrote: > > On Mon, 14 Jun 2021 21:22:43 +0800 > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > Dear kernel developers, > > > > I was trying to debug the crash - memory leak in hwsim_add_one [1] > > recently. However, I encountered a disgusting issue: my breakpoint and > > printk/pr_alert in the functions that will be surely executed do not > > work. The stack trace is in the following. I wrote this email to ask > > for some suggestions on how to debug such cases? > > > > Thanks very much. Looking forward to your reply. > > > > Hi, Dongliang! > > This bug is not similar to others on the dashboard. I spent some time > debugging it a week ago. The main problem here, that memory > allocation happens in the boot time: > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447 > Oh, nice catch. No wonder why my debugging does not work. :( > and reproducer simply tries to > free this data. You can use ftrace to look at it. Smth like this: > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter Thanks for your suggestion. Do you have any conclusions about this case? If you have found out the root cause and start writing patches, I will turn my focus to other cases. BTW, I only found another possible memory leak after some manual code review [1]. However, it is not the root cause for this crash. [1] https://lkml.org/lkml/2021/6/10/1297 > > would work. > > > With regards, > Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 14:19 ` Dongliang Mu @ 2021-06-14 14:25 ` Pavel Skripkin 2021-06-14 14:40 ` Dongliang Mu 0 siblings, 1 reply; 12+ messages in thread From: Pavel Skripkin @ 2021-06-14 14:25 UTC (permalink / raw) To: Dongliang Mu Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Mon, 14 Jun 2021 22:19:10 +0800 Dongliang Mu <mudongliangabcd@gmail.com> wrote: > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin <paskripkin@gmail.com> > wrote: > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > Dear kernel developers, > > > > > > I was trying to debug the crash - memory leak in hwsim_add_one [1] > > > recently. However, I encountered a disgusting issue: my > > > breakpoint and printk/pr_alert in the functions that will be > > > surely executed do not work. The stack trace is in the following. > > > I wrote this email to ask for some suggestions on how to debug > > > such cases? > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > Hi, Dongliang! > > > > This bug is not similar to others on the dashboard. I spent some > > time debugging it a week ago. The main problem here, that memory > > allocation happens in the boot time: > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447 > > > > Oh, nice catch. No wonder why my debugging does not work. :( > > > and reproducer simply tries to > > free this data. You can use ftrace to look at it. Smth like this: > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > Thanks for your suggestion. > > Do you have any conclusions about this case? If you have found out the > root cause and start writing patches, I will turn my focus to other > cases. No, I had some busy days and I have nothing about this bug for now. I've just traced the reproducer execution and that's all :) I guess, some error handling paths are broken, but Im not sure > > BTW, I only found another possible memory leak after some manual code > review [1]. However, it is not the root cause for this crash. > > [1] https://lkml.org/lkml/2021/6/10/1297 > > > > > would work. > > > > > > With regards, > > Pavel Skripkin With regards, Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 14:25 ` Pavel Skripkin @ 2021-06-14 14:40 ` Dongliang Mu 2021-06-14 14:47 ` Pavel Skripkin 0 siblings, 1 reply; 12+ messages in thread From: Dongliang Mu @ 2021-06-14 14:40 UTC (permalink / raw) To: Pavel Skripkin Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin <paskripkin@gmail.com> wrote: > > On Mon, 14 Jun 2021 22:19:10 +0800 > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin <paskripkin@gmail.com> > > wrote: > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > Dear kernel developers, > > > > > > > > I was trying to debug the crash - memory leak in hwsim_add_one [1] > > > > recently. However, I encountered a disgusting issue: my > > > > breakpoint and printk/pr_alert in the functions that will be > > > > surely executed do not work. The stack trace is in the following. > > > > I wrote this email to ask for some suggestions on how to debug > > > > such cases? > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > Hi, Dongliang! > > > > > > This bug is not similar to others on the dashboard. I spent some > > > time debugging it a week ago. The main problem here, that memory > > > allocation happens in the boot time: > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447 > > > > > > > Oh, nice catch. No wonder why my debugging does not work. :( > > > > > and reproducer simply tries to > > > free this data. You can use ftrace to look at it. Smth like this: > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > Thanks for your suggestion. > > > > Do you have any conclusions about this case? If you have found out the > > root cause and start writing patches, I will turn my focus to other > > cases. > > No, I had some busy days and I have nothing about this bug for now. > I've just traced the reproducer execution and that's all :) > > I guess, some error handling paths are broken, but Im not sure In the beginning, I agreed with you. However, after I manually checked functions: hwsim_probe (initialization) and hwsim_remove (cleanup), then things may be different. The cleanup looks correct to me. I would like to debug but stuck with the debugging process. And there is another issue: the cleanup function also does not output anything or hit the breakpoint. I don't quite understand it since the cleanup is not at the boot time. Any idea? > > > > > > BTW, I only found another possible memory leak after some manual code > > review [1]. However, it is not the root cause for this crash. > > > > [1] https://lkml.org/lkml/2021/6/10/1297 > > > > > > > > would work. > > > > > > > > > With regards, > > > Pavel Skripkin > > > > > With regards, > Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 14:40 ` Dongliang Mu @ 2021-06-14 14:47 ` Pavel Skripkin 2021-06-14 15:04 ` Dongliang Mu 0 siblings, 1 reply; 12+ messages in thread From: Pavel Skripkin @ 2021-06-14 14:47 UTC (permalink / raw) To: Dongliang Mu Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Mon, 14 Jun 2021 22:40:55 +0800 Dongliang Mu <mudongliangabcd@gmail.com> wrote: > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > <paskripkin@gmail.com> wrote: > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > <paskripkin@gmail.com> wrote: > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > Dear kernel developers, > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > disgusting issue: my breakpoint and printk/pr_alert in the > > > > > functions that will be surely executed do not work. The stack > > > > > trace is in the following. I wrote this email to ask for some > > > > > suggestions on how to debug such cases? > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > This bug is not similar to others on the dashboard. I spent some > > > > time debugging it a week ago. The main problem here, that memory > > > > allocation happens in the boot time: > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447 > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. :( > > > > > > > and reproducer simply tries to > > > > free this data. You can use ftrace to look at it. Smth like > > > > this: > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > Thanks for your suggestion. > > > > > > Do you have any conclusions about this case? If you have found > > > out the root cause and start writing patches, I will turn my > > > focus to other cases. > > > > No, I had some busy days and I have nothing about this bug for now. > > I've just traced the reproducer execution and that's all :) > > > > I guess, some error handling paths are broken, but Im not sure > > In the beginning, I agreed with you. However, after I manually checked > functions: hwsim_probe (initialization) and hwsim_remove (cleanup), > then things may be different. The cleanup looks correct to me. I would > like to debug but stuck with the debugging process. > > And there is another issue: the cleanup function also does not output > anything or hit the breakpoint. I don't quite understand it since the > cleanup is not at the boot time. > > Any idea? > Output from ftrace (syzkaller repro): root@syzkaller:~# cat /sys/kernel/tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | hwsim_del_radio_nl() { 1) | hwsim_del() { 1) | hwsim_edge_unsubscribe_me() { 1) ! 310.041 us | hwsim_free_edge(); 1) ! 665.221 us | } 1) * 52999.05 us | } 1) * 53035.38 us | } Cleanup function is not the case, I think :) With regards, Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 14:47 ` Pavel Skripkin @ 2021-06-14 15:04 ` Dongliang Mu 2021-06-14 20:30 ` Pavel Skripkin 0 siblings, 1 reply; 12+ messages in thread From: Dongliang Mu @ 2021-06-14 15:04 UTC (permalink / raw) To: Pavel Skripkin Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin <paskripkin@gmail.com> wrote: > > On Mon, 14 Jun 2021 22:40:55 +0800 > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > > <paskripkin@gmail.com> wrote: > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > Dear kernel developers, > > > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > > disgusting issue: my breakpoint and printk/pr_alert in the > > > > > > functions that will be surely executed do not work. The stack > > > > > > trace is in the following. I wrote this email to ask for some > > > > > > suggestions on how to debug such cases? > > > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > > > This bug is not similar to others on the dashboard. I spent some > > > > > time debugging it a week ago. The main problem here, that memory > > > > > allocation happens in the boot time: > > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 init/main.c:1447 > > > > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. :( > > > > > > > > > and reproducer simply tries to > > > > > free this data. You can use ftrace to look at it. Smth like > > > > > this: > > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > > > Thanks for your suggestion. > > > > > > > > Do you have any conclusions about this case? If you have found > > > > out the root cause and start writing patches, I will turn my > > > > focus to other cases. > > > > > > No, I had some busy days and I have nothing about this bug for now. > > > I've just traced the reproducer execution and that's all :) > > > > > > I guess, some error handling paths are broken, but Im not sure > > > > In the beginning, I agreed with you. However, after I manually checked > > functions: hwsim_probe (initialization) and hwsim_remove (cleanup), > > then things may be different. The cleanup looks correct to me. I would > > like to debug but stuck with the debugging process. > > > > And there is another issue: the cleanup function also does not output > > anything or hit the breakpoint. I don't quite understand it since the > > cleanup is not at the boot time. > > > > Any idea? > > > > Output from ftrace (syzkaller repro): > > root@syzkaller:~# cat /sys/kernel/tracing/trace > # tracer: function_graph > # > # CPU DURATION FUNCTION CALLS > # | | | | | | | > 1) | hwsim_del_radio_nl() { > 1) | hwsim_del() { > 1) | hwsim_edge_unsubscribe_me() { > 1) ! 310.041 us | hwsim_free_edge(); > 1) ! 665.221 us | } > 1) * 52999.05 us | } > 1) * 53035.38 us | } > > Cleanup function is not the case, I think :) It seems like I spot the incorrect cleanup function (hwsim_remove is the right one is in my mind). Let me learn how to use ftrace to log the executed functions and then discuss this case with you guys. > > > > With regards, > Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 15:04 ` Dongliang Mu @ 2021-06-14 20:30 ` Pavel Skripkin 2021-06-15 10:37 ` Dongliang Mu 0 siblings, 1 reply; 12+ messages in thread From: Pavel Skripkin @ 2021-06-14 20:30 UTC (permalink / raw) To: Dongliang Mu Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Mon, 14 Jun 2021 23:04:03 +0800 Dongliang Mu <mudongliangabcd@gmail.com> wrote: > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin > <paskripkin@gmail.com> wrote: > > > > On Mon, 14 Jun 2021 22:40:55 +0800 > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > > > <paskripkin@gmail.com> wrote: > > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > Dear kernel developers, > > > > > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > > > disgusting issue: my breakpoint and printk/pr_alert in the > > > > > > > functions that will be surely executed do not work. The > > > > > > > stack trace is in the following. I wrote this email to > > > > > > > ask for some suggestions on how to debug such cases? > > > > > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > > > > > This bug is not similar to others on the dashboard. I spent > > > > > > some time debugging it a week ago. The main problem here, > > > > > > that memory allocation happens in the boot time: > > > > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 > > > > > > > init/main.c:1447 > > > > > > > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. :( > > > > > > > > > > > and reproducer simply tries to > > > > > > free this data. You can use ftrace to look at it. Smth like > > > > > > this: > > > > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > > > > > Thanks for your suggestion. > > > > > > > > > > Do you have any conclusions about this case? If you have found > > > > > out the root cause and start writing patches, I will turn my > > > > > focus to other cases. > > > > > > > > No, I had some busy days and I have nothing about this bug for > > > > now. I've just traced the reproducer execution and that's all :) > > > > > > > > I guess, some error handling paths are broken, but Im not sure > > > > > > In the beginning, I agreed with you. However, after I manually > > > checked functions: hwsim_probe (initialization) and hwsim_remove > > > (cleanup), then things may be different. The cleanup looks > > > correct to me. I would like to debug but stuck with the debugging > > > process. > > > > > > And there is another issue: the cleanup function also does not > > > output anything or hit the breakpoint. I don't quite understand > > > it since the cleanup is not at the boot time. > > > > > > Any idea? > > > > > > > Output from ftrace (syzkaller repro): > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace > > # tracer: function_graph > > # > > # CPU DURATION FUNCTION CALLS > > # | | | | | | | > > 1) | hwsim_del_radio_nl() { > > 1) | hwsim_del() { > > 1) | hwsim_edge_unsubscribe_me() { > > 1) ! 310.041 us | hwsim_free_edge(); > > 1) ! 665.221 us | } > > 1) * 52999.05 us | } > > 1) * 53035.38 us | } > > > > Cleanup function is not the case, I think :) > > It seems like I spot the incorrect cleanup function (hwsim_remove is > the right one is in my mind). Let me learn how to use ftrace to log > the executed functions and then discuss this case with you guys. > Hmmm, I think, there is a mess with lists. I just want to share my debug results, I have no idea about the fix for now. In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces sends a request to delete phy with idx == 0, so this check in hwsim_edge_unsubscribe_me(): if (e->endpoint->idx == phy->idx) { ... clean up code ... } won't be passed and edge won't be freed (because it was allocated for phy with idx == 1). Allocated edge for phy 1 becomes leaked after hwsim_del(). I can't really see the code where phy with idx == 1 can be deleted from list... Maybe, it's kmemleak bug. Similar strange case was with this one https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584. I find it strange, that I could reach leaked pointers after kmemleak reported a leak. Im not familiar with kmemleak internals and I might be wrong With regards, Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-14 20:30 ` Pavel Skripkin @ 2021-06-15 10:37 ` Dongliang Mu 2021-06-15 13:34 ` Pavel Skripkin 0 siblings, 1 reply; 12+ messages in thread From: Dongliang Mu @ 2021-06-15 10:37 UTC (permalink / raw) To: Pavel Skripkin Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com> wrote: > > On Mon, 14 Jun 2021 23:04:03 +0800 > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin > > <paskripkin@gmail.com> wrote: > > > > > > On Mon, 14 Jun 2021 22:40:55 +0800 > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > > > Dear kernel developers, > > > > > > > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert in the > > > > > > > > functions that will be surely executed do not work. The > > > > > > > > stack trace is in the following. I wrote this email to > > > > > > > > ask for some suggestions on how to debug such cases? > > > > > > > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > > > > > > > This bug is not similar to others on the dashboard. I spent > > > > > > > some time debugging it a week ago. The main problem here, > > > > > > > that memory allocation happens in the boot time: > > > > > > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 > > > > > > > > init/main.c:1447 > > > > > > > > > > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. :( > > > > > > > > > > > > > and reproducer simply tries to > > > > > > > free this data. You can use ftrace to look at it. Smth like > > > > > > > this: > > > > > > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > > > > > > > Thanks for your suggestion. > > > > > > > > > > > > Do you have any conclusions about this case? If you have found > > > > > > out the root cause and start writing patches, I will turn my > > > > > > focus to other cases. > > > > > > > > > > No, I had some busy days and I have nothing about this bug for > > > > > now. I've just traced the reproducer execution and that's all :) > > > > > > > > > > I guess, some error handling paths are broken, but Im not sure > > > > > > > > In the beginning, I agreed with you. However, after I manually > > > > checked functions: hwsim_probe (initialization) and hwsim_remove > > > > (cleanup), then things may be different. The cleanup looks > > > > correct to me. I would like to debug but stuck with the debugging > > > > process. > > > > > > > > And there is another issue: the cleanup function also does not > > > > output anything or hit the breakpoint. I don't quite understand > > > > it since the cleanup is not at the boot time. > > > > > > > > Any idea? > > > > > > > > > > Output from ftrace (syzkaller repro): > > > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace > > > # tracer: function_graph > > > # > > > # CPU DURATION FUNCTION CALLS > > > # | | | | | | | > > > 1) | hwsim_del_radio_nl() { > > > 1) | hwsim_del() { > > > 1) | hwsim_edge_unsubscribe_me() { > > > 1) ! 310.041 us | hwsim_free_edge(); > > > 1) ! 665.221 us | } > > > 1) * 52999.05 us | } > > > 1) * 53035.38 us | } > > > > > > Cleanup function is not the case, I think :) > > > > It seems like I spot the incorrect cleanup function (hwsim_remove is > > the right one is in my mind). Let me learn how to use ftrace to log > > the executed functions and then discuss this case with you guys. > > > > Hmmm, I think, there is a mess with lists. > > I just want to share my debug results, I have no idea about the fix for > now. > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces > sends a request to delete phy with idx == 0, so this check in > hwsim_edge_unsubscribe_me(): > > if (e->endpoint->idx == phy->idx) { > ... clean up code ... > } > > won't be passed and edge won't be freed (because it was allocated for > phy with idx == 1). Allocated edge for phy 1 becomes leaked after > hwsim_del(). I can't really see the code where phy with idx == 1 can > be deleted from list... Thanks for sharing your debugging result. hwsim_phys | --------------------------------- | | sub0 (edges) sub1 (edges) ----> e (idx = 1) ----> e (idx = 0) hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1). However, in this function, it only deletes the e in the edge list of sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it leaves the e in the edge list of sub0 non-free. I proposed a patch and test it successfully in the syzbot dashboard. diff --git a/drivers/net/ieee802154/mac802154_hwsim.c b/drivers/net/ieee802154/mac802154_hwsim.c index da9135231c07..b05159cff33a 100644 --- a/drivers/net/ieee802154/mac802154_hwsim.c +++ b/drivers/net/ieee802154/mac802154_hwsim.c @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info, struct device *dev, static void hwsim_del(struct hwsim_phy *phy) { struct hwsim_pib *pib; + struct hwsim_edge *e; hwsim_edge_unsubscribe_me(phy); + // remove the edges in the list + list_for_each_entry_rcu(e, &phy->edges, list) { + list_del_rcu(&e->list); + hwsim_free_edge(e); + } + list_del(&phy->list); rcu_read_lock(); I will send a patch later. > > Maybe, it's kmemleak bug. Similar strange case was with this one > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584. > I find it strange, that I could reach leaked pointers after kmemleak reported a > leak. Im not familiar with kmemleak internals and I might be wrong > > > With regards, > Pavel Skripkin ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-15 10:37 ` Dongliang Mu @ 2021-06-15 13:34 ` Pavel Skripkin 2021-06-16 2:02 ` Dongliang Mu 0 siblings, 1 reply; 12+ messages in thread From: Pavel Skripkin @ 2021-06-15 13:34 UTC (permalink / raw) To: Dongliang Mu Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Tue, 15 Jun 2021 18:37:14 +0800 Dongliang Mu <mudongliangabcd@gmail.com> wrote: > On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com> > wrote: > > > > On Mon, 14 Jun 2021 23:04:03 +0800 > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin > > > <paskripkin@gmail.com> wrote: > > > > > > > > On Mon, 14 Jun 2021 22:40:55 +0800 > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > > > > > Dear kernel developers, > > > > > > > > > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert > > > > > > > > > in the functions that will be surely executed do not > > > > > > > > > work. The stack trace is in the following. I wrote > > > > > > > > > this email to ask for some suggestions on how to > > > > > > > > > debug such cases? > > > > > > > > > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > > > > > > > > > This bug is not similar to others on the dashboard. I > > > > > > > > spent some time debugging it a week ago. The main > > > > > > > > problem here, that memory allocation happens in the > > > > > > > > boot time: > > > > > > > > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 > > > > > > > > > init/main.c:1447 > > > > > > > > > > > > > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. > > > > > > > :( > > > > > > > > > > > > > > > and reproducer simply tries to > > > > > > > > free this data. You can use ftrace to look at it. Smth > > > > > > > > like this: > > > > > > > > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > > > > > > > > > Thanks for your suggestion. > > > > > > > > > > > > > > Do you have any conclusions about this case? If you have > > > > > > > found out the root cause and start writing patches, I > > > > > > > will turn my focus to other cases. > > > > > > > > > > > > No, I had some busy days and I have nothing about this bug > > > > > > for now. I've just traced the reproducer execution and > > > > > > that's all :) > > > > > > > > > > > > I guess, some error handling paths are broken, but Im not > > > > > > sure > > > > > > > > > > In the beginning, I agreed with you. However, after I manually > > > > > checked functions: hwsim_probe (initialization) and > > > > > hwsim_remove (cleanup), then things may be different. The > > > > > cleanup looks correct to me. I would like to debug but stuck > > > > > with the debugging process. > > > > > > > > > > And there is another issue: the cleanup function also does not > > > > > output anything or hit the breakpoint. I don't quite > > > > > understand it since the cleanup is not at the boot time. > > > > > > > > > > Any idea? > > > > > > > > > > > > > Output from ftrace (syzkaller repro): > > > > > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace > > > > # tracer: function_graph > > > > # > > > > # CPU DURATION FUNCTION CALLS > > > > # | | | | | | | > > > > 1) | hwsim_del_radio_nl() { > > > > 1) | hwsim_del() { > > > > 1) | hwsim_edge_unsubscribe_me() { > > > > 1) ! 310.041 us | hwsim_free_edge(); > > > > 1) ! 665.221 us | } > > > > 1) * 52999.05 us | } > > > > 1) * 53035.38 us | } > > > > > > > > Cleanup function is not the case, I think :) > > > > > > It seems like I spot the incorrect cleanup function (hwsim_remove > > > is the right one is in my mind). Let me learn how to use ftrace > > > to log the executed functions and then discuss this case with you > > > guys. > > > > > > > Hmmm, I think, there is a mess with lists. > > > > I just want to share my debug results, I have no idea about the fix > > for now. > > > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces > > sends a request to delete phy with idx == 0, so this check in > > hwsim_edge_unsubscribe_me(): > > > > if (e->endpoint->idx == phy->idx) { > > ... clean up code ... > > } > > > > won't be passed and edge won't be freed (because it was allocated > > for phy with idx == 1). Allocated edge for phy 1 becomes leaked > > after hwsim_del(). I can't really see the code where phy with idx > > == 1 can be deleted from list... > > Thanks for sharing your debugging result. > > hwsim_phys > | > --------------------------------- > | | > sub0 (edges) sub1 (edges) > ----> e (idx = 1) ----> e (idx = 0) > > hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1). > However, in this function, it only deletes the e in the edge list of > sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it > leaves the e in the edge list of sub0 non-free. > > I proposed a patch and test it successfully in the syzbot dashboard. > Cool! I thougth about similar fix before going to bed, but I had really busy morning today :) > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c > b/drivers/net/ieee802154/mac802154_hwsim.c > index da9135231c07..b05159cff33a 100644 > --- a/drivers/net/ieee802154/mac802154_hwsim.c > +++ b/drivers/net/ieee802154/mac802154_hwsim.c > @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info, > struct device *dev, > static void hwsim_del(struct hwsim_phy *phy) > { > struct hwsim_pib *pib; > + struct hwsim_edge *e; > > hwsim_edge_unsubscribe_me(phy); > > + // remove the edges in the list > + list_for_each_entry_rcu(e, &phy->edges, list) { > + list_del_rcu(&e->list); > + hwsim_free_edge(e); > + } > + I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting phy node from global list, then, i guess, rcu locking won't be needed here. > list_del(&phy->list); > > rcu_read_lock(); > > I will send a patch later. > > > > > > Maybe, it's kmemleak bug. Similar strange case was with this one > > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584. > > I find it strange, that I could reach leaked pointers after > > kmemleak reported a leak. Im not familiar with kmemleak internals > > and I might be wrong > > > > > > With regards, > > Pavel Skripkin With regards, Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-15 13:34 ` Pavel Skripkin @ 2021-06-16 2:02 ` Dongliang Mu 2021-06-16 2:11 ` Dongliang Mu 0 siblings, 1 reply; 12+ messages in thread From: Dongliang Mu @ 2021-06-16 2:02 UTC (permalink / raw) To: Pavel Skripkin Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Tue, Jun 15, 2021 at 9:35 PM Pavel Skripkin <paskripkin@gmail.com> wrote: > > On Tue, 15 Jun 2021 18:37:14 +0800 > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com> > > wrote: > > > > > > On Mon, 14 Jun 2021 23:04:03 +0800 > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > On Mon, 14 Jun 2021 22:40:55 +0800 > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > Dear kernel developers, > > > > > > > > > > > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert > > > > > > > > > > in the functions that will be surely executed do not > > > > > > > > > > work. The stack trace is in the following. I wrote > > > > > > > > > > this email to ask for some suggestions on how to > > > > > > > > > > debug such cases? > > > > > > > > > > > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > > > > > > > > > > > This bug is not similar to others on the dashboard. I > > > > > > > > > spent some time debugging it a week ago. The main > > > > > > > > > problem here, that memory allocation happens in the > > > > > > > > > boot time: > > > > > > > > > > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 > > > > > > > > > > init/main.c:1447 > > > > > > > > > > > > > > > > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. > > > > > > > > :( > > > > > > > > > > > > > > > > > and reproducer simply tries to > > > > > > > > > free this data. You can use ftrace to look at it. Smth > > > > > > > > > like this: > > > > > > > > > > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > > > > > > > > > > > Thanks for your suggestion. > > > > > > > > > > > > > > > > Do you have any conclusions about this case? If you have > > > > > > > > found out the root cause and start writing patches, I > > > > > > > > will turn my focus to other cases. > > > > > > > > > > > > > > No, I had some busy days and I have nothing about this bug > > > > > > > for now. I've just traced the reproducer execution and > > > > > > > that's all :) > > > > > > > > > > > > > > I guess, some error handling paths are broken, but Im not > > > > > > > sure > > > > > > > > > > > > In the beginning, I agreed with you. However, after I manually > > > > > > checked functions: hwsim_probe (initialization) and > > > > > > hwsim_remove (cleanup), then things may be different. The > > > > > > cleanup looks correct to me. I would like to debug but stuck > > > > > > with the debugging process. > > > > > > > > > > > > And there is another issue: the cleanup function also does not > > > > > > output anything or hit the breakpoint. I don't quite > > > > > > understand it since the cleanup is not at the boot time. > > > > > > > > > > > > Any idea? > > > > > > > > > > > > > > > > Output from ftrace (syzkaller repro): > > > > > > > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace > > > > > # tracer: function_graph > > > > > # > > > > > # CPU DURATION FUNCTION CALLS > > > > > # | | | | | | | > > > > > 1) | hwsim_del_radio_nl() { > > > > > 1) | hwsim_del() { > > > > > 1) | hwsim_edge_unsubscribe_me() { > > > > > 1) ! 310.041 us | hwsim_free_edge(); > > > > > 1) ! 665.221 us | } > > > > > 1) * 52999.05 us | } > > > > > 1) * 53035.38 us | } > > > > > > > > > > Cleanup function is not the case, I think :) > > > > > > > > It seems like I spot the incorrect cleanup function (hwsim_remove > > > > is the right one is in my mind). Let me learn how to use ftrace > > > > to log the executed functions and then discuss this case with you > > > > guys. > > > > > > > > > > Hmmm, I think, there is a mess with lists. > > > > > > I just want to share my debug results, I have no idea about the fix > > > for now. > > > > > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces > > > sends a request to delete phy with idx == 0, so this check in > > > hwsim_edge_unsubscribe_me(): > > > > > > if (e->endpoint->idx == phy->idx) { > > > ... clean up code ... > > > } > > > > > > won't be passed and edge won't be freed (because it was allocated > > > for phy with idx == 1). Allocated edge for phy 1 becomes leaked > > > after hwsim_del(). I can't really see the code where phy with idx > > > == 1 can be deleted from list... > > > > Thanks for sharing your debugging result. > > > > hwsim_phys > > | > > --------------------------------- > > | | > > sub0 (edges) sub1 (edges) > > ----> e (idx = 1) ----> e (idx = 0) > > > > hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1). > > However, in this function, it only deletes the e in the edge list of > > sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it > > leaves the e in the edge list of sub0 non-free. > > > > I proposed a patch and test it successfully in the syzbot dashboard. > > > > Cool! I thougth about similar fix before going to bed, but I had really > busy morning today :) > > > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c > > b/drivers/net/ieee802154/mac802154_hwsim.c > > index da9135231c07..b05159cff33a 100644 > > --- a/drivers/net/ieee802154/mac802154_hwsim.c > > +++ b/drivers/net/ieee802154/mac802154_hwsim.c > > @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info, > > struct device *dev, > > static void hwsim_del(struct hwsim_phy *phy) > > { > > struct hwsim_pib *pib; > > + struct hwsim_edge *e; > > > > hwsim_edge_unsubscribe_me(phy); > > > > + // remove the edges in the list > > + list_for_each_entry_rcu(e, &phy->edges, list) { > > + list_del_rcu(&e->list); > > + hwsim_free_edge(e); > > + } > > + > > I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in > hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting > phy node from global list, then, i guess, rcu locking won't be needed > here. Yes, you're right. rcu_read_lock is needed here. However, from the code below list_del(&phy->list), I think we'd better still add rcu_read_lock for those statements. How do you think about the following patch? BTW, I've sent a patch with the prefix PATCH. Maybe we can discuss this patch there. diff --git a/drivers/net/ieee802154/mac802154_hwsim.c b/drivers/net/ieee802154/mac802154_hwsim.c index da9135231c07..cf659361a3fb 100644 --- a/drivers/net/ieee802154/mac802154_hwsim.c +++ b/drivers/net/ieee802154/mac802154_hwsim.c @@ -824,12 +824,17 @@ static int hwsim_add_one(struct genl_info *info, struct device *dev, static void hwsim_del(struct hwsim_phy *phy) { struct hwsim_pib *pib; + struct hwsim_edge *e; hwsim_edge_unsubscribe_me(phy); list_del(&phy->list); rcu_read_lock(); + list_for_each_entry_rcu(e, &phy->edges, list) { + list_del_rcu(&e->list); + hwsim_free_edge(e); + } pib = rcu_dereference(phy->pib); rcu_read_unlock(); > > > list_del(&phy->list); > > > > rcu_read_lock(); > > > > I will send a patch later. > > > > > > > > > > Maybe, it's kmemleak bug. Similar strange case was with this one > > > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584. > > > I find it strange, that I could reach leaked pointers after > > > kmemleak reported a leak. Im not familiar with kmemleak internals > > > and I might be wrong > > > > > > > > > With regards, > > > Pavel Skripkin > > > > > With regards, > Pavel Skripkin ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work 2021-06-16 2:02 ` Dongliang Mu @ 2021-06-16 2:11 ` Dongliang Mu 0 siblings, 0 replies; 12+ messages in thread From: Dongliang Mu @ 2021-06-16 2:11 UTC (permalink / raw) To: Pavel Skripkin Cc: alex.aring, David S. Miller, Jakub Kicinski, linux-kernel, linux-wpan, netdev, stefan, syzkaller-bugs, syzbot+b80c9959009a9325cdff, Dan Carpenter, Greg KH On Wed, Jun 16, 2021 at 10:02 AM Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > On Tue, Jun 15, 2021 at 9:35 PM Pavel Skripkin <paskripkin@gmail.com> wrote: > > > > On Tue, 15 Jun 2021 18:37:14 +0800 > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com> > > > wrote: > > > > > > > > On Mon, 14 Jun 2021 23:04:03 +0800 > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > On Mon, 14 Jun 2021 22:40:55 +0800 > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > > > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > > > > > > > <paskripkin@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > Dear kernel developers, > > > > > > > > > > > > > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > > > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert > > > > > > > > > > > in the functions that will be surely executed do not > > > > > > > > > > > work. The stack trace is in the following. I wrote > > > > > > > > > > > this email to ask for some suggestions on how to > > > > > > > > > > > debug such cases? > > > > > > > > > > > > > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > > > > > > > > > > > > > This bug is not similar to others on the dashboard. I > > > > > > > > > > spent some time debugging it a week ago. The main > > > > > > > > > > problem here, that memory allocation happens in the > > > > > > > > > > boot time: > > > > > > > > > > > > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 > > > > > > > > > > > init/main.c:1447 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. > > > > > > > > > :( > > > > > > > > > > > > > > > > > > > and reproducer simply tries to > > > > > > > > > > free this data. You can use ftrace to look at it. Smth > > > > > > > > > > like this: > > > > > > > > > > > > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > > > > > > > > > > > > > Thanks for your suggestion. > > > > > > > > > > > > > > > > > > Do you have any conclusions about this case? If you have > > > > > > > > > found out the root cause and start writing patches, I > > > > > > > > > will turn my focus to other cases. > > > > > > > > > > > > > > > > No, I had some busy days and I have nothing about this bug > > > > > > > > for now. I've just traced the reproducer execution and > > > > > > > > that's all :) > > > > > > > > > > > > > > > > I guess, some error handling paths are broken, but Im not > > > > > > > > sure > > > > > > > > > > > > > > In the beginning, I agreed with you. However, after I manually > > > > > > > checked functions: hwsim_probe (initialization) and > > > > > > > hwsim_remove (cleanup), then things may be different. The > > > > > > > cleanup looks correct to me. I would like to debug but stuck > > > > > > > with the debugging process. > > > > > > > > > > > > > > And there is another issue: the cleanup function also does not > > > > > > > output anything or hit the breakpoint. I don't quite > > > > > > > understand it since the cleanup is not at the boot time. > > > > > > > > > > > > > > Any idea? > > > > > > > > > > > > > > > > > > > Output from ftrace (syzkaller repro): > > > > > > > > > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace > > > > > > # tracer: function_graph > > > > > > # > > > > > > # CPU DURATION FUNCTION CALLS > > > > > > # | | | | | | | > > > > > > 1) | hwsim_del_radio_nl() { > > > > > > 1) | hwsim_del() { > > > > > > 1) | hwsim_edge_unsubscribe_me() { > > > > > > 1) ! 310.041 us | hwsim_free_edge(); > > > > > > 1) ! 665.221 us | } > > > > > > 1) * 52999.05 us | } > > > > > > 1) * 53035.38 us | } > > > > > > > > > > > > Cleanup function is not the case, I think :) > > > > > > > > > > It seems like I spot the incorrect cleanup function (hwsim_remove > > > > > is the right one is in my mind). Let me learn how to use ftrace > > > > > to log the executed functions and then discuss this case with you > > > > > guys. > > > > > > > > > > > > > Hmmm, I think, there is a mess with lists. > > > > > > > > I just want to share my debug results, I have no idea about the fix > > > > for now. > > > > > > > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces > > > > sends a request to delete phy with idx == 0, so this check in > > > > hwsim_edge_unsubscribe_me(): > > > > > > > > if (e->endpoint->idx == phy->idx) { > > > > ... clean up code ... > > > > } > > > > > > > > won't be passed and edge won't be freed (because it was allocated > > > > for phy with idx == 1). Allocated edge for phy 1 becomes leaked > > > > after hwsim_del(). I can't really see the code where phy with idx > > > > == 1 can be deleted from list... > > > > > > Thanks for sharing your debugging result. > > > > > > hwsim_phys > > > | > > > --------------------------------- > > > | | > > > sub0 (edges) sub1 (edges) > > > ----> e (idx = 1) ----> e (idx = 0) > > > > > > hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1). > > > However, in this function, it only deletes the e in the edge list of > > > sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it > > > leaves the e in the edge list of sub0 non-free. > > > > > > I proposed a patch and test it successfully in the syzbot dashboard. > > > > > > > Cool! I thougth about similar fix before going to bed, but I had really > > busy morning today :) > > > > > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c > > > b/drivers/net/ieee802154/mac802154_hwsim.c > > > index da9135231c07..b05159cff33a 100644 > > > --- a/drivers/net/ieee802154/mac802154_hwsim.c > > > +++ b/drivers/net/ieee802154/mac802154_hwsim.c > > > @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info, > > > struct device *dev, > > > static void hwsim_del(struct hwsim_phy *phy) > > > { > > > struct hwsim_pib *pib; > > > + struct hwsim_edge *e; > > > > > > hwsim_edge_unsubscribe_me(phy); > > > > > > + // remove the edges in the list > > > + list_for_each_entry_rcu(e, &phy->edges, list) { > > > + list_del_rcu(&e->list); > > > + hwsim_free_edge(e); > > > + } > > > + > > > > I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in > > hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting > > phy node from global list, then, i guess, rcu locking won't be needed > > here. > > Yes, you're right. rcu_read_lock is needed here. However, from the > code below list_del(&phy->list), I think we'd better still add > rcu_read_lock for those statements. > > How do you think about the following patch? BTW, I've sent a patch > with the prefix PATCH. Maybe we can discuss this patch there. > > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c > b/drivers/net/ieee802154/mac802154_hwsim.c > index da9135231c07..cf659361a3fb 100644 > --- a/drivers/net/ieee802154/mac802154_hwsim.c > +++ b/drivers/net/ieee802154/mac802154_hwsim.c > @@ -824,12 +824,17 @@ static int hwsim_add_one(struct genl_info *info, > struct device *dev, > static void hwsim_del(struct hwsim_phy *phy) > { > struct hwsim_pib *pib; > + struct hwsim_edge *e; > > hwsim_edge_unsubscribe_me(phy); > > list_del(&phy->list); > > rcu_read_lock(); > + list_for_each_entry_rcu(e, &phy->edges, list) { > + list_del_rcu(&e->list); > + hwsim_free_edge(e); > + } > pib = rcu_dereference(phy->pib); > rcu_read_unlock(); > I have sent a v2 patch to the mailing list, please discuss the patch in the corresponding thread [1]. [1] https://lkml.org/lkml/2021/6/15/1585. > > > > > list_del(&phy->list); > > > > > > rcu_read_lock(); > > > > > > I will send a patch later. > > > > > > > > > > > > > > Maybe, it's kmemleak bug. Similar strange case was with this one > > > > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584. > > > > I find it strange, that I could reach leaked pointers after > > > > kmemleak reported a leak. Im not familiar with kmemleak internals > > > > and I might be wrong > > > > > > > > > > > > With regards, > > > > Pavel Skripkin > > > > > > > > > > With regards, > > Pavel Skripkin ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-06-16 2:11 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-14 13:22 Suggestions on how to debug kernel crashes where printk and gdb both does not work Dongliang Mu 2021-06-14 13:34 ` Pavel Skripkin 2021-06-14 14:19 ` Dongliang Mu 2021-06-14 14:25 ` Pavel Skripkin 2021-06-14 14:40 ` Dongliang Mu 2021-06-14 14:47 ` Pavel Skripkin 2021-06-14 15:04 ` Dongliang Mu 2021-06-14 20:30 ` Pavel Skripkin 2021-06-15 10:37 ` Dongliang Mu 2021-06-15 13:34 ` Pavel Skripkin 2021-06-16 2:02 ` Dongliang Mu 2021-06-16 2:11 ` Dongliang Mu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).