All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dongliang Mu <mudongliangabcd@gmail.com>
To: Pavel Skripkin <paskripkin@gmail.com>
Cc: alex.aring@gmail.com, "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-wpan@vger.kernel.org, netdev@vger.kernel.org,
	stefan@datenfreihafen.org,
	syzkaller-bugs <syzkaller-bugs@googlegroups.com>,
	syzbot+b80c9959009a9325cdff@syzkaller.appspotmail.com,
	Dan Carpenter <dan.carpenter@oracle.com>,
	Greg KH <gregkh@linuxfoundation.org>
Subject: Re: Suggestions on how to debug kernel crashes where printk and gdb both does not work
Date: Wed, 16 Jun 2021 10:11:08 +0800	[thread overview]
Message-ID: <CAD-N9QWH5ewqQrmo-h5Em9W=+kDB4JO0x==vE=hOH9f4MhQbJg@mail.gmail.com> (raw)
In-Reply-To: <CAD-N9QV7o_gQtH4tCCDGheFjtx_7xBh7hgtyfbZo71FebUuUrA@mail.gmail.com>

On Wed, Jun 16, 2021 at 10:02 AM Dongliang Mu <mudongliangabcd@gmail.com> wrote:
>
> On Tue, Jun 15, 2021 at 9:35 PM Pavel Skripkin <paskripkin@gmail.com> wrote:
> >
> > On Tue, 15 Jun 2021 18:37:14 +0800
> > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> >
> > > On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@gmail.com>
> > > wrote:
> > > >
> > > > On Mon, 14 Jun 2021 23:04:03 +0800
> > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > >
> > > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin
> > > > > <paskripkin@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, 14 Jun 2021 22:40:55 +0800
> > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > >
> > > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin
> > > > > > > <paskripkin@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800
> > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin
> > > > > > > > > <paskripkin@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800
> > > > > > > > > > Dongliang Mu <mudongliangabcd@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Dear kernel developers,
> > > > > > > > > > >
> > > > > > > > > > > I was trying to debug the crash - memory leak in
> > > > > > > > > > > hwsim_add_one [1] recently. However, I encountered a
> > > > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert
> > > > > > > > > > > in the functions that will be surely executed do not
> > > > > > > > > > > work. The stack trace is in the following. I wrote
> > > > > > > > > > > this email to ask for some suggestions on how to
> > > > > > > > > > > debug such cases?
> > > > > > > > > > >
> > > > > > > > > > > Thanks very much. Looking forward to your reply.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi, Dongliang!
> > > > > > > > > >
> > > > > > > > > > This bug is not similar to others on the dashboard. I
> > > > > > > > > > spent some time debugging it a week ago. The main
> > > > > > > > > > problem here, that memory allocation happens in the
> > > > > > > > > > boot time:
> > > > > > > > > >
> > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7
> > > > > > > > > > > init/main.c:1447
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Oh, nice catch. No wonder why my debugging does not work.
> > > > > > > > > :(
> > > > > > > > >
> > > > > > > > > > and reproducer simply tries to
> > > > > > > > > > free this data. You can use ftrace to look at it. Smth
> > > > > > > > > > like this:
> > > > > > > > > >
> > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter
> > > > > > > > >
> > > > > > > > > Thanks for your suggestion.
> > > > > > > > >
> > > > > > > > > Do you have any conclusions about this case? If you have
> > > > > > > > > found out the root cause and start writing patches, I
> > > > > > > > > will turn my focus to other cases.
> > > > > > > >
> > > > > > > > No, I had some busy days and I have nothing about this bug
> > > > > > > > for now. I've just traced the reproducer execution and
> > > > > > > > that's all :)
> > > > > > > >
> > > > > > > > I guess, some error handling paths are broken, but Im not
> > > > > > > > sure
> > > > > > >
> > > > > > > In the beginning, I agreed with you. However, after I manually
> > > > > > > checked functions: hwsim_probe (initialization) and
> > > > > > > hwsim_remove (cleanup), then things may be different. The
> > > > > > > cleanup looks correct to me. I would like to debug but stuck
> > > > > > > with the debugging process.
> > > > > > >
> > > > > > > And there is another issue: the cleanup function also does not
> > > > > > > output anything or hit the breakpoint. I don't quite
> > > > > > > understand it since the cleanup is not at the boot time.
> > > > > > >
> > > > > > > Any idea?
> > > > > > >
> > > > > >
> > > > > > Output from ftrace (syzkaller repro):
> > > > > >
> > > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace
> > > > > > # tracer: function_graph
> > > > > > #
> > > > > > # CPU  DURATION                  FUNCTION CALLS
> > > > > > # |     |   |                     |   |   |   |
> > > > > >  1)               |  hwsim_del_radio_nl() {
> > > > > >  1)               |    hwsim_del() {
> > > > > >  1)               |      hwsim_edge_unsubscribe_me() {
> > > > > >  1) ! 310.041 us  |        hwsim_free_edge();
> > > > > >  1) ! 665.221 us  |      }
> > > > > >  1) * 52999.05 us |    }
> > > > > >  1) * 53035.38 us |  }
> > > > > >
> > > > > > Cleanup function is not the case, I think :)
> > > > >
> > > > > It seems like I spot the incorrect cleanup function (hwsim_remove
> > > > > is the right one is in my mind). Let me learn how to use ftrace
> > > > > to log the executed functions and then discuss this case with you
> > > > > guys.
> > > > >
> > > >
> > > > Hmmm, I think, there is a mess with lists.
> > > >
> > > > I just want to share my debug results, I have no idea about the fix
> > > > for now.
> > > >
> > > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces
> > > > sends a request to delete phy with idx == 0, so this check in
> > > > hwsim_edge_unsubscribe_me():
> > > >
> > > >         if (e->endpoint->idx == phy->idx) {
> > > >                 ... clean up code ...
> > > >         }
> > > >
> > > > won't be passed and edge won't be freed (because it was allocated
> > > > for phy with idx == 1). Allocated edge for phy 1 becomes leaked
> > > > after hwsim_del(). I can't really see the code where phy with idx
> > > > == 1 can be deleted from list...
> > >
> > > Thanks for sharing your debugging result.
> > >
> > >               hwsim_phys
> > >                        |
> > >    ---------------------------------
> > >    |                                      |
> > > sub0 (edges)                 sub1 (edges)
> > >    ----> e (idx = 1)               ----> e (idx = 0)
> > >
> > > hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1).
> > > However, in this function, it only deletes the e in the edge list of
> > > sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it
> > > leaves the e in the edge list of sub0 non-free.
> > >
> > > I proposed a patch and test it successfully in the syzbot dashboard.
> > >
> >
> > Cool! I thougth about similar fix before going to bed, but I had really
> > busy morning today :)
> >
> > > diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
> > > b/drivers/net/ieee802154/mac802154_hwsim.c
> > > index da9135231c07..b05159cff33a 100644
> > > --- a/drivers/net/ieee802154/mac802154_hwsim.c
> > > +++ b/drivers/net/ieee802154/mac802154_hwsim.c
> > > @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info,
> > > struct device *dev,
> > >  static void hwsim_del(struct hwsim_phy *phy)
> > >  {
> > >   struct hwsim_pib *pib;
> > > + struct hwsim_edge *e;
> > >
> > >   hwsim_edge_unsubscribe_me(phy);
> > >
> > > + // remove the edges in the list
> > > + list_for_each_entry_rcu(e, &phy->edges, list) {
> > > + list_del_rcu(&e->list);
> > > + hwsim_free_edge(e);
> > > + }
> > > +
> >
> > I think, rcu_read_lock() and rcu_read_unlock() are needed here (like in
> > hwsim_edge_unsubscribe_me()). Or you can delete this edges after deleting
> > phy node from global list, then, i guess, rcu locking won't be needed
> > here.
>
> Yes, you're right. rcu_read_lock is needed here. However, from the
> code below list_del(&phy->list), I think we'd better still add
> rcu_read_lock for those statements.
>
> How do you think about the following patch? BTW, I've sent a patch
> with the prefix PATCH. Maybe we can discuss this patch there.
>
> diff --git a/drivers/net/ieee802154/mac802154_hwsim.c
> b/drivers/net/ieee802154/mac802154_hwsim.c
> index da9135231c07..cf659361a3fb 100644
> --- a/drivers/net/ieee802154/mac802154_hwsim.c
> +++ b/drivers/net/ieee802154/mac802154_hwsim.c
> @@ -824,12 +824,17 @@ static int hwsim_add_one(struct genl_info *info,
> struct device *dev,
>  static void hwsim_del(struct hwsim_phy *phy)
>  {
>         struct hwsim_pib *pib;
> +       struct hwsim_edge *e;
>
>         hwsim_edge_unsubscribe_me(phy);
>
>         list_del(&phy->list);
>
>         rcu_read_lock();
> +       list_for_each_entry_rcu(e, &phy->edges, list) {
> +               list_del_rcu(&e->list);
> +               hwsim_free_edge(e);
> +       }
>         pib = rcu_dereference(phy->pib);
>         rcu_read_unlock();
>

I have sent a v2 patch to the mailing list, please discuss the patch
in the corresponding thread [1].

[1] https://lkml.org/lkml/2021/6/15/1585.

> >
> > >   list_del(&phy->list);
> > >
> > >   rcu_read_lock();
> > >
> > >  I will send a patch later.
> > >
> > >
> > > >
> > > > Maybe, it's kmemleak bug. Similar strange case was with this one
> > > > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584.
> > > > I find it strange, that I could reach leaked pointers after
> > > > kmemleak reported a leak. Im not familiar with kmemleak internals
> > > > and I might be wrong
> > > >
> > > >
> > > > With regards,
> > > > Pavel Skripkin
> >
> >
> >
> >
> > With regards,
> > Pavel Skripkin

      reply	other threads:[~2021-06-16  2:11 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-14 13:22 Suggestions on how to debug kernel crashes where printk and gdb both does not work Dongliang Mu
2021-06-14 13:34 ` Pavel Skripkin
2021-06-14 14:19   ` Dongliang Mu
2021-06-14 14:25     ` Pavel Skripkin
2021-06-14 14:40       ` Dongliang Mu
2021-06-14 14:47         ` Pavel Skripkin
2021-06-14 15:04           ` Dongliang Mu
2021-06-14 20:30             ` Pavel Skripkin
2021-06-15 10:37               ` Dongliang Mu
2021-06-15 13:34                 ` Pavel Skripkin
2021-06-16  2:02                   ` Dongliang Mu
2021-06-16  2:11                     ` Dongliang Mu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAD-N9QWH5ewqQrmo-h5Em9W=+kDB4JO0x==vE=hOH9f4MhQbJg@mail.gmail.com' \
    --to=mudongliangabcd@gmail.com \
    --cc=alex.aring@gmail.com \
    --cc=dan.carpenter@oracle.com \
    --cc=davem@davemloft.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-wpan@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=paskripkin@gmail.com \
    --cc=stefan@datenfreihafen.org \
    --cc=syzbot+b80c9959009a9325cdff@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.