All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rocco.Yue <rocco.yue@mediatek.com>
To: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Matthias Brugger <matthias.bgg@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Masahiro Yamada <masahiroy@kernel.org>,
	"Nick Desaulniers" <ndesaulniers@google.com>,
	Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Peter Enderborg <peter.enderborg@sony.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	"Vitor Massaru Iha" <vitor@massaru.org>,
	Sedat Dilek <sedat.dilek@gmail.com>,
	"Wei Yang" <richard.weiyang@gmail.com>,
	Cong Wang <cong.wang@bytedance.com>,
	"Di Zhu" <zhudi21@huawei.com>,
	Stephen Hemminger <stephen@networkplumber.org>,
	Francis Laniel <laniel_francis@privacyrequired.com>,
	Roopa Prabhu <roopa@cumulusnetworks.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	netdev <netdev@vger.kernel.org>,
	linux-arm Mailing List <linux-arm-kernel@lists.infradead.org>,
	"moderated list:ARM/Mediatek SoC support"
	<linux-mediatek@lists.infradead.org>, <wsd_upsream@mediatek.com>
Subject: Re: [PATCH] rtnetlink: add rtnl_lock debug log
Date: Sat, 8 May 2021 16:53:32 +0800	[thread overview]
Message-ID: <1620464012.29475.76.camel@mbjsdccf07> (raw)
In-Reply-To: <CAHp75Vc=GMv5dJ1dJYr=B3W6c+nuPCfXa4wxLJOYPTuqrMskFg@mail.gmail.com>

> > We often encounter system hangs caused by certain processes
> > holding rtnl_lock for a long time. Even if there is a lock
> > detection mechanism in Linux, it is a bit troublesome and
> > affects the system performance. We hope to add a lightweight
> > debugging mechanism for detecting rtnl_lock.
> >
> > Up to now, we have discovered and solved some potential bugs
> > through such debug information of this lightweight rtnl_lock,
> > which is helpful for us.
> >
> > When you say Y for RTNL_LOCK_DEBUG, then the kernel will detect
> > if any function hold rtnl_lock too long and some key information
> > will be printed to help identify the issue point.
> >
> > i.e: from the following logs, we can clear know that the pid=5546
> 
> clearly
> 
> > RfxSender_4 process hold rtnl_lock for a long time, causing the
> 
> holds
> 
> > system hang. And we can also speculate that the delay operation
> 
> to hang
> 

Thanks for your review, Andy.

I Will fix them.

> > may be performed in devinet_ioctl(), resulting in rtnl_lock was
> > not released in time.
> >
> > <6>[  141.151364] ----------- rtnl_print_btrace start -----------
> 
> Can you, please, shrink this to the point?
> 

Will shrink these points.

> > <6>[  141.152079] RfxSender_4[5546][R] hold rtnl_lock more than 2 sec,
> > start time: 139129481562
> > <4>[  141.153114]  rtnl_lock+0x88/0xfc
> > <4>[  141.153523]  devinet_ioctl+0x190/0x1268
> > <4>[  141.154007]  inet_ioctl+0x108/0x1f4
> > <4>[  141.154449]  sock_do_ioctl+0x88/0x200
> > <4>[  141.154911]  sock_ioctl+0x4b0/0x884
> > <4>[  141.155367]  do_vfs_ioctl+0x6b0/0xcc4
> > <4>[  141.155830]  __arm64_sys_ioctl+0xc0/0xec
> > <4>[  141.156326]  el0_svc_common+0x130/0x2c0
> > <4>[  141.156810]  el0_svc_handler+0xd0/0xe0
> > <4>[  141.157283]  el0_svc+0x8/0xc
> > <4>[  141.157646] Call trace:
> > <4>[  141.157956]  dump_backtrace+0x0/0x240
> > <4>[  141.158418]  show_stack+0x18/0x24
> > <4>[  141.158836]  rtnl_print_btrace+0x138/0x1cc
> > <4>[  141.159362]  call_timer_fn+0x120/0x47c
> > <4>[  141.159834]  expire_timers+0x28c/0x420
> > <4>[  141.160306]  __run_timers+0x3d0/0x494
> > <4>[  141.160768]  run_timer_softirq+0x24/0x48
> > <4>[  141.161262]  __do_softirq+0x26c/0x968
> > <4>[  141.161725]  irq_exit+0x1f8/0x2b4
> > <4>[  141.162145]  __handle_domain_irq+0xdc/0x15c
> > <4>[  141.162672]  gic_handle_irq+0xe4/0x188
> > <4>[  141.163144]  el1_irq+0x104/0x200
> > <4>[  141.163559]  __const_udelay+0x118/0x1b0
> > <4>[  141.164044]  devinet_ioctl+0x1a0/0x1268
> > <4>[  141.164527]  inet_ioctl+0x108/0x1f4
> > <4>[  141.164968]  sock_do_ioctl+0x88/0x200
> > <4>[  141.165428]  sock_ioctl+0x4b0/0x884
> > <4>[  141.165868]  do_vfs_ioctl+0x6b0/0xcc4
> > <4>[  141.166330]  __arm64_sys_ioctl+0xc0/0xec
> > <4>[  141.166825]  el0_svc_common+0x130/0x2c0
> > <4>[  141.167308]  el0_svc_handler+0xd0/0xe0
> > <4>[  141.167786]  el0_svc+0x8/0xc
> > <6>[  141.168153] ------------ rtnl_print_btrace end -----------
> >
> > <6>[  147.321389] rtnl_lock is held by [5546] from
> > [139129481562] to [147321378812]
> 
> 
> ...
> 
> > +static struct rtnl_debug_btrace_t rtnl_instance = {
> > +       .task           = NULL,
> > +       .pid            = 0,
> > +       .start_time     = 0,
> > +       .end_time       = 0,
> > +       .nr_entries     = 0,
> 
> static assumes all 0:s, what's the point?
> 

will fix it in the patch v2.

> > +};
> 
> ...
> 
> > +static void rtnl_print_btrace(struct timer_list *unused)
> > +{
> > +       pr_info("----------- %s start -----------\n", __func__);
> > +       pr_info("%s[%d][%c] hold rtnl_lock more than 2 sec, start time: %llu\n",
> > +               rtnl_instance.task->comm,
> > +               rtnl_instance.pid,
> > +               task_state_to_char(rtnl_instance.task),
> > +               rtnl_instance.start_time);
> > +       stack_trace_print(rtnl_instance.addrs, rtnl_instance.nr_entries, 0);
> 
> > +       show_stack(rtnl_instance.task, NULL, KERN_DEBUG);
> 
> Unaligned debug level.
> 

KERN_INFO is expected, I will fix it in patch v2.

> > +       pr_info("------------ %s end -----------\n", __func__);
> 
> Looking into tons of these, perhaps you need to define pr_fmt(). I
> haven't checked if it's already defined, though.
> 

In the files related rtnetlink, pr_fmt() is not defined. I am not sure
if I can define one since there are other error logs in the rtnetlink.c.

In addition, there is "rtnl_lock" field in the log, I usually use it to
retrieve whether rtnl_lock is held for a long time.

If adding pr_fmt() is better, it is ok to me :-)

> > +}
> 
> ...
> 
> > +       if (rtnl_instance.end_time - rtnl_instance.start_time > 2000000000ULL) {
> 
> Perhaps you should use one of the defined constants from time64.h ?
> 

Will have a separate patch for this.

> > +               pr_info("rtnl_lock is held by [%d] from [%llu] to [%llu]\n",
> > +                       rtnl_instance.pid,
> > +                       rtnl_instance.start_time,
> > +                       rtnl_instance.end_time);
> > +       }
> 

Best Regard
Rocco

_______________________________________________
Linux-mediatek mailing list
Linux-mediatek@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-mediatek

WARNING: multiple messages have this Message-ID (diff)
From: Rocco.Yue <rocco.yue@mediatek.com>
To: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Matthias Brugger <matthias.bgg@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Masahiro Yamada <masahiroy@kernel.org>,
	"Nick Desaulniers" <ndesaulniers@google.com>,
	Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Peter Enderborg <peter.enderborg@sony.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	"Vitor Massaru Iha" <vitor@massaru.org>,
	Sedat Dilek <sedat.dilek@gmail.com>,
	"Wei Yang" <richard.weiyang@gmail.com>,
	Cong Wang <cong.wang@bytedance.com>,
	"Di Zhu" <zhudi21@huawei.com>,
	Stephen Hemminger <stephen@networkplumber.org>,
	Francis Laniel <laniel_francis@privacyrequired.com>,
	Roopa Prabhu <roopa@cumulusnetworks.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	netdev <netdev@vger.kernel.org>,
	linux-arm Mailing List <linux-arm-kernel@lists.infradead.org>,
	"moderated list:ARM/Mediatek SoC support"
	<linux-mediatek@lists.infradead.org>, <wsd_upsream@mediatek.com>
Subject: Re: [PATCH] rtnetlink: add rtnl_lock debug log
Date: Sat, 8 May 2021 16:53:32 +0800	[thread overview]
Message-ID: <1620464012.29475.76.camel@mbjsdccf07> (raw)
In-Reply-To: <CAHp75Vc=GMv5dJ1dJYr=B3W6c+nuPCfXa4wxLJOYPTuqrMskFg@mail.gmail.com>

> > We often encounter system hangs caused by certain processes
> > holding rtnl_lock for a long time. Even if there is a lock
> > detection mechanism in Linux, it is a bit troublesome and
> > affects the system performance. We hope to add a lightweight
> > debugging mechanism for detecting rtnl_lock.
> >
> > Up to now, we have discovered and solved some potential bugs
> > through such debug information of this lightweight rtnl_lock,
> > which is helpful for us.
> >
> > When you say Y for RTNL_LOCK_DEBUG, then the kernel will detect
> > if any function hold rtnl_lock too long and some key information
> > will be printed to help identify the issue point.
> >
> > i.e: from the following logs, we can clear know that the pid=5546
> 
> clearly
> 
> > RfxSender_4 process hold rtnl_lock for a long time, causing the
> 
> holds
> 
> > system hang. And we can also speculate that the delay operation
> 
> to hang
> 

Thanks for your review, Andy.

I Will fix them.

> > may be performed in devinet_ioctl(), resulting in rtnl_lock was
> > not released in time.
> >
> > <6>[  141.151364] ----------- rtnl_print_btrace start -----------
> 
> Can you, please, shrink this to the point?
> 

Will shrink these points.

> > <6>[  141.152079] RfxSender_4[5546][R] hold rtnl_lock more than 2 sec,
> > start time: 139129481562
> > <4>[  141.153114]  rtnl_lock+0x88/0xfc
> > <4>[  141.153523]  devinet_ioctl+0x190/0x1268
> > <4>[  141.154007]  inet_ioctl+0x108/0x1f4
> > <4>[  141.154449]  sock_do_ioctl+0x88/0x200
> > <4>[  141.154911]  sock_ioctl+0x4b0/0x884
> > <4>[  141.155367]  do_vfs_ioctl+0x6b0/0xcc4
> > <4>[  141.155830]  __arm64_sys_ioctl+0xc0/0xec
> > <4>[  141.156326]  el0_svc_common+0x130/0x2c0
> > <4>[  141.156810]  el0_svc_handler+0xd0/0xe0
> > <4>[  141.157283]  el0_svc+0x8/0xc
> > <4>[  141.157646] Call trace:
> > <4>[  141.157956]  dump_backtrace+0x0/0x240
> > <4>[  141.158418]  show_stack+0x18/0x24
> > <4>[  141.158836]  rtnl_print_btrace+0x138/0x1cc
> > <4>[  141.159362]  call_timer_fn+0x120/0x47c
> > <4>[  141.159834]  expire_timers+0x28c/0x420
> > <4>[  141.160306]  __run_timers+0x3d0/0x494
> > <4>[  141.160768]  run_timer_softirq+0x24/0x48
> > <4>[  141.161262]  __do_softirq+0x26c/0x968
> > <4>[  141.161725]  irq_exit+0x1f8/0x2b4
> > <4>[  141.162145]  __handle_domain_irq+0xdc/0x15c
> > <4>[  141.162672]  gic_handle_irq+0xe4/0x188
> > <4>[  141.163144]  el1_irq+0x104/0x200
> > <4>[  141.163559]  __const_udelay+0x118/0x1b0
> > <4>[  141.164044]  devinet_ioctl+0x1a0/0x1268
> > <4>[  141.164527]  inet_ioctl+0x108/0x1f4
> > <4>[  141.164968]  sock_do_ioctl+0x88/0x200
> > <4>[  141.165428]  sock_ioctl+0x4b0/0x884
> > <4>[  141.165868]  do_vfs_ioctl+0x6b0/0xcc4
> > <4>[  141.166330]  __arm64_sys_ioctl+0xc0/0xec
> > <4>[  141.166825]  el0_svc_common+0x130/0x2c0
> > <4>[  141.167308]  el0_svc_handler+0xd0/0xe0
> > <4>[  141.167786]  el0_svc+0x8/0xc
> > <6>[  141.168153] ------------ rtnl_print_btrace end -----------
> >
> > <6>[  147.321389] rtnl_lock is held by [5546] from
> > [139129481562] to [147321378812]
> 
> 
> ...
> 
> > +static struct rtnl_debug_btrace_t rtnl_instance = {
> > +       .task           = NULL,
> > +       .pid            = 0,
> > +       .start_time     = 0,
> > +       .end_time       = 0,
> > +       .nr_entries     = 0,
> 
> static assumes all 0:s, what's the point?
> 

will fix it in the patch v2.

> > +};
> 
> ...
> 
> > +static void rtnl_print_btrace(struct timer_list *unused)
> > +{
> > +       pr_info("----------- %s start -----------\n", __func__);
> > +       pr_info("%s[%d][%c] hold rtnl_lock more than 2 sec, start time: %llu\n",
> > +               rtnl_instance.task->comm,
> > +               rtnl_instance.pid,
> > +               task_state_to_char(rtnl_instance.task),
> > +               rtnl_instance.start_time);
> > +       stack_trace_print(rtnl_instance.addrs, rtnl_instance.nr_entries, 0);
> 
> > +       show_stack(rtnl_instance.task, NULL, KERN_DEBUG);
> 
> Unaligned debug level.
> 

KERN_INFO is expected, I will fix it in patch v2.

> > +       pr_info("------------ %s end -----------\n", __func__);
> 
> Looking into tons of these, perhaps you need to define pr_fmt(). I
> haven't checked if it's already defined, though.
> 

In the files related rtnetlink, pr_fmt() is not defined. I am not sure
if I can define one since there are other error logs in the rtnetlink.c.

In addition, there is "rtnl_lock" field in the log, I usually use it to
retrieve whether rtnl_lock is held for a long time.

If adding pr_fmt() is better, it is ok to me :-)

> > +}
> 
> ...
> 
> > +       if (rtnl_instance.end_time - rtnl_instance.start_time > 2000000000ULL) {
> 
> Perhaps you should use one of the defined constants from time64.h ?
> 

Will have a separate patch for this.

> > +               pr_info("rtnl_lock is held by [%d] from [%llu] to [%llu]\n",
> > +                       rtnl_instance.pid,
> > +                       rtnl_instance.start_time,
> > +                       rtnl_instance.end_time);
> > +       }
> 

Best Regard
Rocco

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-05-08  9:06 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-29  7:02 [PATCH] rtnetlink: add rtnl_lock debug log Rocco Yue
2021-04-29  7:02 ` Rocco Yue
2021-04-29  7:02 ` Rocco Yue
2021-04-29  9:32 ` Andy Shevchenko
2021-04-29  9:32   ` Andy Shevchenko
2021-04-29  9:32   ` Andy Shevchenko
2021-05-08  8:53   ` Rocco.Yue [this message]
2021-05-08  8:53     ` Rocco.Yue
2021-04-29 12:07 ` kernel test robot
2021-04-29 12:07   ` kernel test robot
2021-04-30 11:48 ` Tetsuo Handa
2021-04-30 11:48   ` Tetsuo Handa
2021-04-30 11:48   ` Tetsuo Handa
2021-05-08  8:54   ` Rocco.Yue
2021-05-08  8:54     ` Rocco.Yue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1620464012.29475.76.camel@mbjsdccf07 \
    --to=rocco.yue@mediatek.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=andy.shevchenko@gmail.com \
    --cc=anshuman.khandual@arm.com \
    --cc=cong.wang@bytedance.com \
    --cc=davem@davemloft.net \
    --cc=kuba@kernel.org \
    --cc=laniel_francis@privacyrequired.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=masahiroy@kernel.org \
    --cc=matthias.bgg@gmail.com \
    --cc=ndesaulniers@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=peter.enderborg@sony.com \
    --cc=peterz@infradead.org \
    --cc=richard.weiyang@gmail.com \
    --cc=roopa@cumulusnetworks.com \
    --cc=sedat.dilek@gmail.com \
    --cc=stephen@networkplumber.org \
    --cc=tglx@linutronix.de \
    --cc=vitor@massaru.org \
    --cc=wsd_upsream@mediatek.com \
    --cc=zhudi21@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.