All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Ben Greear <greearb@candelatech.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Joe Lawrence <joe.lawrence@stratus.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org,
	"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>,
	Jouni Malinen <jouni@qca.qualcomm.com>,
	Vasanthakumar Thiagarajan <vthiagar@qca.qualcomm.com>,
	Senthil Balasubramanian <senthilb@qca.qualcomm.com>,
	linux-wireless@vger.kernel.org, ath9k-devel@venema.h4ckr.net,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: stop_machine lockup issue in 3.9.y.
Date: Thu, 6 Jun 2013 13:55:14 -0700	[thread overview]
Message-ID: <20130606205514.GC5045@htj.dyndns.org> (raw)
In-Reply-To: <51B004CD.6080007@candelatech.com>

Hello, Ben.

On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >>Ah, so, that's why it's showing up now.  We probably have had the same
> >>issue all along but it used to be masked by the softirq limiting.  Do
> >>you care to revive the 10 iterations limit so that it's limited by
> >>both the count and timing?  We do wanna find out why softirq is
> >>spinning indefinitely tho.
> >
> >Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.

First of all, kudos for tracking the issue down.  While the removal of
looping limit in softirq handling was the direct cause for making the
problem visible, it's very bothering that we have softirq runaway.
Finding out the perpetrator shouldn't be hard.  Something like the
following should work (untested).  Once we know which softirq (prolly
the network one), we can dig deeper.

Thanks.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b5197dc..5af3682 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
 	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
 	int cpu;
 	unsigned long old_flags = current->flags;
+	int cnt = 0;
 
 	/*
 	 * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -244,6 +245,9 @@ restart:
 			kstat_incr_softirqs_this_cpu(vec_nr);
 
 			trace_softirq_entry(vec_nr);
+			if (++cnt >= 5000 && cnt < 5010)
+				printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n",
+				       cnt, h->action);
 			h->action(h);
 			trace_softirq_exit(vec_nr);
 			if (unlikely(prev_count != preempt_count())) {


-- 
tejun

WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj@kernel.org>
To: ath9k-devel@lists.ath9k.org
Subject: [ath9k-devel] stop_machine lockup issue in 3.9.y.
Date: Thu, 6 Jun 2013 13:55:14 -0700	[thread overview]
Message-ID: <20130606205514.GC5045@htj.dyndns.org> (raw)
In-Reply-To: <51B004CD.6080007@candelatech.com>

Hello, Ben.

On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >>Ah, so, that's why it's showing up now.  We probably have had the same
> >>issue all along but it used to be masked by the softirq limiting.  Do
> >>you care to revive the 10 iterations limit so that it's limited by
> >>both the count and timing?  We do wanna find out why softirq is
> >>spinning indefinitely tho.
> >
> >Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.

First of all, kudos for tracking the issue down.  While the removal of
looping limit in softirq handling was the direct cause for making the
problem visible, it's very bothering that we have softirq runaway.
Finding out the perpetrator shouldn't be hard.  Something like the
following should work (untested).  Once we know which softirq (prolly
the network one), we can dig deeper.

Thanks.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b5197dc..5af3682 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
 	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
 	int cpu;
 	unsigned long old_flags = current->flags;
+	int cnt = 0;
 
 	/*
 	 * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -244,6 +245,9 @@ restart:
 			kstat_incr_softirqs_this_cpu(vec_nr);
 
 			trace_softirq_entry(vec_nr);
+			if (++cnt >= 5000 && cnt < 5010)
+				printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n",
+				       cnt, h->action);
 			h->action(h);
 			trace_softirq_exit(vec_nr);
 			if (unlikely(prev_count != preempt_count())) {


-- 
tejun

  parent reply	other threads:[~2013-06-06 20:55 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-31 18:14 Please add to stable: module: don't unlink the module until we've removed all exposure Ben Greear
2013-06-02  5:09 ` Rusty Russell
2013-06-03  3:46   ` Joe Lawrence
2013-06-03 11:25     ` Joe Lawrence
2013-06-03 14:17       ` Joe Lawrence
2013-06-03 15:59         ` Ben Greear
2013-06-03 16:36           ` Ben Greear
2013-06-04  4:37             ` Rusty Russell
2013-06-04  5:56             ` Rusty Russell
2013-06-04 14:07               ` Joe Lawrence
2013-06-04 16:50                 ` Joe Lawrence
2013-06-04 16:53                 ` Ben Greear
2013-06-04 17:45                   ` Ben Greear
2013-06-05  4:17                     ` Rusty Russell
2013-06-05  7:15                       ` Tejun Heo
2013-06-05 16:59                         ` Ben Greear
2013-06-05 18:48                           ` Tejun Heo
2013-06-05 19:11                             ` Ben Greear
2013-06-05 19:31                               ` stop_machine lockup issue in 3.9.y Ben Greear
2013-06-05 20:58                                 ` Ben Greear
2013-06-05 21:11                                   ` Tejun Heo
2013-06-05 21:11                                     ` [ath9k-devel] " Tejun Heo
2013-06-05 21:11                                     ` Tejun Heo
2013-06-05 21:33                                     ` Ben Greear
2013-06-05 21:33                                       ` [ath9k-devel] " Ben Greear
2013-06-06  1:34                                     ` Eric Dumazet
2013-06-06  1:34                                       ` [ath9k-devel] " Eric Dumazet
2013-06-06  1:34                                       ` Eric Dumazet
2013-06-06  3:14                                       ` Tejun Heo
2013-06-06  3:14                                         ` [ath9k-devel] " Tejun Heo
2013-06-06  3:14                                         ` Tejun Heo
2013-06-06  3:26                                         ` Eric Dumazet
2013-06-06  3:26                                           ` [ath9k-devel] " Eric Dumazet
2013-06-06  3:26                                           ` Eric Dumazet
2013-06-06  3:41                                           ` Ben Greear
2013-06-06  3:41                                             ` [ath9k-devel] " Ben Greear
2013-06-06  3:46                                             ` Eric Dumazet
2013-06-06  3:46                                               ` [ath9k-devel] " Eric Dumazet
2013-06-06  3:50                                               ` Ben Greear
2013-06-06  3:50                                                 ` [ath9k-devel] " Ben Greear
2013-06-06  4:08                                                 ` Eric Dumazet
2013-06-06  4:08                                                   ` [ath9k-devel] " Eric Dumazet
2013-06-06 20:55                                             ` Tejun Heo [this message]
2013-06-06 20:55                                               ` Tejun Heo
2013-06-06 21:15                                               ` Ben Greear
2013-06-06 21:15                                                 ` [ath9k-devel] " Ben Greear
2013-06-06 21:17                                                 ` Tejun Heo
2013-06-06 21:17                                                   ` [ath9k-devel] " Tejun Heo
2013-06-05  3:29                 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
2013-06-05  5:07         ` Greg KH
2013-06-05  7:13           ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130606205514.GC5045@htj.dyndns.org \
    --to=tj@kernel.org \
    --cc=ath9k-devel@venema.h4ckr.net \
    --cc=eric.dumazet@gmail.com \
    --cc=greearb@candelatech.com \
    --cc=joe.lawrence@stratus.com \
    --cc=jouni@qca.qualcomm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=mcgrof@qca.qualcomm.com \
    --cc=mingo@redhat.com \
    --cc=rusty@rustcorp.com.au \
    --cc=senthilb@qca.qualcomm.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=vthiagar@qca.qualcomm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.