linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris J Arges <chris.j.arges@canonical.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Rafael David Tinoco <inaddy@ubuntu.com>,
	Peter Anvin <hpa@zytor.com>,
	Jiang Liu <jiang.liu@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Gema Gomez <gema.gomez-solano@canonical.com>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks
Date: Tue, 7 Apr 2015 15:59:46 -0500	[thread overview]
Message-ID: <20150407205945.GA28212@canonical.com> (raw)
In-Reply-To: <20150407092121.GA9971@gmail.com>

On Tue, Apr 07, 2015 at 11:21:21AM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
> > <chris.j.arges@canonical.com> wrote:
> > >
> > > I noticed that with this patch it never reached 'csd: Detected
> > > non-responsive CSD lock...' because it seems that ts_delta never reached
> > > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
> > > hang without reaching this statement. I made the ts0,ts1 values global
> > > and put a counter into the while loop and found that the loop iterated
> > > about 670 million times before the softlockup was detected. In addition
> > > ts0 and ts1 both had the same values upon soft lockup, and thus would
> > > never trip the CSD_LOCK_TIMEOUT check.
> > 
> > Sounds like jiffies stops updating. Which doesn't sound unreasonable 
> > for when there is some IPI problem.
> 
> Yeah - although it weakens the 'IPI lost spuriously' hypothesis: we 
> ought to have irqs enabled here which normally doesn't stop jiffies 
> from updating, and the timer interrupt stopping suggests a much deeper 
> problem than just some lost IPI ...
> 
> > 
> > How about just changing the debug patch to count iterations, and 
> > print out a warning when it reaches ten million or so.
> 
> Yeah, or replace jiffies_to_ms() with:
> 
> 	sched_clock()/1000000
> 
> sched_clock() should be safe to call in these codepaths.
> 
> Like the attached patch. (Totally untested.)
> 

Ingo,

Looks like sched_lock() works in this case.

Adding the dump_stack() line caused various issues such as the VM oopsing on
boot or the softlockup never being detected properly (and thus not crashing).
So the below is running with your patch and 'dump_stack()' commented out.

Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to 5s):
[   22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
[   38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting 5.000 secs for CPU#01
[   38.712715] csd: Re-sending CSD lock (#1) IPI from CPU#00 to CPU#01
[   43.712709] csd: Detected non-responsive CSD lock (#2) on CPU#00, waiting 5.000 secs for CPU#01
[   43.712713] csd: Re-sending CSD lock (#2) IPI from CPU#00 to CPU#01
[   48.712708] csd: Detected non-responsive CSD lock (#3) on CPU#00, waiting 5.000 secs for CPU#01
[   48.712732] csd: Re-sending CSD lock (#3) IPI from CPU#00 to CPU#01
[   53.712708] csd: Detected non-responsive CSD lock (#4) on CPU#00, waiting 5.000 secs for CPU#01
[   53.712712] csd: Re-sending CSD lock (#4) IPI from CPU#00 to CPU#01
[   58.712707] csd: Detected non-responsive CSD lock (#5) on CPU#00, waiting 5.000 secs for CPU#01
[   58.712712] csd: Re-sending CSD lock (#5) IPI from CPU#00 to CPU#01
[   60.080005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26]

Still we never seem to release the lock, even when resending the IPI.

Looking at the call_single_queue I see the following (I crashed during the soft lockup):

crash> p call_single_queue
PER-CPU DATA TYPE:
  struct llist_head call_single_queue;
PER-CPU ADDRESSES:
  [0]: ffff88013fc16580
  [1]: ffff88013fd16580
crash> list -s call_single_data ffff88013fc16580
ffff88013fc16580
struct call_single_data {
  llist = {
    next = 0x0
  }, 
  func = 0x0, 
  info = 0x0, 
  flags = 0
}
crash> list -s call_single_data ffff88013fd16580
ffff88013fd16580
struct call_single_data {
  llist = {
    next = 0xffff88013a517c08
  }, 
  func = 0x0, 
  info = 0x0, 
  flags = 0
}
ffff88013a517c08
struct call_single_data {
  llist = {
    next = 0x0
  }, 
  func = 0xffffffff81067f30 <flush_tlb_func>, 
  info = 0xffff88013a517d00, 
  flags = 3
}

This seems consistent with previous crash dumps.


As I mentioned here: https://lkml.org/lkml/2015/4/6/186
I'm able to reproduce this easily on certain hardware w/
b6b8a1451fc40412c57d10c94b62e22acab28f94 applied and not
9242b5b60df8b13b469bc6b7be08ff6ebb551ad3 on the L0 kernel. I think it makes
sense to get as clear of a picture with this more trivial reproducer, then
re-run this on a L0 w/ v4.0-rcX. Most likely the later case will take many days
to reproduce.

Thanks,
--chris
 

  reply	other threads:[~2015-04-07 21:00 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-11 13:19 smp_call_function_single lockups Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31 10:56                             ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                               ` Chris J Arges
2015-04-01 12:39                                 ` Ingo Molnar
2015-04-01 14:10                                   ` Chris J Arges
2015-04-01 14:55                                     ` Ingo Molnar
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:43                                                 ` [PATCH] smp/call: Detect stuck CSD locks Ingo Molnar
2015-04-03  5:47                                                   ` Ingo Molnar
2015-04-06 16:58                                                   ` Chris J Arges
2015-04-06 17:32                                                     ` Linus Torvalds
2015-04-07  9:21                                                       ` Ingo Molnar
2015-04-07 20:59                                                         ` Chris J Arges [this message]
2015-04-07 21:15                                                           ` Linus Torvalds
2015-04-08  6:47                                                           ` Ingo Molnar
2015-04-13  3:56                                                             ` Chris J Arges
2015-04-13  6:14                                                               ` Ingo Molnar
2015-04-15 19:54                                                                 ` Chris J Arges
2015-04-16 11:04                                                                   ` Ingo Molnar
2015-04-16 15:58                                                                     ` Chris J Arges
2015-04-16 16:31                                                                       ` Ingo Molnar
2015-04-29 21:08                                                                         ` Chris J Arges
2015-05-11 14:00                                                                           ` Ingo Molnar
2015-05-20 18:19                                                                             ` Chris J Arges
2015-04-03  5:45                                                 ` smp_call_function_single lockups Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker
2015-04-18 10:13       ` [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking tip-bot for Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150407205945.GA28212@canonical.com \
    --to=chris.j.arges@canonical.com \
    --cc=axboe@kernel.dk \
    --cc=fweisbec@gmail.com \
    --cc=gema.gomez-solano@canonical.com \
    --cc=hpa@zytor.com \
    --cc=inaddy@ubuntu.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).