linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris J Arges <chris.j.arges@canonical.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Rafael David Tinoco <inaddy@ubuntu.com>,
	Peter Anvin <hpa@zytor.com>,
	Jiang Liu <jiang.liu@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Gema Gomez <gema.gomez-solano@canonical.com>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks
Date: Mon, 06 Apr 2015 11:58:49 -0500	[thread overview]
Message-ID: <5522BB49.5060704@canonical.com> (raw)
In-Reply-To: <20150403054320.GA9863@gmail.com>

On 04/03/2015 12:43 AM, Ingo Molnar wrote:
> 
> * Chris J Arges <chris.j.arges@canonical.com> wrote:
> 
>> Ingo,
>>
>> I think tracking IPI calls from 'generic_exec_single' would make a lot
>> of sense. When you say poll for completion do you mean a loop after
>> 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
>> would be to not alter the timings too much so we can still reproduce the
>> original problem.
>>
>> Another approach:
>> If we want to check for non-ACKed IPIs a possibility would be to add a
>> timestamp field to 'struct call_single_data' and just record jiffies
>> when the IPI gets called. Then have a per-cpu kthread check the
>> 'call_single_queue' percpu list periodically if (jiffies - timestamp) >
>> THRESHOLD. When we reach that condition print the stale entry in
>> call_single_queue, backtrace, then re-send the IPI.
>>
>> Let me know what makes the most sense to hack on.
> 
> Well, the thing is, putting this polling into an async kernel thread 
> loses a lot of data context and right of reference that we might need 
> to re-send an IPI.
> 
> And if the context is not lost we might as well send it from the 
> original, still-looping context - which is a lot simpler as well.
> 
> ( ... and on a deadlocked non-CONFIG_PREEMPT kernel the kernel thread
>   won't run at all, so it won't be able to detect deadlocks. )
> 
> So I'd really suggest instrumenting the already existing CSD polling, 
> which is already a slowpath, so it won't impact timing much.
>  
> I'd suggest the following, rather unintrusive approach:
> 
>  - first just take a jiffies timestamp and generate a warning message 
>    if more than 10 seconds elapsed after sending the IPI, without 
>    having heard from it.
> 
>  - then the IPI is resent. This means adding a bit of a control flow 
>    to csd_lock_wait().
> 
> Something like the patch below, which implements both steps:
> 
>   - It will detect and print CSD deadlocks both in the single- and 
>     multi- function call APIs, and in the pre-IPI CSD lock wait case 
>     as well.
> 
>   - It will resend an IPI if possible
> 
>   - It generates various messages in the deadlock case that should 
>     give us some idea about how the deadlock played out and whether it 
>     got resolved.
> 
> The timeout is set to 10 seconds, that should be plenty even in a 
> virtualization environment.
> 
> Only very lightly tested under a simple lkvm bootup: I verified that 
> the boilerplate message is displayed, and that it doesn't generate 
> false positive messages in light loads - but I haven't checked whether 
> the deadlock detection works at all.
> 
> Thanks,
> 
> 	Ingo
> ---
> 
>  kernel/smp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 46 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f38a1e692259..e0eec1ab3ef2 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -98,22 +98,63 @@ void __init call_function_init(void)
>  	register_cpu_notifier(&hotplug_cfd_notifier);
>  }
>  
> +/* Locking timeout in ms: */
> +#define CSD_LOCK_TIMEOUT (10*1000ULL)
> +
> +/* Print this ID in every printk line we output, to be able to easily sort them apart: */
> +static int csd_bug_count;
> +
>  /*
>   * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
>   *
>   * For non-synchronous ipi calls the csd can still be in use by the
>   * previous function call. For multi-cpu calls its even more interesting
>   * as we'll have to ensure no other cpu is observing our csd.
> + *
> + * ( The overhead of deadlock detection is not a big problem, this is a
> + *   cpu_relax() loop that is actively wasting CPU cycles to poll for
> + *   completion. )
>   */
> -static void csd_lock_wait(struct call_single_data *csd)
> +static void csd_lock_wait(struct call_single_data *csd, int cpu)
>  {
> -	while (csd->flags & CSD_FLAG_LOCK)
> +	int bug_id = 0;
> +	u64 ts0, ts1, ts_delta;
> +
> +	ts0 = jiffies_to_msecs(jiffies);
> +
> +	if (unlikely(!csd_bug_count)) {
> +		csd_bug_count++;
> +		printk("csd: CSD deadlock debugging initiated!\n");
> +	}
> +
> +	while (csd->flags & CSD_FLAG_LOCK) {
> +		ts1 = jiffies_to_msecs(jiffies);
> +
> +		ts_delta = ts1-ts0;

Ingo,

I noticed that with this patch it never reached 'csd: Detected
non-responsive CSD lock...' because it seems that ts_delta never reached
CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
hang without reaching this statement. I made the ts0,ts1 values global
and put a counter into the while loop and found that the loop iterated
about 670 million times before the softlockup was detected. In addition
ts0 and ts1 both had the same values upon soft lockup, and thus would
never trip the CSD_LOCK_TIMEOUT check.

--chris


> +		if (unlikely(ts_delta >= CSD_LOCK_TIMEOUT)) { /* Uh oh, it took too long. Why? */
> +
> +			bug_id = csd_bug_count;
> +			csd_bug_count++;
> +
> +			ts0 = ts1; /* Re-start the timeout detection */
> +
> +			printk("csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d\n",
> +				bug_id, raw_smp_processor_id(), ts_delta/1000ULL, ts_delta % 1000ULL, cpu);
> +			if (cpu >= 0) {
> +				printk("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", bug_id, raw_smp_processor_id(), cpu);
> +				arch_send_call_function_single_ipi(cpu);
> +			}
> +			dump_stack();
> +		}
>  		cpu_relax();
> +	}
> +	if (unlikely(bug_id))
> +		printk("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock after all. Phew!\n", bug_id, raw_smp_processor_id(), cpu);
>  }
>  
>  static void csd_lock(struct call_single_data *csd)
>  {
> -	csd_lock_wait(csd);
> +	csd_lock_wait(csd, -1);
>  	csd->flags |= CSD_FLAG_LOCK;
>  
>  	/*
> @@ -191,7 +232,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
>  		arch_send_call_function_single_ipi(cpu);
>  
>  	if (wait)
> -		csd_lock_wait(csd);
> +		csd_lock_wait(csd, cpu);
>  
>  	return 0;
>  }
> @@ -446,7 +487,7 @@ void smp_call_function_many(const struct cpumask *mask,
>  			struct call_single_data *csd;
>  
>  			csd = per_cpu_ptr(cfd->csd, cpu);
> -			csd_lock_wait(csd);
> +			csd_lock_wait(csd, cpu);
>  		}
>  	}
>  }
> 


  parent reply	other threads:[~2015-04-06 16:59 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-11 13:19 smp_call_function_single lockups Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31 10:56                             ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                               ` Chris J Arges
2015-04-01 12:39                                 ` Ingo Molnar
2015-04-01 14:10                                   ` Chris J Arges
2015-04-01 14:55                                     ` Ingo Molnar
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:43                                                 ` [PATCH] smp/call: Detect stuck CSD locks Ingo Molnar
2015-04-03  5:47                                                   ` Ingo Molnar
2015-04-06 16:58                                                   ` Chris J Arges [this message]
2015-04-06 17:32                                                     ` Linus Torvalds
2015-04-07  9:21                                                       ` Ingo Molnar
2015-04-07 20:59                                                         ` Chris J Arges
2015-04-07 21:15                                                           ` Linus Torvalds
2015-04-08  6:47                                                           ` Ingo Molnar
2015-04-13  3:56                                                             ` Chris J Arges
2015-04-13  6:14                                                               ` Ingo Molnar
2015-04-15 19:54                                                                 ` Chris J Arges
2015-04-16 11:04                                                                   ` Ingo Molnar
2015-04-16 15:58                                                                     ` Chris J Arges
2015-04-16 16:31                                                                       ` Ingo Molnar
2015-04-29 21:08                                                                         ` Chris J Arges
2015-05-11 14:00                                                                           ` Ingo Molnar
2015-05-20 18:19                                                                             ` Chris J Arges
2015-04-03  5:45                                                 ` smp_call_function_single lockups Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker
2015-04-18 10:13       ` [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking tip-bot for Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5522BB49.5060704@canonical.com \
    --to=chris.j.arges@canonical.com \
    --cc=axboe@kernel.dk \
    --cc=fweisbec@gmail.com \
    --cc=gema.gomez-solano@canonical.com \
    --cc=hpa@zytor.com \
    --cc=inaddy@ubuntu.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).