linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris J Arges <chris.j.arges@canonical.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Rafael David Tinoco <inaddy@ubuntu.com>,
	Peter Anvin <hpa@zytor.com>,
	Jiang Liu <jiang.liu@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Gema Gomez <gema.gomez-solano@canonical.com>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [debug PATCHes] Re: smp_call_function_single lockups
Date: Tue, 31 Mar 2015 17:38:00 -0500	[thread overview]
Message-ID: <20150331223800.GB12512@canonical.com> (raw)
In-Reply-To: <20150331105656.GA25180@gmail.com>

On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR 
> > bit clear" seems to be a real issue.
> 
> It's interesting in particular when it happens with an edge-triggered 
> interrupt source: it's much harder to miss level triggered IRQs, which 
> stay around until actively handled. Edge triggered irqs are more 
> fragile to loss of event processing.
> 
> > > Anyway, maybe this sheds some more light on this issue. I can 
> > > reproduce this at will, so let me know of other experiments to do.
> 
> Btw., could you please describe (again) what your current best method 
> for reproduction is? It's been a long discussion ...
> 

Ingo,

To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
( Although I've heard of others that have reproduced on other machines. )

1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Add the following to the L1 cmdline:
nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
L1 vCPUs until the hang occurs.

I attempted to write a module that used smp_call_function_single calls to 
trigger IPIs but have been unable to create a more simple reproducer.

> > Somebody else who knows the apic needs to also take a look, but I'd 
> > love to hear what the actual hardware interrupt is (from that 
> > "vector_irq[vector]" thing above.
> > 
> > I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
> > are 0xf0-0xff), so it sounds like an external one. But that then 
> > requires the whole mapping table thing.
> > 
> > Ingo/Peter/Jiang - is there anything else useful we could print out? 
> > I worry about the irq movement code. Can we add printk's to when an 
> > irq is chasing from one CPU to another and doing that 
> > "move_in_progress" thing? I've always been scared of that code.
> 
> 1)
> 
> So the irq_cfg::move_in_progress field originates from:
> 
>   610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.")
> 
> and I agree that it looks fragile, so it would be nice to see how much 
> (if at all?) it gets used, by sticking a few pr_info()s in it.
> 
> Debug patch for the vector movement state machine attached below. 
> Untested.
> 
> 2)
> 
> But ... having taken a quick look at the vector movement handling, I 
> am scared more than ever,and I cannot convince myself that it's race 
> free. It's possibly harmless, but still, famous last words ...
> 
> For example, most ->move_in_progress transitions happen with the 
> vector_lock held - with the exception of send_cleanup_vector(): there 
> we are only holding the desc->lock, but that's not enough! For 
> example, this codepath:
> 
> static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
> {
>         unsigned me;
> 
>         if (likely(!cfg->move_in_progress))
>                 return;
> 
>         me = smp_processor_id();
> 
>         if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain))
>                 send_cleanup_vector(cfg);
> 
> ...
> 
> void send_cleanup_vector(struct irq_cfg *cfg)
> {
> ...
> 
>         cfg->move_in_progress = 0;
> }
> 
> is blatantly racy, unless I missed something obvious?
> 
> 2b)
> 
> Initially I thought this only affects slowpaths like driver unregister 
> or CPU hotplug, but I think this would be relevant for the fastpath of 
> edge triggered irqs as well:
> 
> void apic_ack_edge(struct irq_data *data)
> {
>         irq_complete_move(irqd_cfg(data));
>         irq_move_irq(data);
>         ack_APIC_irq();
> }
> 
> and irq_complete_move() checks and clears cfg->move_in_progress as 
> listed above.
> 
> So in most scenarios this is probably harmless, because it can in the 
> worst case result in the missing of a ->move_in_progress flag setting.
> 
> But it does not look harmless in the apic_set_affinity() code path: 
> there we call into assign_irq_vector() without the desc->lock held, 
> due to this gem in the IRQ core:
> 
> int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
> {
>         unsigned long flags;
>         struct irq_desc *desc = irq_get_desc_lock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL);
> 
>         if (!desc)
>                 return -EINVAL;
>         desc->affinity_hint = m;
>         irq_put_desc_unlock(desc, flags);
>         /* set the initial affinity to prevent every interrupt being on CPU0 */
>         if (m)
>                 __irq_set_affinity(irq, m, false);
>         return 0;
> }
> 
> argh!
> 
> Now if this ever crosses path with an assign_irq_vector() call on 
> another CPU, then I really cannot see what would save us from deep 
> trouble: we use cfg->vector while that is possibly being modified, 
> right?
> 
> So I'd suggest to put a printk into irq_set_affinity_hint() as well, 
> to make sure it's not used during this test. In particular:
> 
> drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, NULL);
> drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, mask);
> 
> might be triggering it in the virtualization code...
> 
> The second patch below adds the relevant pr_info(). (untested)
> 
> Now this too might be unrelated, because the affinity hint was added 
> ages ago, in:
> 
> e7a297b0d7d6 ("genirq: Add CPU mask affinity hint")
> 
> and the potentially racy nature of calling into set_affinity without 
> the desc lock held was probably not realized back then.
> 
> VirtIO's use of the affinity-hint was added a while ago as well, four 
> years ago, in:
> 
>   75a0a52be3c2 ("virtio: introduce an API to set affinity for a virtqueue")
> 
> 2c)
> 
> The other thing that worries me here is that we apparently send an 
> IRQ-move IPI while having an un-acked local APIC (!). I have vague 
> memories that this was problematic and fragile before.
> 
> To ease my worries in this area I've attached a third patch below that 
> changes the order around.
> 
> This too is ancient behavior - but might be more prominent on certain 
> hardware (and virtualization) variants, so it has a chance to be 
> relevant.
> 
> 3)
> 
> Furthermore, I also noticed that the whole vector_lock()/unlock() 
> business is actively misleading IMHO, the vector_lock looks local to 
> vector.c, while we have these two calls that essentially export it ...
> 
> Some of that damage was inflicted in this ancient commit:
> 
>   d388e5fdc461 ("x86: Restore proper vector locking during cpu hotplug")
> 
> the proper approach would be to protect against hotplug events in 
> assign_irq_vector(), not the other way around!
> 
> Probably not a functional problem, yet somewhat disgusting.
> 
> -------------------------------
> 
> So ... all in one, these various problems might be related to the 
> regression or not, but they sure need addressing.
> 
> Thanks,
> 
> 	Ingo
> 

I've merged the below patch with Linus' patch here:
https://lkml.org/lkml/2015/3/31/955

This was only tested only on the L1, so I can put this on the L0 host and run
this as well. The results:

[  124.897002] apic: vector c1, new-domain move in progress                    
[  124.954827] apic: vector d1, sent cleanup vector, move completed            
[  163.477270] apic: vector d1, new-domain move in progress                    
[  164.041938] apic: vector e1, sent cleanup vector, move completed            
[  213.466971] apic: vector e1, new-domain move in progress                    
[  213.775639] apic: vector 22, sent cleanup vector, move completed            
[  365.996747] apic: vector 22, new-domain move in progress                    
[  366.011136] apic: vector 42, sent cleanup vector, move completed            
[  393.836032] apic: vector 42, new-domain move in progress                    
[  393.837727] apic: vector 52, sent cleanup vector, move completed            
[  454.977514] apic: vector 52, new-domain move in progress                    
[  454.978880] apic: vector 62, sent cleanup vector, move completed            
[  467.055730] apic: vector 62, new-domain move in progress                    
[  467.058129] apic: vector 72, sent cleanup vector, move completed            
[  545.280125] apic: vector 72, new-domain move in progress                    
[  545.282801] apic: vector 82, sent cleanup vector, move completed            
[  567.631652] apic: vector 82, new-domain move in progress                    
[  567.632207] apic: vector 92, sent cleanup vector, move completed            
[  628.940638] apic: vector 92, new-domain move in progress                    
[  628.965274] apic: vector a2, sent cleanup vector, move completed            
[  635.187433] apic: vector a2, new-domain move in progress                    
[  635.191643] apic: vector b2, sent cleanup vector, move completed            
[  673.548020] apic: vector b2, new-domain move in progress                    
[  673.553843] apic: vector c2, sent cleanup vector, move completed            
[  688.221906] apic: vector c2, new-domain move in progress                    
[  688.229487] apic: vector d2, sent cleanup vector, move completed            
[  723.818916] apic: vector d2, new-domain move in progress                    
[  723.828970] apic: vector e2, sent cleanup vector, move completed            
[  733.485435] apic: vector e2, new-domain move in progress                    
[  733.615007] apic: vector 23, sent cleanup vector, move completed            
[  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26] 

Thanks,
--chris j arges

> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 6cedd7914581..1dd1de9dd690 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -143,6 +143,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
>  			cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask);
>  			cfg->move_in_progress =
>  			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
> +			if (cfg->move_in_progress)
> +				pr_info("apic: vector %02x, same-domain move in progress\n", cfg->vector);
>  			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
>  			break;
>  		}
> @@ -178,6 +180,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
>  			cpumask_copy(cfg->old_domain, cfg->domain);
>  			cfg->move_in_progress =
>  			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
> +			if (cfg->move_in_progress)
> +				pr_info("apic: vector %02x, new-domain move in progress\n", cfg->vector);
>  		}
>  		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
>  			per_cpu(vector_irq, new_cpu)[vector] = irq;
> @@ -232,6 +236,7 @@ void clear_irq_vector(int irq, struct irq_cfg *cfg)
>  		}
>  	}
>  	cfg->move_in_progress = 0;
> +	pr_info("apic: vector %02x, vector cleared, move completed\n", cfg->vector);
>  	raw_spin_unlock_irqrestore(&vector_lock, flags);
>  }
>  
> @@ -391,6 +396,7 @@ void send_cleanup_vector(struct irq_cfg *cfg)
>  		free_cpumask_var(cleanup_mask);
>  	}
>  	cfg->move_in_progress = 0;
> +	pr_info("apic: vector %02x, sent cleanup vector, move completed\n", cfg->vector);
>  }
>  
>  asmlinkage __visible void smp_irq_move_cleanup_interrupt(void)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index c0a1100d911f..18300a7b8ed2 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -250,8 +250,10 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
>  	desc->affinity_hint = m;
>  	irq_put_desc_unlock(desc, flags);
>  	/* set the initial affinity to prevent every interrupt being on CPU0 */
> -	if (m)
> +	if (m) {
> +		pr_info("irq/core: Called irq_set_affinity_hint(%d)\n", irq);
>  		__irq_set_affinity(irq, m, false);
> +	}
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(irq_set_affinity_hint);
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 6cedd7914581..833a981c5420 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)
>  
>  void apic_ack_edge(struct irq_data *data)
>  {
> +	ack_APIC_irq();
> +
> +	/* Might generate IPIs, so do this after having ACKed the APIC: */
>  	irq_complete_move(irqd_cfg(data));
>  	irq_move_irq(data);
> -	ack_APIC_irq();
>  }
>  
>  /*
> 

  reply	other threads:[~2015-03-31 22:38 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-11 13:19 smp_call_function_single lockups Rafael David Tinoco
2015-02-11 18:18 ` Linus Torvalds
2015-02-11 19:59   ` Linus Torvalds
2015-02-11 20:42     ` Linus Torvalds
2015-02-12 16:38       ` Rafael David Tinoco
2015-02-18 22:25       ` Peter Zijlstra
2015-02-19 15:42         ` Rafael David Tinoco
2015-02-19 16:14           ` Linus Torvalds
2015-02-23 14:01             ` Rafael David Tinoco
2015-02-23 19:32               ` Linus Torvalds
2015-02-23 20:50                 ` Peter Zijlstra
2015-02-23 21:02                   ` Rafael David Tinoco
2015-02-19 16:16           ` Peter Zijlstra
2015-02-19 16:26           ` Linus Torvalds
2015-02-19 16:32             ` Rafael David Tinoco
2015-02-19 16:59               ` Linus Torvalds
2015-02-19 17:30                 ` Rafael David Tinoco
2015-02-19 17:39                 ` Linus Torvalds
2015-02-19 20:29                   ` Linus Torvalds
2015-02-19 21:59                     ` Linus Torvalds
2015-02-19 22:45                       ` Linus Torvalds
2015-03-31  3:15                         ` Chris J Arges
2015-03-31  4:28                           ` Linus Torvalds
2015-03-31 10:56                             ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                               ` Chris J Arges [this message]
2015-04-01 12:39                                 ` Ingo Molnar
2015-04-01 14:10                                   ` Chris J Arges
2015-04-01 14:55                                     ` Ingo Molnar
2015-03-31  4:46                           ` Linus Torvalds
2015-03-31 15:08                           ` Linus Torvalds
2015-03-31 22:23                             ` Chris J Arges
2015-03-31 23:07                               ` Linus Torvalds
2015-04-01 14:32                                 ` Chris J Arges
2015-04-01 15:36                                   ` Linus Torvalds
2015-04-02  9:55                                     ` Ingo Molnar
2015-04-02 17:35                                       ` Linus Torvalds
2015-04-01 12:43                               ` Ingo Molnar
2015-04-01 16:10                                 ` Chris J Arges
2015-04-01 16:14                                   ` Linus Torvalds
2015-04-01 21:59                                     ` Chris J Arges
2015-04-02 17:31                                       ` Linus Torvalds
2015-04-02 18:26                                         ` Ingo Molnar
2015-04-02 18:51                                           ` Chris J Arges
2015-04-02 19:07                                             ` Ingo Molnar
2015-04-02 20:57                                               ` Linus Torvalds
2015-04-02 21:13                                               ` Chris J Arges
2015-04-03  5:43                                                 ` [PATCH] smp/call: Detect stuck CSD locks Ingo Molnar
2015-04-03  5:47                                                   ` Ingo Molnar
2015-04-06 16:58                                                   ` Chris J Arges
2015-04-06 17:32                                                     ` Linus Torvalds
2015-04-07  9:21                                                       ` Ingo Molnar
2015-04-07 20:59                                                         ` Chris J Arges
2015-04-07 21:15                                                           ` Linus Torvalds
2015-04-08  6:47                                                           ` Ingo Molnar
2015-04-13  3:56                                                             ` Chris J Arges
2015-04-13  6:14                                                               ` Ingo Molnar
2015-04-15 19:54                                                                 ` Chris J Arges
2015-04-16 11:04                                                                   ` Ingo Molnar
2015-04-16 15:58                                                                     ` Chris J Arges
2015-04-16 16:31                                                                       ` Ingo Molnar
2015-04-29 21:08                                                                         ` Chris J Arges
2015-05-11 14:00                                                                           ` Ingo Molnar
2015-05-20 18:19                                                                             ` Chris J Arges
2015-04-03  5:45                                                 ` smp_call_function_single lockups Ingo Molnar
2015-04-06 17:23                                         ` Chris J Arges
2015-02-20  9:30                     ` Ingo Molnar
2015-02-20 16:49                       ` Linus Torvalds
2015-02-20 19:41                         ` Ingo Molnar
2015-02-20 20:03                           ` Linus Torvalds
2015-02-20 20:11                             ` Ingo Molnar
2015-03-20 10:15       ` Peter Zijlstra
2015-03-20 16:26         ` Linus Torvalds
2015-03-20 17:14           ` Mike Galbraith
2015-04-01 14:22       ` Frederic Weisbecker
2015-04-18 10:13       ` [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking tip-bot for Linus Torvalds
2015-04-01  2:00 [debug PATCHes] Re: smp_call_function_single lockups Daniel J Blueman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150331223800.GB12512@canonical.com \
    --to=chris.j.arges@canonical.com \
    --cc=axboe@kernel.dk \
    --cc=fweisbec@gmail.com \
    --cc=gema.gomez-solano@canonical.com \
    --cc=hpa@zytor.com \
    --cc=inaddy@ubuntu.com \
    --cc=jiang.liu@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).