x86/vector: Allow to free vector for managed IRQ
diff mbox series

Message ID 20200312205830.81796-1-peterx@redhat.com
State New
Headers show
Series
  • x86/vector: Allow to free vector for managed IRQ
Related show

Commit Message

Peter Xu March 12, 2020, 8:58 p.m. UTC
After we introduced the "managed_irq" sub-parameter for isolcpus, it's
possible to free a kernel managed irq vector now.

It can be triggered easily by booting a VM with a few vcpus, with one
virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):

[    2.889911] ------------[ cut here ]------------
[    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
[    2.889964] Modules linked in: crc32c_intel serio_raw virtio_blk(+) qemu_fw_cfg
[    2.889968] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc1 #18
[    2.889969] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[    2.889970] RIP: 0010:free_moved_vector+0x126/0x160
[    2.889972] Code: 45 00 48 85 c0 75 df e9 2b ff ff ff 48 8b 05 f9 51 71 01 e8 3c 5a 11 00 85 c0 74 09 80 3d 8d 39 71 01 00 5
[    2.889972] RSP: 0000:ffffb5ac00110fa0 EFLAGS: 00010002
[    2.889973] RAX: 0000000000000001 RBX: ffffa00fad2d60c0 RCX: 0000000000000821
[    2.889974] RDX: 0000000000000000 RSI: 00000000fd2bd4ba RDI: ffffa00fad2d60c0
[    2.889974] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[    2.889975] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000023
[    2.889975] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[    2.889976] FS:  0000000000000000(0000) GS:ffffa00fbbd80000(0000) knlGS:0000000000000000
[    2.889977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.889978] CR2: 00000000ffffffff CR3: 0000000123610001 CR4: 0000000000360ee0
[    2.889980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.889980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.889981] Call Trace:
[    2.889982]  <IRQ>
[    2.889987]  smp_irq_move_cleanup_interrupt+0xac/0xc6
[    2.889989]  irq_move_cleanup_interrupt+0xc/0x20
[    2.889990]  </IRQ>
[    2.889991] RIP: 0010:native_safe_halt+0xe/0x10
[    2.889992] Code: cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 56 82 4f 00 f4 c3 66 90 e9 07 00 00 00 0f 00 0
[    2.889993] RSP: 0000:ffffb5ac00083eb0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdf
[    2.889994] RAX: ffffa00fbb260000 RBX: 0000000000000003 RCX: 0000000000000000
[    2.889994] RDX: ffffa00fbb260000 RSI: 0000000000000006 RDI: ffffa00fbb260000
[    2.889995] RBP: 0000000000000003 R08: 000000cd42e4dffb R09: 0000000000000000
[    2.889995] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa00fbb260000
[    2.889996] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa00fbb260000
[    2.890003]  default_idle+0x1f/0x140
[    2.890006]  do_idle+0x1fa/0x270
[    2.890010]  cpu_startup_entry+0x19/0x20
[    2.890012]  start_secondary+0x164/0x1b0
[    2.890014]  secondary_startup_64+0xa4/0xb0
[    2.890021] irq event stamp: 8758
[    2.890022] hardirqs last  enabled at (8755): [<ffffffffbbb105ca>] default_idle+0x1a/0x140
[    2.890023] hardirqs last disabled at (8756): [<ffffffffbb0039f7>] trace_hardirqs_off_thunk+0x1a/0x1c
[    2.890025] softirqs last  enabled at (8758): [<ffffffffbb0ecce8>] irq_enter+0x68/0x70
[    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
[    2.890027] ---[ end trace deb5d563d2acb13f ]---

I believe the same thing will happen to bare metals.

When allocating the IRQ for the device, activate_managed() will try to
allocate a vector based on what we've calculated for kernel managed
IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
irq_do_set_affinity(), in which we will start to consider the whole
HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
different from when we do the allocation.  When that happens, we'll
need to be able to properly free the old vector on the old core.

Remove the WARN_ON_ONCE() to allow this to happen.

Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ming Lei <minlei@redhat.com>
CC: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kernel/apic/vector.c | 8 --------
 1 file changed, 8 deletions(-)

Comments

Ming Lei March 13, 2020, 3:13 a.m. UTC | #1
On Thu, Mar 12, 2020 at 04:58:30PM -0400, Peter Xu wrote:
> After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> possible to free a kernel managed irq vector now.
> 
> It can be triggered easily by booting a VM with a few vcpus, with one
> virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> 
> [    2.889911] ------------[ cut here ]------------
> [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> [    2.889964] Modules linked in: crc32c_intel serio_raw virtio_blk(+) qemu_fw_cfg
> [    2.889968] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc1 #18
> [    2.889969] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [    2.889970] RIP: 0010:free_moved_vector+0x126/0x160
> [    2.889972] Code: 45 00 48 85 c0 75 df e9 2b ff ff ff 48 8b 05 f9 51 71 01 e8 3c 5a 11 00 85 c0 74 09 80 3d 8d 39 71 01 00 5
> [    2.889972] RSP: 0000:ffffb5ac00110fa0 EFLAGS: 00010002
> [    2.889973] RAX: 0000000000000001 RBX: ffffa00fad2d60c0 RCX: 0000000000000821
> [    2.889974] RDX: 0000000000000000 RSI: 00000000fd2bd4ba RDI: ffffa00fad2d60c0
> [    2.889974] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> [    2.889975] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000023
> [    2.889975] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [    2.889976] FS:  0000000000000000(0000) GS:ffffa00fbbd80000(0000) knlGS:0000000000000000
> [    2.889977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.889978] CR2: 00000000ffffffff CR3: 0000000123610001 CR4: 0000000000360ee0
> [    2.889980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    2.889980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    2.889981] Call Trace:
> [    2.889982]  <IRQ>
> [    2.889987]  smp_irq_move_cleanup_interrupt+0xac/0xc6
> [    2.889989]  irq_move_cleanup_interrupt+0xc/0x20
> [    2.889990]  </IRQ>
> [    2.889991] RIP: 0010:native_safe_halt+0xe/0x10
> [    2.889992] Code: cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 56 82 4f 00 f4 c3 66 90 e9 07 00 00 00 0f 00 0
> [    2.889993] RSP: 0000:ffffb5ac00083eb0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdf
> [    2.889994] RAX: ffffa00fbb260000 RBX: 0000000000000003 RCX: 0000000000000000
> [    2.889994] RDX: ffffa00fbb260000 RSI: 0000000000000006 RDI: ffffa00fbb260000
> [    2.889995] RBP: 0000000000000003 R08: 000000cd42e4dffb R09: 0000000000000000
> [    2.889995] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa00fbb260000
> [    2.889996] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa00fbb260000
> [    2.890003]  default_idle+0x1f/0x140
> [    2.890006]  do_idle+0x1fa/0x270
> [    2.890010]  cpu_startup_entry+0x19/0x20
> [    2.890012]  start_secondary+0x164/0x1b0
> [    2.890014]  secondary_startup_64+0xa4/0xb0
> [    2.890021] irq event stamp: 8758
> [    2.890022] hardirqs last  enabled at (8755): [<ffffffffbbb105ca>] default_idle+0x1a/0x140
> [    2.890023] hardirqs last disabled at (8756): [<ffffffffbb0039f7>] trace_hardirqs_off_thunk+0x1a/0x1c
> [    2.890025] softirqs last  enabled at (8758): [<ffffffffbb0ecce8>] irq_enter+0x68/0x70
> [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> 
> I believe the same thing will happen to bare metals.
> 
> When allocating the IRQ for the device, activate_managed() will try to
> allocate a vector based on what we've calculated for kernel managed
> IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> irq_do_set_affinity(), in which we will start to consider the whole
> HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> different from when we do the allocation.  When that happens, we'll
> need to be able to properly free the old vector on the old core.
> 
> Remove the WARN_ON_ONCE() to allow this to happen.
> 
> Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Ming Lei <minlei@redhat.com>
> CC: Juri Lelli <juri.lelli@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kernel/apic/vector.c | 8 --------
>  1 file changed, 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 2c5676b0a6e7..a1142260b123 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -837,14 +837,6 @@ static void free_moved_vector(struct apic_chip_data *apicd)
>  	unsigned int cpu = apicd->prev_cpu;
>  	bool managed = apicd->is_managed;
>  
> -	/*
> -	 * This should never happen. Managed interrupts are not
> -	 * migrated except on CPU down, which does not involve the
> -	 * cleanup vector. But try to keep the accounting correct
> -	 * nevertheless.
> -	 */
> -	WARN_ON_ONCE(managed);
> -

Since commit 11ea68f553e2, managed interrupt can be migrated on CPU up:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming
Peter Xu March 13, 2020, 12:32 p.m. UTC | #2
On Fri, Mar 13, 2020 at 11:13:41AM +0800, Ming Lei wrote:
> On Thu, Mar 12, 2020 at 04:58:30PM -0400, Peter Xu wrote:
> > After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> > possible to free a kernel managed irq vector now.
> > 
> > It can be triggered easily by booting a VM with a few vcpus, with one
> > virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> > below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> > 
> > [    2.889911] ------------[ cut here ]------------
> > [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> > [    2.889964] Modules linked in: crc32c_intel serio_raw virtio_blk(+) qemu_fw_cfg
> > [    2.889968] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc1 #18
> > [    2.889969] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > [    2.889970] RIP: 0010:free_moved_vector+0x126/0x160
> > [    2.889972] Code: 45 00 48 85 c0 75 df e9 2b ff ff ff 48 8b 05 f9 51 71 01 e8 3c 5a 11 00 85 c0 74 09 80 3d 8d 39 71 01 00 5
> > [    2.889972] RSP: 0000:ffffb5ac00110fa0 EFLAGS: 00010002
> > [    2.889973] RAX: 0000000000000001 RBX: ffffa00fad2d60c0 RCX: 0000000000000821
> > [    2.889974] RDX: 0000000000000000 RSI: 00000000fd2bd4ba RDI: ffffa00fad2d60c0
> > [    2.889974] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > [    2.889975] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000023
> > [    2.889975] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> > [    2.889976] FS:  0000000000000000(0000) GS:ffffa00fbbd80000(0000) knlGS:0000000000000000
> > [    2.889977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    2.889978] CR2: 00000000ffffffff CR3: 0000000123610001 CR4: 0000000000360ee0
> > [    2.889980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [    2.889980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [    2.889981] Call Trace:
> > [    2.889982]  <IRQ>
> > [    2.889987]  smp_irq_move_cleanup_interrupt+0xac/0xc6
> > [    2.889989]  irq_move_cleanup_interrupt+0xc/0x20
> > [    2.889990]  </IRQ>
> > [    2.889991] RIP: 0010:native_safe_halt+0xe/0x10
> > [    2.889992] Code: cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 56 82 4f 00 f4 c3 66 90 e9 07 00 00 00 0f 00 0
> > [    2.889993] RSP: 0000:ffffb5ac00083eb0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdf
> > [    2.889994] RAX: ffffa00fbb260000 RBX: 0000000000000003 RCX: 0000000000000000
> > [    2.889994] RDX: ffffa00fbb260000 RSI: 0000000000000006 RDI: ffffa00fbb260000
> > [    2.889995] RBP: 0000000000000003 R08: 000000cd42e4dffb R09: 0000000000000000
> > [    2.889995] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa00fbb260000
> > [    2.889996] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa00fbb260000
> > [    2.890003]  default_idle+0x1f/0x140
> > [    2.890006]  do_idle+0x1fa/0x270
> > [    2.890010]  cpu_startup_entry+0x19/0x20
> > [    2.890012]  start_secondary+0x164/0x1b0
> > [    2.890014]  secondary_startup_64+0xa4/0xb0
> > [    2.890021] irq event stamp: 8758
> > [    2.890022] hardirqs last  enabled at (8755): [<ffffffffbbb105ca>] default_idle+0x1a/0x140
> > [    2.890023] hardirqs last disabled at (8756): [<ffffffffbb0039f7>] trace_hardirqs_off_thunk+0x1a/0x1c
> > [    2.890025] softirqs last  enabled at (8758): [<ffffffffbb0ecce8>] irq_enter+0x68/0x70
> > [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> > [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> > 
> > I believe the same thing will happen to bare metals.
> > 
> > When allocating the IRQ for the device, activate_managed() will try to
> > allocate a vector based on what we've calculated for kernel managed
> > IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> > when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> > irq_do_set_affinity(), in which we will start to consider the whole
> > HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> > different from when we do the allocation.  When that happens, we'll
> > need to be able to properly free the old vector on the old core.
> > 
> > Remove the WARN_ON_ONCE() to allow this to happen.
> > 
> > Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Ingo Molnar <mingo@redhat.com>
> > CC: Peter Zijlstra <peterz@infradead.org>
> > CC: Ming Lei <minlei@redhat.com>
> > CC: Juri Lelli <juri.lelli@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/kernel/apic/vector.c | 8 --------
> >  1 file changed, 8 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 2c5676b0a6e7..a1142260b123 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -837,14 +837,6 @@ static void free_moved_vector(struct apic_chip_data *apicd)
> >  	unsigned int cpu = apicd->prev_cpu;
> >  	bool managed = apicd->is_managed;
> >  
> > -	/*
> > -	 * This should never happen. Managed interrupts are not
> > -	 * migrated except on CPU down, which does not involve the
> > -	 * cleanup vector. But try to keep the accounting correct
> > -	 * nevertheless.
> > -	 */
> > -	WARN_ON_ONCE(managed);
> > -
> 
> Since commit 11ea68f553e2, managed interrupt can be migrated on CPU up:

Right, my example was triggered even earlier than cpu up (every boot,
because right now we'll allocate the managed irq vector even during
allocation of the IRQ, then another time when setting the affinity).

And IIUC cpu down should also be able to trigger this, say:

  queue1 -> CPU 0,1
  queue2 -> CPU 2,3

Assuming housekeeping cores are 0-2 and CPU3 is isolated, queue2 IRQ
will need to target CPU2 first.  Then if we offline CPU2, the queue2
IRQ should need to move to CPU3 (which is the isolated core) instead.
Then we need to be able to free the old vector on CPU2.

> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks!
Thomas Gleixner March 13, 2020, 2:24 p.m. UTC | #3
Peter Xu <peterx@redhat.com> writes:

> After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> possible to free a kernel managed irq vector now.
>
> It can be triggered easily by booting a VM with a few vcpus, with one
> virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
>
> [    2.889911] ------------[ cut here ]------------
> [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160

<SNIP>

> [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> [    2.890027] ---[ end trace deb5d563d2acb13f ]---

What is this backtrace for? It's completly useless as it merily shows
that the warning triggers. Also even if it'd be useful then it wants to
be trimmed properly.

> I believe the same thing will happen to bare metals.

Believe is not really relevant in engineering.

The problem has nothing to do with virt or bare metal. It's a genuine
issue.

> When allocating the IRQ for the device, activate_managed() will try to
> allocate a vector based on what we've calculated for kernel managed
> IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> irq_do_set_affinity(), in which we will start to consider the whole
> HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> different from when we do the allocation.  When that happens, we'll
> need to be able to properly free the old vector on the old core.

There's lots of 'we' in that text. We do nothing really. Please describe
things in neutral and factual language.

Also there is another way to trigger this: Offline all non-isolated CPUs
in the mask and then bring one online again.

Ming, I really have to ask why these two situations were not tested
before the final submission of that isolation patch. Both issues have
been discussed during review of the different versions. So the warning
should have triggered back then already....

> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 2c5676b0a6e7..a1142260b123 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -837,14 +837,6 @@ static void free_moved_vector(struct apic_chip_data *apicd)
>  	unsigned int cpu = apicd->prev_cpu;
>  	bool managed = apicd->is_managed;
>  
> -	/*
> -	 * This should never happen. Managed interrupts are not
> -	 * migrated except on CPU down, which does not involve the
> -	 * cleanup vector. But try to keep the accounting correct
> -	 * nevertheless.
> -	 */

While the comment is not longer correct, removing it is lame. This
should have an explanation why managed interrupts can end up here.

No need to resend. I fixed it up already.

Thanks,

        tglx
Peter Xu March 13, 2020, 3:19 p.m. UTC | #4
On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> > possible to free a kernel managed irq vector now.
> >
> > It can be triggered easily by booting a VM with a few vcpus, with one
> > virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> > below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> >
> > [    2.889911] ------------[ cut here ]------------
> > [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> 
> <SNIP>
> 
> > [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> > [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> 
> What is this backtrace for? It's completly useless as it merily shows
> that the warning triggers. Also even if it'd be useful then it wants to
> be trimmed properly.

I thought it was a good habit to keep the facts of issues.  Backtrace
is one of them so I kept them.  It could, for example, help people who
spot the same issue in an old/downstream kernel so when they google or
grepping git-log they know the exact issue has been solved by some
commit, even without much knowledge on the internals (because they can
exactly compare the whole dmesg error).

> 
> > I believe the same thing will happen to bare metals.
> 
> Believe is not really relevant in engineering.
> 
> The problem has nothing to do with virt or bare metal. It's a genuine
> issue.
> 
> > When allocating the IRQ for the device, activate_managed() will try to
> > allocate a vector based on what we've calculated for kernel managed
> > IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> > when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> > irq_do_set_affinity(), in which we will start to consider the whole
> > HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> > different from when we do the allocation.  When that happens, we'll
> > need to be able to properly free the old vector on the old core.
> 
> There's lots of 'we' in that text. We do nothing really. Please describe
> things in neutral and factual language.
> 
> Also there is another way to trigger this: Offline all non-isolated CPUs
> in the mask and then bring one online again.

Thanks for your suggestions on not using subjective words and so on.
I'll remember these.

However I think I still miss one thing in the puzzle (although it
turns out that we've agreed on removing the warning already, but just
in case I missed something important) - do you mean that offlining all
the non-isolated CPUs in the mask won't trigger this already?  Because
I also saw some similar comment somewhere else...

Here's my understanding - when offlining, we'll disable the CPU and
reach:

  - irq_migrate_all_off_this_cpu
    - migrate_one_irq
      - irq_do_set_affinity
        - calculate HK_FLAG_MANAGED_IRQ and so on...

Then we can still trigger this irq move event even before we bring
another housekeeping cpu online, right?  Or could you guide me on what
I have missed?

Thanks,
Thomas Gleixner March 13, 2020, 11:54 p.m. UTC | #5
Peter Xu <peterx@redhat.com> writes:
> On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> What is this backtrace for? It's completly useless as it merily shows
>> that the warning triggers. Also even if it'd be useful then it wants to
>> be trimmed properly.
>
> I thought it was a good habit to keep the facts of issues.  Backtrace
> is one of them so I kept them.  It could, for example, help people who
> spot the same issue in an old/downstream kernel so when they google or
> grepping git-log they know the exact issue has been solved by some
> commit, even without much knowledge on the internals (because they can
> exactly compare the whole dmesg error).

This is not really good habit. Changelogs should contain factual
information which is relevant to understand the problem. Backtraces are
useful when the callchain leading to a bug/warn/oops _is_ relevant for
understanding the issue. For things like this where the backtrace is
completly out of context it's more of an distraction than of value.

Aside of that your 'spot the same issue' argument is wrong in this
context as this does not affect anything old/downstream. The feature was
merged during the 5.6 merge window. So it has not reached anything
downstream which needs backports. If distro people pick out such a
feature and backport it to their frankenkernels before the final release
then I really dont care.

One way to preserve the backtrace for google's sake is to write a cover
letter and stick the back trace there if it is not useful in the
changelog.

Also having 40+ lines of untrimmed backtrace is just a horrible
distraction. The timestamps are completely useless and depending on the
type of problem most of the other gunk which is emitted by a
bug/warn/oops backtrace is useless as well.

Why would you be interested in the register values for this problem or
the code or the interrupt flags tracer state? That thing triggered a
warning and nothing in the backtrace gives any clue about why that
happened, IOW it's pure distraction.

If the call chain is relevant to explain the problem, then a trimmed
down version is really sufficient. Here is an example:

     BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
     RIP: 0010:apic_ack_edge+0x1e/0x40
     Call Trace:
       handle_edge_irq+0x7d/0x1e0
       generic_handle_irq+0x27/0x30
       aer_inject_write+0x53a/0x720

This helps, because it illustrates exactly how this BUG was triggered
and it's reduced to exactly the 6 lines because everything else in the
original 50+ lines is useless.

> However I think I still miss one thing in the puzzle (although it
> turns out that we've agreed on removing the warning already, but just
> in case I missed something important) - do you mean that offlining all
> the non-isolated CPUs in the mask won't trigger this already?  Because
> I also saw some similar comment somewhere else...
>
> Here's my understanding - when offlining, we'll disable the CPU and
> reach:
>
>   - irq_migrate_all_off_this_cpu
>     - migrate_one_irq
>       - irq_do_set_affinity
>         - calculate HK_FLAG_MANAGED_IRQ and so on...
>
> Then we can still trigger this irq move event even before we bring
> another housekeeping cpu online, right?  Or could you guide me on what
> I have missed?

The migration when offlining the CPU to which an interrupt is affine
does not trigger this because that uses a different mechanism.

It clears the vector on the outgoing CPU brute force simply because this
CPU can't handle any interrupts anymore. There is some logic to catch
the device interrupt racing against this, but while careful it's not
perfect. There is a theoretical hole there which probably could be
triggered by carefully orchestrating things, but we can't do anything
about it except disabling CPU unplug :)

So the only two ways to trigger this are the ones I described in the
changelog.

Hope that helps.

Thanks,

        tglx
Ming Lei March 14, 2020, 2:04 a.m. UTC | #6
On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> > possible to free a kernel managed irq vector now.
> >
> > It can be triggered easily by booting a VM with a few vcpus, with one
> > virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> > below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> >
> > [    2.889911] ------------[ cut here ]------------
> > [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> 
> <SNIP>
> 
> > [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> > [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> 
> What is this backtrace for? It's completly useless as it merily shows
> that the warning triggers. Also even if it'd be useful then it wants to
> be trimmed properly.
> 
> > I believe the same thing will happen to bare metals.
> 
> Believe is not really relevant in engineering.
> 
> The problem has nothing to do with virt or bare metal. It's a genuine
> issue.
> 
> > When allocating the IRQ for the device, activate_managed() will try to
> > allocate a vector based on what we've calculated for kernel managed
> > IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> > when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> > irq_do_set_affinity(), in which we will start to consider the whole
> > HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> > different from when we do the allocation.  When that happens, we'll
> > need to be able to properly free the old vector on the old core.
> 
> There's lots of 'we' in that text. We do nothing really. Please describe
> things in neutral and factual language.
> 
> Also there is another way to trigger this: Offline all non-isolated CPUs
> in the mask and then bring one online again.
> 
> Ming, I really have to ask why these two situations were not tested
> before the final submission of that isolation patch. Both issues have
> been discussed during review of the different versions. So the warning
> should have triggered back then already....

Hi Thomas,

I run CPU hotplug & unplug stress test with isolcpus:managed_irq, however
the test just checks if the irq's effective vector is setup correctly and
IO can be run as expected.

Looks dmesg warning is missed to check, sorry for that.

Thanks,
Ming
Peter Xu March 14, 2020, 11:53 p.m. UTC | #7
On Sat, Mar 14, 2020 at 12:54:12AM +0100, Thomas Gleixner wrote:
> Peter Xu <peterx@redhat.com> writes:
> > On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> What is this backtrace for? It's completly useless as it merily shows
> >> that the warning triggers. Also even if it'd be useful then it wants to
> >> be trimmed properly.
> >
> > I thought it was a good habit to keep the facts of issues.  Backtrace
> > is one of them so I kept them.  It could, for example, help people who
> > spot the same issue in an old/downstream kernel so when they google or
> > grepping git-log they know the exact issue has been solved by some
> > commit, even without much knowledge on the internals (because they can
> > exactly compare the whole dmesg error).
> 
> This is not really good habit. Changelogs should contain factual
> information which is relevant to understand the problem. Backtraces are
> useful when the callchain leading to a bug/warn/oops _is_ relevant for
> understanding the issue. For things like this where the backtrace is
> completly out of context it's more of an distraction than of value.
> 
> Aside of that your 'spot the same issue' argument is wrong in this
> context as this does not affect anything old/downstream. The feature was
> merged during the 5.6 merge window. So it has not reached anything
> downstream which needs backports. If distro people pick out such a
> feature and backport it to their frankenkernels before the final release
> then I really dont care.
> 
> One way to preserve the backtrace for google's sake is to write a cover
> letter and stick the back trace there if it is not useful in the
> changelog.
> 
> Also having 40+ lines of untrimmed backtrace is just a horrible
> distraction. The timestamps are completely useless and depending on the
> type of problem most of the other gunk which is emitted by a
> bug/warn/oops backtrace is useless as well.
> 
> Why would you be interested in the register values for this problem or
> the code or the interrupt flags tracer state? That thing triggered a
> warning and nothing in the backtrace gives any clue about why that
> happened, IOW it's pure distraction.
> 
> If the call chain is relevant to explain the problem, then a trimmed
> down version is really sufficient. Here is an example:
> 
>      BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
>      RIP: 0010:apic_ack_edge+0x1e/0x40
>      Call Trace:
>        handle_edge_irq+0x7d/0x1e0
>        generic_handle_irq+0x27/0x30
>        aer_inject_write+0x53a/0x720
> 
> This helps, because it illustrates exactly how this BUG was triggered
> and it's reduced to exactly the 6 lines because everything else in the
> original 50+ lines is useless.

Fair enough.

> 
> > However I think I still miss one thing in the puzzle (although it
> > turns out that we've agreed on removing the warning already, but just
> > in case I missed something important) - do you mean that offlining all
> > the non-isolated CPUs in the mask won't trigger this already?  Because
> > I also saw some similar comment somewhere else...
> >
> > Here's my understanding - when offlining, we'll disable the CPU and
> > reach:
> >
> >   - irq_migrate_all_off_this_cpu
> >     - migrate_one_irq
> >       - irq_do_set_affinity
> >         - calculate HK_FLAG_MANAGED_IRQ and so on...
> >
> > Then we can still trigger this irq move event even before we bring
> > another housekeeping cpu online, right?  Or could you guide me on what
> > I have missed?
> 
> The migration when offlining the CPU to which an interrupt is affine
> does not trigger this because that uses a different mechanism.
> 
> It clears the vector on the outgoing CPU brute force simply because this
> CPU can't handle any interrupts anymore. There is some logic to catch
> the device interrupt racing against this, but while careful it's not
> perfect. There is a theoretical hole there which probably could be
> triggered by carefully orchestrating things, but we can't do anything
> about it except disabling CPU unplug :)

Ah I haven't thought about the "theoretical hole" and that far...  I
think I was only focusing on whether set_affinity() would happen, but
I ignored the fact that x86 __send_cleanup_vector() treated offlining
CPU in the special way to avoid sending IRQ_MOVE_CLEANUP_VECTOR at
all.  And that part makes perfect sense since we can't do much with an
offlined core.

> 
> So the only two ways to trigger this are the ones I described in the
> changelog.
> 
> Hope that helps.

Definitely.

Thanks for writting this up, Thomas!

Patch
diff mbox series

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 2c5676b0a6e7..a1142260b123 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -837,14 +837,6 @@  static void free_moved_vector(struct apic_chip_data *apicd)
 	unsigned int cpu = apicd->prev_cpu;
 	bool managed = apicd->is_managed;
 
-	/*
-	 * This should never happen. Managed interrupts are not
-	 * migrated except on CPU down, which does not involve the
-	 * cleanup vector. But try to keep the accounting correct
-	 * nevertheless.
-	 */
-	WARN_ON_ONCE(managed);
-
 	trace_vector_free_moved(apicd->irq, cpu, vector, managed);
 	irq_matrix_free(vector_matrix, cpu, vector, managed);
 	per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;