linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/vector: Allow to free vector for managed IRQ
@ 2020-03-12 20:58 Peter Xu
  2020-03-13  3:13 ` Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Peter Xu @ 2020-03-12 20:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterx, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Ming Lei,
	Juri Lelli

After we introduced the "managed_irq" sub-parameter for isolcpus, it's
possible to free a kernel managed irq vector now.

It can be triggered easily by booting a VM with a few vcpus, with one
virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):

[    2.889911] ------------[ cut here ]------------
[    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
[    2.889964] Modules linked in: crc32c_intel serio_raw virtio_blk(+) qemu_fw_cfg
[    2.889968] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc1 #18
[    2.889969] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[    2.889970] RIP: 0010:free_moved_vector+0x126/0x160
[    2.889972] Code: 45 00 48 85 c0 75 df e9 2b ff ff ff 48 8b 05 f9 51 71 01 e8 3c 5a 11 00 85 c0 74 09 80 3d 8d 39 71 01 00 5
[    2.889972] RSP: 0000:ffffb5ac00110fa0 EFLAGS: 00010002
[    2.889973] RAX: 0000000000000001 RBX: ffffa00fad2d60c0 RCX: 0000000000000821
[    2.889974] RDX: 0000000000000000 RSI: 00000000fd2bd4ba RDI: ffffa00fad2d60c0
[    2.889974] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[    2.889975] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000023
[    2.889975] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[    2.889976] FS:  0000000000000000(0000) GS:ffffa00fbbd80000(0000) knlGS:0000000000000000
[    2.889977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.889978] CR2: 00000000ffffffff CR3: 0000000123610001 CR4: 0000000000360ee0
[    2.889980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.889980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.889981] Call Trace:
[    2.889982]  <IRQ>
[    2.889987]  smp_irq_move_cleanup_interrupt+0xac/0xc6
[    2.889989]  irq_move_cleanup_interrupt+0xc/0x20
[    2.889990]  </IRQ>
[    2.889991] RIP: 0010:native_safe_halt+0xe/0x10
[    2.889992] Code: cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 56 82 4f 00 f4 c3 66 90 e9 07 00 00 00 0f 00 0
[    2.889993] RSP: 0000:ffffb5ac00083eb0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdf
[    2.889994] RAX: ffffa00fbb260000 RBX: 0000000000000003 RCX: 0000000000000000
[    2.889994] RDX: ffffa00fbb260000 RSI: 0000000000000006 RDI: ffffa00fbb260000
[    2.889995] RBP: 0000000000000003 R08: 000000cd42e4dffb R09: 0000000000000000
[    2.889995] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa00fbb260000
[    2.889996] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa00fbb260000
[    2.890003]  default_idle+0x1f/0x140
[    2.890006]  do_idle+0x1fa/0x270
[    2.890010]  cpu_startup_entry+0x19/0x20
[    2.890012]  start_secondary+0x164/0x1b0
[    2.890014]  secondary_startup_64+0xa4/0xb0
[    2.890021] irq event stamp: 8758
[    2.890022] hardirqs last  enabled at (8755): [<ffffffffbbb105ca>] default_idle+0x1a/0x140
[    2.890023] hardirqs last disabled at (8756): [<ffffffffbb0039f7>] trace_hardirqs_off_thunk+0x1a/0x1c
[    2.890025] softirqs last  enabled at (8758): [<ffffffffbb0ecce8>] irq_enter+0x68/0x70
[    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
[    2.890027] ---[ end trace deb5d563d2acb13f ]---

I believe the same thing will happen to bare metals.

When allocating the IRQ for the device, activate_managed() will try to
allocate a vector based on what we've calculated for kernel managed
IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
irq_do_set_affinity(), in which we will start to consider the whole
HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
different from when we do the allocation.  When that happens, we'll
need to be able to properly free the old vector on the old core.

Remove the WARN_ON_ONCE() to allow this to happen.

Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Ming Lei <minlei@redhat.com>
CC: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kernel/apic/vector.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 2c5676b0a6e7..a1142260b123 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -837,14 +837,6 @@ static void free_moved_vector(struct apic_chip_data *apicd)
 	unsigned int cpu = apicd->prev_cpu;
 	bool managed = apicd->is_managed;
 
-	/*
-	 * This should never happen. Managed interrupts are not
-	 * migrated except on CPU down, which does not involve the
-	 * cleanup vector. But try to keep the accounting correct
-	 * nevertheless.
-	 */
-	WARN_ON_ONCE(managed);
-
 	trace_vector_free_moved(apicd->irq, cpu, vector, managed);
 	irq_matrix_free(vector_matrix, cpu, vector, managed);
 	per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/vector: Allow to free vector for managed IRQ
  2020-03-12 20:58 [PATCH] x86/vector: Allow to free vector for managed IRQ Peter Xu
@ 2020-03-13  3:13 ` Ming Lei
  2020-03-13 12:32   ` Peter Xu
  2020-03-13 14:24 ` Thomas Gleixner
  2020-03-13 14:31 ` [tip: x86/urgent] x86/vector: Remove warning on managed interrupt migration tip-bot2 for Peter Xu
  2 siblings, 1 reply; 9+ messages in thread
From: Ming Lei @ 2020-03-13  3:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Ming Lei, Juri Lelli

On Thu, Mar 12, 2020 at 04:58:30PM -0400, Peter Xu wrote:
> After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> possible to free a kernel managed irq vector now.
> 
> It can be triggered easily by booting a VM with a few vcpus, with one
> virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> 
> [    2.889911] ------------[ cut here ]------------
> [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> [    2.889964] Modules linked in: crc32c_intel serio_raw virtio_blk(+) qemu_fw_cfg
> [    2.889968] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc1 #18
> [    2.889969] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [    2.889970] RIP: 0010:free_moved_vector+0x126/0x160
> [    2.889972] Code: 45 00 48 85 c0 75 df e9 2b ff ff ff 48 8b 05 f9 51 71 01 e8 3c 5a 11 00 85 c0 74 09 80 3d 8d 39 71 01 00 5
> [    2.889972] RSP: 0000:ffffb5ac00110fa0 EFLAGS: 00010002
> [    2.889973] RAX: 0000000000000001 RBX: ffffa00fad2d60c0 RCX: 0000000000000821
> [    2.889974] RDX: 0000000000000000 RSI: 00000000fd2bd4ba RDI: ffffa00fad2d60c0
> [    2.889974] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> [    2.889975] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000023
> [    2.889975] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [    2.889976] FS:  0000000000000000(0000) GS:ffffa00fbbd80000(0000) knlGS:0000000000000000
> [    2.889977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.889978] CR2: 00000000ffffffff CR3: 0000000123610001 CR4: 0000000000360ee0
> [    2.889980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    2.889980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    2.889981] Call Trace:
> [    2.889982]  <IRQ>
> [    2.889987]  smp_irq_move_cleanup_interrupt+0xac/0xc6
> [    2.889989]  irq_move_cleanup_interrupt+0xc/0x20
> [    2.889990]  </IRQ>
> [    2.889991] RIP: 0010:native_safe_halt+0xe/0x10
> [    2.889992] Code: cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 56 82 4f 00 f4 c3 66 90 e9 07 00 00 00 0f 00 0
> [    2.889993] RSP: 0000:ffffb5ac00083eb0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdf
> [    2.889994] RAX: ffffa00fbb260000 RBX: 0000000000000003 RCX: 0000000000000000
> [    2.889994] RDX: ffffa00fbb260000 RSI: 0000000000000006 RDI: ffffa00fbb260000
> [    2.889995] RBP: 0000000000000003 R08: 000000cd42e4dffb R09: 0000000000000000
> [    2.889995] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa00fbb260000
> [    2.889996] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa00fbb260000
> [    2.890003]  default_idle+0x1f/0x140
> [    2.890006]  do_idle+0x1fa/0x270
> [    2.890010]  cpu_startup_entry+0x19/0x20
> [    2.890012]  start_secondary+0x164/0x1b0
> [    2.890014]  secondary_startup_64+0xa4/0xb0
> [    2.890021] irq event stamp: 8758
> [    2.890022] hardirqs last  enabled at (8755): [<ffffffffbbb105ca>] default_idle+0x1a/0x140
> [    2.890023] hardirqs last disabled at (8756): [<ffffffffbb0039f7>] trace_hardirqs_off_thunk+0x1a/0x1c
> [    2.890025] softirqs last  enabled at (8758): [<ffffffffbb0ecce8>] irq_enter+0x68/0x70
> [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> 
> I believe the same thing will happen to bare metals.
> 
> When allocating the IRQ for the device, activate_managed() will try to
> allocate a vector based on what we've calculated for kernel managed
> IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> irq_do_set_affinity(), in which we will start to consider the whole
> HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> different from when we do the allocation.  When that happens, we'll
> need to be able to properly free the old vector on the old core.
> 
> Remove the WARN_ON_ONCE() to allow this to happen.
> 
> Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Ming Lei <minlei@redhat.com>
> CC: Juri Lelli <juri.lelli@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kernel/apic/vector.c | 8 --------
>  1 file changed, 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 2c5676b0a6e7..a1142260b123 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -837,14 +837,6 @@ static void free_moved_vector(struct apic_chip_data *apicd)
>  	unsigned int cpu = apicd->prev_cpu;
>  	bool managed = apicd->is_managed;
>  
> -	/*
> -	 * This should never happen. Managed interrupts are not
> -	 * migrated except on CPU down, which does not involve the
> -	 * cleanup vector. But try to keep the accounting correct
> -	 * nevertheless.
> -	 */
> -	WARN_ON_ONCE(managed);
> -

Since commit 11ea68f553e2, managed interrupt can be migrated on CPU up:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/vector: Allow to free vector for managed IRQ
  2020-03-13  3:13 ` Ming Lei
@ 2020-03-13 12:32   ` Peter Xu
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Xu @ 2020-03-13 12:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Ming Lei, Juri Lelli

On Fri, Mar 13, 2020 at 11:13:41AM +0800, Ming Lei wrote:
> On Thu, Mar 12, 2020 at 04:58:30PM -0400, Peter Xu wrote:
> > After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> > possible to free a kernel managed irq vector now.
> > 
> > It can be triggered easily by booting a VM with a few vcpus, with one
> > virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> > below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> > 
> > [    2.889911] ------------[ cut here ]------------
> > [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> > [    2.889964] Modules linked in: crc32c_intel serio_raw virtio_blk(+) qemu_fw_cfg
> > [    2.889968] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc1 #18
> > [    2.889969] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > [    2.889970] RIP: 0010:free_moved_vector+0x126/0x160
> > [    2.889972] Code: 45 00 48 85 c0 75 df e9 2b ff ff ff 48 8b 05 f9 51 71 01 e8 3c 5a 11 00 85 c0 74 09 80 3d 8d 39 71 01 00 5
> > [    2.889972] RSP: 0000:ffffb5ac00110fa0 EFLAGS: 00010002
> > [    2.889973] RAX: 0000000000000001 RBX: ffffa00fad2d60c0 RCX: 0000000000000821
> > [    2.889974] RDX: 0000000000000000 RSI: 00000000fd2bd4ba RDI: ffffa00fad2d60c0
> > [    2.889974] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > [    2.889975] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000023
> > [    2.889975] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> > [    2.889976] FS:  0000000000000000(0000) GS:ffffa00fbbd80000(0000) knlGS:0000000000000000
> > [    2.889977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    2.889978] CR2: 00000000ffffffff CR3: 0000000123610001 CR4: 0000000000360ee0
> > [    2.889980] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [    2.889980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [    2.889981] Call Trace:
> > [    2.889982]  <IRQ>
> > [    2.889987]  smp_irq_move_cleanup_interrupt+0xac/0xc6
> > [    2.889989]  irq_move_cleanup_interrupt+0xc/0x20
> > [    2.889990]  </IRQ>
> > [    2.889991] RIP: 0010:native_safe_halt+0xe/0x10
> > [    2.889992] Code: cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 56 82 4f 00 f4 c3 66 90 e9 07 00 00 00 0f 00 0
> > [    2.889993] RSP: 0000:ffffb5ac00083eb0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdf
> > [    2.889994] RAX: ffffa00fbb260000 RBX: 0000000000000003 RCX: 0000000000000000
> > [    2.889994] RDX: ffffa00fbb260000 RSI: 0000000000000006 RDI: ffffa00fbb260000
> > [    2.889995] RBP: 0000000000000003 R08: 000000cd42e4dffb R09: 0000000000000000
> > [    2.889995] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa00fbb260000
> > [    2.889996] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa00fbb260000
> > [    2.890003]  default_idle+0x1f/0x140
> > [    2.890006]  do_idle+0x1fa/0x270
> > [    2.890010]  cpu_startup_entry+0x19/0x20
> > [    2.890012]  start_secondary+0x164/0x1b0
> > [    2.890014]  secondary_startup_64+0xa4/0xb0
> > [    2.890021] irq event stamp: 8758
> > [    2.890022] hardirqs last  enabled at (8755): [<ffffffffbbb105ca>] default_idle+0x1a/0x140
> > [    2.890023] hardirqs last disabled at (8756): [<ffffffffbb0039f7>] trace_hardirqs_off_thunk+0x1a/0x1c
> > [    2.890025] softirqs last  enabled at (8758): [<ffffffffbb0ecce8>] irq_enter+0x68/0x70
> > [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> > [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> > 
> > I believe the same thing will happen to bare metals.
> > 
> > When allocating the IRQ for the device, activate_managed() will try to
> > allocate a vector based on what we've calculated for kernel managed
> > IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> > when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> > irq_do_set_affinity(), in which we will start to consider the whole
> > HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> > different from when we do the allocation.  When that happens, we'll
> > need to be able to properly free the old vector on the old core.
> > 
> > Remove the WARN_ON_ONCE() to allow this to happen.
> > 
> > Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Ingo Molnar <mingo@redhat.com>
> > CC: Peter Zijlstra <peterz@infradead.org>
> > CC: Ming Lei <minlei@redhat.com>
> > CC: Juri Lelli <juri.lelli@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/x86/kernel/apic/vector.c | 8 --------
> >  1 file changed, 8 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 2c5676b0a6e7..a1142260b123 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -837,14 +837,6 @@ static void free_moved_vector(struct apic_chip_data *apicd)
> >  	unsigned int cpu = apicd->prev_cpu;
> >  	bool managed = apicd->is_managed;
> >  
> > -	/*
> > -	 * This should never happen. Managed interrupts are not
> > -	 * migrated except on CPU down, which does not involve the
> > -	 * cleanup vector. But try to keep the accounting correct
> > -	 * nevertheless.
> > -	 */
> > -	WARN_ON_ONCE(managed);
> > -
> 
> Since commit 11ea68f553e2, managed interrupt can be migrated on CPU up:

Right, my example was triggered even earlier than cpu up (every boot,
because right now we'll allocate the managed irq vector even during
allocation of the IRQ, then another time when setting the affinity).

And IIUC cpu down should also be able to trigger this, say:

  queue1 -> CPU 0,1
  queue2 -> CPU 2,3

Assuming housekeeping cores are 0-2 and CPU3 is isolated, queue2 IRQ
will need to target CPU2 first.  Then if we offline CPU2, the queue2
IRQ should need to move to CPU3 (which is the isolated core) instead.
Then we need to be able to free the old vector on CPU2.

> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/vector: Allow to free vector for managed IRQ
  2020-03-12 20:58 [PATCH] x86/vector: Allow to free vector for managed IRQ Peter Xu
  2020-03-13  3:13 ` Ming Lei
@ 2020-03-13 14:24 ` Thomas Gleixner
  2020-03-13 15:19   ` Peter Xu
  2020-03-14  2:04   ` Ming Lei
  2020-03-13 14:31 ` [tip: x86/urgent] x86/vector: Remove warning on managed interrupt migration tip-bot2 for Peter Xu
  2 siblings, 2 replies; 9+ messages in thread
From: Thomas Gleixner @ 2020-03-13 14:24 UTC (permalink / raw)
  To: Peter Xu, linux-kernel
  Cc: peterx, Ingo Molnar, Peter Zijlstra, Ming Lei, Juri Lelli

Peter Xu <peterx@redhat.com> writes:

> After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> possible to free a kernel managed irq vector now.
>
> It can be triggered easily by booting a VM with a few vcpus, with one
> virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
>
> [    2.889911] ------------[ cut here ]------------
> [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160

<SNIP>

> [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> [    2.890027] ---[ end trace deb5d563d2acb13f ]---

What is this backtrace for? It's completly useless as it merily shows
that the warning triggers. Also even if it'd be useful then it wants to
be trimmed properly.

> I believe the same thing will happen to bare metals.

Believe is not really relevant in engineering.

The problem has nothing to do with virt or bare metal. It's a genuine
issue.

> When allocating the IRQ for the device, activate_managed() will try to
> allocate a vector based on what we've calculated for kernel managed
> IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> irq_do_set_affinity(), in which we will start to consider the whole
> HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> different from when we do the allocation.  When that happens, we'll
> need to be able to properly free the old vector on the old core.

There's lots of 'we' in that text. We do nothing really. Please describe
things in neutral and factual language.

Also there is another way to trigger this: Offline all non-isolated CPUs
in the mask and then bring one online again.

Ming, I really have to ask why these two situations were not tested
before the final submission of that isolation patch. Both issues have
been discussed during review of the different versions. So the warning
should have triggered back then already....

> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 2c5676b0a6e7..a1142260b123 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -837,14 +837,6 @@ static void free_moved_vector(struct apic_chip_data *apicd)
>  	unsigned int cpu = apicd->prev_cpu;
>  	bool managed = apicd->is_managed;
>  
> -	/*
> -	 * This should never happen. Managed interrupts are not
> -	 * migrated except on CPU down, which does not involve the
> -	 * cleanup vector. But try to keep the accounting correct
> -	 * nevertheless.
> -	 */

While the comment is not longer correct, removing it is lame. This
should have an explanation why managed interrupts can end up here.

No need to resend. I fixed it up already.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip: x86/urgent] x86/vector: Remove warning on managed interrupt migration
  2020-03-12 20:58 [PATCH] x86/vector: Allow to free vector for managed IRQ Peter Xu
  2020-03-13  3:13 ` Ming Lei
  2020-03-13 14:24 ` Thomas Gleixner
@ 2020-03-13 14:31 ` tip-bot2 for Peter Xu
  2 siblings, 0 replies; 9+ messages in thread
From: tip-bot2 for Peter Xu @ 2020-03-13 14:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Xu, Thomas Gleixner, Ming Lei, x86, LKML

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     469ff207b4c4033540b50bc59587dc915faa1367
Gitweb:        https://git.kernel.org/tip/469ff207b4c4033540b50bc59587dc915faa1367
Author:        Peter Xu <peterx@redhat.com>
AuthorDate:    Thu, 12 Mar 2020 16:58:30 -04:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 13 Mar 2020 15:29:26 +01:00

x86/vector: Remove warning on managed interrupt migration

The vector management code assumes that managed interrupts cannot be
migrated away from an online CPU. free_moved_vector() has a WARN_ON_ONCE()
which triggers when a managed interrupt vector association on a online CPU
is cleared. The CPU offline code uses a different mechanism which cannot
trigger this.

This assumption is not longer correct because the new CPU isolation feature
which affects the placement of managed interrupts must be able to move a
managed interrupt away from an online CPU.

There are two reasons why this can happen:

  1) When the interrupt is activated the affinity mask which was
     established in irq_create_affinity_masks() is handed in to
     the vector allocation code. This mask contains all CPUs to which
     the interrupt can be made affine to, but this does not take the
     CPU isolation 'managed_irq' mask into account.

     When the interrupt is finally requested by the device driver then the
     affinity is checked again and the CPU isolation 'managed_irq' mask is
     taken into account, which moves the interrupt to a non-isolated CPU if
     possible.

  2) The interrupt can be affine to an isolated CPU because the
     non-isolated CPUs in the calculated affinity mask are not online.

     Once a non-isolated CPU which is in the mask comes online the
     interrupt is migrated to this non-isolated CPU

In both cases the regular online migration mechanism is used which triggers
the WARN_ON_ONCE() in free_moved_vector().

Case #1 could have been addressed by taking the isolation mask into
account, but that would require a massive code change in the activation
logic and the eventual migration event was accepted as a reasonable
tradeoff when the isolation feature was developed. But even if #1 would be
addressed, #2 would still trigger it.

Of course the warning in free_moved_vector() was overlooked at that time
and the above two cases which have been discussed during patch review have
obviously never been tested before the final submission.

So keep it simple and remove the warning.

[ tglx: Rewrote changelog and added a comment to free_moved_vector() ]

Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>                                                                                                                                                                       
Link: https://lkml.kernel.org/r/20200312205830.81796-1-peterx@redhat.com

---
 arch/x86/kernel/apic/vector.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 2c5676b..48293d1 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -838,13 +838,15 @@ static void free_moved_vector(struct apic_chip_data *apicd)
 	bool managed = apicd->is_managed;
 
 	/*
-	 * This should never happen. Managed interrupts are not
-	 * migrated except on CPU down, which does not involve the
-	 * cleanup vector. But try to keep the accounting correct
-	 * nevertheless.
+	 * Managed interrupts are usually not migrated away
+	 * from an online CPU, but CPU isolation 'managed_irq'
+	 * can make that happen.
+	 * 1) Activation does not take the isolation into account
+	 *    to keep the code simple
+	 * 2) Migration away from an isolated CPU can happen when
+	 *    a non-isolated CPU which is in the calculated
+	 *    affinity mask comes online.
 	 */
-	WARN_ON_ONCE(managed);
-
 	trace_vector_free_moved(apicd->irq, cpu, vector, managed);
 	irq_matrix_free(vector_matrix, cpu, vector, managed);
 	per_cpu(vector_irq, cpu)[vector] = VECTOR_UNUSED;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/vector: Allow to free vector for managed IRQ
  2020-03-13 14:24 ` Thomas Gleixner
@ 2020-03-13 15:19   ` Peter Xu
  2020-03-13 23:54     ` Thomas Gleixner
  2020-03-14  2:04   ` Ming Lei
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Xu @ 2020-03-13 15:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Ming Lei, Juri Lelli

On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> > possible to free a kernel managed irq vector now.
> >
> > It can be triggered easily by booting a VM with a few vcpus, with one
> > virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> > below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> >
> > [    2.889911] ------------[ cut here ]------------
> > [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> 
> <SNIP>
> 
> > [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> > [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> 
> What is this backtrace for? It's completly useless as it merily shows
> that the warning triggers. Also even if it'd be useful then it wants to
> be trimmed properly.

I thought it was a good habit to keep the facts of issues.  Backtrace
is one of them so I kept them.  It could, for example, help people who
spot the same issue in an old/downstream kernel so when they google or
grepping git-log they know the exact issue has been solved by some
commit, even without much knowledge on the internals (because they can
exactly compare the whole dmesg error).

> 
> > I believe the same thing will happen to bare metals.
> 
> Believe is not really relevant in engineering.
> 
> The problem has nothing to do with virt or bare metal. It's a genuine
> issue.
> 
> > When allocating the IRQ for the device, activate_managed() will try to
> > allocate a vector based on what we've calculated for kernel managed
> > IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> > when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> > irq_do_set_affinity(), in which we will start to consider the whole
> > HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> > different from when we do the allocation.  When that happens, we'll
> > need to be able to properly free the old vector on the old core.
> 
> There's lots of 'we' in that text. We do nothing really. Please describe
> things in neutral and factual language.
> 
> Also there is another way to trigger this: Offline all non-isolated CPUs
> in the mask and then bring one online again.

Thanks for your suggestions on not using subjective words and so on.
I'll remember these.

However I think I still miss one thing in the puzzle (although it
turns out that we've agreed on removing the warning already, but just
in case I missed something important) - do you mean that offlining all
the non-isolated CPUs in the mask won't trigger this already?  Because
I also saw some similar comment somewhere else...

Here's my understanding - when offlining, we'll disable the CPU and
reach:

  - irq_migrate_all_off_this_cpu
    - migrate_one_irq
      - irq_do_set_affinity
        - calculate HK_FLAG_MANAGED_IRQ and so on...

Then we can still trigger this irq move event even before we bring
another housekeeping cpu online, right?  Or could you guide me on what
I have missed?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/vector: Allow to free vector for managed IRQ
  2020-03-13 15:19   ` Peter Xu
@ 2020-03-13 23:54     ` Thomas Gleixner
  2020-03-14 23:53       ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2020-03-13 23:54 UTC (permalink / raw)
  To: Peter Xu; +Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Ming Lei, Juri Lelli

Peter Xu <peterx@redhat.com> writes:
> On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> What is this backtrace for? It's completly useless as it merily shows
>> that the warning triggers. Also even if it'd be useful then it wants to
>> be trimmed properly.
>
> I thought it was a good habit to keep the facts of issues.  Backtrace
> is one of them so I kept them.  It could, for example, help people who
> spot the same issue in an old/downstream kernel so when they google or
> grepping git-log they know the exact issue has been solved by some
> commit, even without much knowledge on the internals (because they can
> exactly compare the whole dmesg error).

This is not really good habit. Changelogs should contain factual
information which is relevant to understand the problem. Backtraces are
useful when the callchain leading to a bug/warn/oops _is_ relevant for
understanding the issue. For things like this where the backtrace is
completly out of context it's more of an distraction than of value.

Aside of that your 'spot the same issue' argument is wrong in this
context as this does not affect anything old/downstream. The feature was
merged during the 5.6 merge window. So it has not reached anything
downstream which needs backports. If distro people pick out such a
feature and backport it to their frankenkernels before the final release
then I really dont care.

One way to preserve the backtrace for google's sake is to write a cover
letter and stick the back trace there if it is not useful in the
changelog.

Also having 40+ lines of untrimmed backtrace is just a horrible
distraction. The timestamps are completely useless and depending on the
type of problem most of the other gunk which is emitted by a
bug/warn/oops backtrace is useless as well.

Why would you be interested in the register values for this problem or
the code or the interrupt flags tracer state? That thing triggered a
warning and nothing in the backtrace gives any clue about why that
happened, IOW it's pure distraction.

If the call chain is relevant to explain the problem, then a trimmed
down version is really sufficient. Here is an example:

     BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
     RIP: 0010:apic_ack_edge+0x1e/0x40
     Call Trace:
       handle_edge_irq+0x7d/0x1e0
       generic_handle_irq+0x27/0x30
       aer_inject_write+0x53a/0x720

This helps, because it illustrates exactly how this BUG was triggered
and it's reduced to exactly the 6 lines because everything else in the
original 50+ lines is useless.

> However I think I still miss one thing in the puzzle (although it
> turns out that we've agreed on removing the warning already, but just
> in case I missed something important) - do you mean that offlining all
> the non-isolated CPUs in the mask won't trigger this already?  Because
> I also saw some similar comment somewhere else...
>
> Here's my understanding - when offlining, we'll disable the CPU and
> reach:
>
>   - irq_migrate_all_off_this_cpu
>     - migrate_one_irq
>       - irq_do_set_affinity
>         - calculate HK_FLAG_MANAGED_IRQ and so on...
>
> Then we can still trigger this irq move event even before we bring
> another housekeeping cpu online, right?  Or could you guide me on what
> I have missed?

The migration when offlining the CPU to which an interrupt is affine
does not trigger this because that uses a different mechanism.

It clears the vector on the outgoing CPU brute force simply because this
CPU can't handle any interrupts anymore. There is some logic to catch
the device interrupt racing against this, but while careful it's not
perfect. There is a theoretical hole there which probably could be
triggered by carefully orchestrating things, but we can't do anything
about it except disabling CPU unplug :)

So the only two ways to trigger this are the ones I described in the
changelog.

Hope that helps.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/vector: Allow to free vector for managed IRQ
  2020-03-13 14:24 ` Thomas Gleixner
  2020-03-13 15:19   ` Peter Xu
@ 2020-03-14  2:04   ` Ming Lei
  1 sibling, 0 replies; 9+ messages in thread
From: Ming Lei @ 2020-03-14  2:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Xu, linux-kernel, Ingo Molnar, Peter Zijlstra, Ming Lei,
	Juri Lelli

On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > After we introduced the "managed_irq" sub-parameter for isolcpus, it's
> > possible to free a kernel managed irq vector now.
> >
> > It can be triggered easily by booting a VM with a few vcpus, with one
> > virtio-blk device and then mark some cores as HK_FLAG_MANAGED_IRQ (in
> > below case, there're 4 vcpus, with vcpu 3 isolated with managed_irq):
> >
> > [    2.889911] ------------[ cut here ]------------
> > [    2.889964] WARNING: CPU: 3 PID: 0 at arch/x86/kernel/apic/vector.c:853 free_moved_vector+0x126/0x160
> 
> <SNIP>
> 
> > [    2.890026] softirqs last disabled at (8757): [<ffffffffbb0ecccd>] irq_enter+0x4d/0x70
> > [    2.890027] ---[ end trace deb5d563d2acb13f ]---
> 
> What is this backtrace for? It's completly useless as it merily shows
> that the warning triggers. Also even if it'd be useful then it wants to
> be trimmed properly.
> 
> > I believe the same thing will happen to bare metals.
> 
> Believe is not really relevant in engineering.
> 
> The problem has nothing to do with virt or bare metal. It's a genuine
> issue.
> 
> > When allocating the IRQ for the device, activate_managed() will try to
> > allocate a vector based on what we've calculated for kernel managed
> > IRQs (which does not take HK_FLAG_MANAGED_IRQ into account).  However
> > when we bind the IRQ to the IRQ handler, we'll do irq_startup() and
> > irq_do_set_affinity(), in which we will start to consider the whole
> > HK_FLAG_MANAGED_IRQ logic.  This means the chosen core can be
> > different from when we do the allocation.  When that happens, we'll
> > need to be able to properly free the old vector on the old core.
> 
> There's lots of 'we' in that text. We do nothing really. Please describe
> things in neutral and factual language.
> 
> Also there is another way to trigger this: Offline all non-isolated CPUs
> in the mask and then bring one online again.
> 
> Ming, I really have to ask why these two situations were not tested
> before the final submission of that isolation patch. Both issues have
> been discussed during review of the different versions. So the warning
> should have triggered back then already....

Hi Thomas,

I run CPU hotplug & unplug stress test with isolcpus:managed_irq, however
the test just checks if the irq's effective vector is setup correctly and
IO can be run as expected.

Looks dmesg warning is missed to check, sorry for that.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86/vector: Allow to free vector for managed IRQ
  2020-03-13 23:54     ` Thomas Gleixner
@ 2020-03-14 23:53       ` Peter Xu
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Xu @ 2020-03-14 23:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Ming Lei, Juri Lelli

On Sat, Mar 14, 2020 at 12:54:12AM +0100, Thomas Gleixner wrote:
> Peter Xu <peterx@redhat.com> writes:
> > On Fri, Mar 13, 2020 at 03:24:08PM +0100, Thomas Gleixner wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> What is this backtrace for? It's completly useless as it merily shows
> >> that the warning triggers. Also even if it'd be useful then it wants to
> >> be trimmed properly.
> >
> > I thought it was a good habit to keep the facts of issues.  Backtrace
> > is one of them so I kept them.  It could, for example, help people who
> > spot the same issue in an old/downstream kernel so when they google or
> > grepping git-log they know the exact issue has been solved by some
> > commit, even without much knowledge on the internals (because they can
> > exactly compare the whole dmesg error).
> 
> This is not really good habit. Changelogs should contain factual
> information which is relevant to understand the problem. Backtraces are
> useful when the callchain leading to a bug/warn/oops _is_ relevant for
> understanding the issue. For things like this where the backtrace is
> completly out of context it's more of an distraction than of value.
> 
> Aside of that your 'spot the same issue' argument is wrong in this
> context as this does not affect anything old/downstream. The feature was
> merged during the 5.6 merge window. So it has not reached anything
> downstream which needs backports. If distro people pick out such a
> feature and backport it to their frankenkernels before the final release
> then I really dont care.
> 
> One way to preserve the backtrace for google's sake is to write a cover
> letter and stick the back trace there if it is not useful in the
> changelog.
> 
> Also having 40+ lines of untrimmed backtrace is just a horrible
> distraction. The timestamps are completely useless and depending on the
> type of problem most of the other gunk which is emitted by a
> bug/warn/oops backtrace is useless as well.
> 
> Why would you be interested in the register values for this problem or
> the code or the interrupt flags tracer state? That thing triggered a
> warning and nothing in the backtrace gives any clue about why that
> happened, IOW it's pure distraction.
> 
> If the call chain is relevant to explain the problem, then a trimmed
> down version is really sufficient. Here is an example:
> 
>      BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
>      RIP: 0010:apic_ack_edge+0x1e/0x40
>      Call Trace:
>        handle_edge_irq+0x7d/0x1e0
>        generic_handle_irq+0x27/0x30
>        aer_inject_write+0x53a/0x720
> 
> This helps, because it illustrates exactly how this BUG was triggered
> and it's reduced to exactly the 6 lines because everything else in the
> original 50+ lines is useless.

Fair enough.

> 
> > However I think I still miss one thing in the puzzle (although it
> > turns out that we've agreed on removing the warning already, but just
> > in case I missed something important) - do you mean that offlining all
> > the non-isolated CPUs in the mask won't trigger this already?  Because
> > I also saw some similar comment somewhere else...
> >
> > Here's my understanding - when offlining, we'll disable the CPU and
> > reach:
> >
> >   - irq_migrate_all_off_this_cpu
> >     - migrate_one_irq
> >       - irq_do_set_affinity
> >         - calculate HK_FLAG_MANAGED_IRQ and so on...
> >
> > Then we can still trigger this irq move event even before we bring
> > another housekeeping cpu online, right?  Or could you guide me on what
> > I have missed?
> 
> The migration when offlining the CPU to which an interrupt is affine
> does not trigger this because that uses a different mechanism.
> 
> It clears the vector on the outgoing CPU brute force simply because this
> CPU can't handle any interrupts anymore. There is some logic to catch
> the device interrupt racing against this, but while careful it's not
> perfect. There is a theoretical hole there which probably could be
> triggered by carefully orchestrating things, but we can't do anything
> about it except disabling CPU unplug :)

Ah I haven't thought about the "theoretical hole" and that far...  I
think I was only focusing on whether set_affinity() would happen, but
I ignored the fact that x86 __send_cleanup_vector() treated offlining
CPU in the special way to avoid sending IRQ_MOVE_CLEANUP_VECTOR at
all.  And that part makes perfect sense since we can't do much with an
offlined core.

> 
> So the only two ways to trigger this are the ones I described in the
> changelog.
> 
> Hope that helps.

Definitely.

Thanks for writting this up, Thomas!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-03-15  1:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-12 20:58 [PATCH] x86/vector: Allow to free vector for managed IRQ Peter Xu
2020-03-13  3:13 ` Ming Lei
2020-03-13 12:32   ` Peter Xu
2020-03-13 14:24 ` Thomas Gleixner
2020-03-13 15:19   ` Peter Xu
2020-03-13 23:54     ` Thomas Gleixner
2020-03-14 23:53       ` Peter Xu
2020-03-14  2:04   ` Ming Lei
2020-03-13 14:31 ` [tip: x86/urgent] x86/vector: Remove warning on managed interrupt migration tip-bot2 for Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).