linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
@ 2018-01-23 22:01 Lyude Paul
  2018-01-24  1:26 ` Lyude Paul
  2018-01-24 12:50 ` Thomas Gleixner
  0 siblings, 2 replies; 16+ messages in thread
From: Lyude Paul @ 2018-01-23 22:01 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: hpa, keith.busch, mingo, tglx, linux-kernel

Hi! Sorry to be the bearer of bad news, but this patch actually seems to break
suspending and resuming with nouveau on my machine:

[   29.694755] PM: suspend entry (deep)
[   29.694773] PM: Syncing filesystems ... done.
[   29.696203] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   29.697442] OOM killer disabled.
[   29.697448] Freezing remaining freezable tasks ... (elapsed 0.000 seconds)
done.
[   29.698232] Suspending console(s) (use no_console_suspend to debug)
[   29.698993] serial 00:05: disabled
[   29.708227] sd 4:0:0:0: [sda] Synchronizing SCSI cache
[   29.708428] sd 4:0:0:0: [sda] Stopping disk
[   30.614581] ACPI: Preparing to enter system sleep state S3
[   30.917726] PM: Saving platform NVS memory
[   30.917736] Disabling non-boot CPUs ...
[   30.925616] smpboot: CPU 1 is now offline
[   30.936915] smpboot: CPU 2 is now offline
[   30.952824] smpboot: CPU 3 is now offline
[   30.964764] smpboot: CPU 4 is now offline
[   30.980663] smpboot: CPU 5 is now offline
[   30.992692] smpboot: CPU 6 is now offline
[   31.002572] smpboot: CPU 7 is now offline
[   31.003130] ACPI: Low-level resume complete
[   31.003180] PM: Restoring platform NVS memory
[   31.003578] WARNING: CPU: 0 PID: 11523 at kernel/smp.c:291
smp_call_function_single+0xdc/0xe0
[   31.003578] Modules linked in: nouveau video mxm_wmi i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfat fat usbhid
crc32_pclmul i2c_piix4 i2c_core shpchp k10temp wmi acpi_cpufreq crc32c_intel
r8169 mii xhci_pci xhci_hcd w83627hf_wdt
[   31.003590] CPU: 0 PID: 11523 Comm: rtcwake Not tainted 4.15.0-rc8nouveau-
clockgating+ #1
[   31.003591] Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS 1.60
09/19/2017
[   31.003592] RIP: 0010:smp_call_function_single+0xdc/0xe0
[   31.003593] RSP: 0018:ffffc900004a3c40 EFLAGS: 00010046
[   31.003594] RAX: 0000000000000000 RBX: ffffc900004a3cdc RCX: 0000000000000001
[   31.003594] RDX: ffffc900004a3c98 RSI: ffffffff8137a180 RDI: 0000000000000000
[   31.003595] RBP: ffffc900004a3c70 R08: 0000000000000001 R09: 0000000000010000
[   31.003595] R10: ffffc900004a3c98 R11: 0000000000000000 R12: 0000000000000000
[   31.003596] R13: 0000000001000000 R14: ffffc900004a3d0c R15: 0000000000000000
[   31.003597] FS:  00007f03bee93540(0000) GS:ffff88021ae00000(0000)
knlGS:0000000000000000
[   31.003597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   31.003598] CR2: 00007fffb6673008 CR3: 000000020ddd4000 CR4: 00000000003406f0
[   31.003598] Call Trace:
[   31.003603]  ? rdmsr_safe_on_cpu+0x4b/0x70
[   31.003604]  rdmsr_safe_on_cpu+0x4b/0x70
[   31.003606]  get_block_address.isra.0+0x6e/0xe0
[   31.003607]  mce_amd_feature_init+0x63/0x2c0
[   31.003609]  mce_syscore_resume+0x1e/0x30
[   31.003611]  syscore_resume+0x4b/0x170
[   31.003613]  suspend_devices_and_enter+0x608/0x7e0
[   31.003614]  pm_suspend+0x315/0x380
[   31.003615]  state_store+0x7d/0xe0
[   31.003618]  kernfs_fop_write+0xfa/0x180
[   31.003620]  __vfs_write+0x23/0x130
[   31.003623]  ? SYSC_newfstat+0x29/0x40
[   31.003625]  ? _cond_resched+0x15/0x40
[   31.003626]  vfs_write+0xad/0x1a0
[   31.003627]  SyS_write+0x42/0x90
[   31.003629]  entry_SYSCALL_64_fastpath+0x24/0x87
[   31.003630] RIP: 0033:0x7f03be9ae8f4
[   31.003631] RSP: 002b:00007ffe6bf825f8 EFLAGS: 00000246
[   31.003632] Code: fe ff ff 8b 55 e8 83 e2 01 74 0a f3 90 8b 55 e8 83 e2 01 75
f6 48 83 c4 28 41 5a 5d 49 8d 62 f8 c3 8b 05 58 b6 48 01 85 c0 75 86 <0f> ff eb
82 0f 1f 44 00 00 f6 46 18 01 75 15 c7 46 18 01 00 00 
[   31.003648] ---[ end trace 19fa2f7781ed5237 ]---
[   31.004025] Enabling non-boot CPUs ...
[   31.004052] x86: Booting SMP configuration:
[   31.004052] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   31.006368]  cache: parent cpu1 should not be sleeping
[   31.006442] microcode: CPU1: patch_level=0x08001129
[   31.006509] CPU1 is up
[   31.006525] smpboot: Booting Node 0 Processor 2 APIC 0x2
[   31.008832]  cache: parent cpu2 should not be sleeping
[   31.008894] microcode: CPU2: patch_level=0x08001129
[   31.008966] CPU2 is up
[   31.008975] smpboot: Booting Node 0 Processor 3 APIC 0x3
[   31.011264]  cache: parent cpu3 should not be sleeping
[   31.011329] microcode: CPU3: patch_level=0x08001129
[   31.011404] CPU3 is up
[   31.011413] smpboot: Booting Node 0 Processor 4 APIC 0x8
[   31.013833]  cache: parent cpu4 should not be sleeping
[   31.013903] microcode: CPU4: patch_level=0x08001129
[   31.014025] CPU4 is up
[   31.014036] smpboot: Booting Node 0 Processor 5 APIC 0x9
[   31.016354]  cache: parent cpu5 should not be sleeping
[   31.016421] microcode: CPU5: patch_level=0x08001129
[   31.016534] CPU5 is up
[   31.016544] smpboot: Booting Node 0 Processor 6 APIC 0xa
[   31.018857]  cache: parent cpu6 should not be sleeping
[   31.018930] microcode: CPU6: patch_level=0x08001129
[   31.019047] CPU6 is up
[   31.019057] smpboot: Booting Node 0 Processor 7 APIC 0xb
[   31.021376]  cache: parent cpu7 should not be sleeping
[   31.021444] microcode: CPU7: patch_level=0x08001129
[   31.021579] CPU7 is up
[   31.022166] ACPI: Waking up from system sleep state S3
[   31.070791] usb usb1: root hub lost power or was reset
[   31.070794] usb usb2: root hub lost power or was reset
[   31.071628] serial 00:05: activated
[   31.080265] sd 4:0:0:0: [sda] Starting disk
[   31.126099] hpet_rtc_timer_reinit: 68 callbacks suppressed
[   31.126099] hpet1: lost 2 rtc interrupts
[   31.160913] r8169 0000:1e:00.0 enp30s0: link down
[   31.255563] do_IRQ: 1.35 No irq handler for vector
[   31.379537] ata6: SATA link down (SStatus 0 SControl 300)
[   31.379558] ata1: SATA link down (SStatus 0 SControl 300)
[   31.380306] ata2: SATA link down (SStatus 0 SControl 300)
[   31.435705] ata9: SATA link down (SStatus 0 SControl 300)
[   31.589932] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   31.590320] ata5.00: configured for UDMA/133
[   31.610043] usb 1-4: reset low-speed USB device number 2 using xhci_hcd
[   32.226138] usb 1-5: reset low-speed USB device number 3 using xhci_hcd
[   33.257867] nouveau 0000:22:00.0: DRM: EVO timeout
[   34.237185] r8169 0000:1e:00.0 enp30s0: link up
[   35.257880] nouveau 0000:22:00.0: DRM: base-0: timeout
[   37.258334] nouveau 0000:22:00.0: DRM: base-0: timeout
[   37.276084] OOM killer enabled.
[   37.276612] Restarting tasks ... done.
[   37.277722] PM: suspend exit

I haven't yet actually investigated why it does this, but a bisect of master led
me to here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-23 22:01 "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel Lyude Paul
@ 2018-01-24  1:26 ` Lyude Paul
  2018-01-24 12:52   ` Thomas Gleixner
  2018-01-24 12:50 ` Thomas Gleixner
  1 sibling, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-01-24  1:26 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: hpa, keith.busch, mingo, linux-kernel

JFYI: I confirmed this patch is definitely broken. I'm seeing nouveau get
assigned the same MSI vector as another device on the system, which would
explain why interrupts suddenly stop working. I'll keep looking into it further
tomorrow.

On Tue, 2018-01-23 at 17:01 -0500, Lyude Paul wrote:
> Hi! Sorry to be the bearer of bad news, but this patch actually seems to break
> suspending and resuming with nouveau on my machine:
> 
> [   29.694755] PM: suspend entry (deep)
> [   29.694773] PM: Syncing filesystems ... done.
> [   29.696203] Freezing user space processes ... (elapsed 0.001 seconds) done.
> [   29.697442] OOM killer disabled.
> [   29.697448] Freezing remaining freezable tasks ... (elapsed 0.000 seconds)
> done.
> [   29.698232] Suspending console(s) (use no_console_suspend to debug)
> [   29.698993] serial 00:05: disabled
> [   29.708227] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> [   29.708428] sd 4:0:0:0: [sda] Stopping disk
> [   30.614581] ACPI: Preparing to enter system sleep state S3
> [   30.917726] PM: Saving platform NVS memory
> [   30.917736] Disabling non-boot CPUs ...
> [   30.925616] smpboot: CPU 1 is now offline
> [   30.936915] smpboot: CPU 2 is now offline
> [   30.952824] smpboot: CPU 3 is now offline
> [   30.964764] smpboot: CPU 4 is now offline
> [   30.980663] smpboot: CPU 5 is now offline
> [   30.992692] smpboot: CPU 6 is now offline
> [   31.002572] smpboot: CPU 7 is now offline
> [   31.003130] ACPI: Low-level resume complete
> [   31.003180] PM: Restoring platform NVS memory
> [   31.003578] WARNING: CPU: 0 PID: 11523 at kernel/smp.c:291
> smp_call_function_single+0xdc/0xe0
> [   31.003578] Modules linked in: nouveau video mxm_wmi i2c_algo_bit ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfat fat
> usbhid
> crc32_pclmul i2c_piix4 i2c_core shpchp k10temp wmi acpi_cpufreq crc32c_intel
> r8169 mii xhci_pci xhci_hcd w83627hf_wdt
> [   31.003590] CPU: 0 PID: 11523 Comm: rtcwake Not tainted 4.15.0-rc8nouveau-
> clockgating+ #1
> [   31.003591] Hardware name: MSI MS-7A39/A320M GAMING PRO (MS-7A39), BIOS
> 1.60
> 09/19/2017
> [   31.003592] RIP: 0010:smp_call_function_single+0xdc/0xe0
> [   31.003593] RSP: 0018:ffffc900004a3c40 EFLAGS: 00010046
> [   31.003594] RAX: 0000000000000000 RBX: ffffc900004a3cdc RCX:
> 0000000000000001
> [   31.003594] RDX: ffffc900004a3c98 RSI: ffffffff8137a180 RDI:
> 0000000000000000
> [   31.003595] RBP: ffffc900004a3c70 R08: 0000000000000001 R09:
> 0000000000010000
> [   31.003595] R10: ffffc900004a3c98 R11: 0000000000000000 R12:
> 0000000000000000
> [   31.003596] R13: 0000000001000000 R14: ffffc900004a3d0c R15:
> 0000000000000000
> [   31.003597] FS:  00007f03bee93540(0000) GS:ffff88021ae00000(0000)
> knlGS:0000000000000000
> [   31.003597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   31.003598] CR2: 00007fffb6673008 CR3: 000000020ddd4000 CR4:
> 00000000003406f0
> [   31.003598] Call Trace:
> [   31.003603]  ? rdmsr_safe_on_cpu+0x4b/0x70
> [   31.003604]  rdmsr_safe_on_cpu+0x4b/0x70
> [   31.003606]  get_block_address.isra.0+0x6e/0xe0
> [   31.003607]  mce_amd_feature_init+0x63/0x2c0
> [   31.003609]  mce_syscore_resume+0x1e/0x30
> [   31.003611]  syscore_resume+0x4b/0x170
> [   31.003613]  suspend_devices_and_enter+0x608/0x7e0
> [   31.003614]  pm_suspend+0x315/0x380
> [   31.003615]  state_store+0x7d/0xe0
> [   31.003618]  kernfs_fop_write+0xfa/0x180
> [   31.003620]  __vfs_write+0x23/0x130
> [   31.003623]  ? SYSC_newfstat+0x29/0x40
> [   31.003625]  ? _cond_resched+0x15/0x40
> [   31.003626]  vfs_write+0xad/0x1a0
> [   31.003627]  SyS_write+0x42/0x90
> [   31.003629]  entry_SYSCALL_64_fastpath+0x24/0x87
> [   31.003630] RIP: 0033:0x7f03be9ae8f4
> [   31.003631] RSP: 002b:00007ffe6bf825f8 EFLAGS: 00000246
> [   31.003632] Code: fe ff ff 8b 55 e8 83 e2 01 74 0a f3 90 8b 55 e8 83 e2 01
> 75
> f6 48 83 c4 28 41 5a 5d 49 8d 62 f8 c3 8b 05 58 b6 48 01 85 c0 75 86 <0f> ff
> eb
> 82 0f 1f 44 00 00 f6 46 18 01 75 15 c7 46 18 01 00 00 
> [   31.003648] ---[ end trace 19fa2f7781ed5237 ]---
> [   31.004025] Enabling non-boot CPUs ...
> [   31.004052] x86: Booting SMP configuration:
> [   31.004052] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [   31.006368]  cache: parent cpu1 should not be sleeping
> [   31.006442] microcode: CPU1: patch_level=0x08001129
> [   31.006509] CPU1 is up
> [   31.006525] smpboot: Booting Node 0 Processor 2 APIC 0x2
> [   31.008832]  cache: parent cpu2 should not be sleeping
> [   31.008894] microcode: CPU2: patch_level=0x08001129
> [   31.008966] CPU2 is up
> [   31.008975] smpboot: Booting Node 0 Processor 3 APIC 0x3
> [   31.011264]  cache: parent cpu3 should not be sleeping
> [   31.011329] microcode: CPU3: patch_level=0x08001129
> [   31.011404] CPU3 is up
> [   31.011413] smpboot: Booting Node 0 Processor 4 APIC 0x8
> [   31.013833]  cache: parent cpu4 should not be sleeping
> [   31.013903] microcode: CPU4: patch_level=0x08001129
> [   31.014025] CPU4 is up
> [   31.014036] smpboot: Booting Node 0 Processor 5 APIC 0x9
> [   31.016354]  cache: parent cpu5 should not be sleeping
> [   31.016421] microcode: CPU5: patch_level=0x08001129
> [   31.016534] CPU5 is up
> [   31.016544] smpboot: Booting Node 0 Processor 6 APIC 0xa
> [   31.018857]  cache: parent cpu6 should not be sleeping
> [   31.018930] microcode: CPU6: patch_level=0x08001129
> [   31.019047] CPU6 is up
> [   31.019057] smpboot: Booting Node 0 Processor 7 APIC 0xb
> [   31.021376]  cache: parent cpu7 should not be sleeping
> [   31.021444] microcode: CPU7: patch_level=0x08001129
> [   31.021579] CPU7 is up
> [   31.022166] ACPI: Waking up from system sleep state S3
> [   31.070791] usb usb1: root hub lost power or was reset
> [   31.070794] usb usb2: root hub lost power or was reset
> [   31.071628] serial 00:05: activated
> [   31.080265] sd 4:0:0:0: [sda] Starting disk
> [   31.126099] hpet_rtc_timer_reinit: 68 callbacks suppressed
> [   31.126099] hpet1: lost 2 rtc interrupts
> [   31.160913] r8169 0000:1e:00.0 enp30s0: link down
> [   31.255563] do_IRQ: 1.35 No irq handler for vector
> [   31.379537] ata6: SATA link down (SStatus 0 SControl 300)
> [   31.379558] ata1: SATA link down (SStatus 0 SControl 300)
> [   31.380306] ata2: SATA link down (SStatus 0 SControl 300)
> [   31.435705] ata9: SATA link down (SStatus 0 SControl 300)
> [   31.589932] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [   31.590320] ata5.00: configured for UDMA/133
> [   31.610043] usb 1-4: reset low-speed USB device number 2 using xhci_hcd
> [   32.226138] usb 1-5: reset low-speed USB device number 3 using xhci_hcd
> [   33.257867] nouveau 0000:22:00.0: DRM: EVO timeout
> [   34.237185] r8169 0000:1e:00.0 enp30s0: link up
> [   35.257880] nouveau 0000:22:00.0: DRM: base-0: timeout
> [   37.258334] nouveau 0000:22:00.0: DRM: base-0: timeout
> [   37.276084] OOM killer enabled.
> [   37.276612] Restarting tasks ... done.
> [   37.277722] PM: suspend exit
> 
> I haven't yet actually investigated why it does this, but a bisect of master
> led
> me to here.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-23 22:01 "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel Lyude Paul
  2018-01-24  1:26 ` Lyude Paul
@ 2018-01-24 12:50 ` Thomas Gleixner
  2018-01-24 13:38   ` Borislav Petkov
  1 sibling, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2018-01-24 12:50 UTC (permalink / raw)
  To: Lyude Paul
  Cc: H. Peter Anvin, keith.busch, Ingo Molnar, linux-kernel, Borislav Petkov

On Tue, 23 Jan 2018, Lyude Paul wrote:

> Hi! Sorry to be the bearer of bad news, but this patch actually seems to break
> suspending and resuming with nouveau on my machine:

> [   31.003578] WARNING: CPU: 0 PID: 11523 at kernel/smp.c:291
> smp_call_function_single+0xdc/0xe0

This warning has absolutely no relationship to that patch.

> [   31.003592] RIP: 0010:smp_call_function_single+0xdc/0xe0
> [   31.003603]  ? rdmsr_safe_on_cpu+0x4b/0x70
> [   31.003604]  rdmsr_safe_on_cpu+0x4b/0x70
> [   31.003606]  get_block_address.isra.0+0x6e/0xe0
> [   31.003607]  mce_amd_feature_init+0x63/0x2c0
> [   31.003609]  mce_syscore_resume+0x1e/0x30
> [   31.003611]  syscore_resume+0x4b/0x170
> [   31.003613]  suspend_devices_and_enter+0x608/0x7e0
> [   31.003614]  pm_suspend+0x315/0x380
> [   31.003615]  state_store+0x7d/0xe0
> [   31.003618]  kernfs_fop_write+0xfa/0x180
> [   31.003620]  __vfs_write+0x23/0x130
> [   31.003623]  ? SYSC_newfstat+0x29/0x40
> [   31.003625]  ? _cond_resched+0x15/0x40
> [   31.003626]  vfs_write+0xad/0x1a0
> [   31.003627]  SyS_write+0x42/0x90
> [   31.003629]  entry_SYSCALL_64_fastpath+0x24/0x87

Borislav?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24  1:26 ` Lyude Paul
@ 2018-01-24 12:52   ` Thomas Gleixner
  2018-01-24 17:49     ` Lyude Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2018-01-24 12:52 UTC (permalink / raw)
  To: Lyude Paul; +Cc: hpa, keith.busch, mingo, linux-kernel

On Tue, 23 Jan 2018, Lyude Paul wrote:

> JFYI: I confirmed this patch is definitely broken. I'm seeing nouveau get
> assigned the same MSI vector as another device on the system, which would
> explain why interrupts suddenly stop working. I'll keep looking into it further
> tomorrow.

How did you determine that it is the same MSI vector as another device?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24 12:50 ` Thomas Gleixner
@ 2018-01-24 13:38   ` Borislav Petkov
  0 siblings, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2018-01-24 13:38 UTC (permalink / raw)
  To: Thomas Gleixner, Yazen Ghannam
  Cc: Lyude Paul, H. Peter Anvin, keith.busch, Ingo Molnar, linux-kernel

On Wed, Jan 24, 2018 at 01:50:52PM +0100, Thomas Gleixner wrote:
> On Tue, 23 Jan 2018, Lyude Paul wrote:
> 
> > Hi! Sorry to be the bearer of bad news, but this patch actually seems to break
> > suspending and resuming with nouveau on my machine:
> 
> > [   31.003578] WARNING: CPU: 0 PID: 11523 at kernel/smp.c:291
> > smp_call_function_single+0xdc/0xe0
> 
> This warning has absolutely no relationship to that patch.
> 
> > [   31.003592] RIP: 0010:smp_call_function_single+0xdc/0xe0
> > [   31.003603]  ? rdmsr_safe_on_cpu+0x4b/0x70
> > [   31.003604]  rdmsr_safe_on_cpu+0x4b/0x70
> > [   31.003606]  get_block_address.isra.0+0x6e/0xe0
> > [   31.003607]  mce_amd_feature_init+0x63/0x2c0
> > [   31.003609]  mce_syscore_resume+0x1e/0x30
> > [   31.003611]  syscore_resume+0x4b/0x170
> > [   31.003613]  suspend_devices_and_enter+0x608/0x7e0
> > [   31.003614]  pm_suspend+0x315/0x380
> > [   31.003615]  state_store+0x7d/0xe0
> > [   31.003618]  kernfs_fop_write+0xfa/0x180
> > [   31.003620]  __vfs_write+0x23/0x130
> > [   31.003623]  ? SYSC_newfstat+0x29/0x40
> > [   31.003625]  ? _cond_resched+0x15/0x40
> > [   31.003626]  vfs_write+0xad/0x1a0
> > [   31.003627]  SyS_write+0x42/0x90
> > [   31.003629]  entry_SYSCALL_64_fastpath+0x24/0x87
> 
> Borislav?

Yeah

cfee4f6f0b20 ("x86/mce/AMD: Read MSRs on the CPU allocating the threshold blocks")

Yazen CCed.

Non-core banks are accessible only on the node-base CPU or something
like that. But we can't send IPIs with IRQs off, thus the warning.

Yazen, I'm thinking this whole get_block_address() dancing can be
simplified by creating a data structure containing *all* MCi_MISC*
addresses once during boot and then using it instead of reading it from
the MSRs each time.

Also - and I'm wishfully thinking simple here - that structure could
be global as I'd venture a guess that all MISC addresses are the same
system-wide and not node-specific.

But I might be missing something here.

Hmmm.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24 12:52   ` Thomas Gleixner
@ 2018-01-24 17:49     ` Lyude Paul
  2018-01-24 19:13       ` Ghannam, Yazen
  0 siblings, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-01-24 17:49 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: hpa, keith.busch, mingo, linux-kernel

Hi, please ignore the warning: it happens before and after the regressing
commit (I didn't actually mean to include it on the log I gave here, whoops).
As for how I determined nouveau is getting assigned the same IRQ vector as
another device, I checked using /sys/kernel/debug/irq. Additionally; when
nouveau does initialize properly after resume (e.g. after reverting this
patch) I see it get assigned a seperate vector from the other devices.

On Wed, 2018-01-24 at 13:52 +0100, Thomas Gleixner wrote:
> On Tue, 23 Jan 2018, Lyude Paul wrote:
> 
> > JFYI: I confirmed this patch is definitely broken. I'm seeing nouveau get
> > assigned the same MSI vector as another device on the system, which would
> > explain why interrupts suddenly stop working. I'll keep looking into it
> > further
> > tomorrow.
> 
> How did you determine that it is the same MSI vector as another device?
> 
> Thanks,
> 
> 	tglx
> 
-- 
Cheers,
	Lyude Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24 17:49     ` Lyude Paul
@ 2018-01-24 19:13       ` Ghannam, Yazen
  2018-01-24 19:56         ` Lyude Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Ghannam, Yazen @ 2018-01-24 19:13 UTC (permalink / raw)
  To: Lyude Paul, Thomas Gleixner
  Cc: hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Lyude Paul
> Sent: Wednesday, January 24, 2018 12:49 PM
> To: Thomas Gleixner <tglx@linutronix.de>
> Cc: hpa@zytor.com; keith.busch@intel.com; mingo@kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in
> mainline kernel
> 
> Hi, please ignore the warning: it happens before and after the regressing
> commit (I didn't actually mean to include it on the log I gave here, whoops).
> As for how I determined nouveau is getting assigned the same IRQ vector as
> another device, I checked using /sys/kernel/debug/irq. Additionally; when
> nouveau does initialize properly after resume (e.g. after reverting this
> patch) I see it get assigned a seperate vector from the other devices.
> 

+Boris. This thread seems to have split.

Lyude,
Does the warning show on mainline or does it only show when bisecting?

Sorry, I'm not sure what you mean by "it happens before and after the
regressing commit".


Boris,
In any case, I like your idea on saving the block addresses. I can look into this.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24 19:13       ` Ghannam, Yazen
@ 2018-01-24 19:56         ` Lyude Paul
  2018-01-24 20:02           ` Lyude Paul
  2018-01-25  8:54           ` Thomas Gleixner
  0 siblings, 2 replies; 16+ messages in thread
From: Lyude Paul @ 2018-01-24 19:56 UTC (permalink / raw)
  To: Ghannam, Yazen, Thomas Gleixner
  Cc: hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

On Wed, 2018-01-24 at 19:13 +0000, Ghannam, Yazen wrote:
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > owner@vger.kernel.org] On Behalf Of Lyude Paul
> > Sent: Wednesday, January 24, 2018 12:49 PM
> > To: Thomas Gleixner <tglx@linutronix.de>
> > Cc: hpa@zytor.com; keith.busch@intel.com; mingo@kernel.org; linux-
> > kernel@vger.kernel.org
> > Subject: Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau
> > in
> > mainline kernel
> > 
> > Hi, please ignore the warning: it happens before and after the regressing
> > commit (I didn't actually mean to include it on the log I gave here,
> > whoops).
> > As for how I determined nouveau is getting assigned the same IRQ vector as
> > another device, I checked using /sys/kernel/debug/irq. Additionally; when
> > nouveau does initialize properly after resume (e.g. after reverting this
> > patch) I see it get assigned a seperate vector from the other devices.
> > 
> 
> +Boris. This thread seems to have split.
> 
> Lyude,
> Does the warning show on mainline or does it only show when bisecting?
> 
> Sorry, I'm not sure what you mean by "it happens before and after the
> regressing commit".
Sorry about that! Let me clarify a little bit: this is a problem that shows up
on mainline. Normally when we suspend the GPU in nouveau, we free the IRQs
it's using before going into suspend
(drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:88), then reserve IRQs again
on resume (drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:134). Since this
patch got pushed to mainline, the IRQ we get from request_irq() ends up having
the same MSI vector as another device on the system:

Before suspend, nouveau's IRQ allocation:

    handler:  handle_edge_irq
    device:   0000:22:00.0
    status:   0x00000000
    istate:   0x00000000
    ddepth:   0
    wdepth:   0
    dstate:   0x01400200
                IRQD_ACTIVATED
                IRQD_IRQ_STARTED
                IRQD_SINGLE_TARGET
    node:     0
    affinity: 0-7
    effectiv: 1
    pending:  
    domain:  PCI-MSI-2
     hwirq:   0x1100000
     chip:    PCI-MSI
      flags:   0x10
                 IRQCHIP_SKIP_SET_WAKE
     parent:
        domain:  VECTOR
         hwirq:   0x2f
         chip:    APIC
          flags:   0x0
         Vector:    35
         Target:     1

    After resume and allocating the interrupt for nouveau again, we get a message
    from the kernel saying: 

    [  217.150787] do_IRQ: 1.35 No irq handler for vector

    As well, nouveau ends up getting no interrupts from the card and as a result
    fails to come back up:

    [  219.153049] nouveau 0000:22:00.0: DRM: EVO timeout
    [  220.226254] r8169 0000:1e:00.0 enp30s0: link up
    [  221.153054] nouveau 0000:22:00.0: DRM: base-0: timeout
    [  223.153528] nouveau 0000:22:00.0: DRM: base-0: timeout

    If we look through all of the other IRQ allocations, we'll find that now two
    devices have the MSI vector 35:

    nouveau:
    handler:  handle_edge_irq
    device:   0000:22:00.0
    status:   0x00000000
    istate:   0x00000000
    ddepth:   0
    wdepth:   0
    dstate:   0x01400200
                IRQD_ACTIVATED
                IRQD_IRQ_STARTED
                IRQD_SINGLE_TARGET
    node:     0
    affinity: 0-7
    effectiv: 1
    pending:  
    domain:  PCI-MSI-2
     hwirq:   0x1100000
     chip:    PCI-MSI
      flags:   0x10
                 IRQCHIP_SKIP_SET_WAKE
     parent:
        domain:  VECTOR
         hwirq:   0x2f
         chip:    APIC
          flags:   0x0
         Vector:    35
         Target:     1

    and the PCI bridge (00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD]
    Family 17h (Models 00h-0fh) PCIe GPP Bridge):

        handler:  handle_edge_irq
        device:   0000:00:01.3
        status:   0x00000000
        istate:   0x00000000
        ddepth:   0
        wdepth:   0
        dstate:   0x03400200
                    IRQD_ACTIVATED
                    IRQD_IRQ_STARTED
                    IRQD_SINGLE_TARGET
        node:     0
        affinity: 0-7
        effectiv: 0
        pending:  
        domain:  PCI-MSI-2
         hwirq:   0x5800
         chip:    PCI-MSI
          flags:   0x10
                     IRQCHIP_SKIP_SET_WAKE
         parent:
            domain:  VECTOR
             hwirq:   0x19
             chip:    APIC
              flags:   0x0
             Vector:    35
             Target:     0

    hope this helps clarify, I will keep looking at this from my end as well
    > 
> 
> Boris,
> In any case, I like your idea on saving the block addresses. I can look into
> this.
> 
> Thanks,
> Yazen
-- 
Cheers,
	Lyude Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24 19:56         ` Lyude Paul
@ 2018-01-24 20:02           ` Lyude Paul
  2018-01-25  3:29             ` Mike Galbraith
  2018-01-25  8:54           ` Thomas Gleixner
  1 sibling, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-01-24 20:02 UTC (permalink / raw)
  To: Ghannam, Yazen, Thomas Gleixner
  Cc: hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

On Wed, 2018-01-24 at 14:56 -0500, Lyude Paul wrote:
> On Wed, 2018-01-24 at 19:13 +0000, Ghannam, Yazen wrote:
> > > -----Original Message-----
> > > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > > owner@vger.kernel.org] On Behalf Of Lyude Paul
> > > Sent: Wednesday, January 24, 2018 12:49 PM
> > > To: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: hpa@zytor.com; keith.busch@intel.com; mingo@kernel.org; linux-
> > > kernel@vger.kernel.org
> > > Subject: Re: "irq/matrix: Spread interrupts on allocation" breaks
> > > nouveau
> > > in
> > > mainline kernel
> > > 
> > > Hi, please ignore the warning: it happens before and after the
> > > regressing
> > > commit (I didn't actually mean to include it on the log I gave here,
> > > whoops).
> > > As for how I determined nouveau is getting assigned the same IRQ vector
> > > as
> > > another device, I checked using /sys/kernel/debug/irq. Additionally;
> > > when
> > > nouveau does initialize properly after resume (e.g. after reverting this
> > > patch) I see it get assigned a seperate vector from the other devices.
> > > 
> > 
> > +Boris. This thread seems to have split.
> > 
> > Lyude,
> > Does the warning show on mainline or does it only show when bisecting?
> > 
> > Sorry, I'm not sure what you mean by "it happens before and after the
> > regressing commit".
> 
> Sorry about that! Let me clarify a little bit: this is a problem that shows
> up
> on mainline. Normally when we suspend the GPU in nouveau, we free the IRQs
> it's using before going into suspend
> (drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:88), then reserve IRQs again
> on resume (drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:134). Since this
> patch got pushed to mainline, the IRQ we get from request_irq() ends up
> having
> the same MSI vector as another device on the system:
> 
> Before suspend, nouveau's IRQ allocation:
> 
>     handler:  handle_edge_irq
>     device:   0000:22:00.0
>     status:   0x00000000
>     istate:   0x00000000
>     ddepth:   0
>     wdepth:   0
>     dstate:   0x01400200
>                 IRQD_ACTIVATED
>                 IRQD_IRQ_STARTED
>                 IRQD_SINGLE_TARGET
>     node:     0
>     affinity: 0-7
>     effectiv: 1
>     pending:  
>     domain:  PCI-MSI-2
>      hwirq:   0x1100000
>      chip:    PCI-MSI
>       flags:   0x10
>                  IRQCHIP_SKIP_SET_WAKE
>      parent:
>         domain:  VECTOR
>          hwirq:   0x2f
>          chip:    APIC
>           flags:   0x0
>          Vector:    35
>          Target:     1
> 
>     After resume and allocating the interrupt for nouveau again, we get a
> message
>     from the kernel saying: 
> 
>     [  217.150787] do_IRQ: 1.35 No irq handler for vector
> 
>     As well, nouveau ends up getting no interrupts from the card and as a
> result
>     fails to come back up:
> 
>     [  219.153049] nouveau 0000:22:00.0: DRM: EVO timeout
>     [  220.226254] r8169 0000:1e:00.0 enp30s0: link up
>     [  221.153054] nouveau 0000:22:00.0: DRM: base-0: timeout
>     [  223.153528] nouveau 0000:22:00.0: DRM: base-0: timeout
> 
>     If we look through all of the other IRQ allocations, we'll find that now
> two
>     devices have the MSI vector 35:
> 
>     nouveau:
>     handler:  handle_edge_irq
>     device:   0000:22:00.0
>     status:   0x00000000
>     istate:   0x00000000
>     ddepth:   0
>     wdepth:   0
>     dstate:   0x01400200
>                 IRQD_ACTIVATED
>                 IRQD_IRQ_STARTED
>                 IRQD_SINGLE_TARGET
>     node:     0
>     affinity: 0-7
>     effectiv: 1
>     pending:  
>     domain:  PCI-MSI-2
>      hwirq:   0x1100000
>      chip:    PCI-MSI
>       flags:   0x10
>                  IRQCHIP_SKIP_SET_WAKE
>      parent:
>         domain:  VECTOR
>          hwirq:   0x2f
>          chip:    APIC
>           flags:   0x0
>          Vector:    35
>          Target:     1
> 
>     and the PCI bridge (00:01.3 PCI bridge: Advanced Micro Devices, Inc.
> [AMD]
>     Family 17h (Models 00h-0fh) PCIe GPP Bridge):
> 
>         handler:  handle_edge_irq
>         device:   0000:00:01.3
>         status:   0x00000000
>         istate:   0x00000000
>         ddepth:   0
>         wdepth:   0
>         dstate:   0x03400200
>                     IRQD_ACTIVATED
>                     IRQD_IRQ_STARTED
>                     IRQD_SINGLE_TARGET
>         node:     0
>         affinity: 0-7
>         effectiv: 0
>         pending:  
>         domain:  PCI-MSI-2
>          hwirq:   0x5800
>          chip:    PCI-MSI
>           flags:   0x10
>                      IRQCHIP_SKIP_SET_WAKE
>          parent:
>             domain:  VECTOR
>              hwirq:   0x19
>              chip:    APIC
>               flags:   0x0
>              Vector:    35
>              Target:     0
> 
>     hope this helps clarify, I will keep looking at this from my end as well
>     > 
Almost forgot to mention: I came across this patch because reverting it
locally on the mainline kernel makes request_irq() behave normally (it doesn't
attempt to allocate the same vector twice anymore) and nouveau starts doing
suspend/resume correctly again
> > 
> > Boris,
> > In any case, I like your idea on saving the block addresses. I can look
> > into
> > this.
> > 
> > Thanks,
> > Yazen
-- 
Cheers,
	Lyude Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24 20:02           ` Lyude Paul
@ 2018-01-25  3:29             ` Mike Galbraith
  2018-01-25 18:29               ` Lyude Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2018-01-25  3:29 UTC (permalink / raw)
  To: Lyude Paul, Ghannam, Yazen, Thomas Gleixner
  Cc: hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

On Wed, 2018-01-24 at 15:02 -0500, Lyude Paul wrote:
> Almost forgot to mention: I came across this patch because reverting it
> locally on the mainline kernel makes request_irq() behave normally (it doesn't
> attempt to allocate the same vector twice anymore) and nouveau starts doing
> suspend/resume correctly again

Ah, someone already hunted down my resume woes.  Yup, reverting
$subject fixed up my sole reason to use nouveau (to be able to _resume_
as well as suspend:).  If anyone needs a lab rat, just holler.

	-Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-24 19:56         ` Lyude Paul
  2018-01-24 20:02           ` Lyude Paul
@ 2018-01-25  8:54           ` Thomas Gleixner
  2018-01-25 18:23             ` Lyude Paul
  1 sibling, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2018-01-25  8:54 UTC (permalink / raw)
  To: Lyude Paul
  Cc: Ghannam, Yazen, hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

On Wed, 24 Jan 2018, Lyude Paul wrote:
> Sorry about that! Let me clarify a little bit: this is a problem that shows up
> on mainline. Normally when we suspend the GPU in nouveau, we free the IRQs
> it's using before going into suspend
> (drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:88), then reserve IRQs again
> on resume (drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:134). Since this
> patch got pushed to mainline, the IRQ we get from request_irq() ends up having
> the same MSI vector as another device on the system:

It's not the same.

>     nouveau:
>      parent:
>         domain:  VECTOR
>          hwirq:   0x2f
>          chip:    APIC
>           flags:   0x0
>          Vector:    35
>          Target:     1

Vector 35 on CPU1

>     After resume and allocating the interrupt for nouveau again, we get a message
>     from the kernel saying: 
> 
>     [  217.150787] do_IRQ: 1.35 No irq handler for vector

That's because there is a pending irq on the old vector for unknown reasons.

>     As well, nouveau ends up getting no interrupts from the card and as a result
>     fails to come back up:
> 
>     [  219.153049] nouveau 0000:22:00.0: DRM: EVO timeout
>     [  220.226254] r8169 0000:1e:00.0 enp30s0: link up
>     [  221.153054] nouveau 0000:22:00.0: DRM: base-0: timeout
>     [  223.153528] nouveau 0000:22:00.0: DRM: base-0: timeout
> 
>     If we look through all of the other IRQ allocations, we'll find that now two
>     devices have the MSI vector 35:
> 
>     nouveau:
>      parent:
>         domain:  VECTOR
>          hwirq:   0x2f
>          chip:    APIC
>           flags:   0x0
>          Vector:    35
>          Target:     1

Vector 35 on CPU1

>     and the PCI bridge (00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD]
>     Family 17h (Models 00h-0fh) PCIe GPP Bridge):
> 
>          parent:
>             domain:  VECTOR
>              hwirq:   0x19
>              chip:    APIC
>               flags:   0x0
>              Vector:    35
>              Target:     0

Vector 35 on CPU0. Same vector but different CPUs. So it's NOT the same
thing.

The real issue is something completely different and the revert of this
patch merily papers over the underlying problem. I'm pretty sure that you
can trigger this even with the revert in place. Do the following before
suspend:

    echo 2 >/proc/irq/$NOUVEAUIRQ/smp_affinity_list

Then do suspend/resume and you should end up with the same situation.

I can't tell from your dmesg, but I'm pretty confident that

>     [  217.150787] do_IRQ: 1.35 No irq handler for vector

happens _before_ the nouveau driver requests the irq again. Can please you
add some printk to the code in question to verify that?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-25  8:54           ` Thomas Gleixner
@ 2018-01-25 18:23             ` Lyude Paul
  2018-01-25 18:46               ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-01-25 18:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ghannam, Yazen, hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

I think you are right, apologies. Glad to know this isn't a regression in the
IRQ handling code :). It looks like our nouveau problems are probably coming
from the fact that we don't just leave IRQs setup through suspend/resume which
as far as I can tell, is probably not the correct thing to do.

Going to get some patches onto the mailing list for this, thanks for the help!

On Thu, 2018-01-25 at 09:54 +0100, Thomas Gleixner wrote:
> On Wed, 24 Jan 2018, Lyude Paul wrote:
> > Sorry about that! Let me clarify a little bit: this is a problem that shows
> > up
> > on mainline. Normally when we suspend the GPU in nouveau, we free the IRQs
> > it's using before going into suspend
> > (drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:88), then reserve IRQs again
> > on resume (drivers/gpu/drm/nouveau/nvkm/subdev/pci/base.c:134). Since this
> > patch got pushed to mainline, the IRQ we get from request_irq() ends up
> > having
> > the same MSI vector as another device on the system:
> 
> It's not the same.
> 
> >     nouveau:
> >      parent:
> >         domain:  VECTOR
> >          hwirq:   0x2f
> >          chip:    APIC
> >           flags:   0x0
> >          Vector:    35
> >          Target:     1
> 
> Vector 35 on CPU1
> 
> >     After resume and allocating the interrupt for nouveau again, we get a
> > message
> >     from the kernel saying: 
> > 
> >     [  217.150787] do_IRQ: 1.35 No irq handler for vector
> 
> That's because there is a pending irq on the old vector for unknown reasons.
> 
> >     As well, nouveau ends up getting no interrupts from the card and as a
> > result
> >     fails to come back up:
> > 
> >     [  219.153049] nouveau 0000:22:00.0: DRM: EVO timeout
> >     [  220.226254] r8169 0000:1e:00.0 enp30s0: link up
> >     [  221.153054] nouveau 0000:22:00.0: DRM: base-0: timeout
> >     [  223.153528] nouveau 0000:22:00.0: DRM: base-0: timeout
> > 
> >     If we look through all of the other IRQ allocations, we'll find that now
> > two
> >     devices have the MSI vector 35:
> > 
> >     nouveau:
> >      parent:
> >         domain:  VECTOR
> >          hwirq:   0x2f
> >          chip:    APIC
> >           flags:   0x0
> >          Vector:    35
> >          Target:     1
> 
> Vector 35 on CPU1
> 
> >     and the PCI bridge (00:01.3 PCI bridge: Advanced Micro Devices, Inc.
> > [AMD]
> >     Family 17h (Models 00h-0fh) PCIe GPP Bridge):
> > 
> >          parent:
> >             domain:  VECTOR
> >              hwirq:   0x19
> >              chip:    APIC
> >               flags:   0x0
> >              Vector:    35
> >              Target:     0
> 
> Vector 35 on CPU0. Same vector but different CPUs. So it's NOT the same
> thing.
> 
> The real issue is something completely different and the revert of this
> patch merily papers over the underlying problem. I'm pretty sure that you
> can trigger this even with the revert in place. Do the following before
> suspend:
> 
>     echo 2 >/proc/irq/$NOUVEAUIRQ/smp_affinity_list
> 
> Then do suspend/resume and you should end up with the same situation.
> 
> I can't tell from your dmesg, but I'm pretty confident that
> 
> >     [  217.150787] do_IRQ: 1.35 No irq handler for vector
> 
> happens _before_ the nouveau driver requests the irq again. Can please you
> add some printk to the code in question to verify that?
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-25  3:29             ` Mike Galbraith
@ 2018-01-25 18:29               ` Lyude Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-01-25 18:29 UTC (permalink / raw)
  To: Mike Galbraith, Ghannam, Yazen, Thomas Gleixner
  Cc: hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

Will cc you with some patches in a bit, think I have this problem figured out :)

On Thu, 2018-01-25 at 04:29 +0100, Mike Galbraith wrote:
> On Wed, 2018-01-24 at 15:02 -0500, Lyude Paul wrote:
> > Almost forgot to mention: I came across this patch because reverting it
> > locally on the mainline kernel makes request_irq() behave normally (it
> > doesn't
> > attempt to allocate the same vector twice anymore) and nouveau starts doing
> > suspend/resume correctly again
> 
> Ah, someone already hunted down my resume woes.  Yup, reverting
> $subject fixed up my sole reason to use nouveau (to be able to _resume_
> as well as suspend:).  If anyone needs a lab rat, just holler.
> 
> 	-Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-25 18:23             ` Lyude Paul
@ 2018-01-25 18:46               ` Thomas Gleixner
  2018-01-25 19:25                 ` Lyude Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2018-01-25 18:46 UTC (permalink / raw)
  To: Lyude Paul
  Cc: Ghannam, Yazen, hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

On Thu, 25 Jan 2018, Lyude Paul wrote:

> I think you are right, apologies. Glad to know this isn't a regression in the
> IRQ handling code :). It looks like our nouveau problems are probably coming
> from the fact that we don't just leave IRQs setup through suspend/resume which
> as far as I can tell, is probably not the correct thing to do.

If you tear down the interrupt, then you have to make sure that it's
completely masked and disabled on the device side (including MSI).

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-25 18:46               ` Thomas Gleixner
@ 2018-01-25 19:25                 ` Lyude Paul
  2018-01-25 20:12                   ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-01-25 19:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ghannam, Yazen, hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

On Thu, 2018-01-25 at 19:46 +0100, Thomas Gleixner wrote:
> On Thu, 25 Jan 2018, Lyude Paul wrote:
> 
> > I think you are right, apologies. Glad to know this isn't a regression in
> > the
> > IRQ handling code :). It looks like our nouveau problems are probably coming
> > from the fact that we don't just leave IRQs setup through suspend/resume
> > which
> > as far as I can tell, is probably not the correct thing to do.
> 
> If you tear down the interrupt, then you have to make sure that it's
> completely masked and disabled on the device side (including MSI).
Does this only need to be done if we handle irq_request()/irq_free() ourselves,
or can we skip some of these steps if we let the kernel handle
disabling/enabling IRQs during s/r?
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel
  2018-01-25 19:25                 ` Lyude Paul
@ 2018-01-25 20:12                   ` Thomas Gleixner
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2018-01-25 20:12 UTC (permalink / raw)
  To: Lyude Paul
  Cc: Ghannam, Yazen, hpa, keith.busch, mingo, linux-kernel, Borislav Petkov

On Thu, 25 Jan 2018, Lyude Paul wrote:
> On Thu, 2018-01-25 at 19:46 +0100, Thomas Gleixner wrote:
> > On Thu, 25 Jan 2018, Lyude Paul wrote:
> > 
> > > I think you are right, apologies. Glad to know this isn't a regression in
> > > the
> > > IRQ handling code :). It looks like our nouveau problems are probably coming
> > > from the fact that we don't just leave IRQs setup through suspend/resume
> > > which
> > > as far as I can tell, is probably not the correct thing to do.
> > 
> > If you tear down the interrupt, then you have to make sure that it's
> > completely masked and disabled on the device side (including MSI).
> Does this only need to be done if we handle irq_request()/irq_free() ourselves,
> or can we skip some of these steps if we let the kernel handle
> disabling/enabling IRQs during s/r?

If you do not free the interrupt on suspend, then the core does the right
thing. Though you should not inflict an interrupt storm in that case either :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-01-25 20:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-23 22:01 "irq/matrix: Spread interrupts on allocation" breaks nouveau in mainline kernel Lyude Paul
2018-01-24  1:26 ` Lyude Paul
2018-01-24 12:52   ` Thomas Gleixner
2018-01-24 17:49     ` Lyude Paul
2018-01-24 19:13       ` Ghannam, Yazen
2018-01-24 19:56         ` Lyude Paul
2018-01-24 20:02           ` Lyude Paul
2018-01-25  3:29             ` Mike Galbraith
2018-01-25 18:29               ` Lyude Paul
2018-01-25  8:54           ` Thomas Gleixner
2018-01-25 18:23             ` Lyude Paul
2018-01-25 18:46               ` Thomas Gleixner
2018-01-25 19:25                 ` Lyude Paul
2018-01-25 20:12                   ` Thomas Gleixner
2018-01-24 12:50 ` Thomas Gleixner
2018-01-24 13:38   ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).