linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* switcheroo registration vs switching race...
@ 2012-11-28  3:45 Daniel J Blueman
  2012-11-28  8:45 ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel J Blueman @ 2012-11-28  3:45 UTC (permalink / raw)
  To: Seth Forshee, Dave Airlie, Takashi Iwai; +Cc: Linux Kernel

Hi Seth, Dave, Takashi,

If I power down the unused discrete GPU before lightdm starts by
fiddling with the sysfs file [1] in the upstart script, I see a race
manifesting as the discrete GPU's HDA controller timing out to
commands [2].

Adding some debug, I see that the registered audio devices are put
into D3 before the GPU is, but it turns out that the discrete (and
internal) GPU's HDA controller gets registered a bit later, so the
list is empty. The symptom is since the HDA driver it's talking to
hardware which is now in D3.

We could add a mutex to nouveau to allow us to wait for the DGPU HDA
controller, but perhaps this should be solved at a higher level in the
vgaswitcheroo code; what do you think?

Thanks,
  Daniel

--- [1]

echo OFF >/sys/kernel/debug/vgaswitcheroo/switch

--- [2]

snd_hda_intel 0000:00:1b.0: enabling device (0000 -> 0002)
snd_hda_intel 0000:00:1b.0: irq 51 for MSI/MSI-X
input: HDA Intel PCH Mic as /devices/pci0000:00/0000:00:1b.0/sound/card0/input10
input: HDA Intel PCH Headphone as
/devices/pci0000:00/0000:00:1b.0/sound/card0/input11
nouveau [  VBIOS][0000:01:00.0] ... appears to be valid
nouveau [  VBIOS][0000:01:00.0] using image from PRAMIN
nouveau [  VBIOS][0000:01:00.0] BIT signature found
nouveau [  VBIOS][0000:01:00.0] version 80.07.26.04
nouveau [   MXM][0000:01:00.0] no VBIOS data, nothing to do
nouveau [   PFB][0000:01:00.0] RAM type: GDDR5
nouveau [   PFB][0000:01:00.0] RAM size: 1024 MiB
nouveau W[ PGRAPH][0000:01:00.0] disabled, PGRAPH=1 to enable
vga_switcheroo: enabled
[TTM] Zone kernel: Available graphics memory: 4076308 kiB
[TTM] Zone  dma32: Available graphics memory: 2097152 kiB
[TTM] Initializing pool allocator
[TTM] Initializing DMA pool allocator
nouveau [   DRM] VRAM: 1024 MiB
nouveau [   DRM] GART: 512 MiB
nouveau [   DRM] BIT BIOS found
nouveau [   DRM] Bios version 80.07.26.04
nouveau [   DRM] TMDS table version 2.0
nouveau [   DRM] DCB version 4.0
nouveau [   DRM] DCB outp 00: 048101b6 0f230010
nouveau [   DRM] DCB outp 01: 018212d6 0f220020
nouveau [   DRM] DCB outp 02: 01021212 00020020
nouveau [   DRM] DCB outp 03: 088324c6 0f220010
nouveau [   DRM] DCB outp 04: 08032402 00020010
nouveau [   DRM] DCB outp 05: 02843862 00020010
nouveau [   DRM] DCB conn 00: 00020047
nouveau [   DRM] DCB conn 01: 02208146
nouveau [   DRM] DCB conn 02: 01104246
nouveau [   DRM] DCB conn 03: 00410361
[drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[drm] No driver support for vblank timestamp query.
nouveau W[   DRM] voltage table 0x50 unknown
nouveau [   DRM] 4 available performance level(s)
nouveau [   DRM] 1: core 209MHz shader 419MHz memory 405MHz voltage 520mV
nouveau [   DRM] 2: core 390MHz shader 780MHz memory 1080MHz voltage 610mV
nouveau [   DRM] 3: core 1000MHz shader 2000MHz memory 1080MHz voltage 630mV
nouveau [   DRM] 4: core 1254MHz shader 2508MHz memory 1080MHz voltage 630mV
nouveau [   DRM] c:
nouveau E[   DRM] failed to create kernel channel, -22
No connectors reported connected with modes
[drm] Cannot find any crtc or sizes - going 1024x768
nouveau [   DRM] allocated 1024x768 fb: 0x60000, bo ffff880264974400
fb1: nouveaufb frame buffer device
[drm] Initialized nouveau 1.1.0 20120801 for 0000:01:00.0 on minor 1
snd_hda_intel 0000:01:00.1: enabling device (0000 -> 0002)
hda-intel: 0000:01:00.1: Handle VGA-switcheroo audio client
snd_hda_intel 0000:01:00.1: irq 52 for MSI/MSI-X
VGA switcheroo: switched nouveau off
nouveau [   DRM] suspending fbcon...
nouveau [   DRM] suspending display...
nouveau [   DRM] unpinning framebuffer(s)...
nouveau [   DRM] evicting buffers...
nouveau [   DRM] suspending client object trees...
input: HDA NVidia HDMI/DP,pcm=8 as
/devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input12
input: HDA NVidia HDMI/DP,pcm=7 as
/devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input13
input: HDA NVidia HDMI/DP,pcm=3 as
/devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input14
nouveau E[   I2C][0000:01:00.0] AUXCH(3): begin idle timeout 0xffffffff
nouveau E[   I2C][0000:01:00.0] AUXCH(2): begin idle timeout 0xffffffff
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: spurious response 0x0:0x0, last cmd=0x170503
hda-intel: azx_get_response timeout, switching to polling mode: last
cmd=0x001f0500
hda-intel: No response from codec, disabling MSI: last cmd=0x001f0500
hda_intel: azx_get_response timeout, switching to single_cmd mode:
last cmd=0x001f0500
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffbf
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffbf
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffbf
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffbf
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffbf
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffbf
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
azx_single_send_cmd: 179 callbacks suppressed
hda-codec: out of range cmd 0:0:4:707:ffffffbf
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffbf
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffff
hda-codec: out of range cmd 0:0:4:707:fffffffc
hda-codec: out of range cmd 0:0:4:707:ffffffbf
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffbf
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffff
hda-codec: out of range cmd 0:0:5:707:fffffffc
hda-codec: out of range cmd 0:0:5:707:ffffffbf
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffbf
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffff
hda-codec: out of range cmd 0:0:7:707:fffffffc
hda-codec: out of range cmd 0:0:7:707:ffffffbf
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-11-28  3:45 switcheroo registration vs switching race Daniel J Blueman
@ 2012-11-28  8:45 ` Takashi Iwai
  2012-12-03 11:17   ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-11-28  8:45 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Wed, 28 Nov 2012 11:45:07 +0800,
Daniel J Blueman wrote:
> 
> Hi Seth, Dave, Takashi,
> 
> If I power down the unused discrete GPU before lightdm starts by
> fiddling with the sysfs file [1] in the upstart script, I see a race
> manifesting as the discrete GPU's HDA controller timing out to
> commands [2].
> 
> Adding some debug, I see that the registered audio devices are put
> into D3 before the GPU is, but it turns out that the discrete (and
> internal) GPU's HDA controller gets registered a bit later, so the
> list is empty. The symptom is since the HDA driver it's talking to
> hardware which is now in D3.
> 
> We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> controller, but perhaps this should be solved at a higher level in the
> vgaswitcheroo code; what do you think?

Maybe it's a side effect for the recent effort to fix another race in
the probe.  A part of them problem is that the registration is done at
the very last of probing.

Instead of delaying the registration, how about the patch below?


Takashi

---
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 4bb52da..17fbd68 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -49,6 +49,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/clocksource.h>
 #include <linux/time.h>
+#include <linux/completion.h>
 
 #ifdef CONFIG_X86
 /* for snoop control */
@@ -469,6 +470,7 @@ struct azx {
 	/* locks */
 	spinlock_t reg_lock;
 	struct mutex open_mutex;
+	struct completion probe_wait;
 
 	/* streams (x num_streams) */
 	struct azx_dev *azx_dev;
@@ -2790,6 +2792,7 @@ static bool azx_vs_can_switch(struct pci_dev *pci)
 	struct snd_card *card = pci_get_drvdata(pci);
 	struct azx *chip = card->private_data;
 
+	wait_for_completion(&chip->probe_wait);
 	if (chip->init_failed)
 		return false;
 	if (chip->disabled || !chip->bus)
@@ -2851,6 +2854,9 @@ static int azx_free(struct azx *chip)
 
 	azx_notifier_unregister(chip);
 
+	chip->init_failed = 1; /* to be sure */
+	complete(&chip->probe_wait);
+
 	if (use_vga_switcheroo(chip)) {
 		if (chip->disabled && chip->bus)
 			snd_hda_unlock_devices(chip->bus);
@@ -3156,6 +3162,7 @@ static int __devinit azx_create(struct snd_card *card, struct pci_dev *pci,
 	INIT_LIST_HEAD(&chip->pcm_list);
 	INIT_LIST_HEAD(&chip->list);
 	init_vga_switcheroo(chip);
+	init_completion(&chip->probe_wait);
 
 	chip->position_fix[0] = chip->position_fix[1] =
 		check_position_fix(chip, position_fix[dev]);
@@ -3462,6 +3469,13 @@ static int __devinit azx_probe(struct pci_dev *pci,
 	}
 #endif /* CONFIG_SND_HDA_PATCH_LOADER */
 
+	err = register_vga_switcheroo(chip);
+	if (err < 0) {
+		snd_printk(KERN_ERR SFX
+			   "Error registering VGA-switcheroo client\n");
+		goto out_free;
+	}
+
 	if (probe_now) {
 		err = azx_probe_continue(chip);
 		if (err < 0)
@@ -3473,14 +3487,8 @@ static int __devinit azx_probe(struct pci_dev *pci,
 	if (pci_dev_run_wake(pci))
 		pm_runtime_put_noidle(&pci->dev);
 
-	err = register_vga_switcheroo(chip);
-	if (err < 0) {
-		snd_printk(KERN_ERR SFX
-			   "Error registering VGA-switcheroo client\n");
-		goto out_free;
-	}
-
 	dev++;
+	complete(&chip->probe_wait);
 	return 0;
 
 out_free:

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-11-28  8:45 ` Takashi Iwai
@ 2012-12-03 11:17   ` Takashi Iwai
  2012-12-03 14:25     ` Daniel J Blueman
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-03 11:17 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Wed, 28 Nov 2012 09:45:39 +0100,
Takashi Iwai wrote:
> 
> At Wed, 28 Nov 2012 11:45:07 +0800,
> Daniel J Blueman wrote:
> > 
> > Hi Seth, Dave, Takashi,
> > 
> > If I power down the unused discrete GPU before lightdm starts by
> > fiddling with the sysfs file [1] in the upstart script, I see a race
> > manifesting as the discrete GPU's HDA controller timing out to
> > commands [2].
> > 
> > Adding some debug, I see that the registered audio devices are put
> > into D3 before the GPU is, but it turns out that the discrete (and
> > internal) GPU's HDA controller gets registered a bit later, so the
> > list is empty. The symptom is since the HDA driver it's talking to
> > hardware which is now in D3.
> > 
> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> > controller, but perhaps this should be solved at a higher level in the
> > vgaswitcheroo code; what do you think?
> 
> Maybe it's a side effect for the recent effort to fix another race in
> the probe.  A part of them problem is that the registration is done at
> the very last of probing.
> 
> Instead of delaying the registration, how about the patch below?

Ping.  If this really works, I'd like to queue it for 3.8 merge, at
least...


thanks,

Takashi

> 
> 
> Takashi
> 
> ---
> diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> index 4bb52da..17fbd68 100644
> --- a/sound/pci/hda/hda_intel.c
> +++ b/sound/pci/hda/hda_intel.c
> @@ -49,6 +49,7 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/clocksource.h>
>  #include <linux/time.h>
> +#include <linux/completion.h>
>  
>  #ifdef CONFIG_X86
>  /* for snoop control */
> @@ -469,6 +470,7 @@ struct azx {
>  	/* locks */
>  	spinlock_t reg_lock;
>  	struct mutex open_mutex;
> +	struct completion probe_wait;
>  
>  	/* streams (x num_streams) */
>  	struct azx_dev *azx_dev;
> @@ -2790,6 +2792,7 @@ static bool azx_vs_can_switch(struct pci_dev *pci)
>  	struct snd_card *card = pci_get_drvdata(pci);
>  	struct azx *chip = card->private_data;
>  
> +	wait_for_completion(&chip->probe_wait);
>  	if (chip->init_failed)
>  		return false;
>  	if (chip->disabled || !chip->bus)
> @@ -2851,6 +2854,9 @@ static int azx_free(struct azx *chip)
>  
>  	azx_notifier_unregister(chip);
>  
> +	chip->init_failed = 1; /* to be sure */
> +	complete(&chip->probe_wait);
> +
>  	if (use_vga_switcheroo(chip)) {
>  		if (chip->disabled && chip->bus)
>  			snd_hda_unlock_devices(chip->bus);
> @@ -3156,6 +3162,7 @@ static int __devinit azx_create(struct snd_card *card, struct pci_dev *pci,
>  	INIT_LIST_HEAD(&chip->pcm_list);
>  	INIT_LIST_HEAD(&chip->list);
>  	init_vga_switcheroo(chip);
> +	init_completion(&chip->probe_wait);
>  
>  	chip->position_fix[0] = chip->position_fix[1] =
>  		check_position_fix(chip, position_fix[dev]);
> @@ -3462,6 +3469,13 @@ static int __devinit azx_probe(struct pci_dev *pci,
>  	}
>  #endif /* CONFIG_SND_HDA_PATCH_LOADER */
>  
> +	err = register_vga_switcheroo(chip);
> +	if (err < 0) {
> +		snd_printk(KERN_ERR SFX
> +			   "Error registering VGA-switcheroo client\n");
> +		goto out_free;
> +	}
> +
>  	if (probe_now) {
>  		err = azx_probe_continue(chip);
>  		if (err < 0)
> @@ -3473,14 +3487,8 @@ static int __devinit azx_probe(struct pci_dev *pci,
>  	if (pci_dev_run_wake(pci))
>  		pm_runtime_put_noidle(&pci->dev);
>  
> -	err = register_vga_switcheroo(chip);
> -	if (err < 0) {
> -		snd_printk(KERN_ERR SFX
> -			   "Error registering VGA-switcheroo client\n");
> -		goto out_free;
> -	}
> -
>  	dev++;
> +	complete(&chip->probe_wait);
>  	return 0;
>  
>  out_free:

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-03 11:17   ` Takashi Iwai
@ 2012-12-03 14:25     ` Daniel J Blueman
  2012-12-03 14:40       ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel J Blueman @ 2012-12-03 14:25 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> At Wed, 28 Nov 2012 09:45:39 +0100,
> Takashi Iwai wrote:
>>
>> At Wed, 28 Nov 2012 11:45:07 +0800,
>> Daniel J Blueman wrote:
>> >
>> > Hi Seth, Dave, Takashi,
>> >
>> > If I power down the unused discrete GPU before lightdm starts by
>> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> > manifesting as the discrete GPU's HDA controller timing out to
>> > commands [2].
>> >
>> > Adding some debug, I see that the registered audio devices are put
>> > into D3 before the GPU is, but it turns out that the discrete (and
>> > internal) GPU's HDA controller gets registered a bit later, so the
>> > list is empty. The symptom is since the HDA driver it's talking to
>> > hardware which is now in D3.
>> >
>> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> > controller, but perhaps this should be solved at a higher level in the
>> > vgaswitcheroo code; what do you think?
>>
>> Maybe it's a side effect for the recent effort to fix another race in
>> the probe.  A part of them problem is that the registration is done at
>> the very last of probing.
>>
>> Instead of delaying the registration, how about the patch below?
>
> Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> least...

Ping ack; I was trying to find time to understand another race that
occurs with GPU probing after switching, but is separate from the
situation before switching, here.

In the context of writing the switch, it looks like struct azx isn't
allocated by the time azx_vs_set_state accesses it [1,2]; racing with
azx_codec_create?

The full dmesg output is at: http://quora.org/2012/hda-switch-oops.txt

Thanks,
  Daniel

--- [1]

BUG: unable to handle kernel NULL pointer dereference at 0000000000000170
IP: [<ffffffffa01e4006>] azx_vs_set_state+0x26/0x1a0 [snd_hda_intel]
PGD 26323d067 PUD 264f58067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: snd_hda_codec_hdmi snd_hda_codec_cirrus rfcomm bnep
nls_iso8859_1 joydev hid_apple bcm5974 nouveau coretemp kvm_intel b43
kvm uvcvideo videobuf2_core videobuf2_vmalloc videobuf2_memops
ghash_clmulni_intel smsc75xx usbnet mii ttm snd_hda_intel(+)
snd_hda_codec snd_hwdep ssb i915 snd_pcm mxm_wmi snd_timer apple_gmux
applesmc mei lpc_ich microcode hwmon mfd_core input_polldev bcma snd
drm_kms_helper snd_page_alloc video apple_bl sdhci_pci sdhci mmc_core
CPU 1
Pid: 967, comm: sh Not tainted 3.7.0-rc7-expert+ #8 Apple Inc.
MacBookPro10,1/Mac-C3EC7CD22292981F
RIP: 0010:[<ffffffffa01e4006>] [<ffffffffa01e4006>]
azx_vs_set_state+0x26/0x1a0 [snd_hda_intel]
RSP: 0018:ffff88025198de48 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff880251960a00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880265b41098
RBP: ffff88025198de68 R08: 0000000000000003 R09: 0000000000001000
R10: 00007fffe481b730 R11: 0000000000000246 R12: ffff880265b41098
R13: 0000000000000000 R14: ffff88025198df50 R15: 0000000000000000
FS: 00007f4961480700(0000) GS:ffff88026f240000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000170 CR3: 0000000263cd3000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 967, threadinfo ffff88025198c000, task ffff88025d635820)
Stack:
 ffff88025d635820 ffff880251960a00 0000000000000000 ffff88025198de98
 ffff88025198de88 ffffffff812b8e77 ffff880263ef1740 0000000000000004
 ffff88025198def8 ffffffff812b947c ffff88020a46464f ffffffff81107982
Call Trace:
 [<ffffffff812b8e77>] set_audio_state+0x67/0x70
 [<ffffffff812b947c>] vga_switcheroo_debugfs_write+0xbc/0x380
 [<ffffffff81107982>] ? __alloc_fd+0x42/0x110
 [<ffffffff81107aa9>] ? __fd_install+0x29/0x60
 [<ffffffff810ed703>] vfs_write+0xa3/0x160
 [<ffffffff810eda0d>] sys_write+0x4d/0xa0
 [<ffffffff810283f9>] ? do_page_fault+0x9/0x10
 [<ffffffff814b7ed6>] system_call_fastpath+0x1a/0x1f
Code: 00 00 00 00 00 55 48 89 e5 48 83 ec 20 4c 89 65 f0 4c 8d a7 98
00 00 00 4c 89 e7 48 89 5d e8 4c 89 6d f8 41 89 f5 e8 fa a4 0d e1 <48>
8b 98 70 01 00 00 0f b6 83 dd 01 00 00 a8 10 75 34 45 85 ed
RIP [<ffffffffa01e4006>] azx_vs_set_state+0x26/0x1a0 [snd_hda_intel]
 RSP <ffff88025198de48>
CR2: 0000000000000170

--- [2]

$ gdb ./sound/pci/hda/snd-hda-intel.ko
(gdb) list *(azx_vs_set_state+0x26)
0x3036 is in azx_vs_set_state (sound/pci/hda/hda_intel.c:2628).
2623
2624    static void azx_vs_set_state(struct pci_dev *pci,
2625                     enum vga_switcheroo_state state)
2626    {
2627        struct snd_card *card = pci_get_drvdata(pci);
2628        struct azx *chip = card->private_data;
2629        bool disabled;
2630
2631        if (chip->init_failed)
2632            return;
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-03 14:25     ` Daniel J Blueman
@ 2012-12-03 14:40       ` Takashi Iwai
  2012-12-03 15:08         ` Daniel J Blueman
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-03 14:40 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Mon, 3 Dec 2012 22:25:52 +0800,
Daniel J Blueman wrote:
> 
> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> > At Wed, 28 Nov 2012 09:45:39 +0100,
> > Takashi Iwai wrote:
> >>
> >> At Wed, 28 Nov 2012 11:45:07 +0800,
> >> Daniel J Blueman wrote:
> >> >
> >> > Hi Seth, Dave, Takashi,
> >> >
> >> > If I power down the unused discrete GPU before lightdm starts by
> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
> >> > manifesting as the discrete GPU's HDA controller timing out to
> >> > commands [2].
> >> >
> >> > Adding some debug, I see that the registered audio devices are put
> >> > into D3 before the GPU is, but it turns out that the discrete (and
> >> > internal) GPU's HDA controller gets registered a bit later, so the
> >> > list is empty. The symptom is since the HDA driver it's talking to
> >> > hardware which is now in D3.
> >> >
> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> >> > controller, but perhaps this should be solved at a higher level in the
> >> > vgaswitcheroo code; what do you think?
> >>
> >> Maybe it's a side effect for the recent effort to fix another race in
> >> the probe.  A part of them problem is that the registration is done at
> >> the very last of probing.
> >>
> >> Instead of delaying the registration, how about the patch below?
> >
> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> > least...
> 
> Ping ack; I was trying to find time to understand another race that
> occurs with GPU probing after switching, but is separate from the
> situation before switching, here.
> 
> In the context of writing the switch, it looks like struct azx isn't
> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
> azx_codec_create?

It was allocated, but it wasn't assigned properly in pci drvdata.

Below is the revised patch.  Just moved pci_set_drvdata() before
register_vga_switcheroo().  Could you retest with it?


thanks,

Takashi

---
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 4bb52da..b7c482a 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -49,6 +49,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/clocksource.h>
 #include <linux/time.h>
+#include <linux/completion.h>
 
 #ifdef CONFIG_X86
 /* for snoop control */
@@ -469,6 +470,7 @@ struct azx {
 	/* locks */
 	spinlock_t reg_lock;
 	struct mutex open_mutex;
+	struct completion probe_wait;
 
 	/* streams (x num_streams) */
 	struct azx_dev *azx_dev;
@@ -2790,6 +2792,7 @@ static bool azx_vs_can_switch(struct pci_dev *pci)
 	struct snd_card *card = pci_get_drvdata(pci);
 	struct azx *chip = card->private_data;
 
+	wait_for_completion(&chip->probe_wait);
 	if (chip->init_failed)
 		return false;
 	if (chip->disabled || !chip->bus)
@@ -2851,6 +2854,9 @@ static int azx_free(struct azx *chip)
 
 	azx_notifier_unregister(chip);
 
+	chip->init_failed = 1; /* to be sure */
+	complete(&chip->probe_wait);
+
 	if (use_vga_switcheroo(chip)) {
 		if (chip->disabled && chip->bus)
 			snd_hda_unlock_devices(chip->bus);
@@ -3156,6 +3162,7 @@ static int __devinit azx_create(struct snd_card *card, struct pci_dev *pci,
 	INIT_LIST_HEAD(&chip->pcm_list);
 	INIT_LIST_HEAD(&chip->list);
 	init_vga_switcheroo(chip);
+	init_completion(&chip->probe_wait);
 
 	chip->position_fix[0] = chip->position_fix[1] =
 		check_position_fix(chip, position_fix[dev]);
@@ -3462,17 +3469,8 @@ static int __devinit azx_probe(struct pci_dev *pci,
 	}
 #endif /* CONFIG_SND_HDA_PATCH_LOADER */
 
-	if (probe_now) {
-		err = azx_probe_continue(chip);
-		if (err < 0)
-			goto out_free;
-	}
-
 	pci_set_drvdata(pci, card);
 
-	if (pci_dev_run_wake(pci))
-		pm_runtime_put_noidle(&pci->dev);
-
 	err = register_vga_switcheroo(chip);
 	if (err < 0) {
 		snd_printk(KERN_ERR SFX
@@ -3480,11 +3478,22 @@ static int __devinit azx_probe(struct pci_dev *pci,
 		goto out_free;
 	}
 
+	if (probe_now) {
+		err = azx_probe_continue(chip);
+		if (err < 0)
+			goto out_free;
+	}
+
+	if (pci_dev_run_wake(pci))
+		pm_runtime_put_noidle(&pci->dev);
+
 	dev++;
+	complete(&chip->probe_wait);
 	return 0;
 
 out_free:
 	snd_card_free(card);
+	pci_set_drvdata(pci, NULL);
 	return err;
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-03 14:40       ` Takashi Iwai
@ 2012-12-03 15:08         ` Daniel J Blueman
  2012-12-03 16:23           ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel J Blueman @ 2012-12-03 15:08 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
> At Mon, 3 Dec 2012 22:25:52 +0800,
> Daniel J Blueman wrote:
>>
>> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Wed, 28 Nov 2012 09:45:39 +0100,
>> > Takashi Iwai wrote:
>> >>
>> >> At Wed, 28 Nov 2012 11:45:07 +0800,
>> >> Daniel J Blueman wrote:
>> >> >
>> >> > Hi Seth, Dave, Takashi,
>> >> >
>> >> > If I power down the unused discrete GPU before lightdm starts by
>> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> >> > manifesting as the discrete GPU's HDA controller timing out to
>> >> > commands [2].
>> >> >
>> >> > Adding some debug, I see that the registered audio devices are put
>> >> > into D3 before the GPU is, but it turns out that the discrete (and
>> >> > internal) GPU's HDA controller gets registered a bit later, so the
>> >> > list is empty. The symptom is since the HDA driver it's talking to
>> >> > hardware which is now in D3.
>> >> >
>> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> >> > controller, but perhaps this should be solved at a higher level in the
>> >> > vgaswitcheroo code; what do you think?
>> >>
>> >> Maybe it's a side effect for the recent effort to fix another race in
>> >> the probe.  A part of them problem is that the registration is done at
>> >> the very last of probing.
>> >>
>> >> Instead of delaying the registration, how about the patch below?
>> >
>> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
>> > least...
>>
>> Ping ack; I was trying to find time to understand another race that
>> occurs with GPU probing after switching, but is separate from the
>> situation before switching, here.
>>
>> In the context of writing the switch, it looks like struct azx isn't
>> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
>> azx_codec_create?
>
> It was allocated, but it wasn't assigned properly in pci drvdata.
>
> Below is the revised patch.  Just moved pci_set_drvdata() before
> register_vga_switcheroo().  Could you retest with it?

Superb; this addresses the oops.

~1 second after the DGPU is put into D3, I still often see "hda-intel:
spurious response 0x0:0x0, last cmd=0x170500":
http://quora.org/2012/hda-switch-spurious.txt

Presumably this implies the read of the ring-buffer pointer returned
0xffffffff, so the HDA driver understands the pointer to have wrapped
and processes the 191 unwritten entries?

Daniel
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-03 15:08         ` Daniel J Blueman
@ 2012-12-03 16:23           ` Takashi Iwai
  2012-12-03 16:50             ` Daniel J Blueman
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-03 16:23 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Mon, 3 Dec 2012 23:08:28 +0800,
Daniel J Blueman wrote:
> 
> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
> > At Mon, 3 Dec 2012 22:25:52 +0800,
> > Daniel J Blueman wrote:
> >>
> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
> >> > Takashi Iwai wrote:
> >> >>
> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
> >> >> Daniel J Blueman wrote:
> >> >> >
> >> >> > Hi Seth, Dave, Takashi,
> >> >> >
> >> >> > If I power down the unused discrete GPU before lightdm starts by
> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
> >> >> > manifesting as the discrete GPU's HDA controller timing out to
> >> >> > commands [2].
> >> >> >
> >> >> > Adding some debug, I see that the registered audio devices are put
> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
> >> >> > list is empty. The symptom is since the HDA driver it's talking to
> >> >> > hardware which is now in D3.
> >> >> >
> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> >> >> > controller, but perhaps this should be solved at a higher level in the
> >> >> > vgaswitcheroo code; what do you think?
> >> >>
> >> >> Maybe it's a side effect for the recent effort to fix another race in
> >> >> the probe.  A part of them problem is that the registration is done at
> >> >> the very last of probing.
> >> >>
> >> >> Instead of delaying the registration, how about the patch below?
> >> >
> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> >> > least...
> >>
> >> Ping ack; I was trying to find time to understand another race that
> >> occurs with GPU probing after switching, but is separate from the
> >> situation before switching, here.
> >>
> >> In the context of writing the switch, it looks like struct azx isn't
> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
> >> azx_codec_create?
> >
> > It was allocated, but it wasn't assigned properly in pci drvdata.
> >
> > Below is the revised patch.  Just moved pci_set_drvdata() before
> > register_vga_switcheroo().  Could you retest with it?
> 
> Superb; this addresses the oops.

OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.

> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
> spurious response 0x0:0x0, last cmd=0x170500":
> http://quora.org/2012/hda-switch-spurious.txt

Hm, it's not clear who triggers these messages.  I'll try to check the
code paths.

> Presumably this implies the read of the ring-buffer pointer returned
> 0xffffffff, so the HDA driver understands the pointer to have wrapped
> and processes the 191 unwritten entries?

Good point.  Actually there is one bug that looks obviously wrong
(writing 32bit value to CORBWP).  Maybe it has been working just
because writing CORBRP doesn't influence except for the reset bit.

Reading CORBWP as a byte is OK, but this could be better in a word so
that we can check 0xffff as invalid.

A test patch is below.  Hopefully this improves the situation...


thanks,

Takashi

---
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 4bb52da..ce990db 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -805,13 +805,18 @@ static int azx_corb_send_cmd(struct hda_bus *bus, u32 val)
 	spin_lock_irq(&chip->reg_lock);
 
 	/* add command to corb */
-	wp = azx_readb(chip, CORBWP);
+	wp = azx_readw(chip, CORBWP);
+	if (wp == 0xffff) {
+		/* something wrong, controller likely turned to D3 */
+		spin_unlock_irq(&chip->reg_lock);
+		return -1;
+	}
 	wp++;
 	wp %= ICH6_MAX_CORB_ENTRIES;
 
 	chip->rirb.cmds[addr]++;
 	chip->corb.buf[wp] = cpu_to_le32(val);
-	azx_writel(chip, CORBWP, wp);
+	azx_writew(chip, CORBWP, wp);
 
 	spin_unlock_irq(&chip->reg_lock);
 
@@ -827,7 +832,12 @@ static void azx_update_rirb(struct azx *chip)
 	unsigned int addr;
 	u32 res, res_ex;
 
-	wp = azx_readb(chip, RIRBWP);
+	wp = azx_readw(chip, RIRBWP);
+	if (wp == 0xffff) {
+		/* something wrong, controller likely turned to D3 */
+		return;
+	}
+
 	if (wp == chip->rirb.wp)
 		return;
 	chip->rirb.wp = wp;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-03 16:23           ` Takashi Iwai
@ 2012-12-03 16:50             ` Daniel J Blueman
  2012-12-03 17:10               ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel J Blueman @ 2012-12-03 16:50 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
> At Mon, 3 Dec 2012 23:08:28 +0800,
> Daniel J Blueman wrote:
>>
>> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Mon, 3 Dec 2012 22:25:52 +0800,
>> > Daniel J Blueman wrote:
>> >>
>> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
>> >> > Takashi Iwai wrote:
>> >> >>
>> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
>> >> >> Daniel J Blueman wrote:
>> >> >> >
>> >> >> > Hi Seth, Dave, Takashi,
>> >> >> >
>> >> >> > If I power down the unused discrete GPU before lightdm starts by
>> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> >> >> > manifesting as the discrete GPU's HDA controller timing out to
>> >> >> > commands [2].
>> >> >> >
>> >> >> > Adding some debug, I see that the registered audio devices are put
>> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
>> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
>> >> >> > list is empty. The symptom is since the HDA driver it's talking to
>> >> >> > hardware which is now in D3.
>> >> >> >
>> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> >> >> > controller, but perhaps this should be solved at a higher level in the
>> >> >> > vgaswitcheroo code; what do you think?
>> >> >>
>> >> >> Maybe it's a side effect for the recent effort to fix another race in
>> >> >> the probe.  A part of them problem is that the registration is done at
>> >> >> the very last of probing.
>> >> >>
>> >> >> Instead of delaying the registration, how about the patch below?
>> >> >
>> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
>> >> > least...
>> >>
>> >> Ping ack; I was trying to find time to understand another race that
>> >> occurs with GPU probing after switching, but is separate from the
>> >> situation before switching, here.
>> >>
>> >> In the context of writing the switch, it looks like struct azx isn't
>> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
>> >> azx_codec_create?
>> >
>> > It was allocated, but it wasn't assigned properly in pci drvdata.
>> >
>> > Below is the revised patch.  Just moved pci_set_drvdata() before
>> > register_vga_switcheroo().  Could you retest with it?
>>
>> Superb; this addresses the oops.
>
> OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
>
>> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
>> spurious response 0x0:0x0, last cmd=0x170500":
>> http://quora.org/2012/hda-switch-spurious.txt
>
> Hm, it's not clear who triggers these messages.  I'll try to check the
> code paths.
>
>> Presumably this implies the read of the ring-buffer pointer returned
>> 0xffffffff, so the HDA driver understands the pointer to have wrapped
>> and processes the 191 unwritten entries?
>
> Good point.  Actually there is one bug that looks obviously wrong
> (writing 32bit value to CORBWP).  Maybe it has been working just
> because writing CORBRP doesn't influence except for the reset bit.
>
> Reading CORBWP as a byte is OK, but this could be better in a word so
> that we can check 0xffff as invalid.
>
> A test patch is below.  Hopefully this improves the situation...

I'll check this out tomorrow and also instrument the code to get a
backtrace, since there may still be an underlying race with the
previous patches:

[    8.203827] snd_hda_intel 0000:00:1b.0: enabling device (0000 -> 0002)
[    8.203936] snd_hda_intel 0000:00:1b.0: irq 51 for MSI/MSI-X
[   10.981297] VGA switcheroo: switched nouveau off
[   10.981383] nouveau  [     DRM] suspending fbcon...
[   10.981388] nouveau  [     DRM] suspending display...
[   10.981687] nouveau  [     DRM] unpinning framebuffer(s)...
[   10.981825] nouveau  [     DRM] evicting buffers...
[   10.992948] nouveau  [     DRM] suspending client object trees...
[   11.310697] hda-intel: azx_get_response timeout, switching to
polling mode: last cmd=0x000f0000
[   12.320236] hda-intel: No response from codec, disabling MSI: last
cmd=0x000f0000
[   13.329721] hda-intel: Codec #0 probe error; disabling it...
[   14.419163] hda_intel: azx_get_response timeout, switching to
single_cmd mode: last cmd=0x000f0000
[   14.419855] hda-intel: no codecs initialized
[   14.459150] snd_hda_intel 0000:01:00.1: Refused to change power
state, currently in D3
[   14.459176] hda-intel: 0000:01:00.1: Handle VGA-switcheroo audio client
[   14.459183] hda-intel: VGA controller for 0000:01:00.1 is disabled
[   14.459184] hda-intel: Delaying initialization

Full log at: http://quora.org/2012/hda-switch-spurious2.txt

The first response timeout message plausibly could be from the CS4206
(0:1b.0) or Nvidia DGPU (1:0.1); either way, there are no audio
devices after. Also, we see that checking for master abort in one
place won't be sufficient.

Stay tuned for backtraces tomorrow.

Thanks again so far!
  Daniel
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-03 16:50             ` Daniel J Blueman
@ 2012-12-03 17:10               ` Takashi Iwai
  2012-12-04 12:58                 ` Daniel J Blueman
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-03 17:10 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Tue, 4 Dec 2012 00:50:56 +0800,
Daniel J Blueman wrote:
> 
> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
> > At Mon, 3 Dec 2012 23:08:28 +0800,
> > Daniel J Blueman wrote:
> >>
> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
> >> > Daniel J Blueman wrote:
> >> >>
> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
> >> >> > Takashi Iwai wrote:
> >> >> >>
> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
> >> >> >> Daniel J Blueman wrote:
> >> >> >> >
> >> >> >> > Hi Seth, Dave, Takashi,
> >> >> >> >
> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
> >> >> >> > commands [2].
> >> >> >> >
> >> >> >> > Adding some debug, I see that the registered audio devices are put
> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
> >> >> >> > hardware which is now in D3.
> >> >> >> >
> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> >> >> >> > controller, but perhaps this should be solved at a higher level in the
> >> >> >> > vgaswitcheroo code; what do you think?
> >> >> >>
> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
> >> >> >> the probe.  A part of them problem is that the registration is done at
> >> >> >> the very last of probing.
> >> >> >>
> >> >> >> Instead of delaying the registration, how about the patch below?
> >> >> >
> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> >> >> > least...
> >> >>
> >> >> Ping ack; I was trying to find time to understand another race that
> >> >> occurs with GPU probing after switching, but is separate from the
> >> >> situation before switching, here.
> >> >>
> >> >> In the context of writing the switch, it looks like struct azx isn't
> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
> >> >> azx_codec_create?
> >> >
> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
> >> >
> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
> >> > register_vga_switcheroo().  Could you retest with it?
> >>
> >> Superb; this addresses the oops.
> >
> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
> >
> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
> >> spurious response 0x0:0x0, last cmd=0x170500":
> >> http://quora.org/2012/hda-switch-spurious.txt
> >
> > Hm, it's not clear who triggers these messages.  I'll try to check the
> > code paths.
> >
> >> Presumably this implies the read of the ring-buffer pointer returned
> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
> >> and processes the 191 unwritten entries?
> >
> > Good point.  Actually there is one bug that looks obviously wrong
> > (writing 32bit value to CORBWP).  Maybe it has been working just
> > because writing CORBRP doesn't influence except for the reset bit.
> >
> > Reading CORBWP as a byte is OK, but this could be better in a word so
> > that we can check 0xffff as invalid.
> >
> > A test patch is below.  Hopefully this improves the situation...
> 
> I'll check this out tomorrow and also instrument the code to get a
> backtrace, since there may still be an underlying race with the
> previous patches:
> 
> [    8.203827] snd_hda_intel 0000:00:1b.0: enabling device (0000 -> 0002)
> [    8.203936] snd_hda_intel 0000:00:1b.0: irq 51 for MSI/MSI-X
> [   10.981297] VGA switcheroo: switched nouveau off
> [   10.981383] nouveau  [     DRM] suspending fbcon...
> [   10.981388] nouveau  [     DRM] suspending display...
> [   10.981687] nouveau  [     DRM] unpinning framebuffer(s)...
> [   10.981825] nouveau  [     DRM] evicting buffers...
> [   10.992948] nouveau  [     DRM] suspending client object trees...
> [   11.310697] hda-intel: azx_get_response timeout, switching to
> polling mode: last cmd=0x000f0000
> [   12.320236] hda-intel: No response from codec, disabling MSI: last
> cmd=0x000f0000
> [   13.329721] hda-intel: Codec #0 probe error; disabling it...
> [   14.419163] hda_intel: azx_get_response timeout, switching to
> single_cmd mode: last cmd=0x000f0000
> [   14.419855] hda-intel: no codecs initialized
> [   14.459150] snd_hda_intel 0000:01:00.1: Refused to change power
> state, currently in D3
> [   14.459176] hda-intel: 0000:01:00.1: Handle VGA-switcheroo audio client
> [   14.459183] hda-intel: VGA controller for 0000:01:00.1 is disabled
> [   14.459184] hda-intel: Delaying initialization
> 
> Full log at: http://quora.org/2012/hda-switch-spurious2.txt
> 
> The first response timeout message plausibly could be from the CS4206
> (0:1b.0) or Nvidia DGPU (1:0.1); either way, there are no audio
> devices after.

That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
the vga switcheroo things for Nvidia...

The patch below is the revised patch of the first one.  Now the vga
switcheroo registration is done before the video controller D3 check,
so the race should be smaller.


> Also, we see that checking for master abort in one
> place won't be sufficient.
> 
> Stay tuned for backtraces tomorrow.

OK, thanks.


Takashi

---
From: Takashi Iwai <tiwai@suse.de>
Subject: [PATCH] ALSA: hda - Fix yet another race of VGA switcheroo

When the video is turned off while probing the HD-audio driver, the
audio driver gets stuck because the registration of vga_switcheroo
client is done at the end of probing phase without notified.

For fixing this race, here we introduced two things:
- The vga switcheroo is registered at the early stage before accessing
  any hardware register,
- Add a completion and make the can_switch callback waiting until all
  probing process has been finished.

Reported-by: Daniel J Blueman <daniel@quora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
 sound/pci/hda/hda_intel.c | 60 ++++++++++++++++++++++++-----------------------
 1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 4bb52da..b845be5 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -49,6 +49,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/clocksource.h>
 #include <linux/time.h>
+#include <linux/completion.h>
 
 #ifdef CONFIG_X86
 /* for snoop control */
@@ -469,6 +470,7 @@ struct azx {
 	/* locks */
 	spinlock_t reg_lock;
 	struct mutex open_mutex;
+	struct completion probe_wait;
 
 	/* streams (x num_streams) */
 	struct azx_dev *azx_dev;
@@ -2790,6 +2792,7 @@ static bool azx_vs_can_switch(struct pci_dev *pci)
 	struct snd_card *card = pci_get_drvdata(pci);
 	struct azx *chip = card->private_data;
 
+	wait_for_completion(&chip->probe_wait);
 	if (chip->init_failed)
 		return false;
 	if (chip->disabled || !chip->bus)
@@ -2851,6 +2854,9 @@ static int azx_free(struct azx *chip)
 
 	azx_notifier_unregister(chip);
 
+	chip->init_failed = 1; /* to be sure */
+	complete(&chip->probe_wait);
+
 	if (use_vga_switcheroo(chip)) {
 		if (chip->disabled && chip->bus)
 			snd_hda_unlock_devices(chip->bus);
@@ -3156,6 +3162,7 @@ static int __devinit azx_create(struct snd_card *card, struct pci_dev *pci,
 	INIT_LIST_HEAD(&chip->pcm_list);
 	INIT_LIST_HEAD(&chip->list);
 	init_vga_switcheroo(chip);
+	init_completion(&chip->probe_wait);
 
 	chip->position_fix[0] = chip->position_fix[1] =
 		check_position_fix(chip, position_fix[dev]);
@@ -3183,26 +3190,6 @@ static int __devinit azx_create(struct snd_card *card, struct pci_dev *pci,
 		}
 	}
 
-	if (check_hdmi_disabled(pci)) {
-		snd_printk(KERN_INFO SFX "VGA controller for %s is disabled\n",
-			   pci_name(pci));
-		if (use_vga_switcheroo(chip)) {
-			snd_printk(KERN_INFO SFX "Delaying initialization\n");
-			chip->disabled = true;
-			goto ok;
-		}
-		kfree(chip);
-		pci_disable_device(pci);
-		return -ENXIO;
-	}
-
-	err = azx_first_init(chip);
-	if (err < 0) {
-		azx_free(chip);
-		return err;
-	}
-
- ok:
 	err = snd_device_new(card, SNDRV_DEV_LOWLEVEL, chip, &ops);
 	if (err < 0) {
 		snd_printk(KERN_ERR SFX "Error creating device [card]!\n");
@@ -3447,7 +3434,29 @@ static int __devinit azx_probe(struct pci_dev *pci,
 	if (err < 0)
 		goto out_free;
 	card->private_data = chip;
+
+	pci_set_drvdata(pci, card);
+
+	err = register_vga_switcheroo(chip);
+	if (err < 0) {
+		snd_printk(KERN_ERR SFX
+			   "Error registering VGA-switcheroo client\n");
+		goto out_free;
+	}
+
+	if (check_hdmi_disabled(pci)) {
+		snd_printk(KERN_INFO SFX "VGA controller for %s is disabled\n",
+			   pci_name(pci));
+		snd_printk(KERN_INFO SFX "Delaying initialization\n");
+		chip->disabled = true;
+	}
+
 	probe_now = !chip->disabled;
+	if (probe_now) {
+		err = azx_first_init(chip);
+		if (err < 0)
+			goto out_free;
+	}
 
 #ifdef CONFIG_SND_HDA_PATCH_LOADER
 	if (patch[dev] && *patch[dev]) {
@@ -3468,23 +3477,16 @@ static int __devinit azx_probe(struct pci_dev *pci,
 			goto out_free;
 	}
 
-	pci_set_drvdata(pci, card);
-
 	if (pci_dev_run_wake(pci))
 		pm_runtime_put_noidle(&pci->dev);
 
-	err = register_vga_switcheroo(chip);
-	if (err < 0) {
-		snd_printk(KERN_ERR SFX
-			   "Error registering VGA-switcheroo client\n");
-		goto out_free;
-	}
-
 	dev++;
+	complete(&chip->probe_wait);
 	return 0;
 
 out_free:
 	snd_card_free(card);
+	pci_set_drvdata(pci, NULL);
 	return err;
 }
 
-- 
1.8.0.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-03 17:10               ` Takashi Iwai
@ 2012-12-04 12:58                 ` Daniel J Blueman
  2012-12-04 13:23                   ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel J Blueman @ 2012-12-04 12:58 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 4 Dec 2012 00:50:56 +0800,
> Daniel J Blueman wrote:
>>
>> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Mon, 3 Dec 2012 23:08:28 +0800,
>> > Daniel J Blueman wrote:
>> >>
>> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
>> >> > Daniel J Blueman wrote:
>> >> >>
>> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
>> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
>> >> >> > Takashi Iwai wrote:
>> >> >> >>
>> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
>> >> >> >> Daniel J Blueman wrote:
>> >> >> >> >
>> >> >> >> > Hi Seth, Dave, Takashi,
>> >> >> >> >
>> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
>> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
>> >> >> >> > commands [2].
>> >> >> >> >
>> >> >> >> > Adding some debug, I see that the registered audio devices are put
>> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
>> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
>> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
>> >> >> >> > hardware which is now in D3.
>> >> >> >> >
>> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> >> >> >> > controller, but perhaps this should be solved at a higher level in the
>> >> >> >> > vgaswitcheroo code; what do you think?
>> >> >> >>
>> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
>> >> >> >> the probe.  A part of them problem is that the registration is done at
>> >> >> >> the very last of probing.
>> >> >> >>
>> >> >> >> Instead of delaying the registration, how about the patch below?
>> >> >> >
>> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
>> >> >> > least...
>> >> >>
>> >> >> Ping ack; I was trying to find time to understand another race that
>> >> >> occurs with GPU probing after switching, but is separate from the
>> >> >> situation before switching, here.
>> >> >>
>> >> >> In the context of writing the switch, it looks like struct azx isn't
>> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
>> >> >> azx_codec_create?
>> >> >
>> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
>> >> >
>> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
>> >> > register_vga_switcheroo().  Could you retest with it?
>> >>
>> >> Superb; this addresses the oops.
>> >
>> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
>> >
>> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
>> >> spurious response 0x0:0x0, last cmd=0x170500":
>> >> http://quora.org/2012/hda-switch-spurious.txt
>> >
>> > Hm, it's not clear who triggers these messages.  I'll try to check the
>> > code paths.
>> >
>> >> Presumably this implies the read of the ring-buffer pointer returned
>> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
>> >> and processes the 191 unwritten entries?
>> >
>> > Good point.  Actually there is one bug that looks obviously wrong
>> > (writing 32bit value to CORBWP).  Maybe it has been working just
>> > because writing CORBRP doesn't influence except for the reset bit.
>> >
>> > Reading CORBWP as a byte is OK, but this could be better in a word so
>> > that we can check 0xffff as invalid.
>> >
>> > A test patch is below.  Hopefully this improves the situation...
>>
>> I'll check this out tomorrow and also instrument the code to get a
>> backtrace, since there may still be an underlying race with the
>> previous patches:

[...]

> That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
> the vga switcheroo things for Nvidia...
>
> The patch below is the revised patch of the first one.  Now the vga
> switcheroo registration is done before the video controller D3 check,
> so the race should be smaller.

This patch improves things further of course; a major improvement over
without. ~15% of the time, I still get the 'spurious response'
messages in this callpath:

Pid: 473, comm: modprobe Not tainted 3.7.0-rc8-expert+ #15
Call Trace:
 [<ffffffffa0025776>] azx_update_rirb+0xe6/0xf0 [snd_hda_intel]
 [<ffffffffa0026d00>] azx_get_response+0xc0/0x270 [snd_hda_intel]
 [<ffffffffa0226801>] codec_exec_verb+0xc1/0x100 [snd_hda_codec]
 [<ffffffffa022799b>] snd_hda_codec_read+0x5b/0x90 [snd_hda_codec]
 [<ffffffff810545e9>] ? try_to_grab_pending+0xb9/0x130
 [<ffffffffa0227d35>] hda_set_power_state+0xd5/0x110 [snd_hda_codec]
 [<ffffffffa0226420>] hda_call_codec_resume+0x20/0x120 [snd_hda_codec]
 [<ffffffffa0226689>] snd_hda_power_save+0x129/0x1e0 [snd_hda_codec]
 [<ffffffffa022678f>] codec_exec_verb+0x4f/0x100 [snd_hda_codec]
 [<ffffffffa02268ab>] snd_hda_codec_write+0x6b/0xa0 [snd_hda_codec]
 [<ffffffffa0227281>] snd_hda_codec_write_cache+0x31/0xa0 [snd_hda_codec]
 [<ffffffffa022b965>] snd_hda_jack_detect_enable_callback+0x75/0xa0
[snd_hda_codec]
 [<ffffffffa022b9a1>] snd_hda_jack_detect_enable+0x11/0x20 [snd_hda_codec]
 [<ffffffffa00cf5a6>] generic_hdmi_init+0xb6/0xe0 [snd_hda_codec_hdmi]
 [<ffffffffa022ad67>] snd_hda_codec_build_controls+0x37/0x130 [snd_hda_codec]
 [<ffffffffa022b080>] ? snd_hda_codec_build_pcms+0x1a0/0x2e0 [snd_hda_codec]
 [<ffffffffa022ae88>] snd_hda_build_controls+0x28/0x80 [snd_hda_codec]
 [<ffffffffa0027ed2>] azx_probe_continue+0x2f2/0x410 [snd_hda_intel]
 [<ffffffffa0027a10>] ? azx_pcm_hw_params+0xe0/0xe0 [snd_hda_intel]
 [<ffffffffa0026c40>] ? azx_resume+0x110/0x110 [snd_hda_intel]
 [<ffffffffa00258a0>] ? azx_power_notify+0x40/0x40 [snd_hda_intel]
 [<ffffffffa00269e0>] ? azx_runtime_resume+0x40/0x40 [snd_hda_intel]
 [<ffffffffa0025860>] ? azx_vs_can_switch+0x90/0x90 [snd_hda_intel]
 [<ffffffffa0028966>] azx_probe+0x6c9/0x7a3 [snd_hda_intel]
 [<ffffffff8120ae74>] local_pci_probe+0x74/0x100
 [<ffffffff8120b641>] pci_device_probe+0x111/0x120
 [<ffffffff812be936>] driver_probe_device+0x76/0x240
 [<ffffffff812beb9b>] __driver_attach+0x9b/0xa0
 [<ffffffff812beb00>] ? driver_probe_device+0x240/0x240
 [<ffffffff812bcc9d>] bus_for_each_dev+0x4d/0x90
 [<ffffffff812be499>] driver_attach+0x19/0x20
 [<ffffffff812be050>] bus_add_driver+0x190/0x260
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffff812bf202>] driver_register+0x72/0x170
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffff8120b723>] __pci_register_driver+0x43/0x50
 [<ffffffffa002e01e>] azx_driver_init+0x1e/0x1000 [snd_hda_intel]
 [<ffffffff810002da>] do_one_initcall+0x11a/0x160
 [<ffffffff81087cfd>] sys_init_module+0xbd/0x220
 [<ffffffff814b7b96>] system_call_fastpath+0x1a/0x1f

Full boot log is at http://quora.org/2012/hda-switch-spurious3.txt .
Thank you for your work so far, Takashi!

I'll present a patch tomorrow to annotate context of the intel-hda
messages, since it'll help me check behaviour when diagnosing the GPU
switching issues.

Thanks again,
  Daniel
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-04 12:58                 ` Daniel J Blueman
@ 2012-12-04 13:23                   ` Takashi Iwai
  2012-12-04 13:55                     ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-04 13:23 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Tue, 4 Dec 2012 20:58:55 +0800,
Daniel J Blueman wrote:
> 
> On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
> > At Tue, 4 Dec 2012 00:50:56 +0800,
> > Daniel J Blueman wrote:
> >>
> >> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
> >> > At Mon, 3 Dec 2012 23:08:28 +0800,
> >> > Daniel J Blueman wrote:
> >> >>
> >> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
> >> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
> >> >> > Daniel J Blueman wrote:
> >> >> >>
> >> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
> >> >> >> > Takashi Iwai wrote:
> >> >> >> >>
> >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
> >> >> >> >> Daniel J Blueman wrote:
> >> >> >> >> >
> >> >> >> >> > Hi Seth, Dave, Takashi,
> >> >> >> >> >
> >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
> >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
> >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
> >> >> >> >> > commands [2].
> >> >> >> >> >
> >> >> >> >> > Adding some debug, I see that the registered audio devices are put
> >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
> >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
> >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
> >> >> >> >> > hardware which is now in D3.
> >> >> >> >> >
> >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> >> >> >> >> > controller, but perhaps this should be solved at a higher level in the
> >> >> >> >> > vgaswitcheroo code; what do you think?
> >> >> >> >>
> >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
> >> >> >> >> the probe.  A part of them problem is that the registration is done at
> >> >> >> >> the very last of probing.
> >> >> >> >>
> >> >> >> >> Instead of delaying the registration, how about the patch below?
> >> >> >> >
> >> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> >> >> >> > least...
> >> >> >>
> >> >> >> Ping ack; I was trying to find time to understand another race that
> >> >> >> occurs with GPU probing after switching, but is separate from the
> >> >> >> situation before switching, here.
> >> >> >>
> >> >> >> In the context of writing the switch, it looks like struct azx isn't
> >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
> >> >> >> azx_codec_create?
> >> >> >
> >> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
> >> >> >
> >> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
> >> >> > register_vga_switcheroo().  Could you retest with it?
> >> >>
> >> >> Superb; this addresses the oops.
> >> >
> >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
> >> >
> >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
> >> >> spurious response 0x0:0x0, last cmd=0x170500":
> >> >> http://quora.org/2012/hda-switch-spurious.txt
> >> >
> >> > Hm, it's not clear who triggers these messages.  I'll try to check the
> >> > code paths.
> >> >
> >> >> Presumably this implies the read of the ring-buffer pointer returned
> >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
> >> >> and processes the 191 unwritten entries?
> >> >
> >> > Good point.  Actually there is one bug that looks obviously wrong
> >> > (writing 32bit value to CORBWP).  Maybe it has been working just
> >> > because writing CORBRP doesn't influence except for the reset bit.
> >> >
> >> > Reading CORBWP as a byte is OK, but this could be better in a word so
> >> > that we can check 0xffff as invalid.
> >> >
> >> > A test patch is below.  Hopefully this improves the situation...
> >>
> >> I'll check this out tomorrow and also instrument the code to get a
> >> backtrace, since there may still be an underlying race with the
> >> previous patches:
> 
> [...]
> 
> > That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
> > the vga switcheroo things for Nvidia...
> >
> > The patch below is the revised patch of the first one.  Now the vga
> > switcheroo registration is done before the video controller D3 check,
> > so the race should be smaller.
> 
> This patch improves things further of course; a major improvement over
> without. ~15% of the time, I still get the 'spurious response'
> messages in this callpath:

A possible scenario is that the graphics went in D3 before probing
hd-audio, and the hd-audio continues to initialize the hardware
without knowing the graphics counterpart is disabled.

There is a code check_hdmi_disabled() in hda_intel.c that checks the 
availability of the video driver, and it might be that this doesn't
work as expected...

> 
> Pid: 473, comm: modprobe Not tainted 3.7.0-rc8-expert+ #15
> Call Trace:
>  [<ffffffffa0025776>] azx_update_rirb+0xe6/0xf0 [snd_hda_intel]
>  [<ffffffffa0026d00>] azx_get_response+0xc0/0x270 [snd_hda_intel]
>  [<ffffffffa0226801>] codec_exec_verb+0xc1/0x100 [snd_hda_codec]
>  [<ffffffffa022799b>] snd_hda_codec_read+0x5b/0x90 [snd_hda_codec]
>  [<ffffffff810545e9>] ? try_to_grab_pending+0xb9/0x130
>  [<ffffffffa0227d35>] hda_set_power_state+0xd5/0x110 [snd_hda_codec]
>  [<ffffffffa0226420>] hda_call_codec_resume+0x20/0x120 [snd_hda_codec]
>  [<ffffffffa0226689>] snd_hda_power_save+0x129/0x1e0 [snd_hda_codec]
>  [<ffffffffa022678f>] codec_exec_verb+0x4f/0x100 [snd_hda_codec]
>  [<ffffffffa02268ab>] snd_hda_codec_write+0x6b/0xa0 [snd_hda_codec]
>  [<ffffffffa0227281>] snd_hda_codec_write_cache+0x31/0xa0 [snd_hda_codec]
>  [<ffffffffa022b965>] snd_hda_jack_detect_enable_callback+0x75/0xa0
> [snd_hda_codec]
>  [<ffffffffa022b9a1>] snd_hda_jack_detect_enable+0x11/0x20 [snd_hda_codec]
>  [<ffffffffa00cf5a6>] generic_hdmi_init+0xb6/0xe0 [snd_hda_codec_hdmi]
>  [<ffffffffa022ad67>] snd_hda_codec_build_controls+0x37/0x130 [snd_hda_codec]
>  [<ffffffffa022b080>] ? snd_hda_codec_build_pcms+0x1a0/0x2e0 [snd_hda_codec]
>  [<ffffffffa022ae88>] snd_hda_build_controls+0x28/0x80 [snd_hda_codec]
>  [<ffffffffa0027ed2>] azx_probe_continue+0x2f2/0x410 [snd_hda_intel]
>  [<ffffffffa0027a10>] ? azx_pcm_hw_params+0xe0/0xe0 [snd_hda_intel]
>  [<ffffffffa0026c40>] ? azx_resume+0x110/0x110 [snd_hda_intel]
>  [<ffffffffa00258a0>] ? azx_power_notify+0x40/0x40 [snd_hda_intel]
>  [<ffffffffa00269e0>] ? azx_runtime_resume+0x40/0x40 [snd_hda_intel]
>  [<ffffffffa0025860>] ? azx_vs_can_switch+0x90/0x90 [snd_hda_intel]
>  [<ffffffffa0028966>] azx_probe+0x6c9/0x7a3 [snd_hda_intel]
>  [<ffffffff8120ae74>] local_pci_probe+0x74/0x100
>  [<ffffffff8120b641>] pci_device_probe+0x111/0x120
>  [<ffffffff812be936>] driver_probe_device+0x76/0x240
>  [<ffffffff812beb9b>] __driver_attach+0x9b/0xa0
>  [<ffffffff812beb00>] ? driver_probe_device+0x240/0x240
>  [<ffffffff812bcc9d>] bus_for_each_dev+0x4d/0x90
>  [<ffffffff812be499>] driver_attach+0x19/0x20
>  [<ffffffff812be050>] bus_add_driver+0x190/0x260
>  [<ffffffffa002e000>] ? 0xffffffffa002dfff
>  [<ffffffffa002e000>] ? 0xffffffffa002dfff
>  [<ffffffff812bf202>] driver_register+0x72/0x170
>  [<ffffffffa002e000>] ? 0xffffffffa002dfff
>  [<ffffffffa002e000>] ? 0xffffffffa002dfff
>  [<ffffffff8120b723>] __pci_register_driver+0x43/0x50
>  [<ffffffffa002e01e>] azx_driver_init+0x1e/0x1000 [snd_hda_intel]
>  [<ffffffff810002da>] do_one_initcall+0x11a/0x160
>  [<ffffffff81087cfd>] sys_init_module+0xbd/0x220
>  [<ffffffff814b7b96>] system_call_fastpath+0x1a/0x1f
> 
> Full boot log is at http://quora.org/2012/hda-switch-spurious3.txt .
> Thank you for your work so far, Takashi!
> 
> I'll present a patch tomorrow to annotate context of the intel-hda
> messages, since it'll help me check behaviour when diagnosing the GPU
> switching issues.

OK.  I'll also check a bit more about other possibilities.


thanks,

Takashi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-04 13:23                   ` Takashi Iwai
@ 2012-12-04 13:55                     ` Takashi Iwai
  2012-12-04 14:46                       ` Daniel J Blueman
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-04 13:55 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Tue, 04 Dec 2012 14:23:05 +0100,
Takashi Iwai wrote:
> 
> At Tue, 4 Dec 2012 20:58:55 +0800,
> Daniel J Blueman wrote:
> > 
> > On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
> > > At Tue, 4 Dec 2012 00:50:56 +0800,
> > > Daniel J Blueman wrote:
> > >>
> > >> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
> > >> > At Mon, 3 Dec 2012 23:08:28 +0800,
> > >> > Daniel J Blueman wrote:
> > >> >>
> > >> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
> > >> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
> > >> >> > Daniel J Blueman wrote:
> > >> >> >>
> > >> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> > >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
> > >> >> >> > Takashi Iwai wrote:
> > >> >> >> >>
> > >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
> > >> >> >> >> Daniel J Blueman wrote:
> > >> >> >> >> >
> > >> >> >> >> > Hi Seth, Dave, Takashi,
> > >> >> >> >> >
> > >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
> > >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
> > >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
> > >> >> >> >> > commands [2].
> > >> >> >> >> >
> > >> >> >> >> > Adding some debug, I see that the registered audio devices are put
> > >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
> > >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
> > >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
> > >> >> >> >> > hardware which is now in D3.
> > >> >> >> >> >
> > >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> > >> >> >> >> > controller, but perhaps this should be solved at a higher level in the
> > >> >> >> >> > vgaswitcheroo code; what do you think?
> > >> >> >> >>
> > >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
> > >> >> >> >> the probe.  A part of them problem is that the registration is done at
> > >> >> >> >> the very last of probing.
> > >> >> >> >>
> > >> >> >> >> Instead of delaying the registration, how about the patch below?
> > >> >> >> >
> > >> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> > >> >> >> > least...
> > >> >> >>
> > >> >> >> Ping ack; I was trying to find time to understand another race that
> > >> >> >> occurs with GPU probing after switching, but is separate from the
> > >> >> >> situation before switching, here.
> > >> >> >>
> > >> >> >> In the context of writing the switch, it looks like struct azx isn't
> > >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
> > >> >> >> azx_codec_create?
> > >> >> >
> > >> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
> > >> >> >
> > >> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
> > >> >> > register_vga_switcheroo().  Could you retest with it?
> > >> >>
> > >> >> Superb; this addresses the oops.
> > >> >
> > >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
> > >> >
> > >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
> > >> >> spurious response 0x0:0x0, last cmd=0x170500":
> > >> >> http://quora.org/2012/hda-switch-spurious.txt
> > >> >
> > >> > Hm, it's not clear who triggers these messages.  I'll try to check the
> > >> > code paths.
> > >> >
> > >> >> Presumably this implies the read of the ring-buffer pointer returned
> > >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
> > >> >> and processes the 191 unwritten entries?
> > >> >
> > >> > Good point.  Actually there is one bug that looks obviously wrong
> > >> > (writing 32bit value to CORBWP).  Maybe it has been working just
> > >> > because writing CORBRP doesn't influence except for the reset bit.
> > >> >
> > >> > Reading CORBWP as a byte is OK, but this could be better in a word so
> > >> > that we can check 0xffff as invalid.
> > >> >
> > >> > A test patch is below.  Hopefully this improves the situation...
> > >>
> > >> I'll check this out tomorrow and also instrument the code to get a
> > >> backtrace, since there may still be an underlying race with the
> > >> previous patches:
> > 
> > [...]
> > 
> > > That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
> > > the vga switcheroo things for Nvidia...
> > >
> > > The patch below is the revised patch of the first one.  Now the vga
> > > switcheroo registration is done before the video controller D3 check,
> > > so the race should be smaller.
> > 
> > This patch improves things further of course; a major improvement over
> > without. ~15% of the time, I still get the 'spurious response'
> > messages in this callpath:
> 
> A possible scenario is that the graphics went in D3 before probing
> hd-audio, and the hd-audio continues to initialize the hardware
> without knowing the graphics counterpart is disabled.
> 
> There is a code check_hdmi_disabled() in hda_intel.c that checks the 
> availability of the video driver, and it might be that this doesn't
> work as expected...

I think I understand this path.  You are setting "OFF", right?
This will set the audio client OFF before can_switch() is called.
Thus it can be called even before the probing process finished.

In short, wait_for_completion() must be put at the beginning of
azx_vs_set_state() in addition to azx_vs_can_switch().

The revised patch is attached below.  Hopefully this sorts out all
races.


Takashi

---
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 4bb52da..22ecadc 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -49,6 +49,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/clocksource.h>
 #include <linux/time.h>
+#include <linux/completion.h>
 
 #ifdef CONFIG_X86
 /* for snoop control */
@@ -469,6 +470,7 @@ struct azx {
 	/* locks */
 	spinlock_t reg_lock;
 	struct mutex open_mutex;
+	struct completion probe_wait;
 
 	/* streams (x num_streams) */
 	struct azx_dev *azx_dev;
@@ -2745,6 +2747,7 @@ static void azx_vs_set_state(struct pci_dev *pci,
 	struct azx *chip = card->private_data;
 	bool disabled;
 
+	wait_for_completion(&chip->probe_wait);
 	if (chip->init_failed)
 		return;
 
@@ -2790,6 +2793,7 @@ static bool azx_vs_can_switch(struct pci_dev *pci)
 	struct snd_card *card = pci_get_drvdata(pci);
 	struct azx *chip = card->private_data;
 
+	wait_for_completion(&chip->probe_wait);
 	if (chip->init_failed)
 		return false;
 	if (chip->disabled || !chip->bus)
@@ -2851,6 +2855,9 @@ static int azx_free(struct azx *chip)
 
 	azx_notifier_unregister(chip);
 
+	chip->init_failed = 1; /* to be sure */
+	complete(&chip->probe_wait);
+
 	if (use_vga_switcheroo(chip)) {
 		if (chip->disabled && chip->bus)
 			snd_hda_unlock_devices(chip->bus);
@@ -3156,6 +3163,7 @@ static int __devinit azx_create(struct snd_card *card, struct pci_dev *pci,
 	INIT_LIST_HEAD(&chip->pcm_list);
 	INIT_LIST_HEAD(&chip->list);
 	init_vga_switcheroo(chip);
+	init_completion(&chip->probe_wait);
 
 	chip->position_fix[0] = chip->position_fix[1] =
 		check_position_fix(chip, position_fix[dev]);
@@ -3183,26 +3191,6 @@ static int __devinit azx_create(struct snd_card *card, struct pci_dev *pci,
 		}
 	}
 
-	if (check_hdmi_disabled(pci)) {
-		snd_printk(KERN_INFO SFX "VGA controller for %s is disabled\n",
-			   pci_name(pci));
-		if (use_vga_switcheroo(chip)) {
-			snd_printk(KERN_INFO SFX "Delaying initialization\n");
-			chip->disabled = true;
-			goto ok;
-		}
-		kfree(chip);
-		pci_disable_device(pci);
-		return -ENXIO;
-	}
-
-	err = azx_first_init(chip);
-	if (err < 0) {
-		azx_free(chip);
-		return err;
-	}
-
- ok:
 	err = snd_device_new(card, SNDRV_DEV_LOWLEVEL, chip, &ops);
 	if (err < 0) {
 		snd_printk(KERN_ERR SFX "Error creating device [card]!\n");
@@ -3447,7 +3435,29 @@ static int __devinit azx_probe(struct pci_dev *pci,
 	if (err < 0)
 		goto out_free;
 	card->private_data = chip;
+
+	pci_set_drvdata(pci, card);
+
+	err = register_vga_switcheroo(chip);
+	if (err < 0) {
+		snd_printk(KERN_ERR SFX
+			   "Error registering VGA-switcheroo client\n");
+		goto out_free;
+	}
+
+	if (check_hdmi_disabled(pci)) {
+		snd_printk(KERN_INFO SFX "VGA controller for %s is disabled\n",
+			   pci_name(pci));
+		snd_printk(KERN_INFO SFX "Delaying initialization\n");
+		chip->disabled = true;
+	}
+
 	probe_now = !chip->disabled;
+	if (probe_now) {
+		err = azx_first_init(chip);
+		if (err < 0)
+			goto out_free;
+	}
 
 #ifdef CONFIG_SND_HDA_PATCH_LOADER
 	if (patch[dev] && *patch[dev]) {
@@ -3468,23 +3478,16 @@ static int __devinit azx_probe(struct pci_dev *pci,
 			goto out_free;
 	}
 
-	pci_set_drvdata(pci, card);
-
 	if (pci_dev_run_wake(pci))
 		pm_runtime_put_noidle(&pci->dev);
 
-	err = register_vga_switcheroo(chip);
-	if (err < 0) {
-		snd_printk(KERN_ERR SFX
-			   "Error registering VGA-switcheroo client\n");
-		goto out_free;
-	}
-
 	dev++;
+	complete(&chip->probe_wait);
 	return 0;
 
 out_free:
 	snd_card_free(card);
+	pci_set_drvdata(pci, NULL);
 	return err;
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-04 13:55                     ` Takashi Iwai
@ 2012-12-04 14:46                       ` Daniel J Blueman
  2012-12-04 15:03                         ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel J Blueman @ 2012-12-04 14:46 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

On 4 December 2012 21:55, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 04 Dec 2012 14:23:05 +0100,
> Takashi Iwai wrote:
>>
>> At Tue, 4 Dec 2012 20:58:55 +0800,
>> Daniel J Blueman wrote:
>> >
>> > On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
>> > > At Tue, 4 Dec 2012 00:50:56 +0800,
>> > > Daniel J Blueman wrote:
>> > >>
>> > >> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
>> > >> > At Mon, 3 Dec 2012 23:08:28 +0800,
>> > >> > Daniel J Blueman wrote:
>> > >> >>
>> > >> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
>> > >> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
>> > >> >> > Daniel J Blueman wrote:
>> > >> >> >>
>> > >> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
>> > >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
>> > >> >> >> > Takashi Iwai wrote:
>> > >> >> >> >>
>> > >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
>> > >> >> >> >> Daniel J Blueman wrote:
>> > >> >> >> >> >
>> > >> >> >> >> > Hi Seth, Dave, Takashi,
>> > >> >> >> >> >
>> > >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
>> > >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> > >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
>> > >> >> >> >> > commands [2].
>> > >> >> >> >> >
>> > >> >> >> >> > Adding some debug, I see that the registered audio devices are put
>> > >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
>> > >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
>> > >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
>> > >> >> >> >> > hardware which is now in D3.
>> > >> >> >> >> >
>> > >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> > >> >> >> >> > controller, but perhaps this should be solved at a higher level in the
>> > >> >> >> >> > vgaswitcheroo code; what do you think?
>> > >> >> >> >>
>> > >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
>> > >> >> >> >> the probe.  A part of them problem is that the registration is done at
>> > >> >> >> >> the very last of probing.
>> > >> >> >> >>
>> > >> >> >> >> Instead of delaying the registration, how about the patch below?
>> > >> >> >> >
>> > >> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
>> > >> >> >> > least...
>> > >> >> >>
>> > >> >> >> Ping ack; I was trying to find time to understand another race that
>> > >> >> >> occurs with GPU probing after switching, but is separate from the
>> > >> >> >> situation before switching, here.
>> > >> >> >>
>> > >> >> >> In the context of writing the switch, it looks like struct azx isn't
>> > >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
>> > >> >> >> azx_codec_create?
>> > >> >> >
>> > >> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
>> > >> >> >
>> > >> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
>> > >> >> > register_vga_switcheroo().  Could you retest with it?
>> > >> >>
>> > >> >> Superb; this addresses the oops.
>> > >> >
>> > >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
>> > >> >
>> > >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
>> > >> >> spurious response 0x0:0x0, last cmd=0x170500":
>> > >> >> http://quora.org/2012/hda-switch-spurious.txt
>> > >> >
>> > >> > Hm, it's not clear who triggers these messages.  I'll try to check the
>> > >> > code paths.
>> > >> >
>> > >> >> Presumably this implies the read of the ring-buffer pointer returned
>> > >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
>> > >> >> and processes the 191 unwritten entries?
>> > >> >
>> > >> > Good point.  Actually there is one bug that looks obviously wrong
>> > >> > (writing 32bit value to CORBWP).  Maybe it has been working just
>> > >> > because writing CORBRP doesn't influence except for the reset bit.
>> > >> >
>> > >> > Reading CORBWP as a byte is OK, but this could be better in a word so
>> > >> > that we can check 0xffff as invalid.
>> > >> >
>> > >> > A test patch is below.  Hopefully this improves the situation...
>> > >>
>> > >> I'll check this out tomorrow and also instrument the code to get a
>> > >> backtrace, since there may still be an underlying race with the
>> > >> previous patches:
>> >
>> > [...]
>> >
>> > > That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
>> > > the vga switcheroo things for Nvidia...
>> > >
>> > > The patch below is the revised patch of the first one.  Now the vga
>> > > switcheroo registration is done before the video controller D3 check,
>> > > so the race should be smaller.
>> >
>> > This patch improves things further of course; a major improvement over
>> > without. ~15% of the time, I still get the 'spurious response'
>> > messages in this callpath:
>>
>> A possible scenario is that the graphics went in D3 before probing
>> hd-audio, and the hd-audio continues to initialize the hardware
>> without knowing the graphics counterpart is disabled.
>>
>> There is a code check_hdmi_disabled() in hda_intel.c that checks the
>> availability of the video driver, and it might be that this doesn't
>> work as expected...
>
> I think I understand this path.  You are setting "OFF", right?
> This will set the audio client OFF before can_switch() is called.
> Thus it can be called even before the probing process finished.
>
> In short, wait_for_completion() must be put at the beginning of
> azx_vs_set_state() in addition to azx_vs_can_switch().
>
> The revised patch is attached below.  Hopefully this sorts out all
> races.

This works great and instead, I get "Cannot lock devices!" sometimes
(see http://quora.org/2012/hda-switch-lock.txt ). Feel free to add my
'Tested-by: Daniel J Blueman <daniel@quora.org>' or ping me if
anything else to test.

Thanks again!
  Daniel
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-04 14:46                       ` Daniel J Blueman
@ 2012-12-04 15:03                         ` Takashi Iwai
  2012-12-04 15:54                           ` Daniel J Blueman
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-04 15:03 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Tue, 4 Dec 2012 22:46:47 +0800,
Daniel J Blueman wrote:
> 
> On 4 December 2012 21:55, Takashi Iwai <tiwai@suse.de> wrote:
> > At Tue, 04 Dec 2012 14:23:05 +0100,
> > Takashi Iwai wrote:
> >>
> >> At Tue, 4 Dec 2012 20:58:55 +0800,
> >> Daniel J Blueman wrote:
> >> >
> >> > On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
> >> > > At Tue, 4 Dec 2012 00:50:56 +0800,
> >> > > Daniel J Blueman wrote:
> >> > >>
> >> > >> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
> >> > >> > At Mon, 3 Dec 2012 23:08:28 +0800,
> >> > >> > Daniel J Blueman wrote:
> >> > >> >>
> >> > >> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
> >> > >> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
> >> > >> >> > Daniel J Blueman wrote:
> >> > >> >> >>
> >> > >> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> >> > >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
> >> > >> >> >> > Takashi Iwai wrote:
> >> > >> >> >> >>
> >> > >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
> >> > >> >> >> >> Daniel J Blueman wrote:
> >> > >> >> >> >> >
> >> > >> >> >> >> > Hi Seth, Dave, Takashi,
> >> > >> >> >> >> >
> >> > >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
> >> > >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
> >> > >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
> >> > >> >> >> >> > commands [2].
> >> > >> >> >> >> >
> >> > >> >> >> >> > Adding some debug, I see that the registered audio devices are put
> >> > >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
> >> > >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
> >> > >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
> >> > >> >> >> >> > hardware which is now in D3.
> >> > >> >> >> >> >
> >> > >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> >> > >> >> >> >> > controller, but perhaps this should be solved at a higher level in the
> >> > >> >> >> >> > vgaswitcheroo code; what do you think?
> >> > >> >> >> >>
> >> > >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
> >> > >> >> >> >> the probe.  A part of them problem is that the registration is done at
> >> > >> >> >> >> the very last of probing.
> >> > >> >> >> >>
> >> > >> >> >> >> Instead of delaying the registration, how about the patch below?
> >> > >> >> >> >
> >> > >> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> >> > >> >> >> > least...
> >> > >> >> >>
> >> > >> >> >> Ping ack; I was trying to find time to understand another race that
> >> > >> >> >> occurs with GPU probing after switching, but is separate from the
> >> > >> >> >> situation before switching, here.
> >> > >> >> >>
> >> > >> >> >> In the context of writing the switch, it looks like struct azx isn't
> >> > >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
> >> > >> >> >> azx_codec_create?
> >> > >> >> >
> >> > >> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
> >> > >> >> >
> >> > >> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
> >> > >> >> > register_vga_switcheroo().  Could you retest with it?
> >> > >> >>
> >> > >> >> Superb; this addresses the oops.
> >> > >> >
> >> > >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
> >> > >> >
> >> > >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
> >> > >> >> spurious response 0x0:0x0, last cmd=0x170500":
> >> > >> >> http://quora.org/2012/hda-switch-spurious.txt
> >> > >> >
> >> > >> > Hm, it's not clear who triggers these messages.  I'll try to check the
> >> > >> > code paths.
> >> > >> >
> >> > >> >> Presumably this implies the read of the ring-buffer pointer returned
> >> > >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
> >> > >> >> and processes the 191 unwritten entries?
> >> > >> >
> >> > >> > Good point.  Actually there is one bug that looks obviously wrong
> >> > >> > (writing 32bit value to CORBWP).  Maybe it has been working just
> >> > >> > because writing CORBRP doesn't influence except for the reset bit.
> >> > >> >
> >> > >> > Reading CORBWP as a byte is OK, but this could be better in a word so
> >> > >> > that we can check 0xffff as invalid.
> >> > >> >
> >> > >> > A test patch is below.  Hopefully this improves the situation...
> >> > >>
> >> > >> I'll check this out tomorrow and also instrument the code to get a
> >> > >> backtrace, since there may still be an underlying race with the
> >> > >> previous patches:
> >> >
> >> > [...]
> >> >
> >> > > That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
> >> > > the vga switcheroo things for Nvidia...
> >> > >
> >> > > The patch below is the revised patch of the first one.  Now the vga
> >> > > switcheroo registration is done before the video controller D3 check,
> >> > > so the race should be smaller.
> >> >
> >> > This patch improves things further of course; a major improvement over
> >> > without. ~15% of the time, I still get the 'spurious response'
> >> > messages in this callpath:
> >>
> >> A possible scenario is that the graphics went in D3 before probing
> >> hd-audio, and the hd-audio continues to initialize the hardware
> >> without knowing the graphics counterpart is disabled.
> >>
> >> There is a code check_hdmi_disabled() in hda_intel.c that checks the
> >> availability of the video driver, and it might be that this doesn't
> >> work as expected...
> >
> > I think I understand this path.  You are setting "OFF", right?
> > This will set the audio client OFF before can_switch() is called.
> > Thus it can be called even before the probing process finished.
> >
> > In short, wait_for_completion() must be put at the beginning of
> > azx_vs_set_state() in addition to azx_vs_can_switch().
> >
> > The revised patch is attached below.  Hopefully this sorts out all
> > races.
> 
> This works great and instead, I get "Cannot lock devices!" sometimes
> (see http://quora.org/2012/hda-switch-lock.txt ).

This means that some apps (likely udev, PulseAudio or whatever) are
accessing sound devices during you turn off the audio / D-GPU.  It
wouldn't be too big trouble, but it's still a bit risky.

We can put the disconnect fops like a real disconnection and restore
in return, but it will need lots of ALSA core code changes, so I won't
follow for now. 

> Feel free to add my
> 'Tested-by: Daniel J Blueman <daniel@quora.org>' or ping me if
> anything else to test.

OK, I queued the patch finally to sound git tree for-next branch.

Thanks for your quick testing!


Takashi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-04 15:03                         ` Takashi Iwai
@ 2012-12-04 15:54                           ` Daniel J Blueman
  2012-12-04 16:04                             ` Takashi Iwai
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel J Blueman @ 2012-12-04 15:54 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

On 4 December 2012 23:03, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 4 Dec 2012 22:46:47 +0800,
> Daniel J Blueman wrote:
>>
>> On 4 December 2012 21:55, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Tue, 04 Dec 2012 14:23:05 +0100,
>> > Takashi Iwai wrote:
>> >>
>> >> At Tue, 4 Dec 2012 20:58:55 +0800,
>> >> Daniel J Blueman wrote:
>> >> >
>> >> > On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > > At Tue, 4 Dec 2012 00:50:56 +0800,
>> >> > > Daniel J Blueman wrote:
>> >> > >>
>> >> > >> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > >> > At Mon, 3 Dec 2012 23:08:28 +0800,
>> >> > >> > Daniel J Blueman wrote:
>> >> > >> >>
>> >> > >> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > >> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
>> >> > >> >> > Daniel J Blueman wrote:
>> >> > >> >> >>
>> >> > >> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
>> >> > >> >> >> > Takashi Iwai wrote:
>> >> > >> >> >> >>
>> >> > >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
>> >> > >> >> >> >> Daniel J Blueman wrote:
>> >> > >> >> >> >> >
>> >> > >> >> >> >> > Hi Seth, Dave, Takashi,
>> >> > >> >> >> >> >
>> >> > >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
>> >> > >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> >> > >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
>> >> > >> >> >> >> > commands [2].
>> >> > >> >> >> >> >
>> >> > >> >> >> >> > Adding some debug, I see that the registered audio devices are put
>> >> > >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
>> >> > >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
>> >> > >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
>> >> > >> >> >> >> > hardware which is now in D3.
>> >> > >> >> >> >> >
>> >> > >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> >> > >> >> >> >> > controller, but perhaps this should be solved at a higher level in the
>> >> > >> >> >> >> > vgaswitcheroo code; what do you think?
>> >> > >> >> >> >>
>> >> > >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
>> >> > >> >> >> >> the probe.  A part of them problem is that the registration is done at
>> >> > >> >> >> >> the very last of probing.
>> >> > >> >> >> >>
>> >> > >> >> >> >> Instead of delaying the registration, how about the patch below?
>> >> > >> >> >> >
>> >> > >> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
>> >> > >> >> >> > least...
>> >> > >> >> >>
>> >> > >> >> >> Ping ack; I was trying to find time to understand another race that
>> >> > >> >> >> occurs with GPU probing after switching, but is separate from the
>> >> > >> >> >> situation before switching, here.
>> >> > >> >> >>
>> >> > >> >> >> In the context of writing the switch, it looks like struct azx isn't
>> >> > >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
>> >> > >> >> >> azx_codec_create?
>> >> > >> >> >
>> >> > >> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
>> >> > >> >> >
>> >> > >> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
>> >> > >> >> > register_vga_switcheroo().  Could you retest with it?
>> >> > >> >>
>> >> > >> >> Superb; this addresses the oops.
>> >> > >> >
>> >> > >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
>> >> > >> >
>> >> > >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
>> >> > >> >> spurious response 0x0:0x0, last cmd=0x170500":
>> >> > >> >> http://quora.org/2012/hda-switch-spurious.txt
>> >> > >> >
>> >> > >> > Hm, it's not clear who triggers these messages.  I'll try to check the
>> >> > >> > code paths.
>> >> > >> >
>> >> > >> >> Presumably this implies the read of the ring-buffer pointer returned
>> >> > >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
>> >> > >> >> and processes the 191 unwritten entries?
>> >> > >> >
>> >> > >> > Good point.  Actually there is one bug that looks obviously wrong
>> >> > >> > (writing 32bit value to CORBWP).  Maybe it has been working just
>> >> > >> > because writing CORBRP doesn't influence except for the reset bit.
>> >> > >> >
>> >> > >> > Reading CORBWP as a byte is OK, but this could be better in a word so
>> >> > >> > that we can check 0xffff as invalid.
>> >> > >> >
>> >> > >> > A test patch is below.  Hopefully this improves the situation...
>> >> > >>
>> >> > >> I'll check this out tomorrow and also instrument the code to get a
>> >> > >> backtrace, since there may still be an underlying race with the
>> >> > >> previous patches:
>> >> >
>> >> > [...]
>> >> >
>> >> > > That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
>> >> > > the vga switcheroo things for Nvidia...
>> >> > >
>> >> > > The patch below is the revised patch of the first one.  Now the vga
>> >> > > switcheroo registration is done before the video controller D3 check,
>> >> > > so the race should be smaller.
>> >> >
>> >> > This patch improves things further of course; a major improvement over
>> >> > without. ~15% of the time, I still get the 'spurious response'
>> >> > messages in this callpath:
>> >>
>> >> A possible scenario is that the graphics went in D3 before probing
>> >> hd-audio, and the hd-audio continues to initialize the hardware
>> >> without knowing the graphics counterpart is disabled.
>> >>
>> >> There is a code check_hdmi_disabled() in hda_intel.c that checks the
>> >> availability of the video driver, and it might be that this doesn't
>> >> work as expected...
>> >
>> > I think I understand this path.  You are setting "OFF", right?
>> > This will set the audio client OFF before can_switch() is called.
>> > Thus it can be called even before the probing process finished.
>> >
>> > In short, wait_for_completion() must be put at the beginning of
>> > azx_vs_set_state() in addition to azx_vs_can_switch().
>> >
>> > The revised patch is attached below.  Hopefully this sorts out all
>> > races.
>>
>> This works great and instead, I get "Cannot lock devices!" sometimes
>> (see http://quora.org/2012/hda-switch-lock.txt ).
>
> This means that some apps (likely udev, PulseAudio or whatever) are
> accessing sound devices during you turn off the audio / D-GPU.  It
> wouldn't be too big trouble, but it's still a bit risky.

It looks like this prevents suspending, since the waiting process is
sleeping uninterruptably in snd_power_wait (updated
http://quora.org/2012/hda-switch-lock.txt with the details).

I'll workaround it for now by disabling HDMI audio.

Thanks,
  Daniel
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-04 15:54                           ` Daniel J Blueman
@ 2012-12-04 16:04                             ` Takashi Iwai
  2012-12-05 14:25                               ` Daniel J Blueman
  0 siblings, 1 reply; 17+ messages in thread
From: Takashi Iwai @ 2012-12-04 16:04 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

At Tue, 4 Dec 2012 23:54:39 +0800,
Daniel J Blueman wrote:
> 
> On 4 December 2012 23:03, Takashi Iwai <tiwai@suse.de> wrote:
> > At Tue, 4 Dec 2012 22:46:47 +0800,
> > Daniel J Blueman wrote:
> >>
> >> On 4 December 2012 21:55, Takashi Iwai <tiwai@suse.de> wrote:
> >> > At Tue, 04 Dec 2012 14:23:05 +0100,
> >> > Takashi Iwai wrote:
> >> >>
> >> >> At Tue, 4 Dec 2012 20:58:55 +0800,
> >> >> Daniel J Blueman wrote:
> >> >> >
> >> >> > On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
> >> >> > > At Tue, 4 Dec 2012 00:50:56 +0800,
> >> >> > > Daniel J Blueman wrote:
> >> >> > >>
> >> >> > >> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
> >> >> > >> > At Mon, 3 Dec 2012 23:08:28 +0800,
> >> >> > >> > Daniel J Blueman wrote:
> >> >> > >> >>
> >> >> > >> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
> >> >> > >> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
> >> >> > >> >> > Daniel J Blueman wrote:
> >> >> > >> >> >>
> >> >> > >> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
> >> >> > >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
> >> >> > >> >> >> > Takashi Iwai wrote:
> >> >> > >> >> >> >>
> >> >> > >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
> >> >> > >> >> >> >> Daniel J Blueman wrote:
> >> >> > >> >> >> >> >
> >> >> > >> >> >> >> > Hi Seth, Dave, Takashi,
> >> >> > >> >> >> >> >
> >> >> > >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
> >> >> > >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
> >> >> > >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
> >> >> > >> >> >> >> > commands [2].
> >> >> > >> >> >> >> >
> >> >> > >> >> >> >> > Adding some debug, I see that the registered audio devices are put
> >> >> > >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
> >> >> > >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
> >> >> > >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
> >> >> > >> >> >> >> > hardware which is now in D3.
> >> >> > >> >> >> >> >
> >> >> > >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
> >> >> > >> >> >> >> > controller, but perhaps this should be solved at a higher level in the
> >> >> > >> >> >> >> > vgaswitcheroo code; what do you think?
> >> >> > >> >> >> >>
> >> >> > >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
> >> >> > >> >> >> >> the probe.  A part of them problem is that the registration is done at
> >> >> > >> >> >> >> the very last of probing.
> >> >> > >> >> >> >>
> >> >> > >> >> >> >> Instead of delaying the registration, how about the patch below?
> >> >> > >> >> >> >
> >> >> > >> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
> >> >> > >> >> >> > least...
> >> >> > >> >> >>
> >> >> > >> >> >> Ping ack; I was trying to find time to understand another race that
> >> >> > >> >> >> occurs with GPU probing after switching, but is separate from the
> >> >> > >> >> >> situation before switching, here.
> >> >> > >> >> >>
> >> >> > >> >> >> In the context of writing the switch, it looks like struct azx isn't
> >> >> > >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
> >> >> > >> >> >> azx_codec_create?
> >> >> > >> >> >
> >> >> > >> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
> >> >> > >> >> >
> >> >> > >> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
> >> >> > >> >> > register_vga_switcheroo().  Could you retest with it?
> >> >> > >> >>
> >> >> > >> >> Superb; this addresses the oops.
> >> >> > >> >
> >> >> > >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
> >> >> > >> >
> >> >> > >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
> >> >> > >> >> spurious response 0x0:0x0, last cmd=0x170500":
> >> >> > >> >> http://quora.org/2012/hda-switch-spurious.txt
> >> >> > >> >
> >> >> > >> > Hm, it's not clear who triggers these messages.  I'll try to check the
> >> >> > >> > code paths.
> >> >> > >> >
> >> >> > >> >> Presumably this implies the read of the ring-buffer pointer returned
> >> >> > >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
> >> >> > >> >> and processes the 191 unwritten entries?
> >> >> > >> >
> >> >> > >> > Good point.  Actually there is one bug that looks obviously wrong
> >> >> > >> > (writing 32bit value to CORBWP).  Maybe it has been working just
> >> >> > >> > because writing CORBRP doesn't influence except for the reset bit.
> >> >> > >> >
> >> >> > >> > Reading CORBWP as a byte is OK, but this could be better in a word so
> >> >> > >> > that we can check 0xffff as invalid.
> >> >> > >> >
> >> >> > >> > A test patch is below.  Hopefully this improves the situation...
> >> >> > >>
> >> >> > >> I'll check this out tomorrow and also instrument the code to get a
> >> >> > >> backtrace, since there may still be an underlying race with the
> >> >> > >> previous patches:
> >> >> >
> >> >> > [...]
> >> >> >
> >> >> > > That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
> >> >> > > the vga switcheroo things for Nvidia...
> >> >> > >
> >> >> > > The patch below is the revised patch of the first one.  Now the vga
> >> >> > > switcheroo registration is done before the video controller D3 check,
> >> >> > > so the race should be smaller.
> >> >> >
> >> >> > This patch improves things further of course; a major improvement over
> >> >> > without. ~15% of the time, I still get the 'spurious response'
> >> >> > messages in this callpath:
> >> >>
> >> >> A possible scenario is that the graphics went in D3 before probing
> >> >> hd-audio, and the hd-audio continues to initialize the hardware
> >> >> without knowing the graphics counterpart is disabled.
> >> >>
> >> >> There is a code check_hdmi_disabled() in hda_intel.c that checks the
> >> >> availability of the video driver, and it might be that this doesn't
> >> >> work as expected...
> >> >
> >> > I think I understand this path.  You are setting "OFF", right?
> >> > This will set the audio client OFF before can_switch() is called.
> >> > Thus it can be called even before the probing process finished.
> >> >
> >> > In short, wait_for_completion() must be put at the beginning of
> >> > azx_vs_set_state() in addition to azx_vs_can_switch().
> >> >
> >> > The revised patch is attached below.  Hopefully this sorts out all
> >> > races.
> >>
> >> This works great and instead, I get "Cannot lock devices!" sometimes
> >> (see http://quora.org/2012/hda-switch-lock.txt ).
> >
> > This means that some apps (likely udev, PulseAudio or whatever) are
> > accessing sound devices during you turn off the audio / D-GPU.  It
> > wouldn't be too big trouble, but it's still a bit risky.
> 
> It looks like this prevents suspending, since the waiting process is
> sleeping uninterruptably in snd_power_wait (updated
> http://quora.org/2012/hda-switch-lock.txt with the details).

Grrr, yet missing piece.  The patch below should fix it.

Takashi

---
From: Takashi Iwai <tiwai@suse.de>
Subject: [PATCH] ALSA: hda - Avoid doubly suspend after vga switcheroo

The HD-audio driver artificially calls the suspend and the resume code
path in the VGA switcheroo state changes.  When a machine goes to
suspend, it tries to suspend the device again, and it stalls at
snd_power_wait().

This patch adds checks whether the devices were already in (forced)
suspend in PM callbacks for avoiding the doubly suspend.

Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
 sound/pci/hda/hda_intel.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 22ecadc..6a251ec 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -2624,6 +2624,9 @@ static int azx_suspend(struct device *dev)
 	struct azx *chip = card->private_data;
 	struct azx_pcm *p;
 
+	if (chip->disabled)
+		return 0;
+
 	snd_power_change_state(card, SNDRV_CTL_POWER_D3hot);
 	azx_clear_irq_pending(chip);
 	list_for_each_entry(p, &chip->pcm_list, list)
@@ -2649,6 +2652,9 @@ static int azx_resume(struct device *dev)
 	struct snd_card *card = dev_get_drvdata(dev);
 	struct azx *chip = card->private_data;
 
+	if (chip->disabled)
+		return 0;
+
 	pci_set_power_state(pci, PCI_D0);
 	pci_restore_state(pci);
 	if (pci_enable_device(pci) < 0) {
-- 
1.8.0.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: switcheroo registration vs switching race...
  2012-12-04 16:04                             ` Takashi Iwai
@ 2012-12-05 14:25                               ` Daniel J Blueman
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel J Blueman @ 2012-12-05 14:25 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Seth Forshee, Dave Airlie, Linux Kernel

On 5 December 2012 00:04, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 4 Dec 2012 23:54:39 +0800,
> Daniel J Blueman wrote:
>>
>> On 4 December 2012 23:03, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Tue, 4 Dec 2012 22:46:47 +0800,
>> > Daniel J Blueman wrote:
>> >>
>> >> On 4 December 2012 21:55, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > At Tue, 04 Dec 2012 14:23:05 +0100,
>> >> > Takashi Iwai wrote:
>> >> >>
>> >> >> At Tue, 4 Dec 2012 20:58:55 +0800,
>> >> >> Daniel J Blueman wrote:
>> >> >> >
>> >> >> > On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
>> >> >> > > At Tue, 4 Dec 2012 00:50:56 +0800,
>> >> >> > > Daniel J Blueman wrote:
>> >> >> > >>
>> >> >> > >> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
>> >> >> > >> > At Mon, 3 Dec 2012 23:08:28 +0800,
>> >> >> > >> > Daniel J Blueman wrote:
>> >> >> > >> >>
>> >> >> > >> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
>> >> >> > >> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
>> >> >> > >> >> > Daniel J Blueman wrote:
>> >> >> > >> >> >>
>> >> >> > >> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
>> >> >> > >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
>> >> >> > >> >> >> > Takashi Iwai wrote:
>> >> >> > >> >> >> >>
>> >> >> > >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
>> >> >> > >> >> >> >> Daniel J Blueman wrote:
>> >> >> > >> >> >> >> >
>> >> >> > >> >> >> >> > Hi Seth, Dave, Takashi,
>> >> >> > >> >> >> >> >
>> >> >> > >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
>> >> >> > >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> >> >> > >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
>> >> >> > >> >> >> >> > commands [2].
>> >> >> > >> >> >> >> >
>> >> >> > >> >> >> >> > Adding some debug, I see that the registered audio devices are put
>> >> >> > >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
>> >> >> > >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
>> >> >> > >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
>> >> >> > >> >> >> >> > hardware which is now in D3.
>> >> >> > >> >> >> >> >
>> >> >> > >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> >> >> > >> >> >> >> > controller, but perhaps this should be solved at a higher level in the
>> >> >> > >> >> >> >> > vgaswitcheroo code; what do you think?
>> >> >> > >> >> >> >>
>> >> >> > >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
>> >> >> > >> >> >> >> the probe.  A part of them problem is that the registration is done at
>> >> >> > >> >> >> >> the very last of probing.
>> >> >> > >> >> >> >>
>> >> >> > >> >> >> >> Instead of delaying the registration, how about the patch below?
>> >> >> > >> >> >> >
>> >> >> > >> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
>> >> >> > >> >> >> > least...
>> >> >> > >> >> >>
>> >> >> > >> >> >> Ping ack; I was trying to find time to understand another race that
>> >> >> > >> >> >> occurs with GPU probing after switching, but is separate from the
>> >> >> > >> >> >> situation before switching, here.
>> >> >> > >> >> >>
>> >> >> > >> >> >> In the context of writing the switch, it looks like struct azx isn't
>> >> >> > >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
>> >> >> > >> >> >> azx_codec_create?
>> >> >> > >> >> >
>> >> >> > >> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
>> >> >> > >> >> >
>> >> >> > >> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
>> >> >> > >> >> > register_vga_switcheroo().  Could you retest with it?
>> >> >> > >> >>
>> >> >> > >> >> Superb; this addresses the oops.
>> >> >> > >> >
>> >> >> > >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
>> >> >> > >> >
>> >> >> > >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
>> >> >> > >> >> spurious response 0x0:0x0, last cmd=0x170500":
>> >> >> > >> >> http://quora.org/2012/hda-switch-spurious.txt
>> >> >> > >> >
>> >> >> > >> > Hm, it's not clear who triggers these messages.  I'll try to check the
>> >> >> > >> > code paths.
>> >> >> > >> >
>> >> >> > >> >> Presumably this implies the read of the ring-buffer pointer returned
>> >> >> > >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
>> >> >> > >> >> and processes the 191 unwritten entries?
>> >> >> > >> >
>> >> >> > >> > Good point.  Actually there is one bug that looks obviously wrong
>> >> >> > >> > (writing 32bit value to CORBWP).  Maybe it has been working just
>> >> >> > >> > because writing CORBRP doesn't influence except for the reset bit.
>> >> >> > >> >
>> >> >> > >> > Reading CORBWP as a byte is OK, but this could be better in a word so
>> >> >> > >> > that we can check 0xffff as invalid.
>> >> >> > >> >
>> >> >> > >> > A test patch is below.  Hopefully this improves the situation...
>> >> >> > >>
>> >> >> > >> I'll check this out tomorrow and also instrument the code to get a
>> >> >> > >> backtrace, since there may still be an underlying race with the
>> >> >> > >> previous patches:
>> >> >> >
>> >> >> > [...]
>> >> >> >
>> >> >> > > That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
>> >> >> > > the vga switcheroo things for Nvidia...
>> >> >> > >
>> >> >> > > The patch below is the revised patch of the first one.  Now the vga
>> >> >> > > switcheroo registration is done before the video controller D3 check,
>> >> >> > > so the race should be smaller.
>> >> >> >
>> >> >> > This patch improves things further of course; a major improvement over
>> >> >> > without. ~15% of the time, I still get the 'spurious response'
>> >> >> > messages in this callpath:
>> >> >>
>> >> >> A possible scenario is that the graphics went in D3 before probing
>> >> >> hd-audio, and the hd-audio continues to initialize the hardware
>> >> >> without knowing the graphics counterpart is disabled.
>> >> >>
>> >> >> There is a code check_hdmi_disabled() in hda_intel.c that checks the
>> >> >> availability of the video driver, and it might be that this doesn't
>> >> >> work as expected...
>> >> >
>> >> > I think I understand this path.  You are setting "OFF", right?
>> >> > This will set the audio client OFF before can_switch() is called.
>> >> > Thus it can be called even before the probing process finished.
>> >> >
>> >> > In short, wait_for_completion() must be put at the beginning of
>> >> > azx_vs_set_state() in addition to azx_vs_can_switch().
>> >> >
>> >> > The revised patch is attached below.  Hopefully this sorts out all
>> >> > races.
>> >>
>> >> This works great and instead, I get "Cannot lock devices!" sometimes
>> >> (see http://quora.org/2012/hda-switch-lock.txt ).
>> >
>> > This means that some apps (likely udev, PulseAudio or whatever) are
>> > accessing sound devices during you turn off the audio / D-GPU.  It
>> > wouldn't be too big trouble, but it's still a bit risky.
>>
>> It looks like this prevents suspending, since the waiting process is
>> sleeping uninterruptably in snd_power_wait (updated
>> http://quora.org/2012/hda-switch-lock.txt with the details).
>
> Grrr, yet missing piece.  The patch below should fix it.
>
> Takashi
>
> ---
> From: Takashi Iwai <tiwai@suse.de>
> Subject: [PATCH] ALSA: hda - Avoid doubly suspend after vga switcheroo
>
> The HD-audio driver artificially calls the suspend and the resume code
> path in the VGA switcheroo state changes.  When a machine goes to
> suspend, it tries to suspend the device again, and it stalls at
> snd_power_wait().
>
> This patch adds checks whether the devices were already in (forced)
> suspend in PM callbacks for avoiding the doubly suspend.

So far, so good. I haven't had chance to test it deeper, but will drop
a mail if I find an issue, which seems unlikely from initial testing.

Excellent work and thanks!

Dan
--
Daniel J Blueman

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-12-05 14:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-28  3:45 switcheroo registration vs switching race Daniel J Blueman
2012-11-28  8:45 ` Takashi Iwai
2012-12-03 11:17   ` Takashi Iwai
2012-12-03 14:25     ` Daniel J Blueman
2012-12-03 14:40       ` Takashi Iwai
2012-12-03 15:08         ` Daniel J Blueman
2012-12-03 16:23           ` Takashi Iwai
2012-12-03 16:50             ` Daniel J Blueman
2012-12-03 17:10               ` Takashi Iwai
2012-12-04 12:58                 ` Daniel J Blueman
2012-12-04 13:23                   ` Takashi Iwai
2012-12-04 13:55                     ` Takashi Iwai
2012-12-04 14:46                       ` Daniel J Blueman
2012-12-04 15:03                         ` Takashi Iwai
2012-12-04 15:54                           ` Daniel J Blueman
2012-12-04 16:04                             ` Takashi Iwai
2012-12-05 14:25                               ` Daniel J Blueman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).