From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753091Ab2LDM67 (ORCPT ); Tue, 4 Dec 2012 07:58:59 -0500 Received: from mail-oa0-f46.google.com ([209.85.219.46]:34498 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752277Ab2LDM65 (ORCPT ); Tue, 4 Dec 2012 07:58:57 -0500 MIME-Version: 1.0 X-Originating-IP: [121.7.72.46] In-Reply-To: References: Date: Tue, 4 Dec 2012 20:58:55 +0800 Message-ID: Subject: Re: switcheroo registration vs switching race... From: Daniel J Blueman To: Takashi Iwai Cc: Seth Forshee , Dave Airlie , Linux Kernel Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4 December 2012 01:10, Takashi Iwai wrote: > At Tue, 4 Dec 2012 00:50:56 +0800, > Daniel J Blueman wrote: >> >> On 4 December 2012 00:23, Takashi Iwai wrote: >> > At Mon, 3 Dec 2012 23:08:28 +0800, >> > Daniel J Blueman wrote: >> >> >> >> On 3 December 2012 22:40, Takashi Iwai wrote: >> >> > At Mon, 3 Dec 2012 22:25:52 +0800, >> >> > Daniel J Blueman wrote: >> >> >> >> >> >> On 3 December 2012 19:17, Takashi Iwai wrote: >> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100, >> >> >> > Takashi Iwai wrote: >> >> >> >> >> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800, >> >> >> >> Daniel J Blueman wrote: >> >> >> >> > >> >> >> >> > Hi Seth, Dave, Takashi, >> >> >> >> > >> >> >> >> > If I power down the unused discrete GPU before lightdm starts by >> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race >> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to >> >> >> >> > commands [2]. >> >> >> >> > >> >> >> >> > Adding some debug, I see that the registered audio devices are put >> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and >> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the >> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to >> >> >> >> > hardware which is now in D3. >> >> >> >> > >> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA >> >> >> >> > controller, but perhaps this should be solved at a higher level in the >> >> >> >> > vgaswitcheroo code; what do you think? >> >> >> >> >> >> >> >> Maybe it's a side effect for the recent effort to fix another race in >> >> >> >> the probe. A part of them problem is that the registration is done at >> >> >> >> the very last of probing. >> >> >> >> >> >> >> >> Instead of delaying the registration, how about the patch below? >> >> >> > >> >> >> > Ping. If this really works, I'd like to queue it for 3.8 merge, at >> >> >> > least... >> >> >> >> >> >> Ping ack; I was trying to find time to understand another race that >> >> >> occurs with GPU probing after switching, but is separate from the >> >> >> situation before switching, here. >> >> >> >> >> >> In the context of writing the switch, it looks like struct azx isn't >> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with >> >> >> azx_codec_create? >> >> > >> >> > It was allocated, but it wasn't assigned properly in pci drvdata. >> >> > >> >> > Below is the revised patch. Just moved pci_set_drvdata() before >> >> > register_vga_switcheroo(). Could you retest with it? >> >> >> >> Superb; this addresses the oops. >> > >> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable. >> > >> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel: >> >> spurious response 0x0:0x0, last cmd=0x170500": >> >> http://quora.org/2012/hda-switch-spurious.txt >> > >> > Hm, it's not clear who triggers these messages. I'll try to check the >> > code paths. >> > >> >> Presumably this implies the read of the ring-buffer pointer returned >> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped >> >> and processes the 191 unwritten entries? >> > >> > Good point. Actually there is one bug that looks obviously wrong >> > (writing 32bit value to CORBWP). Maybe it has been working just >> > because writing CORBRP doesn't influence except for the reset bit. >> > >> > Reading CORBWP as a byte is OK, but this could be better in a word so >> > that we can check 0xffff as invalid. >> > >> > A test patch is below. Hopefully this improves the situation... >> >> I'll check this out tomorrow and also instrument the code to get a >> backtrace, since there may still be an underlying race with the >> previous patches: [...] > That's odd. The Cirrus one (0000:00:1b.0) must be independent from > the vga switcheroo things for Nvidia... > > The patch below is the revised patch of the first one. Now the vga > switcheroo registration is done before the video controller D3 check, > so the race should be smaller. This patch improves things further of course; a major improvement over without. ~15% of the time, I still get the 'spurious response' messages in this callpath: Pid: 473, comm: modprobe Not tainted 3.7.0-rc8-expert+ #15 Call Trace: [] azx_update_rirb+0xe6/0xf0 [snd_hda_intel] [] azx_get_response+0xc0/0x270 [snd_hda_intel] [] codec_exec_verb+0xc1/0x100 [snd_hda_codec] [] snd_hda_codec_read+0x5b/0x90 [snd_hda_codec] [] ? try_to_grab_pending+0xb9/0x130 [] hda_set_power_state+0xd5/0x110 [snd_hda_codec] [] hda_call_codec_resume+0x20/0x120 [snd_hda_codec] [] snd_hda_power_save+0x129/0x1e0 [snd_hda_codec] [] codec_exec_verb+0x4f/0x100 [snd_hda_codec] [] snd_hda_codec_write+0x6b/0xa0 [snd_hda_codec] [] snd_hda_codec_write_cache+0x31/0xa0 [snd_hda_codec] [] snd_hda_jack_detect_enable_callback+0x75/0xa0 [snd_hda_codec] [] snd_hda_jack_detect_enable+0x11/0x20 [snd_hda_codec] [] generic_hdmi_init+0xb6/0xe0 [snd_hda_codec_hdmi] [] snd_hda_codec_build_controls+0x37/0x130 [snd_hda_codec] [] ? snd_hda_codec_build_pcms+0x1a0/0x2e0 [snd_hda_codec] [] snd_hda_build_controls+0x28/0x80 [snd_hda_codec] [] azx_probe_continue+0x2f2/0x410 [snd_hda_intel] [] ? azx_pcm_hw_params+0xe0/0xe0 [snd_hda_intel] [] ? azx_resume+0x110/0x110 [snd_hda_intel] [] ? azx_power_notify+0x40/0x40 [snd_hda_intel] [] ? azx_runtime_resume+0x40/0x40 [snd_hda_intel] [] ? azx_vs_can_switch+0x90/0x90 [snd_hda_intel] [] azx_probe+0x6c9/0x7a3 [snd_hda_intel] [] local_pci_probe+0x74/0x100 [] pci_device_probe+0x111/0x120 [] driver_probe_device+0x76/0x240 [] __driver_attach+0x9b/0xa0 [] ? driver_probe_device+0x240/0x240 [] bus_for_each_dev+0x4d/0x90 [] driver_attach+0x19/0x20 [] bus_add_driver+0x190/0x260 [] ? 0xffffffffa002dfff [] ? 0xffffffffa002dfff [] driver_register+0x72/0x170 [] ? 0xffffffffa002dfff [] ? 0xffffffffa002dfff [] __pci_register_driver+0x43/0x50 [] azx_driver_init+0x1e/0x1000 [snd_hda_intel] [] do_one_initcall+0x11a/0x160 [] sys_init_module+0xbd/0x220 [] system_call_fastpath+0x1a/0x1f Full boot log is at http://quora.org/2012/hda-switch-spurious3.txt . Thank you for your work so far, Takashi! I'll present a patch tomorrow to annotate context of the intel-hda messages, since it'll help me check behaviour when diagnosing the GPU switching issues. Thanks again, Daniel -- Daniel J Blueman