From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753091Ab2LDM67 (ORCPT <rfc822;w@1wt.eu>);
	Tue, 4 Dec 2012 07:58:59 -0500
Received: from mail-oa0-f46.google.com ([209.85.219.46]:34498 "EHLO
	mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752277Ab2LDM65 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 4 Dec 2012 07:58:57 -0500
MIME-Version: 1.0
X-Originating-IP: [121.7.72.46]
In-Reply-To: <s5h38znysbx.wl%tiwai@suse.de>
References: <CAMVG2st4KZOYk9qMK6FFRnJt0bCEz-0nFsQAa4zXc+sed_eVhQ@mail.gmail.com>
	<s5hwqx69kuk.wl%tiwai@suse.de>
	<s5ha9tv1j2a.wl%tiwai@suse.de>
	<CAMVG2svAdYxJqj=7oA8XpYfaV6MP14Ff_DR3ZwVG0cm-FV_0sA@mail.gmail.com>
	<s5hip8jyza2.wl%tiwai@suse.de>
	<CAMVG2suWOAXJYWJb+QrLH2KV3CJw+Ut6fRRuy0U_F6-9oV9YyA@mail.gmail.com>
	<s5ha9tvyuio.wl%tiwai@suse.de>
	<CAMVG2svjXk-dbhjuiAX4SpTctNbwPkyzOa47n9vA2o1rLiWbBw@mail.gmail.com>
	<s5h38znysbx.wl%tiwai@suse.de>
Date: Tue, 4 Dec 2012 20:58:55 +0800
Message-ID: <CAMVG2sse7VN31gVjGuQr+7R6To8Nwk-OPwFs5Nhauxh7-J=jOg@mail.gmail.com>
Subject: Re: switcheroo registration vs switching race...
From: Daniel J Blueman <daniel@quora.org>
To: Takashi Iwai <tiwai@suse.de>
Cc: Seth Forshee <seth.forshee@canonical.com>, Dave Airlie <airlied@linux.ie>,
        Linux Kernel <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset=ISO-8859-1
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 4 December 2012 01:10, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 4 Dec 2012 00:50:56 +0800,
> Daniel J Blueman wrote:
>>
>> On 4 December 2012 00:23, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Mon, 3 Dec 2012 23:08:28 +0800,
>> > Daniel J Blueman wrote:
>> >>
>> >> On 3 December 2012 22:40, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > At Mon, 3 Dec 2012 22:25:52 +0800,
>> >> > Daniel J Blueman wrote:
>> >> >>
>> >> >> On 3 December 2012 19:17, Takashi Iwai <tiwai@suse.de> wrote:
>> >> >> > At Wed, 28 Nov 2012 09:45:39 +0100,
>> >> >> > Takashi Iwai wrote:
>> >> >> >>
>> >> >> >> At Wed, 28 Nov 2012 11:45:07 +0800,
>> >> >> >> Daniel J Blueman wrote:
>> >> >> >> >
>> >> >> >> > Hi Seth, Dave, Takashi,
>> >> >> >> >
>> >> >> >> > If I power down the unused discrete GPU before lightdm starts by
>> >> >> >> > fiddling with the sysfs file [1] in the upstart script, I see a race
>> >> >> >> > manifesting as the discrete GPU's HDA controller timing out to
>> >> >> >> > commands [2].
>> >> >> >> >
>> >> >> >> > Adding some debug, I see that the registered audio devices are put
>> >> >> >> > into D3 before the GPU is, but it turns out that the discrete (and
>> >> >> >> > internal) GPU's HDA controller gets registered a bit later, so the
>> >> >> >> > list is empty. The symptom is since the HDA driver it's talking to
>> >> >> >> > hardware which is now in D3.
>> >> >> >> >
>> >> >> >> > We could add a mutex to nouveau to allow us to wait for the DGPU HDA
>> >> >> >> > controller, but perhaps this should be solved at a higher level in the
>> >> >> >> > vgaswitcheroo code; what do you think?
>> >> >> >>
>> >> >> >> Maybe it's a side effect for the recent effort to fix another race in
>> >> >> >> the probe.  A part of them problem is that the registration is done at
>> >> >> >> the very last of probing.
>> >> >> >>
>> >> >> >> Instead of delaying the registration, how about the patch below?
>> >> >> >
>> >> >> > Ping.  If this really works, I'd like to queue it for 3.8 merge, at
>> >> >> > least...
>> >> >>
>> >> >> Ping ack; I was trying to find time to understand another race that
>> >> >> occurs with GPU probing after switching, but is separate from the
>> >> >> situation before switching, here.
>> >> >>
>> >> >> In the context of writing the switch, it looks like struct azx isn't
>> >> >> allocated by the time azx_vs_set_state accesses it [1,2]; racing with
>> >> >> azx_codec_create?
>> >> >
>> >> > It was allocated, but it wasn't assigned properly in pci drvdata.
>> >> >
>> >> > Below is the revised patch.  Just moved pci_set_drvdata() before
>> >> > register_vga_switcheroo().  Could you retest with it?
>> >>
>> >> Superb; this addresses the oops.
>> >
>> > OK, I'll queue it to sound tree for 3.8 kernel with Cc to stable.
>> >
>> >> ~1 second after the DGPU is put into D3, I still often see "hda-intel:
>> >> spurious response 0x0:0x0, last cmd=0x170500":
>> >> http://quora.org/2012/hda-switch-spurious.txt
>> >
>> > Hm, it's not clear who triggers these messages.  I'll try to check the
>> > code paths.
>> >
>> >> Presumably this implies the read of the ring-buffer pointer returned
>> >> 0xffffffff, so the HDA driver understands the pointer to have wrapped
>> >> and processes the 191 unwritten entries?
>> >
>> > Good point.  Actually there is one bug that looks obviously wrong
>> > (writing 32bit value to CORBWP).  Maybe it has been working just
>> > because writing CORBRP doesn't influence except for the reset bit.
>> >
>> > Reading CORBWP as a byte is OK, but this could be better in a word so
>> > that we can check 0xffff as invalid.
>> >
>> > A test patch is below.  Hopefully this improves the situation...
>>
>> I'll check this out tomorrow and also instrument the code to get a
>> backtrace, since there may still be an underlying race with the
>> previous patches:

[...]

> That's odd.  The Cirrus one (0000:00:1b.0) must be independent from
> the vga switcheroo things for Nvidia...
>
> The patch below is the revised patch of the first one.  Now the vga
> switcheroo registration is done before the video controller D3 check,
> so the race should be smaller.

This patch improves things further of course; a major improvement over
without. ~15% of the time, I still get the 'spurious response'
messages in this callpath:

Pid: 473, comm: modprobe Not tainted 3.7.0-rc8-expert+ #15
Call Trace:
 [<ffffffffa0025776>] azx_update_rirb+0xe6/0xf0 [snd_hda_intel]
 [<ffffffffa0026d00>] azx_get_response+0xc0/0x270 [snd_hda_intel]
 [<ffffffffa0226801>] codec_exec_verb+0xc1/0x100 [snd_hda_codec]
 [<ffffffffa022799b>] snd_hda_codec_read+0x5b/0x90 [snd_hda_codec]
 [<ffffffff810545e9>] ? try_to_grab_pending+0xb9/0x130
 [<ffffffffa0227d35>] hda_set_power_state+0xd5/0x110 [snd_hda_codec]
 [<ffffffffa0226420>] hda_call_codec_resume+0x20/0x120 [snd_hda_codec]
 [<ffffffffa0226689>] snd_hda_power_save+0x129/0x1e0 [snd_hda_codec]
 [<ffffffffa022678f>] codec_exec_verb+0x4f/0x100 [snd_hda_codec]
 [<ffffffffa02268ab>] snd_hda_codec_write+0x6b/0xa0 [snd_hda_codec]
 [<ffffffffa0227281>] snd_hda_codec_write_cache+0x31/0xa0 [snd_hda_codec]
 [<ffffffffa022b965>] snd_hda_jack_detect_enable_callback+0x75/0xa0
[snd_hda_codec]
 [<ffffffffa022b9a1>] snd_hda_jack_detect_enable+0x11/0x20 [snd_hda_codec]
 [<ffffffffa00cf5a6>] generic_hdmi_init+0xb6/0xe0 [snd_hda_codec_hdmi]
 [<ffffffffa022ad67>] snd_hda_codec_build_controls+0x37/0x130 [snd_hda_codec]
 [<ffffffffa022b080>] ? snd_hda_codec_build_pcms+0x1a0/0x2e0 [snd_hda_codec]
 [<ffffffffa022ae88>] snd_hda_build_controls+0x28/0x80 [snd_hda_codec]
 [<ffffffffa0027ed2>] azx_probe_continue+0x2f2/0x410 [snd_hda_intel]
 [<ffffffffa0027a10>] ? azx_pcm_hw_params+0xe0/0xe0 [snd_hda_intel]
 [<ffffffffa0026c40>] ? azx_resume+0x110/0x110 [snd_hda_intel]
 [<ffffffffa00258a0>] ? azx_power_notify+0x40/0x40 [snd_hda_intel]
 [<ffffffffa00269e0>] ? azx_runtime_resume+0x40/0x40 [snd_hda_intel]
 [<ffffffffa0025860>] ? azx_vs_can_switch+0x90/0x90 [snd_hda_intel]
 [<ffffffffa0028966>] azx_probe+0x6c9/0x7a3 [snd_hda_intel]
 [<ffffffff8120ae74>] local_pci_probe+0x74/0x100
 [<ffffffff8120b641>] pci_device_probe+0x111/0x120
 [<ffffffff812be936>] driver_probe_device+0x76/0x240
 [<ffffffff812beb9b>] __driver_attach+0x9b/0xa0
 [<ffffffff812beb00>] ? driver_probe_device+0x240/0x240
 [<ffffffff812bcc9d>] bus_for_each_dev+0x4d/0x90
 [<ffffffff812be499>] driver_attach+0x19/0x20
 [<ffffffff812be050>] bus_add_driver+0x190/0x260
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffff812bf202>] driver_register+0x72/0x170
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffffa002e000>] ? 0xffffffffa002dfff
 [<ffffffff8120b723>] __pci_register_driver+0x43/0x50
 [<ffffffffa002e01e>] azx_driver_init+0x1e/0x1000 [snd_hda_intel]
 [<ffffffff810002da>] do_one_initcall+0x11a/0x160
 [<ffffffff81087cfd>] sys_init_module+0xbd/0x220
 [<ffffffff814b7b96>] system_call_fastpath+0x1a/0x1f

Full boot log is at http://quora.org/2012/hda-switch-spurious3.txt .
Thank you for your work so far, Takashi!

I'll present a patch tomorrow to annotate context of the intel-hda
messages, since it'll help me check behaviour when diagnosing the GPU
switching issues.

Thanks again,
  Daniel
--
Daniel J Blueman