From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755552Ab2DXQDv (ORCPT ); Tue, 24 Apr 2012 12:03:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:15101 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754625Ab2DXQDu (ORCPT ); Tue, 24 Apr 2012 12:03:50 -0400 Message-ID: <4F96CEDB.3060300@redhat.com> Date: Tue, 24 Apr 2012 12:03:39 -0400 From: Don Dutile User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110927 Red Hat/3.1.15-1.el6_1 Thunderbird/3.1.15 MIME-Version: 1.0 To: Bjorn Helgaas CC: "Rafael J. Wysocki" , Linus Torvalds , Mikko Vinni , "linux-input@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Dmitry Torokhov , Allen Kay , Jesse Barnes , "linux-pci@vger.kernel.org" Subject: Re: [PATCH] PCI: Fix regression in pci_restore_state() References: <1334310754.17013.YahooMailNeo@web161804.mail.bf1.yahoo.com> <201204152140.40747.rjw@sisk.pl> <201204232153.57351.rjw@sisk.pl> <4F95B67D.7020706@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/23/2012 06:33 PM, Bjorn Helgaas wrote: > On Mon, Apr 23, 2012 at 2:07 PM, Don Dutile wrote: >> On 04/23/2012 03:53 PM, Rafael J. Wysocki wrote: >>> >>> On Monday, April 23, 2012, Bjorn Helgaas wrote: >>>> >>>> On Sun, Apr 15, 2012 at 1:40 PM, Rafael J. Wysocki wrote: >>>>> >>>>> On Sunday, April 15, 2012, Linus Torvalds wrote: >>>>>> >>>>>> On Sun, Apr 15, 2012 at 11:47 AM, Rafael J. Wysocki >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> mdelay(10) doesn't really look good either to me in this case, though. >>>>>> >>>>>> >>>>>> Oh, I agree. What kind of ass-backwards device actually needs that >>>>>> kind of crazy delays? It is almost certainly buggy. >>>>>> >>>>>> With retries, 10ms delays are totally unacceptable. There's something >>>>>> wrong. >>>>>> >>>>>> A single ms *may* be ok. >>>>>> >>>>>> Anyway, can you also split the actual "write _one_ register with >>>>>> retry" into a function of its own? The code looks like crap with those >>>>>> multiple levels of looping, with conditionals inside them etc. With a >>>>>> simple helper function, you could change the break into return, and it >>>>>> would look much better, I bet. >>>>> >>>>> >>>>> Sure. It appears cleaner this way. >>>>> >>>>> --- >>>>> From: Rafael J. Wysocki >>>>> Subject: PCI: Fix regression in pci_restore_state(), v3 >>>>> >>>>> Commit 26f41062f28de65e11d3cf353e52d0be73442be1 >>>>> >>>>> PCI: check for pci bar restore completion and retry >>>>> >>>>> attempted to address problems with PCI BAR restoration on systems >>>>> where FLR had not been completed before pci_restore_state() was >>>>> called, but it did that in an utterly wrong way. >>>>> >>>>> First off, instead of retrying the writes for the BAR registers >>>>> only, it did that for all of the PCI config space of the device, >>>>> including the status register (whose value after the write quite >>>>> obviously need not be the same as the written one). Second, it >>>>> added arbitrary delay to pci_restore_state() even for systems >>>>> where the PCI config space restoration was successful at first >>>>> attempt. Finally, the mdelay(10) it added to every iteration of the >>>>> writing loop was way too much of a delay for any reasonable device. >>>>> >>>>> All of this actually caused resume failures for some devices on >>>>> the Mikko's system. >>>>> >>>>> To fix the regression, make pci_restore_state() only retry the >>>>> writes for BAR registers and only wait if the first read from >>>>> the register doesn't return the written value. Additionaly, make >>>>> it wait for 1 ms, instead of 10 ms, after every failing attempt >>>>> to write into config space. >>>>> >>>>> Reported-by: Mikko Vinni >>>>> Signed-off-by: Rafael J. Wysocki >>>>> --- >>>>> drivers/pci/pci.c | 57 >>>>> ++++++++++++++++++++++++++++++++++++------------------ >>>>> 1 file changed, 39 insertions(+), 18 deletions(-) >>>>> >>>>> Index: linux/drivers/pci/pci.c >>>>> =================================================================== >>>>> --- linux.orig/drivers/pci/pci.c >>>>> +++ linux/drivers/pci/pci.c >>>>> @@ -967,16 +967,47 @@ pci_save_state(struct pci_dev *dev) >>>>> return 0; >>>>> } >>>>> >>>>> +static void pci_restore_config_dword(struct pci_dev *pdev, int offset, >>>>> + u32 saved_val, int retry) >>>>> +{ >>>>> + u32 val; >>>>> + >>>>> + pci_read_config_dword(pdev, offset,&val); >>>>> + if (val == saved_val) >>>>> + return; >>>>> + >>>>> + for (;;) { >>>>> + dev_dbg(&pdev->dev, "restoring config space at offset " >>>>> + "%#x (was %#x, writing %#x)\n", offset, val, >>>>> saved_val); >>>>> + pci_write_config_dword(pdev, offset, saved_val); >>>>> + if (retry--<= 0) >>>>> + return; >>>>> + >>>>> + pci_read_config_dword(pdev, offset,&val); >>>>> + if (val == saved_val) >>>>> + return; >>>>> + >>>>> + mdelay(1); >>>>> + } >>>>> +} >>>>> + >>>>> +static void pci_restore_config_space(struct pci_dev *pdev, int start, >>>>> int end, >>>>> + int retry) >>>>> +{ >>>>> + int index; >>>>> + >>>>> + for (index = end; index>= start; index--) >>>>> + pci_restore_config_dword(pdev, 4 * index, >>>>> + >>>>> pdev->saved_config_space[index], >>>>> + retry); >>>>> +} >>>>> + >>>>> /** >>>>> * pci_restore_state - Restore the saved state of a PCI device >>>>> * @dev: - PCI device that we're dealing with >>>>> */ >>>>> void pci_restore_state(struct pci_dev *dev) >>>>> { >>>>> - int i; >>>>> - u32 val; >>>>> - int tries; >>>>> - >>>>> if (!dev->state_saved) >>>>> return; >>>>> >>>>> @@ -984,24 +1015,14 @@ void pci_restore_state(struct pci_dev *d >>>>> pci_restore_pcie_state(dev); >>>>> pci_restore_ats_state(dev); >>>>> >>>>> + pci_restore_config_space(dev, 10, 15, 0); >>>>> /* >>>>> * The Base Address register should be programmed before the >>>>> command >>>>> * register(s) >>>>> */ >>>>> - for (i = 15; i>= 0; i--) { >>>>> - pci_read_config_dword(dev, i * 4,&val); >>>>> - tries = 10; >>>>> - while (tries&& val != dev->saved_config_space[i]) { >>>>> >>>>> - dev_dbg(&dev->dev, "restoring config " >>>>> - "space at offset %#x (was %#x, writing >>>>> %#x)\n", >>>>> - i, val, >>>>> (int)dev->saved_config_space[i]); >>>>> - pci_write_config_dword(dev,i * 4, >>>>> - dev->saved_config_space[i]); >>>>> - pci_read_config_dword(dev, i * 4,&val); >>>>> - mdelay(10); >>>>> - tries--; >>>>> - } >>>>> - } >>>>> + pci_restore_config_space(dev, 4, 9, 10); >>>>> + pci_restore_config_space(dev, 0, 3, 0); >>>>> + >>>>> pci_restore_pcix_state(dev); >>>>> pci_restore_msi_state(dev); >>>>> pci_restore_iov_state(dev); >>>> >>>> >>>> I'd feel better about this if there were a way to delay in the FLR >>>> path instead. If we delay in the restore path, we're only fixing one >>>> of the many ways config space can be written. Other paths that write >>>> config space will just silently fail. >>>> >>>> The PCIe spec (r3.0, sec 6.6.2) mentions waiting for the "pre-FLR >>>> value for Completion Timeout," but I don't see anything that looks >>>> like that in pcie_flr() or pci_af_flr(). Are there any other direct >>>> ways we can detect when the FLR is complete? >>> >>> >>> I'm not aware of any. >>> >>> Thanks, >>> Rafael >> >> I don't think so, either. >> I believe an ECN is being worked in the PCI-SIG >> to add such a notification, though. >> Even if adopted, need to wait for another crank of the hw before >> the notification can be used. > > I agree, we can't do something that works only on new hardware -- we > have to make the existing hardware in the field work. > > What about the "waiting for as much time as the pre-FLR value for > Completion Timeout" part? > > Or can we do something like asserting FLR, sleeping 100ms, then > attempting a write to something in config space and retrying until it > sticks? It's kludgy, but I'm not sure it's any worse than putting the > retries in the restore path, and it would have the advantage that > other writers of config space wouldn't have to worry. > > Bjorn Depending on system config, reading a port that is being FLR'd can cause AERs, which if a driver is registered for the endpoint, it will get AERs reported to the driver and potentially complicate the FLRhandling. This implies a hook to temp-disable AER during FLR, then turning it back on (hw &/or sw). - Don ps -- and there was a deadly embrace where an AER induced during an FLR that was initiated by userspace (libvirt writing to device reset file in sysfs) would lock up the system b/c the AER handler did a config space access, which used the same mutex. Not sure if that was cleaned up finally.... very corner-case-ish, but it shows how subtle multiple PCIe events can become complicated.