* Why do we check for "link-up" in *_pcie_valid_device()? @ 2017-12-14 22:58 Bjorn Helgaas 2017-12-15 18:39 ` Jingoo Han ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Bjorn Helgaas @ 2017-12-14 22:58 UTC (permalink / raw) To: Jingoo Han, Joao Pinto, Ley Foon Tan, Shawn Lin, Michal Simek Cc: Jim Quinlan, Lorenzo Pieralisi, linux-pci, rfi, linux-rockchip Hi all, In the PCI config access path, the *_pcie_valid_device() functions in the dwc, altera, rockchip, and xilinx drivers all check whether the link is up. I think this is racy because the link may go down after we check but before we perform the config access. What would blow up if we removed the *_pcie_link_up() checks? I'd like to either remove the checks or add comments about why the race is acceptable. If we've covered this before, I apologize. Adding a comment will keep me from pestering you about this again in the future. Bjorn ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-14 22:58 Why do we check for "link-up" in *_pcie_valid_device()? Bjorn Helgaas @ 2017-12-15 18:39 ` Jingoo Han 2017-12-15 19:04 ` Bjorn Helgaas 2017-12-22 13:02 ` Bharat Kumar Gogada 2018-01-02 12:24 ` Shawn Lin 2 siblings, 1 reply; 13+ messages in thread From: Jingoo Han @ 2017-12-15 18:39 UTC (permalink / raw) To: 'Bjorn Helgaas', 'Joao Pinto', 'Ley Foon Tan', 'Shawn Lin', 'Michal Simek' Cc: 'Jim Quinlan', 'Lorenzo Pieralisi', linux-pci, rfi, linux-rockchip On Thursday, December 14, 2017 5:58 PM, Bjorn Helgaas wrote: > > Hi all, > > In the PCI config access path, the *_pcie_valid_device() functions in > the dwc, altera, rockchip, and xilinx drivers all check whether the > link is up. > > I think this is racy because the link may go down after we check but > before we perform the config access. > > What would blow up if we removed the *_pcie_link_up() checks? The original intention is to avoid config access before link up. Also, I did not find any racy condition as you mentioned. However, if you think that we need to prevent the racy condition, someone can send a patch or add comments. Best regards, Jingoo Han > > I'd like to either remove the checks or add comments about why the > race is acceptable. If we've covered this before, I apologize. > Adding a comment will keep me from pestering you about this again in > the future. > > Bjorn ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-15 18:39 ` Jingoo Han @ 2017-12-15 19:04 ` Bjorn Helgaas 2017-12-15 20:11 ` Bjorn Helgaas 0 siblings, 1 reply; 13+ messages in thread From: Bjorn Helgaas @ 2017-12-15 19:04 UTC (permalink / raw) To: Jingoo Han Cc: 'Joao Pinto', 'Ley Foon Tan', 'Shawn Lin', 'Michal Simek', 'Jim Quinlan', 'Lorenzo Pieralisi', linux-pci, rfi, linux-rockchip On Fri, Dec 15, 2017 at 01:39:50PM -0500, Jingoo Han wrote: > On Thursday, December 14, 2017 5:58 PM, Bjorn Helgaas wrote: > > In the PCI config access path, the *_pcie_valid_device() functions in > > the dwc, altera, rockchip, and xilinx drivers all check whether the > > link is up. > > > > I think this is racy because the link may go down after we check but > > before we perform the config access. > > > > What would blow up if we removed the *_pcie_link_up() checks? > > The original intention is to avoid config access before link up. > > Also, I did not find any racy condition as you mentioned. > However, if you think that we need to prevent the racy condition, > someone can send a patch or add comments. Here's the race: pci_read_config_word ... dw_pcie_rd_conf dw_pcie_valid_device dw_pcie_link_up # returns true; link currently up <-- link goes down for unrelated reason dw_pcie_rd_other_conf dw_pcie_read readl # issue config read The readl() that issues the config read in the PCIe domain races with the link-down event. If the readl() completes first, everything works as expected. If the link-down happens first, I don't know what happens. This is the core of my question. The hardware *should* do something sane because link-down is an asynchronous event that is unrelated to the config read and may happen at any time. It's just part of the PCIe environment and the spec defines mechanisms for dealing with and reporting the situation. > > I'd like to either remove the checks or add comments about why the > > race is acceptable. If we've covered this before, I apologize. > > Adding a comment will keep me from pestering you about this again in > > the future. > > > > Bjorn > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-15 19:04 ` Bjorn Helgaas @ 2017-12-15 20:11 ` Bjorn Helgaas 0 siblings, 0 replies; 13+ messages in thread From: Bjorn Helgaas @ 2017-12-15 20:11 UTC (permalink / raw) To: Jingoo Han Cc: 'Joao Pinto', 'Ley Foon Tan', 'Shawn Lin', 'Michal Simek', 'Jim Quinlan', 'Lorenzo Pieralisi', linux-pci, rfi, linux-rockchip On Fri, Dec 15, 2017 at 01:04:26PM -0600, Bjorn Helgaas wrote: > On Fri, Dec 15, 2017 at 01:39:50PM -0500, Jingoo Han wrote: > > On Thursday, December 14, 2017 5:58 PM, Bjorn Helgaas wrote: > > > In the PCI config access path, the *_pcie_valid_device() functions in > > > the dwc, altera, rockchip, and xilinx drivers all check whether the > > > link is up. Sorry, Shawn, I included rockchip by mistake; it doesn't actually check for link-up. I'll try to remember to remove you and linux-rockchip from the cc list of future emails. > > > I think this is racy because the link may go down after we check but > > > before we perform the config access. > > > > > > What would blow up if we removed the *_pcie_link_up() checks? > > > > The original intention is to avoid config access before link up. > > > > Also, I did not find any racy condition as you mentioned. > > However, if you think that we need to prevent the racy condition, > > someone can send a patch or add comments. > > Here's the race: > > pci_read_config_word > ... > dw_pcie_rd_conf > dw_pcie_valid_device > dw_pcie_link_up # returns true; link currently up > <-- link goes down for unrelated reason > dw_pcie_rd_other_conf > dw_pcie_read > readl # issue config read > > The readl() that issues the config read in the PCIe domain races with > the link-down event. If the readl() completes first, everything works > as expected. If the link-down happens first, I don't know what > happens. This is the core of my question. > > The hardware *should* do something sane because link-down is an > asynchronous event that is unrelated to the config read and may happen > at any time. It's just part of the PCIe environment and the spec > defines mechanisms for dealing with and reporting the situation. To make this concrete, the patch I would propose is below. This isn't for merging yet; just for discussion and testing. Jim mentioned that the Broadcom STB host controller effects a synchronous or asynchronous abort when doing a config access when the link or power has gone down on the Endpoint. Possibly there's a similar issue on DesignWare, Altera, and Xilinx. I think the best way to handle that is to figure out how to deal with the abort cleanly. Using a test like *_pcie_link_up() to try to avoid it is a 99% solution that means we'll see rare failures that are very difficult to reproduce and debug. diff --git a/drivers/pci/dwc/pcie-designware-host.c b/drivers/pci/dwc/pcie-designware-host.c index 81e2157a7cfb..3dcdcdd6aa37 100644 --- a/drivers/pci/dwc/pcie-designware-host.c +++ b/drivers/pci/dwc/pcie-designware-host.c @@ -524,14 +524,6 @@ static int dw_pcie_wr_other_conf(struct pcie_port *pp, struct pci_bus *bus, static int dw_pcie_valid_device(struct pcie_port *pp, struct pci_bus *bus, int dev) { - struct dw_pcie *pci = to_dw_pcie_from_pp(pp); - - /* If there is no link, then there is no device */ - if (bus->number != pp->root_bus_nr) { - if (!dw_pcie_link_up(pci)) - return 0; - } - /* access only one slot on each root port */ if (bus->number == pp->root_bus_nr && dev > 0) return 0; diff --git a/drivers/pci/host/pcie-altera.c b/drivers/pci/host/pcie-altera.c index 5cc4f594d79a..17ba6cce9bd0 100644 --- a/drivers/pci/host/pcie-altera.c +++ b/drivers/pci/host/pcie-altera.c @@ -140,12 +140,6 @@ static void tlp_write_tx(struct altera_pcie *pcie, static bool altera_pcie_valid_device(struct altera_pcie *pcie, struct pci_bus *bus, int dev) { - /* If there is no link, then there is no device */ - if (bus->number != pcie->root_bus_nr) { - if (!altera_pcie_link_up(pcie)) - return false; - } - /* access only one slot on each root port */ if (bus->number == pcie->root_bus_nr && dev > 0) return false; diff --git a/drivers/pci/host/pcie-xilinx-nwl.c b/drivers/pci/host/pcie-xilinx-nwl.c index 65dea98b2643..db94df5db148 100644 --- a/drivers/pci/host/pcie-xilinx-nwl.c +++ b/drivers/pci/host/pcie-xilinx-nwl.c @@ -218,12 +218,6 @@ static bool nwl_pcie_valid_device(struct pci_bus *bus, unsigned int devfn) { struct nwl_pcie *pcie = bus->sysdata; - /* Check link before accessing downstream ports */ - if (bus->number != pcie->root_busno) { - if (!nwl_pcie_link_up(pcie)) - return false; - } - /* Only one device down on each root port */ if (bus->number == pcie->root_busno && devfn > 0) return false; diff --git a/drivers/pci/host/pcie-xilinx.c b/drivers/pci/host/pcie-xilinx.c index 7b5325990f5e..3cdb99708342 100644 --- a/drivers/pci/host/pcie-xilinx.c +++ b/drivers/pci/host/pcie-xilinx.c @@ -163,11 +163,6 @@ static bool xilinx_pcie_valid_device(struct pci_bus *bus, unsigned int devfn) { struct xilinx_pcie_port *port = bus->sysdata; - /* Check if link is up when trying to access downstream ports */ - if (bus->number != port->root_busno) - if (!xilinx_pcie_link_up(port)) - return false; - /* Only one device down on each root port */ if (bus->number == port->root_busno && devfn > 0) return false; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* RE: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-14 22:58 Why do we check for "link-up" in *_pcie_valid_device()? Bjorn Helgaas 2017-12-15 18:39 ` Jingoo Han @ 2017-12-22 13:02 ` Bharat Kumar Gogada 2017-12-22 17:28 ` Bjorn Helgaas 2018-01-02 12:24 ` Shawn Lin 2 siblings, 1 reply; 13+ messages in thread From: Bharat Kumar Gogada @ 2017-12-22 13:02 UTC (permalink / raw) To: Bjorn Helgaas, Jingoo Han, joao.pinto, Ley Foon Tan, Shawn Lin, Michal Simek Cc: Jim Quinlan, lorenzo.pieralisi, linux-pci, rfi, linux-rockchip Hi all, In the PCI config access path, the *_pcie_valid_device() functions in the d= wc, altera, rockchip, and xilinx drivers all check whether the link is up. I think this is racy because the link may go down after we check but before= we perform the config access. What would blow up if we removed the *_pcie_link_up() checks? I'd like to either remove the checks or add comments about why the race is = acceptable. If we've covered this before, I apologize. Adding a comment will keep me from pestering you about this again in the fu= ture. Hi Bjorn, In both Xilinx driver cases when link is down, hardware responds by AXI DEC= ERR/SLVERR status which=20 causes an exception, synchronous external abort to CPU.=20 This causes system to hang, so we need this check for both of our drivers. We will add comments.=20 We have shutdown, Please expect delay in response till 1st Jan, 2018. Regards, Bharat ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-22 13:02 ` Bharat Kumar Gogada @ 2017-12-22 17:28 ` Bjorn Helgaas 2018-01-02 11:37 ` Lorenzo Pieralisi 2018-01-05 14:26 ` Bharat Kumar Gogada 0 siblings, 2 replies; 13+ messages in thread From: Bjorn Helgaas @ 2017-12-22 17:28 UTC (permalink / raw) To: Bharat Kumar Gogada Cc: Jingoo Han, joao.pinto, Ley Foon Tan, Shawn Lin, Michal Simek, Jim Quinlan, lorenzo.pieralisi, linux-pci, rfi, linux-rockchip On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada wrote: > Bjorn wrote: >> In the PCI config access path, the *_pcie_valid_device() functions >> in the dwc, altera, rockchip, and xilinx drivers all check whether >> the link is up. >> >> I think this is racy because the link may go down after we check but >> before we perform the config access. >> >> What would blow up if we removed the *_pcie_link_up() checks? >> >> I'd like to either remove the checks or add comments about why the >> race is acceptable. If we've covered this before, I apologize. >> Adding a comment will keep me from pestering you about this again in >> the future. > In both Xilinx driver cases when link is down, hardware responds by > AXI DECERR/SLVERR status which causes an exception, synchronous > external abort to CPU. This causes system to hang, so we need this > check for both of our drivers. We will add comments. This is a problem, and checking whether the link is up is a workaround but not a real solution. That means your system may hang if the link happens to go down at the wrong time. A real solution would be to handle the synchronous external abort so it doesn't cause a system hang. Bjorn ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-22 17:28 ` Bjorn Helgaas @ 2018-01-02 11:37 ` Lorenzo Pieralisi 2018-01-05 14:26 ` Bharat Kumar Gogada 1 sibling, 0 replies; 13+ messages in thread From: Lorenzo Pieralisi @ 2018-01-02 11:37 UTC (permalink / raw) To: Bjorn Helgaas Cc: Bharat Kumar Gogada, Jingoo Han, joao.pinto, Ley Foon Tan, Shawn Lin, Michal Simek, Jim Quinlan, linux-pci, rfi, linux-rockchip On Fri, Dec 22, 2017 at 11:28:15AM -0600, Bjorn Helgaas wrote: > On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada wrote: > > Bjorn wrote: > >> In the PCI config access path, the *_pcie_valid_device() functions > >> in the dwc, altera, rockchip, and xilinx drivers all check whether > >> the link is up. > >> > >> I think this is racy because the link may go down after we check but > >> before we perform the config access. > >> > >> What would blow up if we removed the *_pcie_link_up() checks? > >> > >> I'd like to either remove the checks or add comments about why the > >> race is acceptable. If we've covered this before, I apologize. > >> Adding a comment will keep me from pestering you about this again in > >> the future. > > > In both Xilinx driver cases when link is down, hardware responds by > > AXI DECERR/SLVERR status which causes an exception, synchronous > > external abort to CPU. This causes system to hang, so we need this > > check for both of our drivers. We will add comments. > > This is a problem, and checking whether the link is up is a workaround > but not a real solution. That means your system may hang if the link > happens to go down at the wrong time. > > A real solution would be to handle the synchronous external abort > so it doesn't cause a system hang. I still have no idea why checking (in a broken way) that the link is up for _every_ given config access is a solution. If, say, the link is down in xilinx_pcie_init_port(), what would bring it up or, worded differently, why checking that the link is up for every config access instead of host bridge probe time make this code any safer ? It is not safe anyway, it would be good to understand though why the code was written this way so that we can change it appropriately. Thanks, Lorenzo ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-22 17:28 ` Bjorn Helgaas 2018-01-02 11:37 ` Lorenzo Pieralisi @ 2018-01-05 14:26 ` Bharat Kumar Gogada 2018-01-05 15:43 ` Lorenzo Pieralisi 1 sibling, 1 reply; 13+ messages in thread From: Bharat Kumar Gogada @ 2018-01-05 14:26 UTC (permalink / raw) To: Bjorn Helgaas Cc: Jingoo Han, joao.pinto, Ley Foon Tan, Shawn Lin, Michal Simek, Jim Quinlan, lorenzo.pieralisi, linux-pci, rfi, linux-rockchip On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada wrote: > Bjorn wrote: >> In the PCI config access path, the *_pcie_valid_device() functions in=20 >> the dwc, altera, rockchip, and xilinx drivers all check whether the=20 >> link is up. >>=20 >> I think this is racy because the link may go down after we check but=20 >> before we perform the config access. >>=20 >> What would blow up if we removed the *_pcie_link_up() checks? >>=20 >> I'd like to either remove the checks or add comments about why the=20 >> race is acceptable. If we've covered this before, I apologize. >> Adding a comment will keep me from pestering you about this again in=20 >> the future. > In both Xilinx driver cases when link is down, hardware responds by=20 > AXI DECERR/SLVERR status which causes an exception, synchronous=20 > external abort to CPU. This causes system to hang, so we need this=20 > check for both of our drivers. We will add comments. This is a problem, and checking whether the link is up is a workaround but = not a real solution. That means your system may hang if the link happens t= o go down at the wrong time. A real solution would be to handle the synchronous external abort so it doe= sn't cause a system hang. Yes, I agree that this is workaround. For pcie-xilinx.c for arm32, we can h= ave fault handling similar to "imx6q_pcie_abort_handler" in drivers/pci/dwc= /pci-imx6.c. Since this driver is same for Microblaze architecture also, it requires sep= arate handling. For pcie-xilinx-nwl.c ARM64 as per link [1], linux kernel will hang for the= above AXI responses.=20 As of now arm64 RAS is still work in progress [2]. =20 [1] https://www.spinics.net/lists/arm-kernel/msg624203.html [2] https://patchwork.kernel.org/patch/9973967/ The check can be removed, if above issues were addressed. Regards, Bharat =20 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2018-01-05 14:26 ` Bharat Kumar Gogada @ 2018-01-05 15:43 ` Lorenzo Pieralisi 2018-01-08 11:03 ` Lucas Stach 0 siblings, 1 reply; 13+ messages in thread From: Lorenzo Pieralisi @ 2018-01-05 15:43 UTC (permalink / raw) To: Bharat Kumar Gogada Cc: Bjorn Helgaas, Jingoo Han, joao.pinto, Ley Foon Tan, Shawn Lin, Michal Simek, Jim Quinlan, linux-pci, rfi, linux-rockchip On Fri, Jan 05, 2018 at 02:26:34PM +0000, Bharat Kumar Gogada wrote: > On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada wrote: > > Bjorn wrote: > >> In the PCI config access path, the *_pcie_valid_device() functions in > >> the dwc, altera, rockchip, and xilinx drivers all check whether the > >> link is up. > >> > >> I think this is racy because the link may go down after we check but > >> before we perform the config access. > >> > >> What would blow up if we removed the *_pcie_link_up() checks? > >> > >> I'd like to either remove the checks or add comments about why the > >> race is acceptable. If we've covered this before, I apologize. > >> Adding a comment will keep me from pestering you about this again in > >> the future. > > > In both Xilinx driver cases when link is down, hardware responds by > > AXI DECERR/SLVERR status which causes an exception, synchronous > > external abort to CPU. This causes system to hang, so we need this > > check for both of our drivers. We will add comments. > > This is a problem, and checking whether the link is up is a workaround but not a real solution. That means your system may hang if the link happens to go down at the wrong time. > > A real solution would be to handle the synchronous external abort so it doesn't cause a system hang. > > Yes, I agree that this is workaround. For pcie-xilinx.c for arm32, we can have fault handling similar to "imx6q_pcie_abort_handler" in drivers/pci/dwc/pci-imx6.c. > Since this driver is same for Microblaze architecture also, it requires separate handling. > > For pcie-xilinx-nwl.c ARM64 as per link [1], linux kernel will hang for the above AXI responses. > As of now arm64 RAS is still work in progress [2]. > > [1] https://www.spinics.net/lists/arm-kernel/msg624203.html > > [2] https://patchwork.kernel.org/patch/9973967/ > > The check can be removed, if above issues were addressed. I do not see why the above "issues" should be addressed in order to remove that check - as it was pointed out in this thread it just does not solve anything, so what's the reason for keeping it ? Lorenzo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2018-01-05 15:43 ` Lorenzo Pieralisi @ 2018-01-08 11:03 ` Lucas Stach 2018-01-08 11:24 ` Lorenzo Pieralisi 0 siblings, 1 reply; 13+ messages in thread From: Lucas Stach @ 2018-01-08 11:03 UTC (permalink / raw) To: Lorenzo Pieralisi, Bharat Kumar Gogada Cc: Bjorn Helgaas, Jingoo Han, joao.pinto, Ley Foon Tan, Shawn Lin, Michal Simek, Jim Quinlan, linux-pci, rfi, linux-rockchip Am Freitag, den 05.01.2018, 15:43 +0000 schrieb Lorenzo Pieralisi: > On Fri, Jan 05, 2018 at 02:26:34PM +0000, Bharat Kumar Gogada wrote: > > On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada > > wrote: > > > Bjorn wrote: > > > > In the PCI config access path, the *_pcie_valid_device() > > > > functions in > > > > the dwc, altera, rockchip, and xilinx drivers all check whether > > > > the > > > > link is up. > > > > > > > > I think this is racy because the link may go down after we > > > > check but > > > > before we perform the config access. > > > > > > > > What would blow up if we removed the *_pcie_link_up() checks? > > > > > > > > I'd like to either remove the checks or add comments about why > > > > the > > > > race is acceptable. If we've covered this before, I apologize. > > > > Adding a comment will keep me from pestering you about this > > > > again in > > > > the future. > > > In both Xilinx driver cases when link is down, hardware responds > > > by > > > AXI DECERR/SLVERR status which causes an exception, synchronous > > > external abort to CPU. This causes system to hang, so we need > > > this > > > check for both of our drivers. We will add comments. > > > > This is a problem, and checking whether the link is up is a > > workaround but not a real solution. That means your system may > > hang if the link happens to go down at the wrong time. > > > > A real solution would be to handle the synchronous external abort > > so it doesn't cause a system hang. > > > > Yes, I agree that this is workaround. For pcie-xilinx.c for arm32, > > we can have fault handling similar to "imx6q_pcie_abort_handler" in > > drivers/pci/dwc/pci-imx6.c. > > Since this driver is same for Microblaze architecture also, it > > requires separate handling. > > > > For pcie-xilinx-nwl.c ARM64 as per link [1], linux kernel will hang > > for the above AXI responses. > > As of now arm64 RAS is still work in progress [2]. > > > > [1] https://www.spinics.net/lists/arm-kernel/msg624203.html > > > > [2] https://patchwork.kernel.org/patch/9973967/ > > > > The check can be removed, if above issues were addressed. > > I do not see why the above "issues" should be addressed in order to > remove that check - as it was pointed out in this thread it just does > not solve anything, so what's the reason for keeping it ? I solves the issue that you hang the system on PCIe enumeration in 100% of the cases when the link is down and you don't have the abort handler in place. It doesn't solve the race issue, but that is a lot less likely to be hit in the real world. I guess it's not a good idea to remove something that covers 98% of the problem just because it doesn't cover the remaining 2%, right? Regards, Lucas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2018-01-08 11:03 ` Lucas Stach @ 2018-01-08 11:24 ` Lorenzo Pieralisi 0 siblings, 0 replies; 13+ messages in thread From: Lorenzo Pieralisi @ 2018-01-08 11:24 UTC (permalink / raw) To: Lucas Stach Cc: Bharat Kumar Gogada, Bjorn Helgaas, Jingoo Han, joao.pinto, Ley Foon Tan, Shawn Lin, Michal Simek, Jim Quinlan, linux-pci, rfi, linux-rockchip On Mon, Jan 08, 2018 at 12:03:34PM +0100, Lucas Stach wrote: > Am Freitag, den 05.01.2018, 15:43 +0000 schrieb Lorenzo Pieralisi: > > On Fri, Jan 05, 2018 at 02:26:34PM +0000, Bharat Kumar Gogada wrote: > > > On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada > > > wrote: > > > > Bjorn wrote: > > > > > In the PCI config access path, the *_pcie_valid_device() > > > > > functions in > > > > > the dwc, altera, rockchip, and xilinx drivers all check whether > > > > > the > > > > > link is up. > > > > > > > > > > I think this is racy because the link may go down after we > > > > > check but > > > > > before we perform the config access. > > > > > > > > > > What would blow up if we removed the *_pcie_link_up() checks? > > > > > > > > > > I'd like to either remove the checks or add comments about why > > > > > the > > > > > race is acceptable. If we've covered this before, I apologize. > > > > > Adding a comment will keep me from pestering you about this > > > > > again in > > > > > the future. > > > > In both Xilinx driver cases when link is down, hardware responds > > > > by > > > > AXI DECERR/SLVERR status which causes an exception, synchronous > > > > external abort to CPU. This causes system to hang, so we need > > > > this > > > > check for both of our drivers. We will add comments. > > > > > > This is a problem, and checking whether the link is up is a > > > workaround but not a real solution. That means your system may > > > hang if the link happens to go down at the wrong time. > > > > > > A real solution would be to handle the synchronous external abort > > > so it doesn't cause a system hang. > > > > > > Yes, I agree that this is workaround. For pcie-xilinx.c for arm32, > > > we can have fault handling similar to "imx6q_pcie_abort_handler" in > > > drivers/pci/dwc/pci-imx6.c. > > > Since this driver is same for Microblaze architecture also, it > > > requires separate handling. > > > > > > For pcie-xilinx-nwl.c ARM64 as per link [1], linux kernel will hang > > > for the above AXI responses. > > > As of now arm64 RAS is still work in progress [2]. > > > > > > [1] https://www.spinics.net/lists/arm-kernel/msg624203.html > > > > > > [2] https://patchwork.kernel.org/patch/9973967/ > > > > > > The check can be removed, if above issues were addressed. > > > > I do not see why the above "issues" should be addressed in order to > > remove that check - as it was pointed out in this thread it just does > > not solve anything, so what's the reason for keeping it ? > > I solves the issue that you hang the system on PCIe enumeration in 100% > of the cases when the link is down and you don't have the abort handler > in place. There is a mechanism to detect if the link is up before starting enumeration or am I wrong ? Probably what we should be discussing here is what are the causes for the link to go down *unexpectedly* - in other words - what makes that racy check "likely" to work - which was the initial question, by the way. > It doesn't solve the race issue, but that is a lot less likely to be > hit in the real world. I guess it's not a good idea to remove something > that covers 98% of the problem just because it doesn't cover the > remaining 2%, right? See above - I want to understand what your 98% and 2% actually represent. Thanks, Lorenzo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2017-12-14 22:58 Why do we check for "link-up" in *_pcie_valid_device()? Bjorn Helgaas 2017-12-15 18:39 ` Jingoo Han 2017-12-22 13:02 ` Bharat Kumar Gogada @ 2018-01-02 12:24 ` Shawn Lin 2018-01-02 12:28 ` Shawn Lin 2 siblings, 1 reply; 13+ messages in thread From: Shawn Lin @ 2018-01-02 12:24 UTC (permalink / raw) To: Bjorn Helgaas Cc: Jingoo Han, Joao Pinto, Ley Foon Tan, Michal Simek, shawn.lin, Jim Quinlan, Lorenzo Pieralisi, linux-pci, rfi Hi Bjorn, On 2017/12/15 6:58, Bjorn Helgaas wrote: > Hi all, > > In the PCI config access path, the *_pcie_valid_device() functions in > the dwc, altera, rockchip, and xilinx drivers all check whether the > link is up. > > I think this is racy because the link may go down after we check but > before we perform the config access. > The link could be broken at any time, so the check if bogus. And we have accessors for performing the config access but we don't have these for performing memory access, so again the racy is always there. > What would blow up if we removed the *_pcie_link_up() checks? What Rockchip needs(probably for all arm64 system) is something like this patchset[1]. And for arm32, we could refer to imx6q_pcie_abort_handler in drivers/pci/dwc/pci-imx6.c Note that ATF(ARM trust firmware) could help capture these (a)synchronous, so we actually don't even need to resort to patchset[1], and handle these abort in the firmware. [1]https://lkml.org/lkml/2017/11/10/236 > > I'd like to either remove the checks or add comments about why the Remove it makes sense to me. > race is acceptable. If we've covered this before, I apologize. > Adding a comment will keep me from pestering you about this again in > the future. > > Bjorn > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Why do we check for "link-up" in *_pcie_valid_device()? 2018-01-02 12:24 ` Shawn Lin @ 2018-01-02 12:28 ` Shawn Lin 0 siblings, 0 replies; 13+ messages in thread From: Shawn Lin @ 2018-01-02 12:28 UTC (permalink / raw) To: Bjorn Helgaas Cc: shawn.lin, Jingoo Han, Joao Pinto, Ley Foon Tan, Michal Simek, Jim Quinlan, Lorenzo Pieralisi, linux-pci, rfi Hi Bjorn, On 2018/1/2 20:24, Shawn Lin wrote: > Hi Bjorn, > > On 2017/12/15 6:58, Bjorn Helgaas wrote: >> Hi all, >> >> In the PCI config access path, the *_pcie_valid_device() functions in >> the dwc, altera, rockchip, and xilinx drivers all check whether the >> link is up. >> >> I think this is racy because the link may go down after we check but >> before we perform the config access. >> > > The link could be broken at any time, so the check if bogus. And we > have accessors for performing the config access but we don't have these > for performing memory access, so again the racy is always there. > > >> What would blow up if we removed the *_pcie_link_up() checks? > > What Rockchip needs(probably for all arm64 system) is something > like this patchset[1]. And for arm32, we could refer to > imx6q_pcie_abort_handler in drivers/pci/dwc/pci-imx6.c > > Note that ATF(ARM trust firmware) could help capture these > (a)synchronous, so we actually don't even need to resort to patchset[1], > and handle these abort in the firmware. > Sorry for the wrong link, please refer to this one: https://lkml.org/lkml/2017/3/24/413 > > [1]https://lkml.org/lkml/2017/11/10/236 > > >> >> I'd like to either remove the checks or add comments about why the > > Remove it makes sense to me. > >> race is acceptable. If we've covered this before, I apologize. >> Adding a comment will keep me from pestering you about this again in >> the future. >> >> Bjorn >> >> >> > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-01-08 11:24 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-14 22:58 Why do we check for "link-up" in *_pcie_valid_device()? Bjorn Helgaas 2017-12-15 18:39 ` Jingoo Han 2017-12-15 19:04 ` Bjorn Helgaas 2017-12-15 20:11 ` Bjorn Helgaas 2017-12-22 13:02 ` Bharat Kumar Gogada 2017-12-22 17:28 ` Bjorn Helgaas 2018-01-02 11:37 ` Lorenzo Pieralisi 2018-01-05 14:26 ` Bharat Kumar Gogada 2018-01-05 15:43 ` Lorenzo Pieralisi 2018-01-08 11:03 ` Lucas Stach 2018-01-08 11:24 ` Lorenzo Pieralisi 2018-01-02 12:24 ` Shawn Lin 2018-01-02 12:28 ` Shawn Lin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).