linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Geis <pgwipeout@gmail.com>
To: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
Cc: Bjorn Helgaas <helgaas@kernel.org>,
	linux-pci@vger.kernel.org, robh@kernel.org, heiko@sntech.de,
	kw@linux.com, shawn.lin@rock-chips.com,
	linux-kernel@vger.kernel.org, lgirdwood@gmail.com,
	linux-rockchip@lists.infradead.org, broonie@kernel.org,
	bhelgaas@google.com,
	linux-kernel-mentees@lists.linuxfoundation.org,
	lpieralisi@kernel.org, linux-arm-kernel@lists.infradead.org,
	Dan Johansen <strit@manjaro.org>
Subject: Re: [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan
Date: Wed, 10 May 2023 15:46:06 -0400	[thread overview]
Message-ID: <CAMdYzYrPNwq7OC-_9PE18coO-y_NZuu9bAj4XX1r8q8Gy_EofA@mail.gmail.com> (raw)
In-Reply-To: <CSIJZYWBC38N.2M99O6W1PLR4B@vincent-arch>

On Wed, May 10, 2023 at 7:16 AM Vincenzo Palazzo
<vincenzopalazzodev@gmail.com> wrote:
>
> > On Tue, May 9, 2023 at 5:19 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > Hi Vincenzo,
> > >
> > > Thanks for raising this issue.  Let's see what we can do to address
> > > it.
> > >
> > > On Tue, May 09, 2023 at 05:39:12PM +0200, Vincenzo Palazzo wrote:
> > > > Add a configurable delay to the Rockchip PCIe driver to address
> > > > crashes that occur on some old devices, such as the Pine64 RockPro64.
> > > >
> > > > This issue is affecting the ARM community, but there is no
> > > > upstream solution for it yet.
> > >
> > > It sounds like this happens with several endpoints, right?  And I
> > > assume the endpoints work fine in other non-Rockchip systems?  If
> > > that's the case, my guess is the problem is with the Rockchip host
> > > controller and how it's initialized, not with the endpoints.
> > >
> > > The only delays and timeouts I see in the driver now are in
> > > rockchip_pcie_host_init_port(), where it waits for link training to
> > > complete.  I assume the link training did completely successfully
> > > since you don't mention either a gen1 or gen2 timeout (although the
> > > gen2 message is a dev_dbg() that normally wouldn't go to the console).
> > >
> > > I don't know that the spec contains a retrain timeout value.  Several
> > > other drivers use 1 second, while rockchip uses 500ms (for example,
> > > see LINK_RETRAIN_TIMEOUT and LINK_UP_TIMEOUT).
> > >
> > > I think we need to understand the issue better before adding a DT
> > > property and a module parameter.  Those are hard for users to deal
> > > with.  If we can figure out a value that works for everybody, it would
> > > be better to just hard-code it in the driver and use that all the
> > > time.
> >
> > Good Evening,
> >
> > The main issue with the rk3399 is the PCIe controller is buggy and
> > triggers a SoC panic when certain error conditions occur that should
> > be handled gracefully. One of those conditions is when an endpoint
> > requests an access to wait and retry later. Many years ago we ran that
> > issue to ground and with Robin Murphy's help we found that while it's
> > possible to gracefully handle that condition it required hijacking the
> > entire arm64 error handling routine. Not exactly scalable for just one
> > SoC. The configurable waits allow us to program reasonable times for
> > 90% of the endpoints that come up in the normal amount of time, while
> > being able to adjust it for the other 10% that do not. Some require
> > multiple seconds before they return without error. Part of the reason
> > we don't want to hardcode the wait time is because the probe isn't
> > handled asynchronously, so the kernel appears to hang while waiting
> > for the timeout.
>
> Yeah, I smell a hardware bug in my code. I hate waiting in this way inside
> the code, so it's usually wrong when you need to do something like that.

Correct, it's the unfortunate way of arm64 development. Almost none of
the SoCs implement all of their hardware in a completely specification
compliant way.

>
> During my research, I also found this patch (https://bugzilla.redhat.com/show_bug.cgi?id=2134177)
> that provides a fix in another possibly cleaner way.
>
> But I don't understand the reasoning behind it, so maybe I
> haven't spent enough time thinking about it.

That is a completely different issue, unrelated to the PCI crash.

>
> > I'm curious if it's been tested with this series on top:
> > https://lore.kernel.org/linux-arm-kernel/20230418074700.1083505-8-rick.wertenbroek@gmail.com/T/
> > I'm particularly curious if
> > [v5,04/11] PCI: rockchip: Add poll and timeout to wait for PHY PLLs to be locked
> > makes a difference in the behavior. Please test this and see if it
> > improves the timeouts you need for the endpoints you're testing
> > against.
>
> Mh, I can try to cherry-pick the commit and test it in my own test environment. Currently, I haven't been
> able to test it due to a lack of hardware, but I'm seeking a way to obtain one.
> Luckily, I have someone on the Manjaro arm team who can help me test it,
> so I'll try to do that.
>
> Cheers!
>
> Vincent.

  reply	other threads:[~2023-05-10 19:46 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-09 15:39 [PATCH v1] drivers: pci: introduce configurable delay for Rockchip PCIe bus scan Vincenzo Palazzo
2023-05-09 21:19 ` Bjorn Helgaas
2023-05-10  0:11   ` Peter Geis
2023-05-10 11:16     ` Vincenzo Palazzo
2023-05-10 19:46       ` Peter Geis [this message]
2023-05-10 20:47     ` Bjorn Helgaas
2023-05-11  1:07       ` Peter Geis
2023-05-12 10:46         ` Vincenzo Palazzo
2023-05-13  1:24           ` Bjorn Helgaas
2023-05-13 11:40             ` Peter Geis
2023-05-15 11:04               ` Vincenzo Palazzo
2023-05-15 16:51               ` Bjorn Helgaas
2023-05-15 20:52                 ` Peter Geis
2023-07-12 15:42               ` Vincenzo Palazzo
2023-05-10 11:35   ` Vincenzo Palazzo
2023-05-12 16:40   ` Vincenzo Palazzo
2023-05-10  7:57 ` Greg KH
2023-05-10 10:49   ` Vincenzo Palazzo
2023-11-20  4:15 ` Tom Fitzhenry
  -- strict thread matches above, loose matches on Subject: below --
2023-05-01 20:14 Vincenzo Palazzo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMdYzYrPNwq7OC-_9PE18coO-y_NZuu9bAj4XX1r8q8Gy_EofA@mail.gmail.com \
    --to=pgwipeout@gmail.com \
    --cc=bhelgaas@google.com \
    --cc=broonie@kernel.org \
    --cc=heiko@sntech.de \
    --cc=helgaas@kernel.org \
    --cc=kw@linux.com \
    --cc=lgirdwood@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel-mentees@lists.linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rockchip@lists.infradead.org \
    --cc=lpieralisi@kernel.org \
    --cc=robh@kernel.org \
    --cc=shawn.lin@rock-chips.com \
    --cc=strit@manjaro.org \
    --cc=vincenzopalazzodev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).