All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thierry Reding <thierry.reding@gmail.com>
To: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: Joakim Zhang <qiangqing.zhang@nxp.com>,
	Jon Hunter <jonathanh@nvidia.com>,
	Giuseppe Cavallaro <peppe.cavallaro@st.com>,
	Alexandre Torgue <alexandre.torgue@st.com>,
	Jose Abreu <joabreu@synopsys.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-tegra <linux-tegra@vger.kernel.org>
Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
Date: Tue, 13 Apr 2021 18:06:39 +0200	[thread overview]
Message-ID: <YHXBj3rsEjf8Y+qn@orome.fritz.box> (raw)
In-Reply-To: <DB8PR04MB6795C779FD47D5712DAE11CDE64F9@DB8PR04MB6795.eurprd04.prod.outlook.com>

[-- Attachment #1: Type: text/plain, Size: 6333 bytes --]

On Tue, Apr 13, 2021 at 12:13:01PM +0000, Joakim Zhang wrote:
> 
> Hi Jon,
> 
> > -----Original Message-----
> > From: Jon Hunter <jonathanh@nvidia.com>
> > Sent: 2021年4月13日 16:41
> > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> > <peppe.cavallaro@st.com>; Alexandre Torgue <alexandre.torgue@st.com>;
> > Jose Abreu <joabreu@synopsys.com>
> > Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> > <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> > Jakub Kicinski <kuba@kernel.org>
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> > resume back
> > 
> > 
> > On 01/04/2021 17:28, Jon Hunter wrote:
> > >
> > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > >
> > > ...
> > >
> > >>> In answer to your question, resuming from suspend does work on this
> > >>> board without your change. We have been testing suspend/resume now
> > >>> on this board since Linux v5.8 and so we have the ability to bisect
> > >>> such regressions. So it is clear to me that this is the change that caused
> > this, but I am not sure why.
> > >>
> > >> Yes, I know this issue is regression caused by my patch. I just want to
> > analyze the potential reasons. Due to the code change only related to the page
> > recycle and reallocate.
> > >> So I guess if this page operate need IOMMU works when IOMMU is enabled.
> > Could you help check if IOMMU driver resume before STMMAC? Our common
> > desire is to find the root cause, right?
> > >
> > >
> > > Yes of course that is the desire here indeed. I had assumed that the
> > > suspend/resume order was good because we have never seen any problems,
> > > but nonetheless it is always good to check. Using ftrace I enabled
> > > tracing of the appropriate suspend/resume functions and this is what I
> > > see ...
> > >
> > > # tracer: function
> > > #
> > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > #
> > > #                                _-----=> irqs-off
> > > #                               / _----=> need-resched
> > > #                              | / _---=> hardirq/softirq
> > > #                              || / _--=> preempt-depth
> > > #                              ||| /     delay
> > > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > > #              | |         |   ||||      |         |
> > >          rtcwake-748     [000] ...1   536.700777:
> > stmmac_pltfr_suspend <-platform_pm_suspend
> > >          rtcwake-748     [000] ...1   536.735532:
> > arm_smmu_pm_suspend <-platform_pm_suspend
> > >          rtcwake-748     [000] ...1   536.757290:
> > arm_smmu_pm_resume <-platform_pm_resume
> > >          rtcwake-748     [003] ...1   536.856771:
> > stmmac_pltfr_resume <-platform_pm_resume
> > >
> > >
> > > So I don't see any ordering issues that could be causing this.
> > 
> > 
> > Another thing I have found is that for our platform, if the driver for the ethernet
> > PHY (in this case broadcom PHY) is enabled, then it fails to resume but if I
> > disable the PHY in the kernel configuration, then resume works. I have found
> > that if I move the reinit of the RX buffers to before the startup of the phy, then
> > it can resume OK with the PHY enabled.
> > 
> > Does the following work for you? Does your platform use a specific ethernet
> > PHY driver?
> 
> I am also looking into this issue these days, we use the Realtek RTL8211FDI PHY, driver is drivers/net/phy/realtek.c.
> 
> For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock from PHY, so we need phylink_start before MAC.
> 
> I will test below code change tomorrow to see if it can work at my side, since it is only re-init memory, need not RXC clock.
> 
> 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index 208cae344ffa..071d15d86dbe 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> >                         return ret;
> >         }
> > +       rtnl_lock();
> > +       mutex_lock(&priv->lock);
> > +       stmmac_reinit_rx_buffers(priv);
> > +       mutex_unlock(&priv->lock);
> > +
> >         if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> > -               rtnl_lock();
> >                 phylink_start(priv->phylink);
> >                 /* We may have called phylink_speed_down before */
> >                 phylink_speed_up(priv->phylink);
> > -               rtnl_unlock();
> >         }
> > -       rtnl_lock();
> >         mutex_lock(&priv->lock);
> >         stmmac_reset_queues_param(priv);
> > -       stmmac_reinit_rx_buffers(priv);
> >         stmmac_free_tx_skbufs(priv);
> >         stmmac_clear_descriptors(priv);
> > 
> > 
> > It is still not clear to us why the existing call to
> > stmmac_clear_descriptors() is not sufficient to fix your problem.
> 
> During suspend/resume stress test, I found rx descriptor may not refill when system suspended, rx descriptor could be: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040.
> When system resume back, stmmac_clear_descriptors() would change this rx descriptor to: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040, a broken rx descriptor.
> So at my side, stmmac_clear_descriptors() seems to be chief culprit. I have a idea if there is way to ensure all rx descriptors are refilled when suspend MAC.
> 
> > How often does the issue you see occur?
> Suspend about 2000 times.

Hi David, Jakub,

given where we are in the release cycle, I think it'd be best to revert
commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume
back") for now.

To summarize the discussion: the patch was meant as a workaround to fix
an occasional suspend/resume failure on one board that was not fully
root caused, and ends up causing fully reproducible suspend/resume
failures on at least one other board.

Joakim is looking at an alternative solution and Jon and I can provide
testing from the Tegra side for any fixes.

Do you want me to send a revert patch or can you revert directly on top
of your tree?

Thanks,
Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2021-04-13 16:07 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-24 10:50 Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back Jon Hunter
2021-03-24 12:20 ` Joakim Zhang
2021-03-24 12:39   ` Jon Hunter
2021-03-25  7:53     ` Joakim Zhang
2021-03-25  8:00       ` Jon Hunter
2021-03-25  8:12         ` Joakim Zhang
2021-03-30 12:50           ` Jon Hunter
2021-03-31  7:43             ` Joakim Zhang
2021-03-31 11:10               ` Joakim Zhang
2021-03-31 11:29                 ` Jon Hunter
2021-03-31 11:41                   ` Joakim Zhang
2021-04-01 16:28                     ` Jon Hunter
2021-04-13  8:41                       ` Jon Hunter
2021-04-13 12:13                         ` Joakim Zhang
2021-04-13 16:06                           ` Thierry Reding [this message]
2021-04-13 16:43                             ` Jakub Kicinski
2021-04-14  2:18                             ` Joakim Zhang
2021-04-14  7:40                               ` Thierry Reding
2021-04-14  8:06                                 ` Joakim Zhang
2021-04-14 12:22                                   ` Joakim Zhang
2021-03-31 11:11               ` Jon Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YHXBj3rsEjf8Y+qn@orome.fritz.box \
    --to=thierry.reding@gmail.com \
    --cc=alexandre.torgue@st.com \
    --cc=davem@davemloft.net \
    --cc=joabreu@synopsys.com \
    --cc=jonathanh@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peppe.cavallaro@st.com \
    --cc=qiangqing.zhang@nxp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.