linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
@ 2021-03-24 10:50 Jon Hunter
  2021-03-24 12:20 ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Hunter @ 2021-03-24 10:50 UTC (permalink / raw)
  To: Joakim Zhang
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski

Hi Joakim,

Starting with v5.12-rc3 I noticed that one of our boards, Tegra186
Jetson TX2, was not long resuming from suspend. Bisect points to commit
9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume back")
and reverting this on top of mainline fixes the problem.

Interestingly, the board appears to partially resume from suspend and I
see ethernet appear to resume ...

 dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link
 mode
 dwmac4: Master AXI performs any burst length
 dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
 dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow
 control rx/tx

I don't see any crash, but then it hangs at some point. Please note that
this board is using NFS for mounting the rootfs.

Let me know if there is any more info I can provide or tests I can run.

Thanks
Jon



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-24 10:50 Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back Jon Hunter
@ 2021-03-24 12:20 ` Joakim Zhang
  2021-03-24 12:39   ` Jon Hunter
  0 siblings, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-03-24 12:20 UTC (permalink / raw)
  To: Jon Hunter; +Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


> -----Original Message-----
> From: Jon Hunter <jonathanh@nvidia.com>
> Sent: 2021年3月24日 18:51
> To: Joakim Zhang <qiangqing.zhang@nxp.com>
> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> Jakub Kicinski <kuba@kernel.org>
> Subject: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> Hi Joakim,
> 
> Starting with v5.12-rc3 I noticed that one of our boards, Tegra186 Jetson TX2,
> was not long resuming from suspend. Bisect points to commit 9c63faaa931e
> ("net: stmmac: re-init rx buffers when mac resume back") and reverting this on
> top of mainline fixes the problem.
> 
> Interestingly, the board appears to partially resume from suspend and I see
> ethernet appear to resume ...
> 
>  dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link  mode
>  dwmac4: Master AXI performs any burst length  dwc-eth-dwmac
> 2490000.ethernet eth0: No Safety Features support found  dwc-eth-dwmac
> 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow  control rx/tx
> 
> I don't see any crash, but then it hangs at some point. Please note that this
> board is using NFS for mounting the rootfs.
> 
> Let me know if there is any more info I can provide or tests I can run.

Hi Jon,

Sorry for this breakage at your side.

You mean one of your boards? Does other boards with STMMAC can work fine?

We do daily test with NFS to mount rootfs, on issue found. And I add this patch at the resume patch, and on error check, this should not break suspend.
I even did the overnight stress test, there is no issue found.

Could you please do more test to see where the issue happen?

Best Regards,
Joakim Zhang
> Thanks
> Jon
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-24 12:20 ` Joakim Zhang
@ 2021-03-24 12:39   ` Jon Hunter
  2021-03-25  7:53     ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Hunter @ 2021-03-24 12:39 UTC (permalink / raw)
  To: Joakim Zhang
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski



On 24/03/2021 12:20, Joakim Zhang wrote:

...

> Sorry for this breakage at your side.
> 
> You mean one of your boards? Does other boards with STMMAC can work fine?

We have two devices with the STMMAC and one works OK and the other
fails. They are different generation of device and so there could be
some architectural differences which is causing this to only be seen on
one device.

> We do daily test with NFS to mount rootfs, on issue found. And I add this patch at the resume patch, and on error check, this should not break suspend.
> I even did the overnight stress test, there is no issue found.
> 
> Could you please do more test to see where the issue happen?

The issue occurs 100% of the time on the failing board and always on the
first resume from suspend. Is there any more debug I can enable to track
down what the problem is?

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-24 12:39   ` Jon Hunter
@ 2021-03-25  7:53     ` Joakim Zhang
  2021-03-25  8:00       ` Jon Hunter
  0 siblings, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-03-25  7:53 UTC (permalink / raw)
  To: Jon Hunter; +Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


> -----Original Message-----
> From: Jon Hunter <jonathanh@nvidia.com>
> Sent: 2021年3月24日 20:39
> To: Joakim Zhang <qiangqing.zhang@nxp.com>
> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> Jakub Kicinski <kuba@kernel.org>
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> 
> On 24/03/2021 12:20, Joakim Zhang wrote:
> 
> ...
> 
> > Sorry for this breakage at your side.
> >
> > You mean one of your boards? Does other boards with STMMAC can work
> fine?
> 
> We have two devices with the STMMAC and one works OK and the other fails.
> They are different generation of device and so there could be some
> architectural differences which is causing this to only be seen on one device.
It's really strange, but I also don't know what architectural differences could affect this. Sorry.

> > We do daily test with NFS to mount rootfs, on issue found. And I add this
> patch at the resume patch, and on error check, this should not break suspend.
> > I even did the overnight stress test, there is no issue found.
> >
> > Could you please do more test to see where the issue happen?
> 
> The issue occurs 100% of the time on the failing board and always on the first
> resume from suspend. Is there any more debug I can enable to track down
> what the problem is?
> 

As commit messages described, the patch aims to re-init rx buffers address, since the address is not fixed, so I only can 
recycle and then re-allocate all of them. The page pool is allocated once when open the net device.

Could you please debug if it fails at some functions, such as page_pool_dev_alloc_pages() ?

Best Regards,
Joakim Zhang
> Jon
> 
> --
> nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-25  7:53     ` Joakim Zhang
@ 2021-03-25  8:00       ` Jon Hunter
  2021-03-25  8:12         ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Hunter @ 2021-03-25  8:00 UTC (permalink / raw)
  To: Joakim Zhang
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


On 25/03/2021 07:53, Joakim Zhang wrote:
> 
>> -----Original Message-----
>> From: Jon Hunter <jonathanh@nvidia.com>
>> Sent: 2021年3月24日 20:39
>> To: Joakim Zhang <qiangqing.zhang@nxp.com>
>> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
>> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
>> Jakub Kicinski <kuba@kernel.org>
>> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
>> resume back
>>
>>
>>
>> On 24/03/2021 12:20, Joakim Zhang wrote:
>>
>> ...
>>
>>> Sorry for this breakage at your side.
>>>
>>> You mean one of your boards? Does other boards with STMMAC can work
>> fine?
>>
>> We have two devices with the STMMAC and one works OK and the other fails.
>> They are different generation of device and so there could be some
>> architectural differences which is causing this to only be seen on one device.
> It's really strange, but I also don't know what architectural differences could affect this. Sorry.


Maybe caching somewhere? In other words, could there be any cache
flushing that we are missing here?

>>> We do daily test with NFS to mount rootfs, on issue found. And I add this
>> patch at the resume patch, and on error check, this should not break suspend.
>>> I even did the overnight stress test, there is no issue found.
>>>
>>> Could you please do more test to see where the issue happen?
>>
>> The issue occurs 100% of the time on the failing board and always on the first
>> resume from suspend. Is there any more debug I can enable to track down
>> what the problem is?
>>
> 
> As commit messages described, the patch aims to re-init rx buffers address, since the address is not fixed, so I only can 
> recycle and then re-allocate all of them. The page pool is allocated once when open the net device.
> 
> Could you please debug if it fails at some functions, such as page_pool_dev_alloc_pages() ?


Yes that was the first thing I tried, but no obvious failures from
allocating the pools.

Are you certain that the problem you are seeing, that is being fixed by
this change, is generic to all devices? The commit message states that
'descriptor write back by DMA could exhibit unusual behavior', is this a
known issue in the STMMAC controller? If so does this impact all
versions and what is the actual problem?

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-25  8:00       ` Jon Hunter
@ 2021-03-25  8:12         ` Joakim Zhang
  2021-03-30 12:50           ` Jon Hunter
  0 siblings, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-03-25  8:12 UTC (permalink / raw)
  To: Jon Hunter; +Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


> -----Original Message-----
> From: Jon Hunter <jonathanh@nvidia.com>
> Sent: 2021年3月25日 16:01
> To: Joakim Zhang <qiangqing.zhang@nxp.com>
> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> Jakub Kicinski <kuba@kernel.org>
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> On 25/03/2021 07:53, Joakim Zhang wrote:
> >
> >> -----Original Message-----
> >> From: Jon Hunter <jonathanh@nvidia.com>
> >> Sent: 2021年3月24日 20:39
> >> To: Joakim Zhang <qiangqing.zhang@nxp.com>
> >> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> >> <linux-kernel@vger.kernel.org>; linux-tegra
> >> <linux-tegra@vger.kernel.org>; Jakub Kicinski <kuba@kernel.org>
> >> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> >> when mac resume back
> >>
> >>
> >>
> >> On 24/03/2021 12:20, Joakim Zhang wrote:
> >>
> >> ...
> >>
> >>> Sorry for this breakage at your side.
> >>>
> >>> You mean one of your boards? Does other boards with STMMAC can work
> >> fine?
> >>
> >> We have two devices with the STMMAC and one works OK and the other
> fails.
> >> They are different generation of device and so there could be some
> >> architectural differences which is causing this to only be seen on one device.
> > It's really strange, but I also don't know what architectural differences could
> affect this. Sorry.
> 
> 
> Maybe caching somewhere? In other words, could there be any cache flushing
> that we are missing here?
Have no idea, have not account into such case.

> >>> We do daily test with NFS to mount rootfs, on issue found. And I add
> >>> this
> >> patch at the resume patch, and on error check, this should not break
> suspend.
> >>> I even did the overnight stress test, there is no issue found.
> >>>
> >>> Could you please do more test to see where the issue happen?
> >>
> >> The issue occurs 100% of the time on the failing board and always on
> >> the first resume from suspend. Is there any more debug I can enable
> >> to track down what the problem is?
> >>
> >
> > As commit messages described, the patch aims to re-init rx buffers
> > address, since the address is not fixed, so I only can recycle and then
> re-allocate all of them. The page pool is allocated once when open the net
> device.
> >
> > Could you please debug if it fails at some functions, such as
> page_pool_dev_alloc_pages() ?
> 
> 
> Yes that was the first thing I tried, but no obvious failures from allocating the
> pools.
> 
> Are you certain that the problem you are seeing, that is being fixed by this
> change, is generic to all devices? The commit message states that 'descriptor
> write back by DMA could exhibit unusual behavior', is this a known issue in the
> STMMAC controller? If so does this impact all versions and what is the actual
> problem?

Yes, I confirm this patch fix issue at my side. It should not be a generic, it can reproduce at one of our boards.
To be honest, I have not found the root cause, this should be a workaround, I upstream it since I think it will not affect others which don't suffer from this.

Best Regards,
Joakim Zhang
> Jon
> 
> --
> nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-25  8:12         ` Joakim Zhang
@ 2021-03-30 12:50           ` Jon Hunter
  2021-03-31  7:43             ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Hunter @ 2021-03-30 12:50 UTC (permalink / raw)
  To: Joakim Zhang
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski



On 25/03/2021 08:12, Joakim Zhang wrote:

...

>>>>> You mean one of your boards? Does other boards with STMMAC can work
>>>> fine?
>>>>
>>>> We have two devices with the STMMAC and one works OK and the other
>> fails.
>>>> They are different generation of device and so there could be some
>>>> architectural differences which is causing this to only be seen on one device.
>>> It's really strange, but I also don't know what architectural differences could
>> affect this. Sorry.


I realised that for the board which fails after this change is made, it
has the IOMMU enabled. The other board does not at the moment (although
work is in progress to enable). If I add 'iommu.passthrough=1' to
cmdline for the failing board, then it works again. So in my case, the
problem is linked to the IOMMU being enabled.

Does you platform enable the IOMMU?

Thanks
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-30 12:50           ` Jon Hunter
@ 2021-03-31  7:43             ` Joakim Zhang
  2021-03-31 11:10               ` Joakim Zhang
  2021-03-31 11:11               ` Jon Hunter
  0 siblings, 2 replies; 21+ messages in thread
From: Joakim Zhang @ 2021-03-31  7:43 UTC (permalink / raw)
  To: Jon Hunter; +Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


> -----Original Message-----
> From: Jon Hunter <jonathanh@nvidia.com>
> Sent: 2021年3月30日 20:51
> To: Joakim Zhang <qiangqing.zhang@nxp.com>
> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> Jakub Kicinski <kuba@kernel.org>
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> 
> On 25/03/2021 08:12, Joakim Zhang wrote:
> 
> ...
> 
> >>>>> You mean one of your boards? Does other boards with STMMAC can
> >>>>> work
> >>>> fine?
> >>>>
> >>>> We have two devices with the STMMAC and one works OK and the other
> >> fails.
> >>>> They are different generation of device and so there could be some
> >>>> architectural differences which is causing this to only be seen on one
> device.
> >>> It's really strange, but I also don't know what architectural
> >>> differences could
> >> affect this. Sorry.
> 
> 
> I realised that for the board which fails after this change is made, it has the
> IOMMU enabled. The other board does not at the moment (although work is in
> progress to enable). If I add 'iommu.passthrough=1' to cmdline for the failing
> board, then it works again. So in my case, the problem is linked to the IOMMU
> being enabled.
> 
> Does you platform enable the IOMMU?

Hi Jon,

There is no IOMMU hardware available on our boards. But why IOMMU would affect it during suspend/resume, and no problem in normal mode?

Best Regards,
Joakim Zhang
> Thanks
> Jon
> 
> --
> nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-31  7:43             ` Joakim Zhang
@ 2021-03-31 11:10               ` Joakim Zhang
  2021-03-31 11:29                 ` Jon Hunter
  2021-03-31 11:11               ` Jon Hunter
  1 sibling, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-03-31 11:10 UTC (permalink / raw)
  To: Joakim Zhang, Jon Hunter
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


> -----Original Message-----
> From: Joakim Zhang <qiangqing.zhang@nxp.com>
> Sent: 2021年3月31日 15:44
> To: Jon Hunter <jonathanh@nvidia.com>
> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> Jakub Kicinski <kuba@kernel.org>
> Subject: RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> > -----Original Message-----
> > From: Jon Hunter <jonathanh@nvidia.com>
> > Sent: 2021年3月30日 20:51
> > To: Joakim Zhang <qiangqing.zhang@nxp.com>
> > Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> > <linux-kernel@vger.kernel.org>; linux-tegra
> > <linux-tegra@vger.kernel.org>; Jakub Kicinski <kuba@kernel.org>
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > when mac resume back
> >
> >
> >
> > On 25/03/2021 08:12, Joakim Zhang wrote:
> >
> > ...
> >
> > >>>>> You mean one of your boards? Does other boards with STMMAC can
> > >>>>> work
> > >>>> fine?
> > >>>>
> > >>>> We have two devices with the STMMAC and one works OK and the
> > >>>> other
> > >> fails.
> > >>>> They are different generation of device and so there could be
> > >>>> some architectural differences which is causing this to only be
> > >>>> seen on one
> > device.
> > >>> It's really strange, but I also don't know what architectural
> > >>> differences could
> > >> affect this. Sorry.
> >
> >
> > I realised that for the board which fails after this change is made,
> > it has the IOMMU enabled. The other board does not at the moment
> > (although work is in progress to enable). If I add
> > 'iommu.passthrough=1' to cmdline for the failing board, then it works
> > again. So in my case, the problem is linked to the IOMMU being enabled.
> >
> > Does you platform enable the IOMMU?
> 
> Hi Jon,
> 
> There is no IOMMU hardware available on our boards. But why IOMMU would
> affect it during suspend/resume, and no problem in normal mode?

One more add, I saw drivers/iommu/tegra-gart.c(not sure if is this) support suspend/resume, is it possible iommu resume back after stmmac?

Best Regards,
Joakim Zhang
> Best Regards,
> Joakim Zhang
> > Thanks
> > Jon
> >
> > --
> > nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-31  7:43             ` Joakim Zhang
  2021-03-31 11:10               ` Joakim Zhang
@ 2021-03-31 11:11               ` Jon Hunter
  1 sibling, 0 replies; 21+ messages in thread
From: Jon Hunter @ 2021-03-31 11:11 UTC (permalink / raw)
  To: Joakim Zhang, Jakub Kicinski, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu
  Cc: netdev, Linux Kernel Mailing List, linux-tegra


On 31/03/2021 08:43, Joakim Zhang wrote:

...

>>>>>>> You mean one of your boards? Does other boards with STMMAC can
>>>>>>> work
>>>>>> fine?
>>>>>>
>>>>>> We have two devices with the STMMAC and one works OK and the other
>>>> fails.
>>>>>> They are different generation of device and so there could be some
>>>>>> architectural differences which is causing this to only be seen on one
>> device.
>>>>> It's really strange, but I also don't know what architectural
>>>>> differences could
>>>> affect this. Sorry.
>>
>>
>> I realised that for the board which fails after this change is made, it has the
>> IOMMU enabled. The other board does not at the moment (although work is in
>> progress to enable). If I add 'iommu.passthrough=1' to cmdline for the failing
>> board, then it works again. So in my case, the problem is linked to the IOMMU
>> being enabled.
>>
>> Does you platform enable the IOMMU?
> 
> Hi Jon,
> 
> There is no IOMMU hardware available on our boards. But why IOMMU would affect it during suspend/resume, and no problem in normal mode?


I am not sure either and I don't see anything obvious.

Guiseppe, Alexandre, Jose, do you see anything that is wrong with
Joakim's change 9c63faaa931e? This is completely breaking resume from
suspend on one of our boards and I would like to get your inputs?

Thanks
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-31 11:10               ` Joakim Zhang
@ 2021-03-31 11:29                 ` Jon Hunter
  2021-03-31 11:41                   ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Hunter @ 2021-03-31 11:29 UTC (permalink / raw)
  To: Joakim Zhang, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


On 31/03/2021 12:10, Joakim Zhang wrote:

...

>>>>>>>> You mean one of your boards? Does other boards with STMMAC can
>>>>>>>> work
>>>>>>> fine?
>>>>>>>
>>>>>>> We have two devices with the STMMAC and one works OK and the
>>>>>>> other
>>>>> fails.
>>>>>>> They are different generation of device and so there could be
>>>>>>> some architectural differences which is causing this to only be
>>>>>>> seen on one
>>> device.
>>>>>> It's really strange, but I also don't know what architectural
>>>>>> differences could
>>>>> affect this. Sorry.
>>>
>>>
>>> I realised that for the board which fails after this change is made,
>>> it has the IOMMU enabled. The other board does not at the moment
>>> (although work is in progress to enable). If I add
>>> 'iommu.passthrough=1' to cmdline for the failing board, then it works
>>> again. So in my case, the problem is linked to the IOMMU being enabled.
>>>
>>> Does you platform enable the IOMMU?
>>
>> Hi Jon,
>>
>> There is no IOMMU hardware available on our boards. But why IOMMU would
>> affect it during suspend/resume, and no problem in normal mode?
> 
> One more add, I saw drivers/iommu/tegra-gart.c(not sure if is this) support suspend/resume, is it possible iommu resume back after stmmac?


This board is the tegra186-p2771-0000 (Jetson TX2) and uses the
arm,mmu-500 and not the above driver.

In answer to your question, resuming from suspend does work on this
board without your change. We have been testing suspend/resume now on
this board since Linux v5.8 and so we have the ability to bisect such
regressions. So it is clear to me that this is the change that caused
this, but I am not sure why.

Thanks
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-31 11:29                 ` Jon Hunter
@ 2021-03-31 11:41                   ` Joakim Zhang
  2021-04-01 16:28                     ` Jon Hunter
  0 siblings, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-03-31 11:41 UTC (permalink / raw)
  To: Jon Hunter, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


> -----Original Message-----
> From: Jon Hunter <jonathanh@nvidia.com>
> Sent: 2021年3月31日 19:29
> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> <peppe.cavallaro@st.com>; Alexandre Torgue <alexandre.torgue@st.com>;
> Jose Abreu <joabreu@synopsys.com>
> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> Jakub Kicinski <kuba@kernel.org>
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> On 31/03/2021 12:10, Joakim Zhang wrote:
> 
> ...
> 
> >>>>>>>> You mean one of your boards? Does other boards with STMMAC can
> >>>>>>>> work
> >>>>>>> fine?
> >>>>>>>
> >>>>>>> We have two devices with the STMMAC and one works OK and the
> >>>>>>> other
> >>>>> fails.
> >>>>>>> They are different generation of device and so there could be
> >>>>>>> some architectural differences which is causing this to only be
> >>>>>>> seen on one
> >>> device.
> >>>>>> It's really strange, but I also don't know what architectural
> >>>>>> differences could
> >>>>> affect this. Sorry.
> >>>
> >>>
> >>> I realised that for the board which fails after this change is made,
> >>> it has the IOMMU enabled. The other board does not at the moment
> >>> (although work is in progress to enable). If I add
> >>> 'iommu.passthrough=1' to cmdline for the failing board, then it
> >>> works again. So in my case, the problem is linked to the IOMMU being
> enabled.
> >>>
> >>> Does you platform enable the IOMMU?
> >>
> >> Hi Jon,
> >>
> >> There is no IOMMU hardware available on our boards. But why IOMMU
> >> would affect it during suspend/resume, and no problem in normal mode?
> >
> > One more add, I saw drivers/iommu/tegra-gart.c(not sure if is this) support
> suspend/resume, is it possible iommu resume back after stmmac?
> 
> 
> This board is the tegra186-p2771-0000 (Jetson TX2) and uses the
> arm,mmu-500 and not the above driver.

OK.

> In answer to your question, resuming from suspend does work on this board
> without your change. We have been testing suspend/resume now on this board
> since Linux v5.8 and so we have the ability to bisect such regressions. So it is
> clear to me that this is the change that caused this, but I am not sure why.

Yes, I know this issue is regression caused by my patch. I just want to analyze the potential reasons. Due to the code change only related to the page recycle and reallocate.
So I guess if this page operate need IOMMU works when IOMMU is enabled. Could you help check if IOMMU driver resume before STMMAC? Our common desire is to find the root cause, right?

Best Regards,
Joakim Zhang
> Thanks
> Jon
> 
> --
> nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-03-31 11:41                   ` Joakim Zhang
@ 2021-04-01 16:28                     ` Jon Hunter
  2021-04-13  8:41                       ` Jon Hunter
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Hunter @ 2021-04-01 16:28 UTC (permalink / raw)
  To: Joakim Zhang, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


On 31/03/2021 12:41, Joakim Zhang wrote:

...

>> In answer to your question, resuming from suspend does work on this board
>> without your change. We have been testing suspend/resume now on this board
>> since Linux v5.8 and so we have the ability to bisect such regressions. So it is
>> clear to me that this is the change that caused this, but I am not sure why.
> 
> Yes, I know this issue is regression caused by my patch. I just want to analyze the potential reasons. Due to the code change only related to the page recycle and reallocate.
> So I guess if this page operate need IOMMU works when IOMMU is enabled. Could you help check if IOMMU driver resume before STMMAC? Our common desire is to find the root cause, right?


Yes of course that is the desire here indeed. I had assumed that the
suspend/resume order was good because we have never seen any problems,
but nonetheless it is always good to check. Using ftrace I enabled
tracing of the appropriate suspend/resume functions and this is what
I see ...

# tracer: function
#
# entries-in-buffer/entries-written: 4/4   #P:6
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
         rtcwake-748     [000] ...1   536.700777: stmmac_pltfr_suspend <-platform_pm_suspend
         rtcwake-748     [000] ...1   536.735532: arm_smmu_pm_suspend <-platform_pm_suspend
         rtcwake-748     [000] ...1   536.757290: arm_smmu_pm_resume <-platform_pm_resume
         rtcwake-748     [003] ...1   536.856771: stmmac_pltfr_resume <-platform_pm_resume


So I don't see any ordering issues that could be causing this. 

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-01 16:28                     ` Jon Hunter
@ 2021-04-13  8:41                       ` Jon Hunter
  2021-04-13 12:13                         ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Hunter @ 2021-04-13  8:41 UTC (permalink / raw)
  To: Joakim Zhang, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


On 01/04/2021 17:28, Jon Hunter wrote:
> 
> On 31/03/2021 12:41, Joakim Zhang wrote:
> 
> ...
> 
>>> In answer to your question, resuming from suspend does work on this board
>>> without your change. We have been testing suspend/resume now on this board
>>> since Linux v5.8 and so we have the ability to bisect such regressions. So it is
>>> clear to me that this is the change that caused this, but I am not sure why.
>>
>> Yes, I know this issue is regression caused by my patch. I just want to analyze the potential reasons. Due to the code change only related to the page recycle and reallocate.
>> So I guess if this page operate need IOMMU works when IOMMU is enabled. Could you help check if IOMMU driver resume before STMMAC? Our common desire is to find the root cause, right?
> 
> 
> Yes of course that is the desire here indeed. I had assumed that the
> suspend/resume order was good because we have never seen any problems,
> but nonetheless it is always good to check. Using ftrace I enabled
> tracing of the appropriate suspend/resume functions and this is what
> I see ...
> 
> # tracer: function
> #
> # entries-in-buffer/entries-written: 4/4   #P:6
> #
> #                                _-----=> irqs-off
> #                               / _----=> need-resched
> #                              | / _---=> hardirq/softirq
> #                              || / _--=> preempt-depth
> #                              ||| /     delay
> #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> #              | |         |   ||||      |         |
>          rtcwake-748     [000] ...1   536.700777: stmmac_pltfr_suspend <-platform_pm_suspend
>          rtcwake-748     [000] ...1   536.735532: arm_smmu_pm_suspend <-platform_pm_suspend
>          rtcwake-748     [000] ...1   536.757290: arm_smmu_pm_resume <-platform_pm_resume
>          rtcwake-748     [003] ...1   536.856771: stmmac_pltfr_resume <-platform_pm_resume
> 
> 
> So I don't see any ordering issues that could be causing this. 


Another thing I have found is that for our platform, if the driver for
the ethernet PHY (in this case broadcom PHY) is enabled, then it fails
to resume but if I disable the PHY in the kernel configuration, then
resume works. I have found that if I move the reinit of the RX buffers
to before the startup of the phy, then it can resume OK with the PHY
enabled.

Does the following work for you? Does your platform use a specific
ethernet PHY driver?

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 208cae344ffa..071d15d86dbe 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
                        return ret;
        }
+       rtnl_lock();
+       mutex_lock(&priv->lock);
+       stmmac_reinit_rx_buffers(priv);
+       mutex_unlock(&priv->lock);
+
        if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
-               rtnl_lock();
                phylink_start(priv->phylink);
                /* We may have called phylink_speed_down before */
                phylink_speed_up(priv->phylink);
-               rtnl_unlock();
        }
-       rtnl_lock();
        mutex_lock(&priv->lock);
        stmmac_reset_queues_param(priv);
-       stmmac_reinit_rx_buffers(priv);
        stmmac_free_tx_skbufs(priv);
        stmmac_clear_descriptors(priv);


It is still not clear to us why the existing call to
stmmac_clear_descriptors() is not sufficient to fix your problem.

How often does the issue you see occur?

Thanks
Jon

-- 
nvpublic

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-13  8:41                       ` Jon Hunter
@ 2021-04-13 12:13                         ` Joakim Zhang
  2021-04-13 16:06                           ` Thierry Reding
  0 siblings, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-04-13 12:13 UTC (permalink / raw)
  To: Jon Hunter, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu
  Cc: netdev, Linux Kernel Mailing List, linux-tegra, Jakub Kicinski


Hi Jon,

> -----Original Message-----
> From: Jon Hunter <jonathanh@nvidia.com>
> Sent: 2021年4月13日 16:41
> To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> <peppe.cavallaro@st.com>; Alexandre Torgue <alexandre.torgue@st.com>;
> Jose Abreu <joabreu@synopsys.com>
> Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> Jakub Kicinski <kuba@kernel.org>
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> On 01/04/2021 17:28, Jon Hunter wrote:
> >
> > On 31/03/2021 12:41, Joakim Zhang wrote:
> >
> > ...
> >
> >>> In answer to your question, resuming from suspend does work on this
> >>> board without your change. We have been testing suspend/resume now
> >>> on this board since Linux v5.8 and so we have the ability to bisect
> >>> such regressions. So it is clear to me that this is the change that caused
> this, but I am not sure why.
> >>
> >> Yes, I know this issue is regression caused by my patch. I just want to
> analyze the potential reasons. Due to the code change only related to the page
> recycle and reallocate.
> >> So I guess if this page operate need IOMMU works when IOMMU is enabled.
> Could you help check if IOMMU driver resume before STMMAC? Our common
> desire is to find the root cause, right?
> >
> >
> > Yes of course that is the desire here indeed. I had assumed that the
> > suspend/resume order was good because we have never seen any problems,
> > but nonetheless it is always good to check. Using ftrace I enabled
> > tracing of the appropriate suspend/resume functions and this is what I
> > see ...
> >
> > # tracer: function
> > #
> > # entries-in-buffer/entries-written: 4/4   #P:6
> > #
> > #                                _-----=> irqs-off
> > #                               / _----=> need-resched
> > #                              | / _---=> hardirq/softirq
> > #                              || / _--=> preempt-depth
> > #                              ||| /     delay
> > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > #              | |         |   ||||      |         |
> >          rtcwake-748     [000] ...1   536.700777:
> stmmac_pltfr_suspend <-platform_pm_suspend
> >          rtcwake-748     [000] ...1   536.735532:
> arm_smmu_pm_suspend <-platform_pm_suspend
> >          rtcwake-748     [000] ...1   536.757290:
> arm_smmu_pm_resume <-platform_pm_resume
> >          rtcwake-748     [003] ...1   536.856771:
> stmmac_pltfr_resume <-platform_pm_resume
> >
> >
> > So I don't see any ordering issues that could be causing this.
> 
> 
> Another thing I have found is that for our platform, if the driver for the ethernet
> PHY (in this case broadcom PHY) is enabled, then it fails to resume but if I
> disable the PHY in the kernel configuration, then resume works. I have found
> that if I move the reinit of the RX buffers to before the startup of the phy, then
> it can resume OK with the PHY enabled.
> 
> Does the following work for you? Does your platform use a specific ethernet
> PHY driver?

I am also looking into this issue these days, we use the Realtek RTL8211FDI PHY, driver is drivers/net/phy/realtek.c.

For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock from PHY, so we need phylink_start before MAC.

I will test below code change tomorrow to see if it can work at my side, since it is only re-init memory, need not RXC clock.


> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 208cae344ffa..071d15d86dbe 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
>                         return ret;
>         }
> +       rtnl_lock();
> +       mutex_lock(&priv->lock);
> +       stmmac_reinit_rx_buffers(priv);
> +       mutex_unlock(&priv->lock);
> +
>         if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> -               rtnl_lock();
>                 phylink_start(priv->phylink);
>                 /* We may have called phylink_speed_down before */
>                 phylink_speed_up(priv->phylink);
> -               rtnl_unlock();
>         }
> -       rtnl_lock();
>         mutex_lock(&priv->lock);
>         stmmac_reset_queues_param(priv);
> -       stmmac_reinit_rx_buffers(priv);
>         stmmac_free_tx_skbufs(priv);
>         stmmac_clear_descriptors(priv);
> 
> 
> It is still not clear to us why the existing call to
> stmmac_clear_descriptors() is not sufficient to fix your problem.

During suspend/resume stress test, I found rx descriptor may not refill when system suspended, rx descriptor could be: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040.
When system resume back, stmmac_clear_descriptors() would change this rx descriptor to: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040, a broken rx descriptor.
So at my side, stmmac_clear_descriptors() seems to be chief culprit. I have a idea if there is way to ensure all rx descriptors are refilled when suspend MAC.

> How often does the issue you see occur?
Suspend about 2000 times.

Best Regards,
Joakim Zhang
> Thanks
> Jon
> 
> --
> nvpublic

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-13 12:13                         ` Joakim Zhang
@ 2021-04-13 16:06                           ` Thierry Reding
  2021-04-13 16:43                             ` Jakub Kicinski
  2021-04-14  2:18                             ` Joakim Zhang
  0 siblings, 2 replies; 21+ messages in thread
From: Thierry Reding @ 2021-04-13 16:06 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: Joakim Zhang, Jon Hunter, Giuseppe Cavallaro, Alexandre Torgue,
	Jose Abreu, netdev, Linux Kernel Mailing List, linux-tegra

[-- Attachment #1: Type: text/plain, Size: 6333 bytes --]

On Tue, Apr 13, 2021 at 12:13:01PM +0000, Joakim Zhang wrote:
> 
> Hi Jon,
> 
> > -----Original Message-----
> > From: Jon Hunter <jonathanh@nvidia.com>
> > Sent: 2021年4月13日 16:41
> > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> > <peppe.cavallaro@st.com>; Alexandre Torgue <alexandre.torgue@st.com>;
> > Jose Abreu <joabreu@synopsys.com>
> > Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> > <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>;
> > Jakub Kicinski <kuba@kernel.org>
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> > resume back
> > 
> > 
> > On 01/04/2021 17:28, Jon Hunter wrote:
> > >
> > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > >
> > > ...
> > >
> > >>> In answer to your question, resuming from suspend does work on this
> > >>> board without your change. We have been testing suspend/resume now
> > >>> on this board since Linux v5.8 and so we have the ability to bisect
> > >>> such regressions. So it is clear to me that this is the change that caused
> > this, but I am not sure why.
> > >>
> > >> Yes, I know this issue is regression caused by my patch. I just want to
> > analyze the potential reasons. Due to the code change only related to the page
> > recycle and reallocate.
> > >> So I guess if this page operate need IOMMU works when IOMMU is enabled.
> > Could you help check if IOMMU driver resume before STMMAC? Our common
> > desire is to find the root cause, right?
> > >
> > >
> > > Yes of course that is the desire here indeed. I had assumed that the
> > > suspend/resume order was good because we have never seen any problems,
> > > but nonetheless it is always good to check. Using ftrace I enabled
> > > tracing of the appropriate suspend/resume functions and this is what I
> > > see ...
> > >
> > > # tracer: function
> > > #
> > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > #
> > > #                                _-----=> irqs-off
> > > #                               / _----=> need-resched
> > > #                              | / _---=> hardirq/softirq
> > > #                              || / _--=> preempt-depth
> > > #                              ||| /     delay
> > > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > > #              | |         |   ||||      |         |
> > >          rtcwake-748     [000] ...1   536.700777:
> > stmmac_pltfr_suspend <-platform_pm_suspend
> > >          rtcwake-748     [000] ...1   536.735532:
> > arm_smmu_pm_suspend <-platform_pm_suspend
> > >          rtcwake-748     [000] ...1   536.757290:
> > arm_smmu_pm_resume <-platform_pm_resume
> > >          rtcwake-748     [003] ...1   536.856771:
> > stmmac_pltfr_resume <-platform_pm_resume
> > >
> > >
> > > So I don't see any ordering issues that could be causing this.
> > 
> > 
> > Another thing I have found is that for our platform, if the driver for the ethernet
> > PHY (in this case broadcom PHY) is enabled, then it fails to resume but if I
> > disable the PHY in the kernel configuration, then resume works. I have found
> > that if I move the reinit of the RX buffers to before the startup of the phy, then
> > it can resume OK with the PHY enabled.
> > 
> > Does the following work for you? Does your platform use a specific ethernet
> > PHY driver?
> 
> I am also looking into this issue these days, we use the Realtek RTL8211FDI PHY, driver is drivers/net/phy/realtek.c.
> 
> For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock from PHY, so we need phylink_start before MAC.
> 
> I will test below code change tomorrow to see if it can work at my side, since it is only re-init memory, need not RXC clock.
> 
> 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index 208cae344ffa..071d15d86dbe 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> >                         return ret;
> >         }
> > +       rtnl_lock();
> > +       mutex_lock(&priv->lock);
> > +       stmmac_reinit_rx_buffers(priv);
> > +       mutex_unlock(&priv->lock);
> > +
> >         if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> > -               rtnl_lock();
> >                 phylink_start(priv->phylink);
> >                 /* We may have called phylink_speed_down before */
> >                 phylink_speed_up(priv->phylink);
> > -               rtnl_unlock();
> >         }
> > -       rtnl_lock();
> >         mutex_lock(&priv->lock);
> >         stmmac_reset_queues_param(priv);
> > -       stmmac_reinit_rx_buffers(priv);
> >         stmmac_free_tx_skbufs(priv);
> >         stmmac_clear_descriptors(priv);
> > 
> > 
> > It is still not clear to us why the existing call to
> > stmmac_clear_descriptors() is not sufficient to fix your problem.
> 
> During suspend/resume stress test, I found rx descriptor may not refill when system suspended, rx descriptor could be: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040.
> When system resume back, stmmac_clear_descriptors() would change this rx descriptor to: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040, a broken rx descriptor.
> So at my side, stmmac_clear_descriptors() seems to be chief culprit. I have a idea if there is way to ensure all rx descriptors are refilled when suspend MAC.
> 
> > How often does the issue you see occur?
> Suspend about 2000 times.

Hi David, Jakub,

given where we are in the release cycle, I think it'd be best to revert
commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume
back") for now.

To summarize the discussion: the patch was meant as a workaround to fix
an occasional suspend/resume failure on one board that was not fully
root caused, and ends up causing fully reproducible suspend/resume
failures on at least one other board.

Joakim is looking at an alternative solution and Jon and I can provide
testing from the Tegra side for any fixes.

Do you want me to send a revert patch or can you revert directly on top
of your tree?

Thanks,
Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-13 16:06                           ` Thierry Reding
@ 2021-04-13 16:43                             ` Jakub Kicinski
  2021-04-14  2:18                             ` Joakim Zhang
  1 sibling, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2021-04-13 16:43 UTC (permalink / raw)
  To: Thierry Reding
  Cc: David S. Miller, Joakim Zhang, Jon Hunter, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, netdev, Linux Kernel Mailing List,
	linux-tegra

On Tue, 13 Apr 2021 18:06:39 +0200 Thierry Reding wrote:
> given where we are in the release cycle, I think it'd be best to revert
> commit 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume
> back") for now.
> 
> To summarize the discussion: the patch was meant as a workaround to fix
> an occasional suspend/resume failure on one board that was not fully
> root caused, and ends up causing fully reproducible suspend/resume
> failures on at least one other board.
> 
> Joakim is looking at an alternative solution and Jon and I can provide
> testing from the Tegra side for any fixes.
> 
> Do you want me to send a revert patch or can you revert directly on top
> of your tree?

Please send a revert with the justification in the commit log, and 
perhaps also:

Link: https://lore.kernel.org/r/708edb92-a5df-ecc4-3126-5ab36707e275@nvidia.com/

for those who want to dig deeper in the history. Make sure you 
CC relevant folks so they can review and express their opinions.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-13 16:06                           ` Thierry Reding
  2021-04-13 16:43                             ` Jakub Kicinski
@ 2021-04-14  2:18                             ` Joakim Zhang
  2021-04-14  7:40                               ` Thierry Reding
  1 sibling, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-04-14  2:18 UTC (permalink / raw)
  To: Thierry Reding, David S. Miller, Jakub Kicinski
  Cc: Jon Hunter, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	netdev, Linux Kernel Mailing List, linux-tegra


> -----Original Message-----
> From: Thierry Reding <thierry.reding@gmail.com>
> Sent: 2021年4月14日 0:07
> To: David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>
> Cc: Joakim Zhang <qiangqing.zhang@nxp.com>; Jon Hunter
> <jonathanh@nvidia.com>; Giuseppe Cavallaro <peppe.cavallaro@st.com>;
> Alexandre Torgue <alexandre.torgue@st.com>; Jose Abreu
> <joabreu@synopsys.com>; netdev@vger.kernel.org; Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> On Tue, Apr 13, 2021 at 12:13:01PM +0000, Joakim Zhang wrote:
> >
> > Hi Jon,
> >
> > > -----Original Message-----
> > > From: Jon Hunter <jonathanh@nvidia.com>
> > > Sent: 2021年4月13日 16:41
> > > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> > > <peppe.cavallaro@st.com>; Alexandre Torgue
> > > <alexandre.torgue@st.com>; Jose Abreu <joabreu@synopsys.com>
> > > Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> > > <linux-kernel@vger.kernel.org>; linux-tegra
> > > <linux-tegra@vger.kernel.org>; Jakub Kicinski <kuba@kernel.org>
> > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > when mac resume back
> > >
> > >
> > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > >
> > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > >
> > > > ...
> > > >
> > > >>> In answer to your question, resuming from suspend does work on
> > > >>> this board without your change. We have been testing
> > > >>> suspend/resume now on this board since Linux v5.8 and so we have
> > > >>> the ability to bisect such regressions. So it is clear to me
> > > >>> that this is the change that caused
> > > this, but I am not sure why.
> > > >>
> > > >> Yes, I know this issue is regression caused by my patch. I just
> > > >> want to
> > > analyze the potential reasons. Due to the code change only related
> > > to the page recycle and reallocate.
> > > >> So I guess if this page operate need IOMMU works when IOMMU is
> enabled.
> > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > common desire is to find the root cause, right?
> > > >
> > > >
> > > > Yes of course that is the desire here indeed. I had assumed that
> > > > the suspend/resume order was good because we have never seen any
> > > > problems, but nonetheless it is always good to check. Using ftrace
> > > > I enabled tracing of the appropriate suspend/resume functions and
> > > > this is what I see ...
> > > >
> > > > # tracer: function
> > > > #
> > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > #
> > > > #                                _-----=> irqs-off
> > > > #                               / _----=> need-resched
> > > > #                              | / _---=> hardirq/softirq
> > > > #                              || / _--=> preempt-depth
> > > > #                              ||| /     delay
> > > > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > > > #              | |         |   ||||      |         |
> > > >          rtcwake-748     [000] ...1   536.700777:
> > > stmmac_pltfr_suspend <-platform_pm_suspend
> > > >          rtcwake-748     [000] ...1   536.735532:
> > > arm_smmu_pm_suspend <-platform_pm_suspend
> > > >          rtcwake-748     [000] ...1   536.757290:
> > > arm_smmu_pm_resume <-platform_pm_resume
> > > >          rtcwake-748     [003] ...1   536.856771:
> > > stmmac_pltfr_resume <-platform_pm_resume
> > > >
> > > >
> > > > So I don't see any ordering issues that could be causing this.
> > >
> > >
> > > Another thing I have found is that for our platform, if the driver
> > > for the ethernet PHY (in this case broadcom PHY) is enabled, then it
> > > fails to resume but if I disable the PHY in the kernel
> > > configuration, then resume works. I have found that if I move the
> > > reinit of the RX buffers to before the startup of the phy, then it can resume
> OK with the PHY enabled.
> > >
> > > Does the following work for you? Does your platform use a specific
> > > ethernet PHY driver?
> >
> > I am also looking into this issue these days, we use the Realtek RTL8211FDI
> PHY, driver is drivers/net/phy/realtek.c.
> >
> > For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock
> from PHY, so we need phylink_start before MAC.
> >
> > I will test below code change tomorrow to see if it can work at my side, since
> it is only re-init memory, need not RXC clock.
> >
> >
> > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > index 208cae344ffa..071d15d86dbe 100644
> > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> > >                         return ret;
> > >         }
> > > +       rtnl_lock();
> > > +       mutex_lock(&priv->lock);
> > > +       stmmac_reinit_rx_buffers(priv);
> > > +       mutex_unlock(&priv->lock);
> > > +
> > >         if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> > > -               rtnl_lock();
> > >                 phylink_start(priv->phylink);
> > >                 /* We may have called phylink_speed_down before */
> > >                 phylink_speed_up(priv->phylink);
> > > -               rtnl_unlock();
> > >         }
> > > -       rtnl_lock();
> > >         mutex_lock(&priv->lock);
> > >         stmmac_reset_queues_param(priv);
> > > -       stmmac_reinit_rx_buffers(priv);
> > >         stmmac_free_tx_skbufs(priv);
> > >         stmmac_clear_descriptors(priv);
> > >
> > >
> > > It is still not clear to us why the existing call to
> > > stmmac_clear_descriptors() is not sufficient to fix your problem.
> >
> > During suspend/resume stress test, I found rx descriptor may not refill when
> system suspended, rx descriptor could be: 008 [0x00000000c4310080]: 0x0
> 0x40 0x0 0x34010040.
> > When system resume back, stmmac_clear_descriptors() would change this rx
> descriptor to: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040, a broken
> rx descriptor.
> > So at my side, stmmac_clear_descriptors() seems to be chief culprit. I have a
> idea if there is way to ensure all rx descriptors are refilled when suspend MAC.
> >
> > > How often does the issue you see occur?
> > Suspend about 2000 times.
> 
> Hi David, Jakub,
> 
> given where we are in the release cycle, I think it'd be best to revert commit
> 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume
> back") for now.
> 
> To summarize the discussion: the patch was meant as a workaround to fix an
> occasional suspend/resume failure on one board that was not fully root caused,
> and ends up causing fully reproducible suspend/resume failures on at least one
> other board.
> 
> Joakim is looking at an alternative solution and Jon and I can provide testing
> from the Tegra side for any fixes.
> 
> Do you want me to send a revert patch or can you revert directly on top of your
> tree?

Hi Thierry, David, Jakub,

From my point of view, it is not a good choose to send a revert patch directly.

At my side, I have found the root cause. When system suspended, it is possible that there are packets have not been received yet, such as:
008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040.

After system resume, stmmac_clear_descriptors() clear the descriptor, let it becomes below, it is a broken descriptor.
008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040

I think it is a software bug there, and I don't know why others have not reported it. This is a random issue, but there is a certain probability that it will occur.
My patch is a solution, may not a good solution, now it seems not a workaround.

At Joh's side, said it is related to IOMMU first, and then said re-init rx buffers before PHY start also can fix it, and this patch also only breaks one of their boards.
This makes me think there is a specific integration in their SoCs. I have not seen others report it is broken at their side. Theoretically, at least, this patch should have
no side effect.

In conclusion, we can revert this patch if we can find a better way to fix this issue (packets have not received when system suspended).

Best Regards,
Joakim Zhang
> Thanks,
> Thierry

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-14  2:18                             ` Joakim Zhang
@ 2021-04-14  7:40                               ` Thierry Reding
  2021-04-14  8:06                                 ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Thierry Reding @ 2021-04-14  7:40 UTC (permalink / raw)
  To: Joakim Zhang
  Cc: David S. Miller, Jakub Kicinski, Jon Hunter, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, netdev, Linux Kernel Mailing List,
	linux-tegra

[-- Attachment #1: Type: text/plain, Size: 11312 bytes --]

On Wed, Apr 14, 2021 at 02:18:58AM +0000, Joakim Zhang wrote:
> 
> > -----Original Message-----
> > From: Thierry Reding <thierry.reding@gmail.com>
> > Sent: 2021年4月14日 0:07
> > To: David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>
> > Cc: Joakim Zhang <qiangqing.zhang@nxp.com>; Jon Hunter
> > <jonathanh@nvidia.com>; Giuseppe Cavallaro <peppe.cavallaro@st.com>;
> > Alexandre Torgue <alexandre.torgue@st.com>; Jose Abreu
> > <joabreu@synopsys.com>; netdev@vger.kernel.org; Linux Kernel Mailing List
> > <linux-kernel@vger.kernel.org>; linux-tegra <linux-tegra@vger.kernel.org>
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> > resume back
> > 
> > On Tue, Apr 13, 2021 at 12:13:01PM +0000, Joakim Zhang wrote:
> > >
> > > Hi Jon,
> > >
> > > > -----Original Message-----
> > > > From: Jon Hunter <jonathanh@nvidia.com>
> > > > Sent: 2021年4月13日 16:41
> > > > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> > > > <peppe.cavallaro@st.com>; Alexandre Torgue
> > > > <alexandre.torgue@st.com>; Jose Abreu <joabreu@synopsys.com>
> > > > Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> > > > <linux-kernel@vger.kernel.org>; linux-tegra
> > > > <linux-tegra@vger.kernel.org>; Jakub Kicinski <kuba@kernel.org>
> > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > > when mac resume back
> > > >
> > > >
> > > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > > >
> > > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > > >
> > > > > ...
> > > > >
> > > > >>> In answer to your question, resuming from suspend does work on
> > > > >>> this board without your change. We have been testing
> > > > >>> suspend/resume now on this board since Linux v5.8 and so we have
> > > > >>> the ability to bisect such regressions. So it is clear to me
> > > > >>> that this is the change that caused
> > > > this, but I am not sure why.
> > > > >>
> > > > >> Yes, I know this issue is regression caused by my patch. I just
> > > > >> want to
> > > > analyze the potential reasons. Due to the code change only related
> > > > to the page recycle and reallocate.
> > > > >> So I guess if this page operate need IOMMU works when IOMMU is
> > enabled.
> > > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > > common desire is to find the root cause, right?
> > > > >
> > > > >
> > > > > Yes of course that is the desire here indeed. I had assumed that
> > > > > the suspend/resume order was good because we have never seen any
> > > > > problems, but nonetheless it is always good to check. Using ftrace
> > > > > I enabled tracing of the appropriate suspend/resume functions and
> > > > > this is what I see ...
> > > > >
> > > > > # tracer: function
> > > > > #
> > > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > > #
> > > > > #                                _-----=> irqs-off
> > > > > #                               / _----=> need-resched
> > > > > #                              | / _---=> hardirq/softirq
> > > > > #                              || / _--=> preempt-depth
> > > > > #                              ||| /     delay
> > > > > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > > > > #              | |         |   ||||      |         |
> > > > >          rtcwake-748     [000] ...1   536.700777:
> > > > stmmac_pltfr_suspend <-platform_pm_suspend
> > > > >          rtcwake-748     [000] ...1   536.735532:
> > > > arm_smmu_pm_suspend <-platform_pm_suspend
> > > > >          rtcwake-748     [000] ...1   536.757290:
> > > > arm_smmu_pm_resume <-platform_pm_resume
> > > > >          rtcwake-748     [003] ...1   536.856771:
> > > > stmmac_pltfr_resume <-platform_pm_resume
> > > > >
> > > > >
> > > > > So I don't see any ordering issues that could be causing this.
> > > >
> > > >
> > > > Another thing I have found is that for our platform, if the driver
> > > > for the ethernet PHY (in this case broadcom PHY) is enabled, then it
> > > > fails to resume but if I disable the PHY in the kernel
> > > > configuration, then resume works. I have found that if I move the
> > > > reinit of the RX buffers to before the startup of the phy, then it can resume
> > OK with the PHY enabled.
> > > >
> > > > Does the following work for you? Does your platform use a specific
> > > > ethernet PHY driver?
> > >
> > > I am also looking into this issue these days, we use the Realtek RTL8211FDI
> > PHY, driver is drivers/net/phy/realtek.c.
> > >
> > > For our EQOS MAC integrated in our SoC, Rx side logic depends on RXC clock
> > from PHY, so we need phylink_start before MAC.
> > >
> > > I will test below code change tomorrow to see if it can work at my side, since
> > it is only re-init memory, need not RXC clock.
> > >
> > >
> > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > index 208cae344ffa..071d15d86dbe 100644
> > > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> > > >                         return ret;
> > > >         }
> > > > +       rtnl_lock();
> > > > +       mutex_lock(&priv->lock);
> > > > +       stmmac_reinit_rx_buffers(priv);
> > > > +       mutex_unlock(&priv->lock);
> > > > +
> > > >         if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> > > > -               rtnl_lock();
> > > >                 phylink_start(priv->phylink);
> > > >                 /* We may have called phylink_speed_down before */
> > > >                 phylink_speed_up(priv->phylink);
> > > > -               rtnl_unlock();
> > > >         }
> > > > -       rtnl_lock();
> > > >         mutex_lock(&priv->lock);
> > > >         stmmac_reset_queues_param(priv);
> > > > -       stmmac_reinit_rx_buffers(priv);
> > > >         stmmac_free_tx_skbufs(priv);
> > > >         stmmac_clear_descriptors(priv);
> > > >
> > > >
> > > > It is still not clear to us why the existing call to
> > > > stmmac_clear_descriptors() is not sufficient to fix your problem.
> > >
> > > During suspend/resume stress test, I found rx descriptor may not refill when
> > system suspended, rx descriptor could be: 008 [0x00000000c4310080]: 0x0
> > 0x40 0x0 0x34010040.
> > > When system resume back, stmmac_clear_descriptors() would change this rx
> > descriptor to: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040, a broken
> > rx descriptor.
> > > So at my side, stmmac_clear_descriptors() seems to be chief culprit. I have a
> > idea if there is way to ensure all rx descriptors are refilled when suspend MAC.
> > >
> > > > How often does the issue you see occur?
> > > Suspend about 2000 times.
> > 
> > Hi David, Jakub,
> > 
> > given where we are in the release cycle, I think it'd be best to revert commit
> > 9c63faaa931e ("net: stmmac: re-init rx buffers when mac resume
> > back") for now.
> > 
> > To summarize the discussion: the patch was meant as a workaround to fix an
> > occasional suspend/resume failure on one board that was not fully root caused,
> > and ends up causing fully reproducible suspend/resume failures on at least one
> > other board.
> > 
> > Joakim is looking at an alternative solution and Jon and I can provide testing
> > from the Tegra side for any fixes.
> > 
> > Do you want me to send a revert patch or can you revert directly on top of your
> > tree?
> 
> Hi Thierry, David, Jakub,
> 
> From my point of view, it is not a good choose to send a revert patch directly.
> 
> At my side, I have found the root cause. When system suspended, it is
> possible that there are packets have not been received yet, such as:
> 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040.
> 
> After system resume, stmmac_clear_descriptors() clear the descriptor,
> let it becomes below, it is a broken descriptor.
> 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040

So it sounds like that is what needs to be fixed. Reallocating all
buffers and rewriting the descriptors seems more of a sledgehammer
approach than a proper fix to this problem.

> I think it is a software bug there, and I don't know why others have
> not reported it. This is a random issue, but there is a certain
> probability that it will occur.

If this is really as rare as you say, I'm not completely surprised that
nobody has reported it.

> My patch is a solution, may not a good solution, now it seems not a
> workaround.

It's not an acceptable solution if it causes a regression.

> At Joh's side, said it is related to IOMMU first, and then said
> re-init rx buffers before PHY start also can fix it, and this patch
> also only breaks one of their boards.

It's certainly possible that IOMMU has some sort of impact on the
reproducibility of the issue, but it's also a fact that before this
patch the systems that are now broken had been working.

Also, it's not relevant how many boards are broken. If a patch breaks a
single setup that used to work, that's a regression. What your patch
does is basically exchanging one working setup for another. And the
regression is even worse than the issue that you were trying to fix:
Jetson TX2 reliably fails to resume properly *every time*, whereas you
confirmed that you're only seeing this particular issue about once in
2000 suspend/resume cycles.

That's not how we do kernel development. Jon reported the regression 3
weeks ago and nobody's come up with a fix that solves this properly and
for everyone. Given that we may only have 4 days left before the final
release, the safest course of action at this point is to revert and then
we can try again for the next cycle. Jon and I can help test any patches
on the Tegra side.

> This makes me think there is a specific integration in their SoCs.

Even if this was an integration issue, which I doubt, that's completely
irrelevant. What's relevant is that the setup was working before this
patch.

> I have not seen others report it is broken at their side.

Prior to your patch submission, did anybody report that suspend/resume
was broken for them? Also, if you are the only one seeing this issue,
perhaps this is an integration issue in your SoC?

As you can see this kind of argument makes us go in circles, hence why
we have the rule that when a patch causes a regression it either gets
fixed or reverted. Anything else leads to insanity.

> Theoretically, at least, this patch should have no side effect.

Sorry, but that's not a valid argument. Practically this is causing a
problem and that counts more than theory.

> In conclusion, we can revert this patch if we can find a better way to
> fix this issue (packets have not received when system suspended).

Again, not how it works. The patch should be reverted to restore
functionality for setups that were previously working. Then we can can
try to find a better way to fix this issue.

I'm not confident that we can find that proper fix within the next few
days, so let's try again for the next release cycle.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-14  7:40                               ` Thierry Reding
@ 2021-04-14  8:06                                 ` Joakim Zhang
  2021-04-14 12:22                                   ` Joakim Zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Joakim Zhang @ 2021-04-14  8:06 UTC (permalink / raw)
  To: Thierry Reding
  Cc: David S. Miller, Jakub Kicinski, Jon Hunter, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, netdev, Linux Kernel Mailing List,
	linux-tegra


> -----Original Message-----
> From: Thierry Reding <thierry.reding@gmail.com>
> Sent: 2021年4月14日 15:41
> To: Joakim Zhang <qiangqing.zhang@nxp.com>
> Cc: David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>;
> Jon Hunter <jonathanh@nvidia.com>; Giuseppe Cavallaro
> <peppe.cavallaro@st.com>; Alexandre Torgue <alexandre.torgue@st.com>;
> Jose Abreu <joabreu@synopsys.com>; netdev@vger.kernel.org; Linux Kernel
> Mailing List <linux-kernel@vger.kernel.org>; linux-tegra
> <linux-tegra@vger.kernel.org>
> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> On Wed, Apr 14, 2021 at 02:18:58AM +0000, Joakim Zhang wrote:
> >
> > > -----Original Message-----
> > > From: Thierry Reding <thierry.reding@gmail.com>
> > > Sent: 2021年4月14日 0:07
> > > To: David S. Miller <davem@davemloft.net>; Jakub Kicinski
> > > <kuba@kernel.org>
> > > Cc: Joakim Zhang <qiangqing.zhang@nxp.com>; Jon Hunter
> > > <jonathanh@nvidia.com>; Giuseppe Cavallaro <peppe.cavallaro@st.com>;
> > > Alexandre Torgue <alexandre.torgue@st.com>; Jose Abreu
> > > <joabreu@synopsys.com>; netdev@vger.kernel.org; Linux Kernel Mailing
> > > List <linux-kernel@vger.kernel.org>; linux-tegra
> > > <linux-tegra@vger.kernel.org>
> > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > when mac resume back
> > >
> > > On Tue, Apr 13, 2021 at 12:13:01PM +0000, Joakim Zhang wrote:
> > > >
> > > > Hi Jon,
> > > >
> > > > > -----Original Message-----
> > > > > From: Jon Hunter <jonathanh@nvidia.com>
> > > > > Sent: 2021年4月13日 16:41
> > > > > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> > > > > <peppe.cavallaro@st.com>; Alexandre Torgue
> > > > > <alexandre.torgue@st.com>; Jose Abreu <joabreu@synopsys.com>
> > > > > Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> > > > > <linux-kernel@vger.kernel.org>; linux-tegra
> > > > > <linux-tegra@vger.kernel.org>; Jakub Kicinski <kuba@kernel.org>
> > > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx
> > > > > buffers when mac resume back
> > > > >
> > > > >
> > > > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > > > >
> > > > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > >>> In answer to your question, resuming from suspend does work
> > > > > >>> on this board without your change. We have been testing
> > > > > >>> suspend/resume now on this board since Linux v5.8 and so we
> > > > > >>> have the ability to bisect such regressions. So it is clear
> > > > > >>> to me that this is the change that caused
> > > > > this, but I am not sure why.
> > > > > >>
> > > > > >> Yes, I know this issue is regression caused by my patch. I
> > > > > >> just want to
> > > > > analyze the potential reasons. Due to the code change only
> > > > > related to the page recycle and reallocate.
> > > > > >> So I guess if this page operate need IOMMU works when IOMMU
> > > > > >> is
> > > enabled.
> > > > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > > > common desire is to find the root cause, right?
> > > > > >
> > > > > >
> > > > > > Yes of course that is the desire here indeed. I had assumed
> > > > > > that the suspend/resume order was good because we have never
> > > > > > seen any problems, but nonetheless it is always good to check.
> > > > > > Using ftrace I enabled tracing of the appropriate
> > > > > > suspend/resume functions and this is what I see ...
> > > > > >
> > > > > > # tracer: function
> > > > > > #
> > > > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > > > #
> > > > > > #                                _-----=> irqs-off
> > > > > > #                               / _----=> need-resched
> > > > > > #                              | / _---=> hardirq/softirq
> > > > > > #                              || / _--=> preempt-depth
> > > > > > #                              ||| /     delay
> > > > > > #           TASK-PID     CPU#  ||||   TIMESTAMP
> FUNCTION
> > > > > > #              | |         |   ||||      |         |
> > > > > >          rtcwake-748     [000] ...1   536.700777:
> > > > > stmmac_pltfr_suspend <-platform_pm_suspend
> > > > > >          rtcwake-748     [000] ...1   536.735532:
> > > > > arm_smmu_pm_suspend <-platform_pm_suspend
> > > > > >          rtcwake-748     [000] ...1   536.757290:
> > > > > arm_smmu_pm_resume <-platform_pm_resume
> > > > > >          rtcwake-748     [003] ...1   536.856771:
> > > > > stmmac_pltfr_resume <-platform_pm_resume
> > > > > >
> > > > > >
> > > > > > So I don't see any ordering issues that could be causing this.
> > > > >
> > > > >
> > > > > Another thing I have found is that for our platform, if the
> > > > > driver for the ethernet PHY (in this case broadcom PHY) is
> > > > > enabled, then it fails to resume but if I disable the PHY in the
> > > > > kernel configuration, then resume works. I have found that if I
> > > > > move the reinit of the RX buffers to before the startup of the
> > > > > phy, then it can resume
> > > OK with the PHY enabled.
> > > > >
> > > > > Does the following work for you? Does your platform use a
> > > > > specific ethernet PHY driver?
> > > >
> > > > I am also looking into this issue these days, we use the Realtek
> > > > RTL8211FDI
> > > PHY, driver is drivers/net/phy/realtek.c.
> > > >
> > > > For our EQOS MAC integrated in our SoC, Rx side logic depends on
> > > > RXC clock
> > > from PHY, so we need phylink_start before MAC.
> > > >
> > > > I will test below code change tomorrow to see if it can work at my
> > > > side, since
> > > it is only re-init memory, need not RXC clock.
> > > >
> > > >
> > > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > index 208cae344ffa..071d15d86dbe 100644
> > > > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> > > > >                         return ret;
> > > > >         }
> > > > > +       rtnl_lock();
> > > > > +       mutex_lock(&priv->lock);
> > > > > +       stmmac_reinit_rx_buffers(priv);
> > > > > +       mutex_unlock(&priv->lock);
> > > > > +
> > > > >         if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> > > > > -               rtnl_lock();
> > > > >                 phylink_start(priv->phylink);
> > > > >                 /* We may have called phylink_speed_down before
> */
> > > > >                 phylink_speed_up(priv->phylink);
> > > > > -               rtnl_unlock();
> > > > >         }
> > > > > -       rtnl_lock();
> > > > >         mutex_lock(&priv->lock);
> > > > >         stmmac_reset_queues_param(priv);
> > > > > -       stmmac_reinit_rx_buffers(priv);
> > > > >         stmmac_free_tx_skbufs(priv);
> > > > >         stmmac_clear_descriptors(priv);
> > > > >
> > > > >
> > > > > It is still not clear to us why the existing call to
> > > > > stmmac_clear_descriptors() is not sufficient to fix your problem.
> > > >
> > > > During suspend/resume stress test, I found rx descriptor may not
> > > > refill when
> > > system suspended, rx descriptor could be: 008 [0x00000000c4310080]:
> > > 0x0
> > > 0x40 0x0 0x34010040.
> > > > When system resume back, stmmac_clear_descriptors() would change
> > > > this rx
> > > descriptor to: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040, a
> > > broken rx descriptor.
> > > > So at my side, stmmac_clear_descriptors() seems to be chief
> > > > culprit. I have a
> > > idea if there is way to ensure all rx descriptors are refilled when suspend
> MAC.
> > > >
> > > > > How often does the issue you see occur?
> > > > Suspend about 2000 times.
> > >
> > > Hi David, Jakub,
> > >
> > > given where we are in the release cycle, I think it'd be best to
> > > revert commit 9c63faaa931e ("net: stmmac: re-init rx buffers when
> > > mac resume
> > > back") for now.
> > >
> > > To summarize the discussion: the patch was meant as a workaround to
> > > fix an occasional suspend/resume failure on one board that was not
> > > fully root caused, and ends up causing fully reproducible
> > > suspend/resume failures on at least one other board.
> > >
> > > Joakim is looking at an alternative solution and Jon and I can
> > > provide testing from the Tegra side for any fixes.
> > >
> > > Do you want me to send a revert patch or can you revert directly on
> > > top of your tree?
> >
> > Hi Thierry, David, Jakub,
> >
> > From my point of view, it is not a good choose to send a revert patch directly.
> >
> > At my side, I have found the root cause. When system suspended, it is
> > possible that there are packets have not been received yet, such as:
> > 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040.
> >
> > After system resume, stmmac_clear_descriptors() clear the descriptor,
> > let it becomes below, it is a broken descriptor.
> > 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040
> 
> So it sounds like that is what needs to be fixed. Reallocating all buffers and
> rewriting the descriptors seems more of a sledgehammer approach than a
> proper fix to this problem.
> 
> > I think it is a software bug there, and I don't know why others have
> > not reported it. This is a random issue, but there is a certain
> > probability that it will occur.
> 
> If this is really as rare as you say, I'm not completely surprised that nobody has
> reported it.
> 
> > My patch is a solution, may not a good solution, now it seems not a
> > workaround.
> 
> It's not an acceptable solution if it causes a regression.
> 
> > At Joh's side, said it is related to IOMMU first, and then said
> > re-init rx buffers before PHY start also can fix it, and this patch
> > also only breaks one of their boards.
> 
> It's certainly possible that IOMMU has some sort of impact on the
> reproducibility of the issue, but it's also a fact that before this patch the
> systems that are now broken had been working.
> 
> Also, it's not relevant how many boards are broken. If a patch breaks a single
> setup that used to work, that's a regression. What your patch does is basically
> exchanging one working setup for another. And the regression is even worse
> than the issue that you were trying to fix:
> Jetson TX2 reliably fails to resume properly *every time*, whereas you
> confirmed that you're only seeing this particular issue about once in
> 2000 suspend/resume cycles.
> 
> That's not how we do kernel development. Jon reported the regression 3 weeks
> ago and nobody's come up with a fix that solves this properly and for everyone.
> Given that we may only have 4 days left before the final release, the safest
> course of action at this point is to revert and then we can try again for the next
> cycle. Jon and I can help test any patches on the Tegra side.
> 
> > This makes me think there is a specific integration in their SoCs.
> 
> Even if this was an integration issue, which I doubt, that's completely irrelevant.
> What's relevant is that the setup was working before this patch.
> 
> > I have not seen others report it is broken at their side.
> 
> Prior to your patch submission, did anybody report that suspend/resume was
> broken for them? Also, if you are the only one seeing this issue, perhaps this is
> an integration issue in your SoC?
> 
> As you can see this kind of argument makes us go in circles, hence why we have
> the rule that when a patch causes a regression it either gets fixed or reverted.
> Anything else leads to insanity.
> 
> > Theoretically, at least, this patch should have no side effect.
> 
> Sorry, but that's not a valid argument. Practically this is causing a problem and
> that counts more than theory.
> 
> > In conclusion, we can revert this patch if we can find a better way to
> > fix this issue (packets have not received when system suspended).
> 
> Again, not how it works. The patch should be reverted to restore functionality
> for setups that were previously working. Then we can can try to find a better
> way to fix this issue.
> 
> I'm not confident that we can find that proper fix within the next few days, so
> let's try again for the next release cycle.

Hi Thierry,

Thanks for your detailed explanation, please send a revert patch first for your urgent requirement.

Also please describe this issue I met and the reason for this revert. We can start another mail loop to discuss this issue.
My original thought is not to make regression, I am also sad about this. We need someone is much familiar about STMMAC driver,
who could point us a better solution.

Best Regards,
Joakim Zhang
> Thierry

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back
  2021-04-14  8:06                                 ` Joakim Zhang
@ 2021-04-14 12:22                                   ` Joakim Zhang
  0 siblings, 0 replies; 21+ messages in thread
From: Joakim Zhang @ 2021-04-14 12:22 UTC (permalink / raw)
  To: Joakim Zhang, Thierry Reding
  Cc: David S. Miller, Jakub Kicinski, Jon Hunter, Giuseppe Cavallaro,
	Alexandre Torgue, Jose Abreu, netdev, Linux Kernel Mailing List,
	linux-tegra


> -----Original Message-----
> From: Joakim Zhang <qiangqing.zhang@nxp.com>
> Sent: 2021年4月14日 16:07
> To: Thierry Reding <thierry.reding@gmail.com>
> Cc: David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>;
> Jon Hunter <jonathanh@nvidia.com>; Giuseppe Cavallaro
> <peppe.cavallaro@st.com>; Alexandre Torgue <alexandre.torgue@st.com>;
> Jose Abreu <joabreu@synopsys.com>; netdev@vger.kernel.org; Linux Kernel
> Mailing List <linux-kernel@vger.kernel.org>; linux-tegra
> <linux-tegra@vger.kernel.org>
> Subject: RE: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
> resume back
> 
> 
> > -----Original Message-----
> > From: Thierry Reding <thierry.reding@gmail.com>
> > Sent: 2021年4月14日 15:41
> > To: Joakim Zhang <qiangqing.zhang@nxp.com>
> > Cc: David S. Miller <davem@davemloft.net>; Jakub Kicinski
> > <kuba@kernel.org>; Jon Hunter <jonathanh@nvidia.com>; Giuseppe
> > Cavallaro <peppe.cavallaro@st.com>; Alexandre Torgue
> > <alexandre.torgue@st.com>; Jose Abreu <joabreu@synopsys.com>;
> > netdev@vger.kernel.org; Linux Kernel Mailing List
> > <linux-kernel@vger.kernel.org>; linux-tegra
> > <linux-tegra@vger.kernel.org>
> > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > when mac resume back
> >
> > On Wed, Apr 14, 2021 at 02:18:58AM +0000, Joakim Zhang wrote:
> > >
> > > > -----Original Message-----
> > > > From: Thierry Reding <thierry.reding@gmail.com>
> > > > Sent: 2021年4月14日 0:07
> > > > To: David S. Miller <davem@davemloft.net>; Jakub Kicinski
> > > > <kuba@kernel.org>
> > > > Cc: Joakim Zhang <qiangqing.zhang@nxp.com>; Jon Hunter
> > > > <jonathanh@nvidia.com>; Giuseppe Cavallaro
> > > > <peppe.cavallaro@st.com>; Alexandre Torgue
> > > > <alexandre.torgue@st.com>; Jose Abreu <joabreu@synopsys.com>;
> > > > netdev@vger.kernel.org; Linux Kernel Mailing List
> > > > <linux-kernel@vger.kernel.org>; linux-tegra
> > > > <linux-tegra@vger.kernel.org>
> > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers
> > > > when mac resume back
> > > >
> > > > On Tue, Apr 13, 2021 at 12:13:01PM +0000, Joakim Zhang wrote:
> > > > >
> > > > > Hi Jon,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jon Hunter <jonathanh@nvidia.com>
> > > > > > Sent: 2021年4月13日 16:41
> > > > > > To: Joakim Zhang <qiangqing.zhang@nxp.com>; Giuseppe Cavallaro
> > > > > > <peppe.cavallaro@st.com>; Alexandre Torgue
> > > > > > <alexandre.torgue@st.com>; Jose Abreu <joabreu@synopsys.com>
> > > > > > Cc: netdev@vger.kernel.org; Linux Kernel Mailing List
> > > > > > <linux-kernel@vger.kernel.org>; linux-tegra
> > > > > > <linux-tegra@vger.kernel.org>; Jakub Kicinski
> > > > > > <kuba@kernel.org>
> > > > > > Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx
> > > > > > buffers when mac resume back
> > > > > >
> > > > > >
> > > > > > On 01/04/2021 17:28, Jon Hunter wrote:
> > > > > > >
> > > > > > > On 31/03/2021 12:41, Joakim Zhang wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > >>> In answer to your question, resuming from suspend does
> > > > > > >>> work on this board without your change. We have been
> > > > > > >>> testing suspend/resume now on this board since Linux v5.8
> > > > > > >>> and so we have the ability to bisect such regressions. So
> > > > > > >>> it is clear to me that this is the change that caused
> > > > > > this, but I am not sure why.
> > > > > > >>
> > > > > > >> Yes, I know this issue is regression caused by my patch. I
> > > > > > >> just want to
> > > > > > analyze the potential reasons. Due to the code change only
> > > > > > related to the page recycle and reallocate.
> > > > > > >> So I guess if this page operate need IOMMU works when IOMMU
> > > > > > >> is
> > > > enabled.
> > > > > > Could you help check if IOMMU driver resume before STMMAC? Our
> > > > > > common desire is to find the root cause, right?
> > > > > > >
> > > > > > >
> > > > > > > Yes of course that is the desire here indeed. I had assumed
> > > > > > > that the suspend/resume order was good because we have never
> > > > > > > seen any problems, but nonetheless it is always good to check.
> > > > > > > Using ftrace I enabled tracing of the appropriate
> > > > > > > suspend/resume functions and this is what I see ...
> > > > > > >
> > > > > > > # tracer: function
> > > > > > > #
> > > > > > > # entries-in-buffer/entries-written: 4/4   #P:6
> > > > > > > #
> > > > > > > #                                _-----=> irqs-off
> > > > > > > #                               / _----=> need-resched
> > > > > > > #                              | / _---=> hardirq/softirq
> > > > > > > #                              || / _--=> preempt-depth
> > > > > > > #                              ||| /     delay
> > > > > > > #           TASK-PID     CPU#  ||||   TIMESTAMP
> > FUNCTION
> > > > > > > #              | |         |   ||||      |         |
> > > > > > >          rtcwake-748     [000] ...1   536.700777:
> > > > > > stmmac_pltfr_suspend <-platform_pm_suspend
> > > > > > >          rtcwake-748     [000] ...1   536.735532:
> > > > > > arm_smmu_pm_suspend <-platform_pm_suspend
> > > > > > >          rtcwake-748     [000] ...1   536.757290:
> > > > > > arm_smmu_pm_resume <-platform_pm_resume
> > > > > > >          rtcwake-748     [003] ...1   536.856771:
> > > > > > stmmac_pltfr_resume <-platform_pm_resume
> > > > > > >
> > > > > > >
> > > > > > > So I don't see any ordering issues that could be causing this.
> > > > > >
> > > > > >
> > > > > > Another thing I have found is that for our platform, if the
> > > > > > driver for the ethernet PHY (in this case broadcom PHY) is
> > > > > > enabled, then it fails to resume but if I disable the PHY in
> > > > > > the kernel configuration, then resume works. I have found that
> > > > > > if I move the reinit of the RX buffers to before the startup
> > > > > > of the phy, then it can resume
> > > > OK with the PHY enabled.
> > > > > >
> > > > > > Does the following work for you? Does your platform use a
> > > > > > specific ethernet PHY driver?
> > > > >
> > > > > I am also looking into this issue these days, we use the Realtek
> > > > > RTL8211FDI
> > > > PHY, driver is drivers/net/phy/realtek.c.
> > > > >
> > > > > For our EQOS MAC integrated in our SoC, Rx side logic depends on
> > > > > RXC clock
> > > > from PHY, so we need phylink_start before MAC.
> > > > >
> > > > > I will test below code change tomorrow to see if it can work at
> > > > > my side, since
> > > > it is only re-init memory, need not RXC clock.
> > > > >
> > > > >
> > > > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > > index 208cae344ffa..071d15d86dbe 100644
> > > > > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > > > > > @@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
> > > > > >                         return ret;
> > > > > >         }
> > > > > > +       rtnl_lock();
> > > > > > +       mutex_lock(&priv->lock);
> > > > > > +       stmmac_reinit_rx_buffers(priv);
> > > > > > +       mutex_unlock(&priv->lock);
> > > > > > +
> > > > > >         if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
> > > > > > -               rtnl_lock();
> > > > > >                 phylink_start(priv->phylink);
> > > > > >                 /* We may have called phylink_speed_down
> > > > > > before
> > */
> > > > > >                 phylink_speed_up(priv->phylink);
> > > > > > -               rtnl_unlock();
> > > > > >         }
> > > > > > -       rtnl_lock();
> > > > > >         mutex_lock(&priv->lock);
> > > > > >         stmmac_reset_queues_param(priv);
> > > > > > -       stmmac_reinit_rx_buffers(priv);
> > > > > >         stmmac_free_tx_skbufs(priv);
> > > > > >         stmmac_clear_descriptors(priv);
> > > > > >
> > > > > >
> > > > > > It is still not clear to us why the existing call to
> > > > > > stmmac_clear_descriptors() is not sufficient to fix your problem.
> > > > >
> > > > > During suspend/resume stress test, I found rx descriptor may not
> > > > > refill when
> > > > system suspended, rx descriptor could be: 008 [0x00000000c4310080]:
> > > > 0x0
> > > > 0x40 0x0 0x34010040.
> > > > > When system resume back, stmmac_clear_descriptors() would change
> > > > > this rx
> > > > descriptor to: 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040,
> > > > a broken rx descriptor.
> > > > > So at my side, stmmac_clear_descriptors() seems to be chief
> > > > > culprit. I have a
> > > > idea if there is way to ensure all rx descriptors are refilled
> > > > when suspend
> > MAC.
> > > > >
> > > > > > How often does the issue you see occur?
> > > > > Suspend about 2000 times.
> > > >
> > > > Hi David, Jakub,
> > > >
> > > > given where we are in the release cycle, I think it'd be best to
> > > > revert commit 9c63faaa931e ("net: stmmac: re-init rx buffers when
> > > > mac resume
> > > > back") for now.
> > > >
> > > > To summarize the discussion: the patch was meant as a workaround
> > > > to fix an occasional suspend/resume failure on one board that was
> > > > not fully root caused, and ends up causing fully reproducible
> > > > suspend/resume failures on at least one other board.
> > > >
> > > > Joakim is looking at an alternative solution and Jon and I can
> > > > provide testing from the Tegra side for any fixes.
> > > >
> > > > Do you want me to send a revert patch or can you revert directly
> > > > on top of your tree?
> > >
> > > Hi Thierry, David, Jakub,
> > >
> > > From my point of view, it is not a good choose to send a revert patch
> directly.
> > >
> > > At my side, I have found the root cause. When system suspended, it
> > > is possible that there are packets have not been received yet, such as:
> > > 008 [0x00000000c4310080]: 0x0 0x40 0x0 0x34010040.
> > >
> > > After system resume, stmmac_clear_descriptors() clear the
> > > descriptor, let it becomes below, it is a broken descriptor.
> > > 008 [0x00000000c4310080]: 0x0 0x40 0x0 0xb5010040
> >
> > So it sounds like that is what needs to be fixed. Reallocating all
> > buffers and rewriting the descriptors seems more of a sledgehammer
> > approach than a proper fix to this problem.
> >
> > > I think it is a software bug there, and I don't know why others have
> > > not reported it. This is a random issue, but there is a certain
> > > probability that it will occur.
> >
> > If this is really as rare as you say, I'm not completely surprised
> > that nobody has reported it.
> >
> > > My patch is a solution, may not a good solution, now it seems not a
> > > workaround.
> >
> > It's not an acceptable solution if it causes a regression.
> >
> > > At Joh's side, said it is related to IOMMU first, and then said
> > > re-init rx buffers before PHY start also can fix it, and this patch
> > > also only breaks one of their boards.
> >
> > It's certainly possible that IOMMU has some sort of impact on the
> > reproducibility of the issue, but it's also a fact that before this
> > patch the systems that are now broken had been working.
> >
> > Also, it's not relevant how many boards are broken. If a patch breaks
> > a single setup that used to work, that's a regression. What your patch
> > does is basically exchanging one working setup for another. And the
> > regression is even worse than the issue that you were trying to fix:
> > Jetson TX2 reliably fails to resume properly *every time*, whereas you
> > confirmed that you're only seeing this particular issue about once in
> > 2000 suspend/resume cycles.
> >
> > That's not how we do kernel development. Jon reported the regression 3
> > weeks ago and nobody's come up with a fix that solves this properly and for
> everyone.
> > Given that we may only have 4 days left before the final release, the
> > safest course of action at this point is to revert and then we can try
> > again for the next cycle. Jon and I can help test any patches on the Tegra
> side.
> >
> > > This makes me think there is a specific integration in their SoCs.
> >
> > Even if this was an integration issue, which I doubt, that's completely
> irrelevant.
> > What's relevant is that the setup was working before this patch.
> >
> > > I have not seen others report it is broken at their side.
> >
> > Prior to your patch submission, did anybody report that suspend/resume
> > was broken for them? Also, if you are the only one seeing this issue,
> > perhaps this is an integration issue in your SoC?
> >
> > As you can see this kind of argument makes us go in circles, hence why
> > we have the rule that when a patch causes a regression it either gets fixed or
> reverted.
> > Anything else leads to insanity.
> >
> > > Theoretically, at least, this patch should have no side effect.
> >
> > Sorry, but that's not a valid argument. Practically this is causing a
> > problem and that counts more than theory.
> >
> > > In conclusion, we can revert this patch if we can find a better way
> > > to fix this issue (packets have not received when system suspended).
> >
> > Again, not how it works. The patch should be reverted to restore
> > functionality for setups that were previously working. Then we can can
> > try to find a better way to fix this issue.
> >
> > I'm not confident that we can find that proper fix within the next few
> > days, so let's try again for the next release cycle.
> 
> Hi Thierry,
> 
> Thanks for your detailed explanation, please send a revert patch first for your
> urgent requirement.
> 
> Also please describe this issue I met and the reason for this revert. We can
> start another mail loop to discuss this issue.
> My original thought is not to make regression, I am also sad about this. We
> need someone is much familiar about STMMAC driver, who could point us a
> better solution.

Hi Thierry,

After you revert that patch, could you help test below code change if possible? I check the stmmac driver, from the first version, it will clear both rx and tx descriptors.
I change the behavior to only clear tx descriptors, not sure if it also have ang other risks, need more people review it. Now I am also doing the overnight stress test.

--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5412,9 +5412,12 @@ static void stmmac_reset_queues_param(struct stmmac_priv *priv)
                tx_q->mss = 0;

                netdev_tx_reset_queue(netdev_get_tx_queue(priv->dev, queue));
+
+               stmmac_clear_tx_descriptors(priv, queue);
        }
 }

 /**
  * stmmac_resume - resume callback
  * @dev: device pointer
@@ -5470,9 +5473,7 @@ int stmmac_resume(struct device *dev)
        mutex_lock(&priv->lock);

        stmmac_reset_queues_param(priv);
-       stmmac_reinit_rx_buffers(priv);
        stmmac_free_tx_skbufs(priv);
-       stmmac_clear_descriptors(priv);

        stmmac_hw_setup(ndev, false);
        stmmac_init_coalesce(priv);

> Best Regards,
> Joakim Zhang
> > Thierry

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-04-14 12:22 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 10:50 Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back Jon Hunter
2021-03-24 12:20 ` Joakim Zhang
2021-03-24 12:39   ` Jon Hunter
2021-03-25  7:53     ` Joakim Zhang
2021-03-25  8:00       ` Jon Hunter
2021-03-25  8:12         ` Joakim Zhang
2021-03-30 12:50           ` Jon Hunter
2021-03-31  7:43             ` Joakim Zhang
2021-03-31 11:10               ` Joakim Zhang
2021-03-31 11:29                 ` Jon Hunter
2021-03-31 11:41                   ` Joakim Zhang
2021-04-01 16:28                     ` Jon Hunter
2021-04-13  8:41                       ` Jon Hunter
2021-04-13 12:13                         ` Joakim Zhang
2021-04-13 16:06                           ` Thierry Reding
2021-04-13 16:43                             ` Jakub Kicinski
2021-04-14  2:18                             ` Joakim Zhang
2021-04-14  7:40                               ` Thierry Reding
2021-04-14  8:06                                 ` Joakim Zhang
2021-04-14 12:22                                   ` Joakim Zhang
2021-03-31 11:11               ` Jon Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).