Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
@ 2018-09-28 15:54 Maciej S. Szmigiero
  2018-09-28 22:00 ` Chris Clayton
  0 siblings, 1 reply; 22+ messages in thread
From: Maciej S. Szmigiero @ 2018-09-28 15:54 UTC (permalink / raw)
  To: Chris Clayton
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Heiner Kallweit, Realtek linux nic maintainers, linux-kernel

Hi,

> Hi,
> 
> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
> 
> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.

Please have a look at the following thread:
https://lkml.org/lkml/2018/9/25/1118

Maciej


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-09-28 15:54 R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev) Maciej S. Szmigiero
@ 2018-09-28 22:00 ` Chris Clayton
  2018-09-28 22:13   ` Heiner Kallweit
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-09-28 22:00 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Heiner Kallweit, Realtek linux nic maintainers, linux-kernel

Thanks Maciej.

On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
> Hi,
> 
>> Hi,
>>
>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>
>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
> 
> Please have a look at the following thread:
> https://lkml.org/lkml/2018/9/25/1118
> 

I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
Heiner's patch to the 4.19, but again the problem is not solved.

> Maciej
> 
Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-09-28 22:00 ` Chris Clayton
@ 2018-09-28 22:13   ` Heiner Kallweit
  2018-09-29  7:25     ` Chris Clayton
  2018-10-04  8:41     ` Chris Clayton
  0 siblings, 2 replies; 22+ messages in thread
From: Heiner Kallweit @ 2018-09-28 22:13 UTC (permalink / raw)
  To: Chris Clayton, Maciej S. Szmigiero
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Realtek linux nic maintainers, linux-kernel

On 29.09.2018 00:00, Chris Clayton wrote:
> Thanks Maciej.
> 
> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>> Hi,
>>
>>> Hi,
>>>
>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>
>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>
>> Please have a look at the following thread:
>> https://lkml.org/lkml/2018/9/25/1118
>>
> 
> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
> Heiner's patch to the 4.19, but again the problem is not solved.
> 
I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.

Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
Can you provide the dmesg part with the XID?
According to your lspci output neither MSI nor MSI-X is active.
Do you have to use nomsi for whatever reason?

Heiner

>> Maciej
>>
> Chris
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-09-28 22:13   ` Heiner Kallweit
@ 2018-09-29  7:25     ` Chris Clayton
  2018-09-29  7:38       ` Chris Clayton
  2018-10-04  8:41     ` Chris Clayton
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-09-29  7:25 UTC (permalink / raw)
  To: Heiner Kallweit, Maciej S. Szmigiero
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Realtek linux nic maintainers, linux-kernel



On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
> 
> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
> Can you provide the dmesg part with the XID?

$ dmesg | grep -i r8169
[    5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    5.321432] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
[    5.322892] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19
[    5.323786] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[   10.232077] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
[   10.235218] r8169 0000:05:00.2 eth0: link down
[   11.717460] r8169 0000:05:00.2 eth0: link up

$ dmesg | grep -i r8169
[    5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    5.208677] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
[    5.210066] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
[    5.210676] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[   10.456081] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
[   10.459217] r8169 0000:05:00.2 eth0: link down
[   10.459880] r8169 0000:05:00.2 eth0: link down
[   12.015158] r8169 0000:05:00.2 eth0: link up


> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?

No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
has a very clear "say Y".

> 
> Heiner
> 
>>> Maciej
>>>
>> Chris
>>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-09-29  7:25     ` Chris Clayton
@ 2018-09-29  7:38       ` Chris Clayton
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Clayton @ 2018-09-29  7:38 UTC (permalink / raw)
  To: Heiner Kallweit, Maciej S. Szmigiero
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Realtek linux nic maintainers, linux-kernel

Sorry, sent by accident. Note to self - don't attempt email until after second cup of coffee.

On 29/09/2018 08:25, Chris Clayton wrote:
> 
> 
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
>> Can you provide the dmesg part with the XID?

I meant to say that I have now re-enabled MSI in 4.18.7 - the latest stable series kernel in which eth0 continues to
function reliably after a suspend/resume cycle. The second dmesg output below is taken from that kernel. The first one
was from an up-to-date 4.19 kernel
> 
> $ dmesg | grep -i r8169
> [    5.320679] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [    5.321432] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
> [    5.322892] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 19
> [    5.323786] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
> [   10.232077] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
> [   10.235218] r8169 0000:05:00.2 eth0: link down
> [   11.717460] r8169 0000:05:00.2 eth0: link up
> 
> $ dmesg | grep -i r8169
> [    5.208040] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [    5.208677] r8169 0000:05:00.2: can't disable ASPM; OS doesn't have ASPM control
> [    5.210066] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
> [    5.210676] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
> [   10.456081] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
> [   10.459217] r8169 0000:05:00.2 eth0: link down
> [   10.459880] r8169 0000:05:00.2 eth0: link down
> [   12.015158] r8169 0000:05:00.2 eth0: link up
> 
> 
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
> 
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
> has a very clear "say Y".

As I said above I have re-enabled MSI.
> 
>>
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-09-28 22:13   ` Heiner Kallweit
  2018-09-29  7:25     ` Chris Clayton
@ 2018-10-04  8:41     ` Chris Clayton
  2018-10-07 19:36       ` Chris Clayton
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-10-04  8:41 UTC (permalink / raw)
  To: Heiner Kallweit, Maciej S. Szmigiero
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Realtek linux nic maintainers, linux-kernel

Hi Heiner,

Here's the reply to your questions. Sorry for the delay.

On 28/09/2018 23:13, Heiner Kallweit wrote:
> On 29.09.2018 00:00, Chris Clayton wrote:
>> Thanks Maciej.
>>
>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>> Hi,
>>>
>>>> Hi,
>>>>
>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>
>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>
>>> Please have a look at the following thread:
>>> https://lkml.org/lkml/2018/9/25/1118
>>>
>>
>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>> Heiner's patch to the 4.19, but again the problem is not solved.
>>
> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
> 
> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
> Can you provide the dmesg part with the XID?

$ dmesg | grep r8169
[    5.274938] libphy: r8169: probed
[    5.276563] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
[    5.278158] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[    9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet]
(mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
[    9.460876] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
[   11.005336] r8169 0000:05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx

> According to your lspci output neither MSI nor MSI-X is active.
> Do you have to use nomsi for whatever reason?
> 

No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
has a very clear "say Y". I've re-enabled it now.

Chris

> Heiner
> 
>>> Maciej
>>>
>> Chris
>>
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-04  8:41     ` Chris Clayton
@ 2018-10-07 19:36       ` Chris Clayton
  2018-10-09 12:32         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-10-07 19:36 UTC (permalink / raw)
  To: Heiner Kallweit, Maciej S. Szmigiero
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Realtek linux nic maintainers, linux-kernel

Hi again,

I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
14-15ms to more than 1000ms.

Chris

On 04/10/2018 09:41, Chris Clayton wrote:
> Hi Heiner,
> 
> Here's the reply to your questions. Sorry for the delay.
> 
> On 28/09/2018 23:13, Heiner Kallweit wrote:
>> On 29.09.2018 00:00, Chris Clayton wrote:
>>> Thanks Maciej.
>>>
>>> On 28/09/2018 16:54, Maciej S. Szmigiero wrote:
>>>> Hi,
>>>>
>>>>> Hi,
>>>>>
>>>>> I upgraded my kernel to 4.18.10 recently and have since been experiencing network problems after resuming from a
>>>>> suspend to RAM or disk. I previously had 4.18.6 and that was OK.
>>>>>
>>>>> The pattern of the problem is that when I first boot, the network is fine. But, after resume from suspend I find that
>>>>> the time taken for a ping of one of my ISP's nameservers increases from 14-15ms to more than 1000ms. Moreover, when I
>>>>> open a browser (chromium or firefox), it fails to retrieve my home page (https://www.google.co.uk) and pings of the
>>>>> nameserver fail with the message "Destination Host Unreachable". Often, I can revive the network by stopping it with
>>>>> /sbin/if(down,up} but sometimes it is necessary to also remove the r8169 module and load it again.
>>>>
>>>> Please have a look at the following thread:
>>>> https://lkml.org/lkml/2018/9/25/1118
>>>>
>>>
>>> I applied your patch for the 4.18 stable kernels to 4.18.10, but the problem is not solved by it. Similarly, I applied
>>> Heiner's patch to the 4.19, but again the problem is not solved.
>>>
>> I think we talk about two different issues here. The one the fix is for has no link to suspend/resume.
>>
>> Chris, the lspci output doesn't provide enough detail to determine the exact chip version.
>> Can you provide the dmesg part with the XID?
> 
> $ dmesg | grep r8169
> [    5.274938] libphy: r8169: probed
> [    5.276563] r8169 0000:05:00.2 eth0: RTL8411, 80:fa:5b:08:d0:3d, XID 48800800, IRQ 29
> [    5.278158] r8169 0000:05:00.2 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
> [    9.275275] RTL8211E Gigabit Ethernet r8169-502:00: attached PHY driver [RTL8211E Gigabit Ethernet]
> (mii_bus:phy_addr=r8169-502:00, irq=IGNORE)
> [    9.460876] r8169 0000:05:00.2 eth0: No native access to PCI extended config space, falling back to CSI
> [   11.005336] r8169 0000:05:00.2 eth0: Link is Up - 100Mbps/Full - flow control rx/tx
> 
>> According to your lspci output neither MSI nor MSI-X is active.
>> Do you have to use nomsi for whatever reason?
>>
> 
> No, I do not use nomsi, but MSI wasn't enabled in my kernel config. I'm 99% sure that it used to be - I've no idea how
> it got dropped. If I'm not sure about an option, I start by taking the recommendation in the kconfig help. Help on MSI
> has a very clear "say Y". I've re-enabled it now.
> 
> Chris
> 
>> Heiner
>>
>>>> Maciej
>>>>
>>> Chris
>>>
>>
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-07 19:36       ` Chris Clayton
@ 2018-10-09 12:32         ` Maciej S. Szmigiero
  2018-10-09 14:40           ` Chris Clayton
  0 siblings, 1 reply; 22+ messages in thread
From: Maciej S. Szmigiero @ 2018-10-09 12:32 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

On 07.10.2018 21:36, Chris Clayton wrote:
> Hi again,
> 
> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
> 14-15ms to more than 1000ms.

You can try comparing chip registers (ethtool -d eth0) in the working
state (before a suspend) and in the broken state (after a resume).
Maybe there will be some obvious in the difference.

The same goes for the PCI configuration (lspci -d :8168 -vv).

> Chris

Maciej

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-09 12:32         ` Maciej S. Szmigiero
@ 2018-10-09 14:40           ` Chris Clayton
  2018-10-09 20:36             ` Heiner Kallweit
  2018-10-09 21:39             ` Heiner Kallweit
  0 siblings, 2 replies; 22+ messages in thread
From: Chris Clayton @ 2018-10-09 14:40 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

Thanks to Maciej and Heiner for their replies.

On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
> On 07.10.2018 21:36, Chris Clayton wrote:
>> Hi again,
>>
>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>> 14-15ms to more than 1000ms.
> 
> You can try comparing chip registers (ethtool -d eth0) in the working
> state (before a suspend) and in the broken state (after a resume).
> Maybe there will be some obvious in the difference.
> 
> The same goes for the PCI configuration (lspci -d :8168 -vv).
> 
Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.

Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.

I've attached files I redirected the outputs to.

Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
the diagnostics shown in the attachments.)

Chris

>> Chris
> 
> Maciej
> 

[-- Attachment #2: r8169-post-suspend --]
[-- Type: text/plain, Size: 5653 bytes --]

ethtool -d eth0
===============
RealTek RTL8411 registers:
--------------------------------------------------------
0x00: MAC Address                      80:fa:5b:08:d0:3d
0x08: Multicast Address Filter     0x00000000 0x00000080
0x10: Dump Tally Counter Command   0x0c2ec000 0x00000004
0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x00000004
0x28: Tx High Priority Ring Addr   0x00000000 0x00000000
0x30: Flash memory read/write                 0x00000000
0x34: Early Rx Byte Count                              0
0x36: Early Rx Status                               0x00
0x37: Command                                       0x0c
      Rx on, Tx on
0x3C: Interrupt Mask                              0x803f
      SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 
0x3E: Interrupt Status                            0x0000
      
0x40: Tx Configuration                        0x4b800f80
0x44: Rx Configuration                        0x0002870e
0x48: Timer count                             0x00000000
0x4C: Missed packet counter                     0x000000
0x50: EEPROM Command                                0x10
0x51: Config 0                                      0x00
0x52: Config 1                                      0xcf
0x53: Config 2                                      0x3c
0x54: Config 3                                      0x60
0x55: Config 4                                      0x10
0x56: Config 5                                      0x02
0x58: Timer interrupt                         0x00000000
0x5C: Multiple Interrupt Select                   0x0000
0x60: PHY access                              0x80040de1
0x64: TBI control and status                  0x27ffff01
0x68: TBI Autonegotiation advertisement (ANAR)    0xf70c
0x6A: TBI Link partner ability (LPAR)             0x0002
0x6C: PHY status                                    0xeb
0x84: PM wakeup frame 0            0x00000000 0x00000000
0x8C: PM wakeup frame 1            0x00000000 0x00000000
0x94: PM wakeup frame 2 (low)      0x00000000 0x00000000
0x9C: PM wakeup frame 2 (high)     0x00000000 0x00000000
0xA4: PM wakeup frame 3 (low)      0x00000000 0x00000000
0xAC: PM wakeup frame 3 (high)     0x00000000 0x00000000
0xB4: PM wakeup frame 4 (low)      0xffffffff 0xffffffff
0xBC: PM wakeup frame 4 (high)     0x00000000 0x00000000
0xC4: Wakeup frame 0 CRC                          0x0000
0xC6: Wakeup frame 1 CRC                          0x0000
0xC8: Wakeup frame 2 CRC                          0x0000
0xCA: Wakeup frame 3 CRC                          0x0000
0xCC: Wakeup frame 4 CRC                          0x0000
0xDA: RX packet maximum size                      0x4000
0xE0: C+ Command                                  0x20e1
      VLAN de-tagging
      RX checksumming
0xE2: Interrupt Mitigation                        0x5151
      TxTimer:       5
      TxPackets:     1
      RxTimer:       5
      RxPackets:     1
0xE4: Rx Ring Addr                 0x07935000 0x00000004
0xEC: Early Tx threshold                            0x27
0xF0: Func Event                              0x0040003f
0xF4: Func Event Mask                         0x00000000
0xF8: Func Preset State                       0x00031eff
0xFC: Func Force Event                        0x00000000

lspci -d :8168 -vv
==================
pcilib: sysfs_read_vpd: read failed: Input/output error
05:00.2 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0a)
	Subsystem: CLEVO/KAPOK Computer RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 19
	Region 0: I/O ports at e000 [size=256]
	Region 2: Memory at f0004000 (64-bit, prefetchable) [size=4K]
	Region 4: Memory at f0000000 (64-bit, prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 01
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (ok), Width x1 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00000800
	Capabilities: [d0] Vital Product Data
		Not readable
	Kernel driver in use: r8169
	Kernel modules: r8169

[-- Attachment #3: r8169-pre-suspend --]
[-- Type: text/plain, Size: 5653 bytes --]

ethtool -d eth0
===============
RealTek RTL8411 registers:
--------------------------------------------------------
0x00: MAC Address                      80:fa:5b:08:d0:3d
0x08: Multicast Address Filter     0x00000000 0x00000080
0x10: Dump Tally Counter Command   0x0c2ec000 0x00000004
0x20: Tx Normal Priority Ring Addr 0x07a0a000 0x00000004
0x28: Tx High Priority Ring Addr   0x00000000 0x00000000
0x30: Flash memory read/write                 0x00000000
0x34: Early Rx Byte Count                              0
0x36: Early Rx Status                               0x00
0x37: Command                                       0x0c
      Rx on, Tx on
0x3C: Interrupt Mask                              0x803f
      SERR LinkChg RxNoBuf TxErr TxOK RxErr RxOK 
0x3E: Interrupt Status                            0x0000
      
0x40: Tx Configuration                        0x4b800f80
0x44: Rx Configuration                        0x0002870e
0x48: Timer count                             0x00000000
0x4C: Missed packet counter                     0x000000
0x50: EEPROM Command                                0x10
0x51: Config 0                                      0x00
0x52: Config 1                                      0xcf
0x53: Config 2                                      0x3c
0x54: Config 3                                      0x60
0x55: Config 4                                      0x10
0x56: Config 5                                      0x02
0x58: Timer interrupt                         0x00000000
0x5C: Multiple Interrupt Select                   0x0000
0x60: PHY access                              0x80040de1
0x64: TBI control and status                  0x27ffff01
0x68: TBI Autonegotiation advertisement (ANAR)    0xf70c
0x6A: TBI Link partner ability (LPAR)             0x0002
0x6C: PHY status                                    0xeb
0x84: PM wakeup frame 0            0x00000000 0x00000000
0x8C: PM wakeup frame 1            0x00000000 0x00000000
0x94: PM wakeup frame 2 (low)      0x00000000 0x00000000
0x9C: PM wakeup frame 2 (high)     0x00000000 0x00000000
0xA4: PM wakeup frame 3 (low)      0x00000000 0x00000000
0xAC: PM wakeup frame 3 (high)     0x00000000 0x00000000
0xB4: PM wakeup frame 4 (low)      0xffffffff 0xffffffff
0xBC: PM wakeup frame 4 (high)     0x00000000 0x00000000
0xC4: Wakeup frame 0 CRC                          0x0000
0xC6: Wakeup frame 1 CRC                          0x0000
0xC8: Wakeup frame 2 CRC                          0x0000
0xCA: Wakeup frame 3 CRC                          0x0000
0xCC: Wakeup frame 4 CRC                          0x0000
0xDA: RX packet maximum size                      0x4000
0xE0: C+ Command                                  0x20e1
      VLAN de-tagging
      RX checksumming
0xE2: Interrupt Mitigation                        0x5151
      TxTimer:       5
      TxPackets:     1
      RxTimer:       5
      RxPackets:     1
0xE4: Rx Ring Addr                 0x07935000 0x00000004
0xEC: Early Tx threshold                            0x27
0xF0: Func Event                              0x0040003f
0xF4: Func Event Mask                         0x00000000
0xF8: Func Preset State                       0x00031eff
0xFC: Func Force Event                        0x00000000

lspci -d :8168 -vv
==================
pcilib: sysfs_read_vpd: read failed: Input/output error
05:00.2 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0a)
	Subsystem: CLEVO/KAPOK Computer RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 19
	Region 0: I/O ports at e000 [size=256]
	Region 2: Memory at f0004000 (64-bit, prefetchable) [size=4K]
	Region 4: Memory at f0000000 (64-bit, prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 01
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (ok), Width x1 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00000800
	Capabilities: [d0] Vital Product Data
		Not readable
	Kernel driver in use: r8169
	Kernel modules: r8169

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-09 14:40           ` Chris Clayton
@ 2018-10-09 20:36             ` Heiner Kallweit
  2018-10-10  0:24               ` Maciej S. Szmigiero
  2018-10-09 21:39             ` Heiner Kallweit
  1 sibling, 1 reply; 22+ messages in thread
From: Heiner Kallweit @ 2018-10-09 20:36 UTC (permalink / raw)
  To: Chris Clayton, Maciej S. Szmigiero
  Cc: David S. Miller, Azat Khuzhin, Greg Kroah-Hartman,
	Realtek linux nic maintainers, linux-kernel

On 09.10.2018 16:40, Chris Clayton wrote:
> Thanks to Maciej and Heiner for their replies.
> 
> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>> On 07.10.2018 21:36, Chris Clayton wrote:
>>> Hi again,
>>>
>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>> 14-15ms to more than 1000ms.
>>
>> You can try comparing chip registers (ethtool -d eth0) in the working
>> state (before a suspend) and in the broken state (after a resume).
>> Maybe there will be some obvious in the difference.
>>
>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>
> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
> 
> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
> 
Hmm, this is very weird, especially taking into account that in your original
report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
register values seem to be the same before and after resume. So how can the
chip behave differently?
So far my best guess is that some chip quirk causes it to accept writes to
register RxConfig, but to misinterpret or ignore the written value.
So far your report is the only one (affecting RTL8411), but we don't know
whether other chip versions are affected too.
One option could be to call rtl_init_rxcfg() for chip versions <= 06 only
because for them we know that they need this call.


> I've attached files I redirected the outputs to.
> 
> Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
> scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
> the diagnostics shown in the attachments.)
> 
> Chris
> 
>>> Chris
>>
>> Maciej
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-09 20:36             ` Heiner Kallweit
@ 2018-10-10  0:24               ` Maciej S. Szmigiero
  2018-10-10  8:09                 ` Chris Clayton
                                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Maciej S. Szmigiero @ 2018-10-10  0:24 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

On 09.10.2018 22:36, Heiner Kallweit wrote:
> On 09.10.2018 16:40, Chris Clayton wrote:
>> Thanks to Maciej and Heiner for their replies.
>>
>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>> Hi again,
>>>>
>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>> 14-15ms to more than 1000ms.
>>>
>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>> state (before a suspend) and in the broken state (after a resume).
>>> Maybe there will be some obvious in the difference.
>>>
>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>
>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>
>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>
> Hmm, this is very weird, especially taking into account that in your original
> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
> register values seem to be the same before and after resume. So how can the
> chip behave differently?
> So far my best guess is that some chip quirk causes it to accept writes to
> register RxConfig, but to misinterpret or ignore the written value.
> So far your report is the only one (affecting RTL8411), but we don't know
> whether other chip versions are affected too.

Also, it is interesting that even if one removes a call to
rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
written to moments later by rtl_set_rx_mode().

The only chip accesses in the meantime seems to be a write to TxConfig by
rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
to MAR0 earlier in rtl_set_rx_mode().

My proposals are:
1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
in rtl_hw_start().
Maybe the chip does not like sometimes that RxConfig is written before
TxConfig.

2) Check the original value of RxConfig (after a resume) before
rtl_init_rxcfg() overwrites it (compile tested only):
--- r8169.c.ori
+++ r8169.c
@@ -5155,6 +5155,9 @@
 	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
 	RTL_R8(tp, IntrMask);
 	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
+
+	pr_notice("RxConfig before init was %.8x\n",
+		(unsigned int)RTL_R32(tp, RxConfig));
 	rtl_init_rxcfg(tp);
 	rtl_set_tx_config_registers(tp);
 

This should be the value that you got when you removed the call to
rtl_init_rxcfg() for testing.
Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
writes (under the "default:" label for your NIC model).

Hope this helps,
Maciej

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-10  0:24               ` Maciej S. Szmigiero
@ 2018-10-10  8:09                 ` Chris Clayton
  2018-10-10  8:51                   ` Chris Clayton
  2018-10-10 22:30                 ` Chris Clayton
  2018-10-10 22:49                 ` Chris Clayton
  2 siblings, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-10-10  8:09 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel



On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 
After testing your first proposal, which made no  difference, I founf the following in dmesg in the output from dmesg:

[  761.999468] ------------[ cut here ]------------
[  761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0
[  761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via
videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon
snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel]
[  761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
[  761.999504] Hardware name: Notebook                         W65_67SZ                        /W65_67SZ
       , BIOS 1.03.05 02/26/2014
[  761.999508] Workqueue: events rtl_task [r8169]
[  761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0
[  761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1
81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00
[  761.999513] RSP: 0018:ffff88040f803e98 EFLAGS: 00010282
[  761.999514] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[  761.999516] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff88040f8153d0
[  761.999517] RBP: ffff88040ca9a3b8 R08: ffffffff813565f0 R09: 000000000000034e
[  761.999517] R10: 0000000000000007 R11: 0000000000000000 R12: ffff88040ca9a39c
[  761.999518] R13: ffff88040ca9a000 R14: 0000000000000001 R15: ffff8803ea17cc80
[  761.999520] FS:  0000000000000000(0000) GS:ffff88040f800000(0000) knlGS:0000000000000000
[  761.999521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  761.999522] CR2: 00007f67280206b8 CR3: 000000000200a002 CR4: 00000000001606f0
[  761.999523] Call Trace:
[  761.999525]  <IRQ>
[  761.999527]  ? qdisc_reset+0xe0/0xe0
[  761.999529]  ? qdisc_reset+0xe0/0xe0
[  761.999532]  call_timer_fn+0x11/0x70
[  761.999534]  expire_timers+0x8e/0xa0
[  761.999535]  run_timer_softirq+0x7e/0x150
[  761.999538]  ? __hrtimer_run_queues+0x12b/0x1a0
[  761.999541]  ? recalibrate_cpu_khz+0x10/0x10
[  761.999543]  ? ktime_get+0x32/0x90
[  761.999546]  ? lapic_next_event+0x20/0x20
[  761.999549]  __do_softirq+0xcc/0x1fc
[  761.999552]  irq_exit+0x82/0xb0
[  761.999554]  smp_apic_timer_interrupt+0x61/0x90
[  761.999556]  apic_timer_interrupt+0xf/0x20
[  761.999557]  </IRQ>
[  761.999560] RIP: 0010:rtl_slow_event_work+0x2a/0x1f0 [r8169]
[  761.999562] Code: 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 10 4c 8b 67 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08
31 c0 48 8b 07 66 8b 68 3e <66> 23 af da 0d 00 00 48 8b 07 66 89 68 3e 40 f6 c5 40 0f 85 3b 01
[  761.999563] RSP: 0018:ffffc900014d7e40 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  761.999564] RAX: ffffc900000b9000 RBX: ffff88040ca9a7c0 RCX: ffff88040f81f160
[  761.999565] RDX: ffff8803ea21b300 RSI: 0000000000000000 RDI: ffff88040ca9a7c0
[  761.999566] RBP: ffff88040ca90050 R08: 0000000000000000 R09: 000073746e657665
[  761.999567] R10: 8080808080808080 R11: ffff88040f81ea68 R12: ffff88040ca9a000
[  761.999568] R13: ffff88040ca9a000 R14: ffff88040f81f140 R15: 0000000000000000
[  761.999571]  ? __switch_to_asm+0x34/0x70
[  761.999573]  rtl_task+0x4f/0x70 [r8169]
[  761.999576]  process_one_work+0x1bc/0x2f0
[  761.999577]  worker_thread+0x28/0x3c0
[  761.999579]  ? process_one_work+0x2f0/0x2f0
[  761.999581]  kthread+0x109/0x120
[  761.999583]  ? kthread_park+0x80/0x80
[  761.999585]  ret_from_fork+0x35/0x40
[  761.999586] ---[ end trace fd5800440feffc06 ]---

I haven't seen this before, but maybe it's a consequence of swapping the order of the two functions calls.

I'll work on the second proposal later today.

Chris
> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
>  	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>  	RTL_R8(tp, IntrMask);
>  	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> +	pr_notice("RxConfig before init was %.8x\n",
> +		(unsigned int)RTL_R32(tp, RxConfig));
>  	rtl_init_rxcfg(tp);
>  	rtl_set_tx_config_registers(tp);
>  
> 
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).
> 
> Hope this helps,
> Maciej
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-10  8:09                 ` Chris Clayton
@ 2018-10-10  8:51                   ` Chris Clayton
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Clayton @ 2018-10-10  8:51 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I tested the wrong kernel/module to get the
results I provided below. That, however, may make the results more interesting because they happened with a virgin rc7
kernel/module.

I'll test your proposals properly later.

Chris

On 10/10/2018 09:09, Chris Clayton wrote:
> 
> 
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your original
>>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> After testing your first proposal, which made no  difference, I founf the following in dmesg in the output from dmesg:
> 
> [  761.999468] ------------[ cut here ]------------
> [  761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
> [  761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 dev_watchdog+0x1e9/0x1f0
> [  761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
> nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_via
> videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common usbhid realtek coretemp snd_hda_intel hwmon
> snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last unloaded: btintel]
> [  761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
> [  761.999504] Hardware name: Notebook                         W65_67SZ                        /W65_67SZ
>        , BIOS 1.03.05 02/26/2014
> [  761.999508] Workqueue: events rtl_task [r8169]
> [  761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0
> [  761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1
> 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 c7 07 00 00 00 00
> [  761.999513] RSP: 0018:ffff88040f803e98 EFLAGS: 00010282
> [  761.999514] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
> [  761.999516] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff88040f8153d0
> [  761.999517] RBP: ffff88040ca9a3b8 R08: ffffffff813565f0 R09: 000000000000034e
> [  761.999517] R10: 0000000000000007 R11: 0000000000000000 R12: ffff88040ca9a39c
> [  761.999518] R13: ffff88040ca9a000 R14: 0000000000000001 R15: ffff8803ea17cc80
> [  761.999520] FS:  0000000000000000(0000) GS:ffff88040f800000(0000) knlGS:0000000000000000
> [  761.999521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  761.999522] CR2: 00007f67280206b8 CR3: 000000000200a002 CR4: 00000000001606f0
> [  761.999523] Call Trace:
> [  761.999525]  <IRQ>
> [  761.999527]  ? qdisc_reset+0xe0/0xe0
> [  761.999529]  ? qdisc_reset+0xe0/0xe0
> [  761.999532]  call_timer_fn+0x11/0x70
> [  761.999534]  expire_timers+0x8e/0xa0
> [  761.999535]  run_timer_softirq+0x7e/0x150
> [  761.999538]  ? __hrtimer_run_queues+0x12b/0x1a0
> [  761.999541]  ? recalibrate_cpu_khz+0x10/0x10
> [  761.999543]  ? ktime_get+0x32/0x90
> [  761.999546]  ? lapic_next_event+0x20/0x20
> [  761.999549]  __do_softirq+0xcc/0x1fc
> [  761.999552]  irq_exit+0x82/0xb0
> [  761.999554]  smp_apic_timer_interrupt+0x61/0x90
> [  761.999556]  apic_timer_interrupt+0xf/0x20
> [  761.999557]  </IRQ>
> [  761.999560] RIP: 0010:rtl_slow_event_work+0x2a/0x1f0 [r8169]
> [  761.999562] Code: 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 10 4c 8b 67 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08
> 31 c0 48 8b 07 66 8b 68 3e <66> 23 af da 0d 00 00 48 8b 07 66 89 68 3e 40 f6 c5 40 0f 85 3b 01
> [  761.999563] RSP: 0018:ffffc900014d7e40 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> [  761.999564] RAX: ffffc900000b9000 RBX: ffff88040ca9a7c0 RCX: ffff88040f81f160
> [  761.999565] RDX: ffff8803ea21b300 RSI: 0000000000000000 RDI: ffff88040ca9a7c0
> [  761.999566] RBP: ffff88040ca90050 R08: 0000000000000000 R09: 000073746e657665
> [  761.999567] R10: 8080808080808080 R11: ffff88040f81ea68 R12: ffff88040ca9a000
> [  761.999568] R13: ffff88040ca9a000 R14: ffff88040f81f140 R15: 0000000000000000
> [  761.999571]  ? __switch_to_asm+0x34/0x70
> [  761.999573]  rtl_task+0x4f/0x70 [r8169]
> [  761.999576]  process_one_work+0x1bc/0x2f0
> [  761.999577]  worker_thread+0x28/0x3c0
> [  761.999579]  ? process_one_work+0x2f0/0x2f0
> [  761.999581]  kthread+0x109/0x120
> [  761.999583]  ? kthread_park+0x80/0x80
> [  761.999585]  ret_from_fork+0x35/0x40
> [  761.999586] ---[ end trace fd5800440feffc06 ]---
> 
> I haven't seen this before, but maybe it's a consequence of swapping the order of the two functions calls.
> 
> I'll work on the second proposal later today.
> 
> Chris
>> 2) Check the original value of RxConfig (after a resume) before
>> rtl_init_rxcfg() overwrites it (compile tested only):
>> --- r8169.c.ori
>> +++ r8169.c
>> @@ -5155,6 +5155,9 @@
>>  	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>>  	RTL_R8(tp, IntrMask);
>>  	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
>> +
>> +	pr_notice("RxConfig before init was %.8x\n",
>> +		(unsigned int)RTL_R32(tp, RxConfig));
>>  	rtl_init_rxcfg(tp);
>>  	rtl_set_tx_config_registers(tp);
>>  
>>
>> This should be the value that you got when you removed the call to
>> rtl_init_rxcfg() for testing.
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
>>
>> Hope this helps,
>> Maciej
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-10  0:24               ` Maciej S. Szmigiero
  2018-10-10  8:09                 ` Chris Clayton
@ 2018-10-10 22:30                 ` Chris Clayton
  2018-10-10 22:32                   ` Chris Clayton
  2018-10-10 22:49                 ` Chris Clayton
  2 siblings, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-10-10 22:30 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

OK, right kernel/module used this time. Please see findings below.

On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 

This change made no difference. Networking still dies if I open a browser or leave ping running long enough.

> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
>  	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>  	RTL_R8(tp, IntrMask);
>  	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> +	pr_notice("RxConfig before init was %.8x\n",
> +		(unsigned int)RTL_R32(tp, RxConfig));
>  	rtl_init_rxcfg(tp);
>  	rtl_set_tx_config_registers(tp);
>  
> 
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).

This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool
-d", I can see RxConfig with the following values

	During boot:	0x00028700
	Before suspend:	0x0002870e
	During resume:	0x00024000
	Post resume:	0x0002870e

I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the
following values:

	During boot:	0x00028700
	Before suspend:	0x0002870e
	During resume:	0x00024000
	Post resume:	0x0002870e

> 
> Hope this helps,
> Maciej
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-10 22:30                 ` Chris Clayton
@ 2018-10-10 22:32                   ` Chris Clayton
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Clayton @ 2018-10-10 22:32 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

Too late at night to be doing this stuff. Clicked send instead of saving a draft. Sorry, please ignore.

On 10/10/2018 23:30, Chris Clayton wrote:
> OK, right kernel/module used this time. Please see findings below.
> 
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your original
>>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> 
> This change made no difference. Networking still dies if I open a browser or leave ping running long enough.
> 
>> 2) Check the original value of RxConfig (after a resume) before
>> rtl_init_rxcfg() overwrites it (compile tested only):
>> --- r8169.c.ori
>> +++ r8169.c
>> @@ -5155,6 +5155,9 @@
>>  	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>>  	RTL_R8(tp, IntrMask);
>>  	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
>> +
>> +	pr_notice("RxConfig before init was %.8x\n",
>> +		(unsigned int)RTL_R32(tp, RxConfig));
>>  	rtl_init_rxcfg(tp);
>>  	rtl_set_tx_config_registers(tp);
>>  
>>
>> This should be the value that you got when you removed the call to
>> rtl_init_rxcfg() for testing.
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
> 
> This might be more interesting. Through combination of viewing the output from pr_notice() and the output from "ethtool
> -d", I can see RxConfig with the following values
> 
> 	During boot:	0x00028700
> 	Before suspend:	0x0002870e
> 	During resume:	0x00024000
> 	Post resume:	0x0002870e
> 
> I then removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt, installed and rebooted. Now I see the
> following values:
> 
> 	During boot:	0x00028700
> 	Before suspend:	0x0002870e
> 	During resume:	0x00024000
> 	Post resume:	0x0002870e
> 
>>
>> Hope this helps,
>> Maciej
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-10  0:24               ` Maciej S. Szmigiero
  2018-10-10  8:09                 ` Chris Clayton
  2018-10-10 22:30                 ` Chris Clayton
@ 2018-10-10 22:49                 ` Chris Clayton
  2018-10-11  0:12                   ` Maciej S. Szmigiero
  2 siblings, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-10-10 22:49 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

OK, right kernel/module used this time. Please see findings below.

On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
> On 09.10.2018 22:36, Heiner Kallweit wrote:
>> On 09.10.2018 16:40, Chris Clayton wrote:
>>> Thanks to Maciej and Heiner for their replies.
>>>
>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>> Hi again,
>>>>>
>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>>> 14-15ms to more than 1000ms.
>>>>
>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>> state (before a suspend) and in the broken state (after a resume).
>>>> Maybe there will be some obvious in the difference.
>>>>
>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>
>>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>>
>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>>
>> Hmm, this is very weird, especially taking into account that in your original
>> report you state that removing the call to rtl_init_rxcfg() from rtl_hw_start()
>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>> register values seem to be the same before and after resume. So how can the
>> chip behave differently?
>> So far my best guess is that some chip quirk causes it to accept writes to
>> register RxConfig, but to misinterpret or ignore the written value.
>> So far your report is the only one (affecting RTL8411), but we don't know
>> whether other chip versions are affected too.
> 
> Also, it is interesting that even if one removes a call to
> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
> written to moments later by rtl_set_rx_mode().
> 
> The only chip accesses in the meantime seems to be a write to TxConfig by
> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
> to MAR0 earlier in rtl_set_rx_mode().
> 
> My proposals are:
> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
> in rtl_hw_start().
> Maybe the chip does not like sometimes that RxConfig is written before
> TxConfig.
> 

This change made no difference. Networking still dies if I open a browser or leave ping running long enough.

> 2) Check the original value of RxConfig (after a resume) before
> rtl_init_rxcfg() overwrites it (compile tested only):
> --- r8169.c.ori
> +++ r8169.c
> @@ -5155,6 +5155,9 @@
>  	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>  	RTL_R8(tp, IntrMask);
>  	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +
> +	pr_notice("RxConfig before init was %.8x\n",
> +		(unsigned int)RTL_R32(tp, RxConfig));
>  	rtl_init_rxcfg(tp);
>  	rtl_set_tx_config_registers(tp);
>  
> 
> This should be the value that you got when you removed the call to
> rtl_init_rxcfg() for testing.
> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
> writes (under the "default:" label for your NIC model).
> 

This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
"ethtool -d", I can see RxConfig with the following values

	During boot:	0x00028700
	Before suspend:	0x0002870e
	During resume:	0x00024000
	Post resume:	0x0002870e

As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
installed and rebooted. Now I see the following values:

	During boot:	0x00028700
	Before suspend:	0x0002870e
	During resume:	0x00024000
	Post resume:	0x0002400e

As with 4.18.10, networking now appears to be stable after the resume. Starting a browser results in my homepage being
displayed and I've spent a few minutes surfing with no interruptions. Similarly, ping runs without stopping. I simply
don't know enough to know what might now be enabled or disabled by this change in value, but hopefully it will provide a
clue to someone as to what is going on.

Chris

> Hope this helps,
> Maciej
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-10 22:49                 ` Chris Clayton
@ 2018-10-11  0:12                   ` Maciej S. Szmigiero
  2018-10-11  8:24                     ` Chris Clayton
  0 siblings, 1 reply; 22+ messages in thread
From: Maciej S. Szmigiero @ 2018-10-11  0:12 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

On 11.10.2018 00:49, Chris Clayton wrote:
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
>>
> 
> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
> "ethtool -d", I can see RxConfig with the following values
> 
> 	During boot:	0x00028700
> 	Before suspend:	0x0002870e
> 	During resume:	0x00024000
> 	Post resume:	0x0002870e
> 
> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
> installed and rebooted. Now I see the following values:
> 
> 	During boot:	0x00028700
> 	Before suspend:	0x0002870e
> 	During resume:	0x00024000
> 	Post resume:	0x0002400e
> 

Now we can finally see some difference...
Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
(bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
is kind of expected - one can see that the working configuration
post-resume has bit 14 (or 0x4000) set, too.

This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.

RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
family as your RTL_GIGA_MAC_VER_38, so can you please try the following
change:
--- r8169.c
+++ r8169.c
@@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
 	case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
 	case RTL_GIGA_MAC_VER_34:
 	case RTL_GIGA_MAC_VER_35:
+	case RTL_GIGA_MAC_VER_38:
 		RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
 		break;
 	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:

This will add RX_MULTI_EN also for your chip model (you need to add back
the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).

If this does not help then I would try another values in the above write:
1) RTL_W32(tp, RxConfig, 0x00024000);
2) RTL_W32(tp, RxConfig, 0x00004000);
3) RTL_W32(tp, RxConfig, RX_DMA_BURST);
4) RTL_W32(tp, RxConfig, RX128_INT_EN);

> Chris

Maciej

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-11  0:12                   ` Maciej S. Szmigiero
@ 2018-10-11  8:24                     ` Chris Clayton
  2018-10-11 12:23                       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Clayton @ 2018-10-11  8:24 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel



On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
> On 11.10.2018 00:49, Chris Clayton wrote:
>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>> writes (under the "default:" label for your NIC model).
>>>
>>
>> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
>> "ethtool -d", I can see RxConfig with the following values
>>
>> 	During boot:	0x00028700
>> 	Before suspend:	0x0002870e
>> 	During resume:	0x00024000
>> 	Post resume:	0x0002870e
>>
>> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>> installed and rebooted. Now I see the following values:
>>
>> 	During boot:	0x00028700
>> 	Before suspend:	0x0002870e
>> 	During resume:	0x00024000
>> 	Post resume:	0x0002400e
>>
> 
> Now we can finally see some difference...
> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
> is kind of expected - one can see that the working configuration
> post-resume has bit 14 (or 0x4000) set, too.
> 
> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
> 
> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
> change:
> --- r8169.c
> +++ r8169.c
> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>  	case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>  	case RTL_GIGA_MAC_VER_34:
>  	case RTL_GIGA_MAC_VER_35:
> +	case RTL_GIGA_MAC_VER_38:
>  		RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
>  		break;
>  	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
> 
> This will add RX_MULTI_EN also for your chip model (you need to add back
> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>

That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the
ping times are back in the 14-15ms range.

Chris

> If this does not help then I would try another values in the above write:
> 1) RTL_W32(tp, RxConfig, 0x00024000);
> 2) RTL_W32(tp, RxConfig, 0x00004000);
> 3) RTL_W32(tp, RxConfig, RX_DMA_BURST);
> 4) RTL_W32(tp, RxConfig, RX128_INT_EN);
> 
>> Chris
> 
> Maciej
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-11  8:24                     ` Chris Clayton
@ 2018-10-11 12:23                       ` Maciej S. Szmigiero
  2018-10-11 13:34                         ` Chris Clayton
  0 siblings, 1 reply; 22+ messages in thread
From: Maciej S. Szmigiero @ 2018-10-11 12:23 UTC (permalink / raw)
  To: Chris Clayton
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel

On 11.10.2018 10:24, Chris Clayton wrote:
> On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
>> On 11.10.2018 00:49, Chris Clayton wrote:
>>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>>> writes (under the "default:" label for your NIC model).
>>>>
>>>
>>> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
>>> "ethtool -d", I can see RxConfig with the following values
>>>
>>> 	During boot:	0x00028700
>>> 	Before suspend:	0x0002870e
>>> 	During resume:	0x00024000
>>> 	Post resume:	0x0002870e
>>>
>>> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>>> installed and rebooted. Now I see the following values:
>>>
>>> 	During boot:	0x00028700
>>> 	Before suspend:	0x0002870e
>>> 	During resume:	0x00024000
>>> 	Post resume:	0x0002400e
>>>
>>
>> Now we can finally see some difference...
>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
>> is kind of expected - one can see that the working configuration
>> post-resume has bit 14 (or 0x4000) set, too.
>>
>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
>>
>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
>> change:
>> --- r8169.c
>> +++ r8169.c
>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>>  	case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>>  	case RTL_GIGA_MAC_VER_34:
>>  	case RTL_GIGA_MAC_VER_35:
>> +	case RTL_GIGA_MAC_VER_38:
>>  		RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
>>  		break;
>>  	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
>>
>> This will add RX_MULTI_EN also for your chip model (you need to add back
>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>>
> 
> That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the
> ping times are back in the 14-15ms range.

Nice!

I will submit a patch, it would be great if you could test it and then
add a "Tested-by:" tag.
 
> Chris

Maciej

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-11 12:23                       ` Maciej S. Szmigiero
@ 2018-10-11 13:34                         ` Chris Clayton
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Clayton @ 2018-10-11 13:34 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Heiner Kallweit, David S. Miller, Azat Khuzhin,
	Greg Kroah-Hartman, Realtek linux nic maintainers, linux-kernel



On 11/10/2018 13:23, Maciej S. Szmigiero wrote:
> On 11.10.2018 10:24, Chris Clayton wrote:
>> On 11/10/2018 01:12, Maciej S. Szmigiero wrote:
>>> On 11.10.2018 00:49, Chris Clayton wrote:
>>>>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>>>>> writes (under the "default:" label for your NIC model).
>>>>>
>>>>
>>>> This might be more interesting. Through a combination of viewing the output from pr_notice() and the output from
>>>> "ethtool -d", I can see RxConfig with the following values
>>>>
>>>> 	During boot:	0x00028700
>>>> 	Before suspend:	0x0002870e
>>>> 	During resume:	0x00024000
>>>> 	Post resume:	0x0002870e
>>>>
>>>> As I did with 4.18.10 early on in the process, I removed the call to rtl_init_rxcfg() from rtl_hw_start() and rebuilt,
>>>> installed and rebooted. Now I see the following values:
>>>>
>>>> 	During boot:	0x00028700
>>>> 	Before suspend:	0x0002870e
>>>> 	During resume:	0x00024000
>>>> 	Post resume:	0x0002400e
>>>>
>>>
>>> Now we can finally see some difference...
>>> Besides missing RX128_INT_EN (bit 15 or 0x8000) and RX_DMA_BURST
>>> (bits 8-10 or 0x700) - that rtl_init_rxcfg() would normally set so this
>>> is kind of expected - one can see that the working configuration
>>> post-resume has bit 14 (or 0x4000) set, too.
>>>
>>> This bit is described in the driver as RX_MULTI_EN ("8111c only") and is
>>> set by rtl_init_rxcfg() for example for RTL_GIGA_MAC_VER_35.
>>>
>>> RTL_GIGA_MAC_VER_35 is described in the driver as being in the same
>>> family as your RTL_GIGA_MAC_VER_38, so can you please try the following
>>> change:
>>> --- r8169.c
>>> +++ r8169.c
>>> @@ -4271,6 +4271,7 @@ static void rtl_init_rxcfg(struct rtl816
>>>  	case RTL_GIGA_MAC_VER_18 ... RTL_GIGA_MAC_VER_24:
>>>  	case RTL_GIGA_MAC_VER_34:
>>>  	case RTL_GIGA_MAC_VER_35:
>>> +	case RTL_GIGA_MAC_VER_38:
>>>  		RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
>>>  		break;
>>>  	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
>>>
>>> This will add RX_MULTI_EN also for your chip model (you need to add back
>>> the call to rtl_init_rxcfg() to rtl_hw_start(), naturally).
>>>
>>
>> That's done the trick. With the above change applied, my network runs running fine after a suspend/resume cycle and the
>> ping times are back in the 14-15ms range.
> 
> Nice!
> 
> I will submit a patch, it would be great if you could test it and then
> add a "Tested-by:" tag.
>  

Will do, Maciej.

Thanks for solving this.
>> Chris
> 
> Maciej
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-09 14:40           ` Chris Clayton
  2018-10-09 20:36             ` Heiner Kallweit
@ 2018-10-09 21:39             ` Heiner Kallweit
  2018-10-09 23:32               ` Chris Clayton
  1 sibling, 1 reply; 22+ messages in thread
From: Heiner Kallweit @ 2018-10-09 21:39 UTC (permalink / raw)
  To: Chris Clayton, Maciej S. Szmigiero
  Cc: Azat Khuzhin, Realtek linux nic maintainers, linux-kernel

On 09.10.2018 16:40, Chris Clayton wrote:
> Thanks to Maciej and Heiner for their replies.
> 
> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>> On 07.10.2018 21:36, Chris Clayton wrote:
>>> Hi again,
>>>
>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>> 14-15ms to more than 1000ms.
>>
>> You can try comparing chip registers (ethtool -d eth0) in the working
>> state (before a suspend) and in the broken state (after a resume).
>> Maybe there will be some obvious in the difference.
>>
>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>
> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
> 
> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
> 
> I've attached files I redirected the outputs to.
> 
> Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
> scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
> the diagnostics shown in the attachments.)
> 
I'd like to check whether it may be a timing issue. The following experimental patch
adds a PCI commit after writing register ChipCmd. Could you please check whether
it changes anything?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 7d3f671e1..f3c359492 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct  rtl8169_private *tp)
 	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
 	RTL_R8(tp, IntrMask);
 	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
+	RTL_R8(tp, ChipCmd);
 	rtl_init_rxcfg(tp);
 	rtl_set_tx_config_registers(tp);
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)
  2018-10-09 21:39             ` Heiner Kallweit
@ 2018-10-09 23:32               ` Chris Clayton
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Clayton @ 2018-10-09 23:32 UTC (permalink / raw)
  To: Heiner Kallweit, Maciej S. Szmigiero
  Cc: Azat Khuzhin, Realtek linux nic maintainers, linux-kernel



On 09/10/2018 22:39, Heiner Kallweit wrote:
> On 09.10.2018 16:40, Chris Clayton wrote:
>> Thanks to Maciej and Heiner for their replies.
>>
>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>> Hi again,
>>>>
>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, but tried it anyway. I can confirm that the
>>>> regression is still present and my network still fails when, after a resume from suspend (to ram or disk), I open my
>>>> browser or my mail client. In both those cases the failure is almost immediate - e.g. my home page doesn't get displayed
>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite so quickly but the reported time increases from
>>>> 14-15ms to more than 1000ms.
>>>
>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>> state (before a suspend) and in the broken state (after a resume).
>>> Maybe there will be some obvious in the difference.
>>>
>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>
>> Maciej suggested comparing the output from lspci -vv for the ethernet device. They are identical.
>>
>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" pre and post suspend. Again, they are identical.
>> Heiner specifically suggested looking at the RxConfig. The value of that is 0x0002870e both pre and post suspend.
>>
>> I've attached files I redirected the outputs to.
>>
>> Please don't hesitate to ask for any other information needed to solve this problem. In the meantime, I've now got
>> scripts that stop the network during suspend and restart it during resume. (Those scripts were removed whilst I gathered
>> the diagnostics shown in the attachments.)
>>
> I'd like to check whether it may be a timing issue. The following experimental patch
> adds a PCI commit after writing register ChipCmd. Could you please check whether
> it changes anything?
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 7d3f671e1..f3c359492 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4641,6 +4641,7 @@ static void rtl_hw_start(struct  rtl8169_private *tp)
>  	/* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>  	RTL_R8(tp, IntrMask);
>  	RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
> +	RTL_R8(tp, ChipCmd);
>  	rtl_init_rxcfg(tp);
>  	rtl_set_tx_config_registers(tp);
>  
> 

Sorry, this patch doesn't make any difference - my network still fails. After a suspend/resume my browsers (chromium
and firefox) both fail to open my home page (https://www.google.co.uk). The ping time for one of my ISP's name servers
increases from 14-15ms to more than 1000ms, although it after a few pings it does reduce. As the screen grab below
shows, the network does eventually fail

$ ping NS1
PING ns1 (90.207.238.97): 56 data bytes
64 bytes from 90.207.238.97: icmp_seq=0 ttl=251 time=1017.289 ms
64 bytes from 90.207.238.97: icmp_seq=1 ttl=251 time=1018.051 ms
64 bytes from 90.207.238.97: icmp_seq=2 ttl=251 time=1015.271 ms
64 bytes from 90.207.238.97: icmp_seq=3 ttl=251 time=1015.495 ms
64 bytes from 90.207.238.97: icmp_seq=6 ttl=251 time=1015.646 ms
64 bytes from 90.207.238.97: icmp_seq=7 ttl=251 time=1022.609 ms
64 bytes from 90.207.238.97: icmp_seq=8 ttl=251 time=1015.612 ms
64 bytes from 90.207.238.97: icmp_seq=10 ttl=251 time=1015.551 ms
64 bytes from 90.207.238.97: icmp_seq=12 ttl=251 time=1015.446 ms
64 bytes from 90.207.238.97: icmp_seq=13 ttl=251 time=1015.657 ms
64 bytes from 90.207.238.97: icmp_seq=14 ttl=251 time=1015.614 ms
64 bytes from 90.207.238.97: icmp_seq=15 ttl=251 time=1015.651 ms
64 bytes from 90.207.238.97: icmp_seq=17 ttl=251 time=1015.459 ms
64 bytes from 90.207.238.97: icmp_seq=18 ttl=251 time=1015.443 ms
64 bytes from 90.207.238.97: icmp_seq=19 ttl=251 time=1015.936 ms
64 bytes from 90.207.238.97: icmp_seq=20 ttl=251 time=1015.681 ms
64 bytes from 90.207.238.97: icmp_seq=22 ttl=251 time=1015.410 ms
64 bytes from 90.207.238.97: icmp_seq=23 ttl=251 time=1015.487 ms
64 bytes from 90.207.238.97: icmp_seq=24 ttl=251 time=1016.169 ms
64 bytes from 90.207.238.97: icmp_seq=25 ttl=251 time=1015.659 ms
64 bytes from 90.207.238.97: icmp_seq=26 ttl=251 time=14.606 ms
64 bytes from 90.207.238.97: icmp_seq=30 ttl=251 time=32.765 ms
64 bytes from 90.207.238.97: icmp_seq=31 ttl=251 time=115.052 ms
64 bytes from 90.207.238.97: icmp_seq=33 ttl=251 time=757.115 ms
64 bytes from 90.207.238.97: icmp_seq=34 ttl=251 time=176.696 ms
64 bytes from 90.207.238.97: icmp_seq=35 ttl=251 time=1017.462 ms
64 bytes from 90.207.238.97: icmp_seq=36 ttl=251 time=16.394 ms
64 bytes from 90.207.238.97: icmp_seq=37 ttl=251 time=20.402 ms
64 bytes from 90.207.238.97: icmp_seq=38 ttl=251 time=37.795 ms
64 bytes from 90.207.238.97: icmp_seq=39 ttl=251 time=141.997 ms
92 bytes from laptop.local.lan (192.168.0.20): Destination Host Unreachable
92 bytes from laptop.local.lan (192.168.0.20): Destination Host Unreachable
...


Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-10-11 13:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-28 15:54 R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev) Maciej S. Szmigiero
2018-09-28 22:00 ` Chris Clayton
2018-09-28 22:13   ` Heiner Kallweit
2018-09-29  7:25     ` Chris Clayton
2018-09-29  7:38       ` Chris Clayton
2018-10-04  8:41     ` Chris Clayton
2018-10-07 19:36       ` Chris Clayton
2018-10-09 12:32         ` Maciej S. Szmigiero
2018-10-09 14:40           ` Chris Clayton
2018-10-09 20:36             ` Heiner Kallweit
2018-10-10  0:24               ` Maciej S. Szmigiero
2018-10-10  8:09                 ` Chris Clayton
2018-10-10  8:51                   ` Chris Clayton
2018-10-10 22:30                 ` Chris Clayton
2018-10-10 22:32                   ` Chris Clayton
2018-10-10 22:49                 ` Chris Clayton
2018-10-11  0:12                   ` Maciej S. Szmigiero
2018-10-11  8:24                     ` Chris Clayton
2018-10-11 12:23                       ` Maciej S. Szmigiero
2018-10-11 13:34                         ` Chris Clayton
2018-10-09 21:39             ` Heiner Kallweit
2018-10-09 23:32               ` Chris Clayton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.