linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* A weird problem of Realtek r8168 after resume from S3
@ 2018-12-13  2:20 Chris Chiu
  2018-12-14  3:33 ` Chris Chiu
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Chiu @ 2018-12-13  2:20 UTC (permalink / raw)
  To: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

Hi,
    We got an acer laptop which has a problem with ethernet networking after
resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
follows.
02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)

    The problem is the ethernet is not accessible after resume. Pinging via
ethernet always shows the response `Destination Host Unreachable`. However,
the interesting part is, when I run tcpdump to monitor the problematic ethernet
interface, the networking is back to alive. But it's dead again after
I stop tcpdump.
One more thing, if I ping the problematic machine from others, it achieves the
same effect as above tcpdump. Maybe it's about the register setting for RX path?

    I tried the latest 4.20 rc version but the problem still there. I
also tried some
hw_reset or init thing in the resume path but no effect. Any
suggestion for this?
Thanks

Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-13  2:20 A weird problem of Realtek r8168 after resume from S3 Chris Chiu
@ 2018-12-14  3:33 ` Chris Chiu
  2018-12-14  7:36   ` Heiner Kallweit
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Chiu @ 2018-12-14  3:33 UTC (permalink / raw)
  To: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>
> Hi,
>     We got an acer laptop which has a problem with ethernet networking after
> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> follows.
> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>
>     The problem is the ethernet is not accessible after resume. Pinging via
> ethernet always shows the response `Destination Host Unreachable`. However,
> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> interface, the networking is back to alive. But it's dead again after
> I stop tcpdump.
> One more thing, if I ping the problematic machine from others, it achieves the
> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>
>     I tried the latest 4.20 rc version but the problem still there. I
> also tried some
> hw_reset or init thing in the resume path but no effect. Any
> suggestion for this?
> Thanks
>
> Chris

Gentle ping. Any additional information required?

Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-14  3:33 ` Chris Chiu
@ 2018-12-14  7:36   ` Heiner Kallweit
  2018-12-17 13:25     ` Chris Chiu
  0 siblings, 1 reply; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-14  7:36 UTC (permalink / raw)
  To: Chris Chiu, nic_swsd, davem, netdev, Linux Kernel,
	Linux Upstreaming Team

On 14.12.2018 04:33, Chris Chiu wrote:
> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>
>> Hi,
>>     We got an acer laptop which has a problem with ethernet networking after
>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>> follows.
>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>
Helpful would be a "dmesg | grep r8169", especially chip name + XID.

>>     The problem is the ethernet is not accessible after resume. Pinging via
>> ethernet always shows the response `Destination Host Unreachable`. However,
>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>> interface, the networking is back to alive. But it's dead again after
>> I stop tcpdump.
>> One more thing, if I ping the problematic machine from others, it achieves the
>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>
You could compare the register dumps (ethtool -d) before and after S3 sleep
to find out whether there's a difference.

>>     I tried the latest 4.20 rc version but the problem still there. I
>> also tried some
>> hw_reset or init thing in the resume path but no effect. Any
>> suggestion for this?
>> Thanks
>>
Did previous kernel versions work? If it's a regression, a bisect would be
appreciated, because with the chip versions I've got I can't reproduce the issue.

>> Chris
> 
> Gentle ping. Any additional information required?
> 
> Chris
> 
Heiner

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-14  7:36   ` Heiner Kallweit
@ 2018-12-17 13:25     ` Chris Chiu
  2018-12-17 19:08       ` Heiner Kallweit
  2018-12-17 21:45       ` Heiner Kallweit
  0 siblings, 2 replies; 17+ messages in thread
From: Chris Chiu @ 2018-12-17 13:25 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 14.12.2018 04:33, Chris Chiu wrote:
> > On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
> >>
> >> Hi,
> >>     We got an acer laptop which has a problem with ethernet networking after
> >> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >> follows.
> >> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>
> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>
[   22.362774] r8169 0000:02:00.1 (unnamed net_device)
(uninitialized): mac_version = 0x2b
[   22.365580] libphy: r8169: probed
[   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
XID 5c800800, IRQ 38
[   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
bytes, tx checksumming: ko]

> >>     The problem is the ethernet is not accessible after resume. Pinging via
> >> ethernet always shows the response `Destination Host Unreachable`. However,
> >> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >> interface, the networking is back to alive. But it's dead again after
> >> I stop tcpdump.
> >> One more thing, if I ping the problematic machine from others, it achieves the
> >> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>
> You could compare the register dumps (ethtool -d) before and after S3 sleep
> to find out whether there's a difference.
>

Actually, I just found I lead the wrong direction. The S3 suspend does
help to reproduce,
but it's not necessary. All I need to do is ping around 5 mins and the
network connection
fails.  And I also find one thing interesting, disabling the  MSI-X
interrupt like commit
[d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
Although I don't
understand the root cause. Anything I can do to help?

> >>     I tried the latest 4.20 rc version but the problem still there. I
> >> also tried some
> >> hw_reset or init thing in the resume path but no effect. Any
> >> suggestion for this?
> >> Thanks
> >>
> Did previous kernel versions work? If it's a regression, a bisect would be
> appreciated, because with the chip versions I've got I can't reproduce the issue.
>
> >> Chris
> >
> > Gentle ping. Any additional information required?
> >
> > Chris
> >
> Heiner

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-17 13:25     ` Chris Chiu
@ 2018-12-17 19:08       ` Heiner Kallweit
  2018-12-18 13:25         ` Chris Chiu
  2018-12-17 21:45       ` Heiner Kallweit
  1 sibling, 1 reply; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-17 19:08 UTC (permalink / raw)
  To: Chris Chiu; +Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On 17.12.2018 14:25, Chris Chiu wrote:
> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 14.12.2018 04:33, Chris Chiu wrote:
>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>>>
>>>> Hi,
>>>>     We got an acer laptop which has a problem with ethernet networking after
>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>> follows.
>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>
>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>
> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> (uninitialized): mac_version = 0x2b
> [   22.365580] libphy: r8169: probed
> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> XID 5c800800, IRQ 38
> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> bytes, tx checksumming: ko]
> 
Thanks for the info.

>>>>     The problem is the ethernet is not accessible after resume. Pinging via
>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>> interface, the networking is back to alive. But it's dead again after
>>>> I stop tcpdump.
>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>
>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>> to find out whether there's a difference.
>>
> 
> Actually, I just found I lead the wrong direction. The S3 suspend does
> help to reproduce,
> but it's not necessary. All I need to do is ping around 5 mins and the
> network connection
> fails.  And I also find one thing interesting, disabling the  MSI-X
> interrupt like commit
> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> Although I don't
> understand the root cause. Anything I can do to help?
> 
This is indeed very, very weird. You say switching from MSI-X to MSI fixes
the issue, but also pinging the machine from outside brings back the network.
Both actions affect totally different corners.

The commit and related issue you mention was a workaround in the driver,
the root cause was a MSI-X-related  issue with certain Intel chipsets deep
in the PCI core. After this was fixed we removed the workaround again.
This shouldn't be related to your issue.

Hard to say for now is whether the issue is:
- a driver issue
- a hardware issue in the RTL8411
- an issue with the chipset on your mainboard

According to your description it doesn't take a special scenario to trigger
the issue, so most likely also other users of Acer notebooks with RTL8411
should be affected (after briefly checking this should be at least Aspire
F15, V15, V7). Therefore I wonder why there aren't more reports.

This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
So you could test this revision and the one before.

Eventually, if the issue really should be caused by a side effect of using
MSI-X, then the question is whether we need to disable MSI-X for RTL8411
in general or just for RTL8411 and a certain subsystem id.

>>>>     I tried the latest 4.20 rc version but the problem still there. I
>>>> also tried some
>>>> hw_reset or init thing in the resume path but no effect. Any
>>>> suggestion for this?
>>>> Thanks
>>>>
>> Did previous kernel versions work? If it's a regression, a bisect would be
>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>
>>>> Chris
>>>
>>> Gentle ping. Any additional information required?
>>>
>>> Chris
>>>
>> Heiner
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-17 13:25     ` Chris Chiu
  2018-12-17 19:08       ` Heiner Kallweit
@ 2018-12-17 21:45       ` Heiner Kallweit
  2018-12-18 12:31         ` Chris Chiu
  1 sibling, 1 reply; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-17 21:45 UTC (permalink / raw)
  To: Chris Chiu; +Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On 17.12.2018 14:25, Chris Chiu wrote:
> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 14.12.2018 04:33, Chris Chiu wrote:
>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>>>
>>>> Hi,
>>>>     We got an acer laptop which has a problem with ethernet networking after
>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>> follows.
>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>
>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>
> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> (uninitialized): mac_version = 0x2b
> [   22.365580] libphy: r8169: probed
> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> XID 5c800800, IRQ 38
> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> bytes, tx checksumming: ko]
> 
>>>>     The problem is the ethernet is not accessible after resume. Pinging via
>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>> interface, the networking is back to alive. But it's dead again after
>>>> I stop tcpdump.
>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>
>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>> to find out whether there's a difference.
>>
> 
> Actually, I just found I lead the wrong direction. The S3 suspend does
> help to reproduce,
> but it's not necessary. All I need to do is ping around 5 mins and the
> network connection
> fails.  And I also find one thing interesting, disabling the  MSI-X
> interrupt like commit
> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> Although I don't
> understand the root cause. Anything I can do to help?
> 
One  more thing: I checked the vendor driver and it uses a different sequence
to initialize the ePHY. Could you please check whether the following patch
makes a difference? I don't have much hope but it's worth a try.

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 8462553e3..7cfb22e05 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5097,11 +5097,16 @@ static void rtl_hw_start_8168g_2(struct rtl8169_private *tp)
 static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
 {
 	static const struct ephy_info e_info_8411_2[] = {
-		{ 0x00, 0x0000,	0x0008 },
-		{ 0x0c, 0x3df0,	0x0200 },
-		{ 0x0f, 0xffff,	0x5200 },
-		{ 0x19, 0x0020,	0x0000 },
-		{ 0x1e, 0x0000,	0x2000 }
+		{ 0x00, 0x0008,	0x0000 },
+		{ 0x0c, 0x37d0,	0x0820 },
+		{ 0x1e, 0x0000,	0x0001 },
+		{ 0x19, 0x8021,	0x0000 },
+		{ 0x1e, 0x0000,	0x2000 },
+		{ 0x0d, 0x0100,	0x0200 },
+		{ 0x00, 0x0000,	0x0080 },
+		{ 0x06, 0x0000,	0x0010 },
+		{ 0x04, 0x0000,	0x0010 },
+		{ 0x1d, 0x0000,	0x4000 },
 	};
 
 	rtl_hw_start_8168g(tp);
-- 
2.20.0


>>>>     I tried the latest 4.20 rc version but the problem still there. I
>>>> also tried some
>>>> hw_reset or init thing in the resume path but no effect. Any
>>>> suggestion for this?
>>>> Thanks
>>>>
>> Did previous kernel versions work? If it's a regression, a bisect would be
>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>
>>>> Chris
>>>
>>> Gentle ping. Any additional information required?
>>>
>>> Chris
>>>
>> Heiner
> 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-17 21:45       ` Heiner Kallweit
@ 2018-12-18 12:31         ` Chris Chiu
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Chiu @ 2018-12-18 12:31 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Tue, Dec 18, 2018 at 5:45 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 17.12.2018 14:25, Chris Chiu wrote:
> > On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> On 14.12.2018 04:33, Chris Chiu wrote:
> >>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
> >>>>
> >>>> Hi,
> >>>>     We got an acer laptop which has a problem with ethernet networking after
> >>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >>>> follows.
> >>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>>>
> >> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
> >>
> > [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> > (uninitialized): mac_version = 0x2b
> > [   22.365580] libphy: r8169: probed
> > [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> > XID 5c800800, IRQ 38
> > [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> > bytes, tx checksumming: ko]
> >
> >>>>     The problem is the ethernet is not accessible after resume. Pinging via
> >>>> ethernet always shows the response `Destination Host Unreachable`. However,
> >>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >>>> interface, the networking is back to alive. But it's dead again after
> >>>> I stop tcpdump.
> >>>> One more thing, if I ping the problematic machine from others, it achieves the
> >>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>>>
> >> You could compare the register dumps (ethtool -d) before and after S3 sleep
> >> to find out whether there's a difference.
> >>
> >
> > Actually, I just found I lead the wrong direction. The S3 suspend does
> > help to reproduce,
> > but it's not necessary. All I need to do is ping around 5 mins and the
> > network connection
> > fails.  And I also find one thing interesting, disabling the  MSI-X
> > interrupt like commit
> > [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> > Although I don't
> > understand the root cause. Anything I can do to help?
> >
> One  more thing: I checked the vendor driver and it uses a different sequence
> to initialize the ePHY. Could you please check whether the following patch
> makes a difference? I don't have much hope but it's worth a try.
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 8462553e3..7cfb22e05 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -5097,11 +5097,16 @@ static void rtl_hw_start_8168g_2(struct rtl8169_private *tp)
>  static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>  {
>         static const struct ephy_info e_info_8411_2[] = {
> -               { 0x00, 0x0000, 0x0008 },
> -               { 0x0c, 0x3df0, 0x0200 },
> -               { 0x0f, 0xffff, 0x5200 },
> -               { 0x19, 0x0020, 0x0000 },
> -               { 0x1e, 0x0000, 0x2000 }
> +               { 0x00, 0x0008, 0x0000 },
> +               { 0x0c, 0x37d0, 0x0820 },
> +               { 0x1e, 0x0000, 0x0001 },
> +               { 0x19, 0x8021, 0x0000 },
> +               { 0x1e, 0x0000, 0x2000 },
> +               { 0x0d, 0x0100, 0x0200 },
> +               { 0x00, 0x0000, 0x0080 },
> +               { 0x06, 0x0000, 0x0010 },
> +               { 0x04, 0x0000, 0x0010 },
> +               { 0x1d, 0x0000, 0x4000 },
>         };
>
>         rtl_hw_start_8168g(tp);
> --
> 2.20.0
>
As you expected, I applied the phy init change for this specific MAC_VER_43,
it makes no difference.

>
> >>>>     I tried the latest 4.20 rc version but the problem still there. I
> >>>> also tried some
> >>>> hw_reset or init thing in the resume path but no effect. Any
> >>>> suggestion for this?
> >>>> Thanks
> >>>>
> >> Did previous kernel versions work? If it's a regression, a bisect would be
> >> appreciated, because with the chip versions I've got I can't reproduce the issue.
> >>
> >>>> Chris
> >>>
> >>> Gentle ping. Any additional information required?
> >>>
> >>> Chris
> >>>
> >> Heiner
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-17 19:08       ` Heiner Kallweit
@ 2018-12-18 13:25         ` Chris Chiu
  2018-12-18 18:21           ` Heiner Kallweit
  2018-12-18 20:28           ` Heiner Kallweit
  0 siblings, 2 replies; 17+ messages in thread
From: Chris Chiu @ 2018-12-18 13:25 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 17.12.2018 14:25, Chris Chiu wrote:
> > On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> On 14.12.2018 04:33, Chris Chiu wrote:
> >>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
> >>>>
> >>>> Hi,
> >>>>     We got an acer laptop which has a problem with ethernet networking after
> >>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >>>> follows.
> >>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>>>
> >> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
> >>
> > [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> > (uninitialized): mac_version = 0x2b
> > [   22.365580] libphy: r8169: probed
> > [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> > XID 5c800800, IRQ 38
> > [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> > bytes, tx checksumming: ko]
> >
> Thanks for the info.
>
> >>>>     The problem is the ethernet is not accessible after resume. Pinging via
> >>>> ethernet always shows the response `Destination Host Unreachable`. However,
> >>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >>>> interface, the networking is back to alive. But it's dead again after
> >>>> I stop tcpdump.
> >>>> One more thing, if I ping the problematic machine from others, it achieves the
> >>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>>>
> >> You could compare the register dumps (ethtool -d) before and after S3 sleep
> >> to find out whether there's a difference.
> >>
> >
> > Actually, I just found I lead the wrong direction. The S3 suspend does
> > help to reproduce,
> > but it's not necessary. All I need to do is ping around 5 mins and the
> > network connection
> > fails.  And I also find one thing interesting, disabling the  MSI-X
> > interrupt like commit
> > [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> > Although I don't
> > understand the root cause. Anything I can do to help?
> >
> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
> the issue, but also pinging the machine from outside brings back the network.
> Both actions affect totally different corners.
>
> The commit and related issue you mention was a workaround in the driver,
> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
> in the PCI core. After this was fixed we removed the workaround again.
> This shouldn't be related to your issue.
>
> Hard to say for now is whether the issue is:
> - a driver issue
> - a hardware issue in the RTL8411
> - an issue with the chipset on your mainboard
>
> According to your description it doesn't take a special scenario to trigger
> the issue, so most likely also other users of Acer notebooks with RTL8411
> should be affected (after briefly checking this should be at least Aspire
> F15, V15, V7). Therefore I wonder why there aren't more reports.
>
> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
> So you could test this revision and the one before.
>
> Eventually, if the issue really should be caused by a side effect of using
> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
> in general or just for RTL8411 and a certain subsystem id.
>

I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
interrupt handling"),
the problem still there. Then I revert to the previous revision, the
problem goes away.
So I think it's pretty much the side effect of MSI-X. However, as you
mentioned that
you didn't hit this problem, I'll ask the vendor to verify if this
problem also happens on
other machines with the same chip. Then we can determine to disable for specific
mac version or just a certain subsystem id.

> >>>>     I tried the latest 4.20 rc version but the problem still there. I
> >>>> also tried some
> >>>> hw_reset or init thing in the resume path but no effect. Any
> >>>> suggestion for this?
> >>>> Thanks
> >>>>
> >> Did previous kernel versions work? If it's a regression, a bisect would be
> >> appreciated, because with the chip versions I've got I can't reproduce the issue.
> >>
> >>>> Chris
> >>>
> >>> Gentle ping. Any additional information required?
> >>>
> >>> Chris
> >>>
> >> Heiner
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-18 13:25         ` Chris Chiu
@ 2018-12-18 18:21           ` Heiner Kallweit
  2018-12-19 14:37             ` Chris Chiu
  2018-12-18 20:28           ` Heiner Kallweit
  1 sibling, 1 reply; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-18 18:21 UTC (permalink / raw)
  To: Chris Chiu; +Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On 18.12.2018 14:25, Chris Chiu wrote:
> On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 17.12.2018 14:25, Chris Chiu wrote:
>>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> On 14.12.2018 04:33, Chris Chiu wrote:
>>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>     We got an acer laptop which has a problem with ethernet networking after
>>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>>>> follows.
>>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>>>
>>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>>>
>>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
>>> (uninitialized): mac_version = 0x2b
>>> [   22.365580] libphy: r8169: probed
>>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
>>> XID 5c800800, IRQ 38
>>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
>>> bytes, tx checksumming: ko]
>>>
>> Thanks for the info.
>>
>>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
>>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>>>> interface, the networking is back to alive. But it's dead again after
>>>>>> I stop tcpdump.
>>>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>>>
>>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>>>> to find out whether there's a difference.
>>>>
>>>
>>> Actually, I just found I lead the wrong direction. The S3 suspend does
>>> help to reproduce,
>>> but it's not necessary. All I need to do is ping around 5 mins and the
>>> network connection
>>> fails.  And I also find one thing interesting, disabling the  MSI-X
>>> interrupt like commit
>>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
>>> Although I don't
>>> understand the root cause. Anything I can do to help?
>>>
>> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
>> the issue, but also pinging the machine from outside brings back the network.
>> Both actions affect totally different corners.
>>
>> The commit and related issue you mention was a workaround in the driver,
>> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
>> in the PCI core. After this was fixed we removed the workaround again.
>> This shouldn't be related to your issue.
>>
>> Hard to say for now is whether the issue is:
>> - a driver issue
>> - a hardware issue in the RTL8411
>> - an issue with the chipset on your mainboard
>>
>> According to your description it doesn't take a special scenario to trigger
>> the issue, so most likely also other users of Acer notebooks with RTL8411
>> should be affected (after briefly checking this should be at least Aspire
>> F15, V15, V7). Therefore I wonder why there aren't more reports.
>>
>> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
>> So you could test this revision and the one before.
>>
>> Eventually, if the issue really should be caused by a side effect of using
>> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
>> in general or just for RTL8411 and a certain subsystem id.
>>
> 
> I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
> interrupt handling"),
> the problem still there. Then I revert to the previous revision, the
> problem goes away.
> So I think it's pretty much the side effect of MSI-X. However, as you
> mentioned that
> you didn't hit this problem, I'll ask the vendor to verify if this
> problem also happens on
> other machines with the same chip. Then we can determine to disable for specific
> mac version or just a certain subsystem id.
> 
Thanks a lot for testing. OK, I have one more idea.
AFAICS RTL8411 also has an integrated card reader controller which is driven
by module rtsx_pci. Maybe if both components (card reader controller + ethernet)
use different interrupt types, RTL8411 can't properly handle this.
In case module rtsx_pci is loaded on your system, can you check whether not
loading it (e.g. by blacklisting) or removing it makes a difference?

Can you provide the "lspci -v" output for the card reader part of RTL8411?

>>>>>>     I tried the latest 4.20 rc version but the problem still there. I
>>>>>> also tried some
>>>>>> hw_reset or init thing in the resume path but no effect. Any
>>>>>> suggestion for this?
>>>>>> Thanks
>>>>>>
>>>> Did previous kernel versions work? If it's a regression, a bisect would be
>>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>>>
>>>>>> Chris
>>>>>
>>>>> Gentle ping. Any additional information required?
>>>>>
>>>>> Chris
>>>>>
>>>> Heiner
>>>
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-18 13:25         ` Chris Chiu
  2018-12-18 18:21           ` Heiner Kallweit
@ 2018-12-18 20:28           ` Heiner Kallweit
  2018-12-19 15:32             ` Chris Chiu
  1 sibling, 1 reply; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-18 20:28 UTC (permalink / raw)
  To: Chris Chiu; +Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On 18.12.2018 14:25, Chris Chiu wrote:
> On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 17.12.2018 14:25, Chris Chiu wrote:
>>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> On 14.12.2018 04:33, Chris Chiu wrote:
>>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>     We got an acer laptop which has a problem with ethernet networking after
>>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>>>> follows.
>>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>>>
>>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>>>
>>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
>>> (uninitialized): mac_version = 0x2b
>>> [   22.365580] libphy: r8169: probed
>>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
>>> XID 5c800800, IRQ 38
>>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
>>> bytes, tx checksumming: ko]
>>>
>> Thanks for the info.
>>
>>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
>>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>>>> interface, the networking is back to alive. But it's dead again after
>>>>>> I stop tcpdump.
>>>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>>>
>>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>>>> to find out whether there's a difference.
>>>>
>>>
>>> Actually, I just found I lead the wrong direction. The S3 suspend does
>>> help to reproduce,
>>> but it's not necessary. All I need to do is ping around 5 mins and the
>>> network connection
>>> fails.  And I also find one thing interesting, disabling the  MSI-X
>>> interrupt like commit
>>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
>>> Although I don't
>>> understand the root cause. Anything I can do to help?
>>>
>> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
>> the issue, but also pinging the machine from outside brings back the network.
>> Both actions affect totally different corners.
>>
>> The commit and related issue you mention was a workaround in the driver,
>> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
>> in the PCI core. After this was fixed we removed the workaround again.
>> This shouldn't be related to your issue.
>>
>> Hard to say for now is whether the issue is:
>> - a driver issue
>> - a hardware issue in the RTL8411
>> - an issue with the chipset on your mainboard
>>
>> According to your description it doesn't take a special scenario to trigger
>> the issue, so most likely also other users of Acer notebooks with RTL8411
>> should be affected (after briefly checking this should be at least Aspire
>> F15, V15, V7). Therefore I wonder why there aren't more reports.
>>
>> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
>> So you could test this revision and the one before.
>>
>> Eventually, if the issue really should be caused by a side effect of using
>> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
>> in general or just for RTL8411 and a certain subsystem id.
>>
> 
> I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
> interrupt handling"),
> the problem still there. Then I revert to the previous revision, the
> problem goes away.
> So I think it's pretty much the side effect of MSI-X. However, as you
> mentioned that
> you didn't hit this problem, I'll ask the vendor to verify if this
> problem also happens on
> other machines with the same chip. Then we can determine to disable for specific
> mac version or just a certain subsystem id.
> 
>>>>>>     I tried the latest 4.20 rc version but the problem still there. I
>>>>>> also tried some
>>>>>> hw_reset or init thing in the resume path but no effect. Any
>>>>>> suggestion for this?
>>>>>> Thanks
>>>>>>
>>>> Did previous kernel versions work? If it's a regression, a bisect would be
>>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>>>
>>>>>> Chris
>>>>>
>>>>> Gentle ping. Any additional information required?
>>>>>
>>>>> Chris
>>>>>
>>>> Heiner
>>>
>>
> 

As an additional note:
I found that the rtsx_pci driver doesn't support MSI-X currently.
The following patch adds MSI-X support (it's compile-tested only
because I don't have a system with RTL8411).
Would be interesting to see whether it makes a difference if both
components on this combo chip use MSI-X.

---
 drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
 include/linux/rtsx_pci.h           |  1 -
 2 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
index da445223f..d1349c248 100644
--- a/drivers/misc/cardreader/rtsx_pcr.c
+++ b/drivers/misc/cardreader/rtsx_pcr.c
@@ -35,10 +35,6 @@
 
 #include "rtsx_pcr.h"
 
-static bool msi_en = true;
-module_param(msi_en, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(msi_en, "Enable MSI");
-
 static DEFINE_IDR(rtsx_pci_idr);
 static DEFINE_SPINLOCK(rtsx_pci_lock);
 
@@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
 
 static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
 {
-	pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
-			__func__, pcr->msi_en, pcr->pci->irq);
+	int ret;
 
-	if (request_irq(pcr->pci->irq, rtsx_pci_isr,
-			pcr->msi_en ? 0 : IRQF_SHARED,
-			DRV_NAME_RTSX_PCI, pcr)) {
-		dev_err(&(pcr->pci->dev),
-			"rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
-			pcr->pci->irq);
-		return -1;
-	}
+	ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
+	if (ret < 0)
+		goto err;
 
-	pcr->irq = pcr->pci->irq;
-	pci_intx(pcr->pci, !pcr->msi_en);
+	ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
+			      DRV_NAME_RTSX_PCI);
+	if (ret)
+		goto err;
 
 	return 0;
+err:
+	pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
+	return ret;
 }
 
 static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
@@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
 	INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
 	INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
 
-	pcr->msi_en = msi_en;
-	if (pcr->msi_en) {
-		ret = pci_enable_msi(pcidev);
-		if (ret)
-			pcr->msi_en = false;
-	}
-
 	ret = rtsx_pci_acquire_irq(pcr);
 	if (ret < 0)
-		goto disable_msi;
+		goto free_dma;
 
 	pci_set_master(pcidev);
-	synchronize_irq(pcr->irq);
 
 	ret = rtsx_pci_init_chip(pcr);
 	if (ret < 0)
@@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
 	return 0;
 
 disable_irq:
-	free_irq(pcr->irq, (void *)pcr);
-disable_msi:
-	if (pcr->msi_en)
-		pci_disable_msi(pcr->pci);
+	pci_free_irq(pcr->pci, 0, pcr);
+free_dma:
 	dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
 			pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
 unmap:
@@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
 
 	dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
 			pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
-	free_irq(pcr->irq, (void *)pcr);
-	if (pcr->msi_en)
-		pci_disable_msi(pcr->pci);
+	pci_free_irq(pcr->pci, 0, pcr);
 	iounmap(pcr->remap_addr);
 
 	pci_release_regions(pcidev);
@@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
 	rtsx_pci_power_off(pcr, HOST_ENTER_S1);
 
 	pci_disable_device(pcidev);
-	free_irq(pcr->irq, (void *)pcr);
-	if (pcr->msi_en)
-		pci_disable_msi(pcr->pci);
+	pci_free_irq(pcr->pci, 0, pcr);
 }
 
 #else /* CONFIG_PM */
diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
index e964bbd03..10abfe7f2 100644
--- a/include/linux/rtsx_pci.h
+++ b/include/linux/rtsx_pci.h
@@ -1190,7 +1190,6 @@ struct rtsx_pcr {
 	/* pci resources */
 	unsigned long			addr;
 	void __iomem			*remap_addr;
-	int				irq;
 
 	/* host reserved buffer */
 	void				*rtsx_resv_buf;
-- 
2.20.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-18 18:21           ` Heiner Kallweit
@ 2018-12-19 14:37             ` Chris Chiu
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Chiu @ 2018-12-19 14:37 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Wed, Dec 19, 2018 at 2:22 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 18.12.2018 14:25, Chris Chiu wrote:
> > On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> On 17.12.2018 14:25, Chris Chiu wrote:
> >>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>
> >>>> On 14.12.2018 04:33, Chris Chiu wrote:
> >>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>     We got an acer laptop which has a problem with ethernet networking after
> >>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >>>>>> follows.
> >>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>>>>>
> >>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
> >>>>
> >>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> >>> (uninitialized): mac_version = 0x2b
> >>> [   22.365580] libphy: r8169: probed
> >>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> >>> XID 5c800800, IRQ 38
> >>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> >>> bytes, tx checksumming: ko]
> >>>
> >> Thanks for the info.
> >>
> >>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
> >>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
> >>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >>>>>> interface, the networking is back to alive. But it's dead again after
> >>>>>> I stop tcpdump.
> >>>>>> One more thing, if I ping the problematic machine from others, it achieves the
> >>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>>>>>
> >>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
> >>>> to find out whether there's a difference.
> >>>>
> >>>
> >>> Actually, I just found I lead the wrong direction. The S3 suspend does
> >>> help to reproduce,
> >>> but it's not necessary. All I need to do is ping around 5 mins and the
> >>> network connection
> >>> fails.  And I also find one thing interesting, disabling the  MSI-X
> >>> interrupt like commit
> >>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> >>> Although I don't
> >>> understand the root cause. Anything I can do to help?
> >>>
> >> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
> >> the issue, but also pinging the machine from outside brings back the network.
> >> Both actions affect totally different corners.
> >>
> >> The commit and related issue you mention was a workaround in the driver,
> >> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
> >> in the PCI core. After this was fixed we removed the workaround again.
> >> This shouldn't be related to your issue.
> >>
> >> Hard to say for now is whether the issue is:
> >> - a driver issue
> >> - a hardware issue in the RTL8411
> >> - an issue with the chipset on your mainboard
> >>
> >> According to your description it doesn't take a special scenario to trigger
> >> the issue, so most likely also other users of Acer notebooks with RTL8411
> >> should be affected (after briefly checking this should be at least Aspire
> >> F15, V15, V7). Therefore I wonder why there aren't more reports.
> >>
> >> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
> >> So you could test this revision and the one before.
> >>
> >> Eventually, if the issue really should be caused by a side effect of using
> >> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
> >> in general or just for RTL8411 and a certain subsystem id.
> >>
> >
> > I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
> > interrupt handling"),
> > the problem still there. Then I revert to the previous revision, the
> > problem goes away.
> > So I think it's pretty much the side effect of MSI-X. However, as you
> > mentioned that
> > you didn't hit this problem, I'll ask the vendor to verify if this
> > problem also happens on
> > other machines with the same chip. Then we can determine to disable for specific
> > mac version or just a certain subsystem id.
> >
> Thanks a lot for testing. OK, I have one more idea.
> AFAICS RTL8411 also has an integrated card reader controller which is driven
> by module rtsx_pci. Maybe if both components (card reader controller + ethernet)
> use different interrupt types, RTL8411 can't properly handle this.
> In case module rtsx_pci is loaded on your system, can you check whether not
> loading it (e.g. by blacklisting) or removing it makes a difference?
>

I boot my kernel with rtsx_pci_ms/rtsx_pci on blacklist, but it
doesn't change anything.

> Can you provide the "lspci -v" output for the card reader part of RTL8411?

02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd.
RTL8411B PCI Express Card Reader (rev 01)
        Subsystem: Acer Incorporated [ALI] RTL8411B PCI Express Card Reader
        Flags: bus master, fast devsel, latency 0, IRQ 34
        Memory at f0b05000 (32-bit, non-prefetchable) [size=4K]
        Expansion ROM at f0b10000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
        Capabilities: [d0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Virtual Channel
        Capabilities: [160] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [170] Latency Tolerance Reporting
        Capabilities: [178] L1 PM Substates
        Kernel driver in use: rtsx_pci
        Kernel modules: rtsx_pci

>
> >>>>>>     I tried the latest 4.20 rc version but the problem still there. I
> >>>>>> also tried some
> >>>>>> hw_reset or init thing in the resume path but no effect. Any
> >>>>>> suggestion for this?
> >>>>>> Thanks
> >>>>>>
> >>>> Did previous kernel versions work? If it's a regression, a bisect would be
> >>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
> >>>>
> >>>>>> Chris
> >>>>>
> >>>>> Gentle ping. Any additional information required?
> >>>>>
> >>>>> Chris
> >>>>>
> >>>> Heiner
> >>>
> >>
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-18 20:28           ` Heiner Kallweit
@ 2018-12-19 15:32             ` Chris Chiu
  2018-12-19 19:41               ` Heiner Kallweit
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Chiu @ 2018-12-19 15:32 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Wed, Dec 19, 2018 at 4:28 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 18.12.2018 14:25, Chris Chiu wrote:
> > On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> On 17.12.2018 14:25, Chris Chiu wrote:
> >>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>
> >>>> On 14.12.2018 04:33, Chris Chiu wrote:
> >>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>     We got an acer laptop which has a problem with ethernet networking after
> >>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >>>>>> follows.
> >>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>>>>>
> >>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
> >>>>
> >>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> >>> (uninitialized): mac_version = 0x2b
> >>> [   22.365580] libphy: r8169: probed
> >>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> >>> XID 5c800800, IRQ 38
> >>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> >>> bytes, tx checksumming: ko]
> >>>
> >> Thanks for the info.
> >>
> >>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
> >>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
> >>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >>>>>> interface, the networking is back to alive. But it's dead again after
> >>>>>> I stop tcpdump.
> >>>>>> One more thing, if I ping the problematic machine from others, it achieves the
> >>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>>>>>
> >>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
> >>>> to find out whether there's a difference.
> >>>>
> >>>
> >>> Actually, I just found I lead the wrong direction. The S3 suspend does
> >>> help to reproduce,
> >>> but it's not necessary. All I need to do is ping around 5 mins and the
> >>> network connection
> >>> fails.  And I also find one thing interesting, disabling the  MSI-X
> >>> interrupt like commit
> >>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> >>> Although I don't
> >>> understand the root cause. Anything I can do to help?
> >>>
> >> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
> >> the issue, but also pinging the machine from outside brings back the network.
> >> Both actions affect totally different corners.
> >>
> >> The commit and related issue you mention was a workaround in the driver,
> >> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
> >> in the PCI core. After this was fixed we removed the workaround again.
> >> This shouldn't be related to your issue.
> >>
> >> Hard to say for now is whether the issue is:
> >> - a driver issue
> >> - a hardware issue in the RTL8411
> >> - an issue with the chipset on your mainboard
> >>
> >> According to your description it doesn't take a special scenario to trigger
> >> the issue, so most likely also other users of Acer notebooks with RTL8411
> >> should be affected (after briefly checking this should be at least Aspire
> >> F15, V15, V7). Therefore I wonder why there aren't more reports.
> >>
> >> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
> >> So you could test this revision and the one before.
> >>
> >> Eventually, if the issue really should be caused by a side effect of using
> >> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
> >> in general or just for RTL8411 and a certain subsystem id.
> >>
> >
> > I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
> > interrupt handling"),
> > the problem still there. Then I revert to the previous revision, the
> > problem goes away.
> > So I think it's pretty much the side effect of MSI-X. However, as you
> > mentioned that
> > you didn't hit this problem, I'll ask the vendor to verify if this
> > problem also happens on
> > other machines with the same chip. Then we can determine to disable for specific
> > mac version or just a certain subsystem id.
> >
> >>>>>>     I tried the latest 4.20 rc version but the problem still there. I
> >>>>>> also tried some
> >>>>>> hw_reset or init thing in the resume path but no effect. Any
> >>>>>> suggestion for this?
> >>>>>> Thanks
> >>>>>>
> >>>> Did previous kernel versions work? If it's a regression, a bisect would be
> >>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
> >>>>
> >>>>>> Chris
> >>>>>
> >>>>> Gentle ping. Any additional information required?
> >>>>>
> >>>>> Chris
> >>>>>
> >>>> Heiner
> >>>
> >>
> >
>
> As an additional note:
> I found that the rtsx_pci driver doesn't support MSI-X currently.
> The following patch adds MSI-X support (it's compile-tested only
> because I don't have a system with RTL8411).
> Would be interesting to see whether it makes a difference if both
> components on this combo chip use MSI-X.
>
> ---
>  drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
>  include/linux/rtsx_pci.h           |  1 -
>  2 files changed, 16 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
> index da445223f..d1349c248 100644
> --- a/drivers/misc/cardreader/rtsx_pcr.c
> +++ b/drivers/misc/cardreader/rtsx_pcr.c
> @@ -35,10 +35,6 @@
>
>  #include "rtsx_pcr.h"
>
> -static bool msi_en = true;
> -module_param(msi_en, bool, S_IRUGO | S_IWUSR);
> -MODULE_PARM_DESC(msi_en, "Enable MSI");
> -
>  static DEFINE_IDR(rtsx_pci_idr);
>  static DEFINE_SPINLOCK(rtsx_pci_lock);
>
> @@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
>
>  static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
>  {
> -       pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
> -                       __func__, pcr->msi_en, pcr->pci->irq);
> +       int ret;
>
> -       if (request_irq(pcr->pci->irq, rtsx_pci_isr,
> -                       pcr->msi_en ? 0 : IRQF_SHARED,
> -                       DRV_NAME_RTSX_PCI, pcr)) {
> -               dev_err(&(pcr->pci->dev),
> -                       "rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
> -                       pcr->pci->irq);
> -               return -1;
> -       }
> +       ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
> +       if (ret < 0)
> +               goto err;
>
> -       pcr->irq = pcr->pci->irq;
> -       pci_intx(pcr->pci, !pcr->msi_en);
> +       ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
> +                             DRV_NAME_RTSX_PCI);
> +       if (ret)
> +               goto err;
>
>         return 0;
> +err:
> +       pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
> +       return ret;
>  }
>
>  static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
> @@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>         INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
>         INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
>
> -       pcr->msi_en = msi_en;
> -       if (pcr->msi_en) {
> -               ret = pci_enable_msi(pcidev);
> -               if (ret)
> -                       pcr->msi_en = false;
> -       }
> -
>         ret = rtsx_pci_acquire_irq(pcr);
>         if (ret < 0)
> -               goto disable_msi;
> +               goto free_dma;
>
>         pci_set_master(pcidev);
> -       synchronize_irq(pcr->irq);
>
>         ret = rtsx_pci_init_chip(pcr);
>         if (ret < 0)
> @@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>         return 0;
>
>  disable_irq:
> -       free_irq(pcr->irq, (void *)pcr);
> -disable_msi:
> -       if (pcr->msi_en)
> -               pci_disable_msi(pcr->pci);
> +       pci_free_irq(pcr->pci, 0, pcr);
> +free_dma:
>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
>  unmap:
> @@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
>
>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
> -       free_irq(pcr->irq, (void *)pcr);
> -       if (pcr->msi_en)
> -               pci_disable_msi(pcr->pci);
> +       pci_free_irq(pcr->pci, 0, pcr);
>         iounmap(pcr->remap_addr);
>
>         pci_release_regions(pcidev);
> @@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>
>         pci_disable_device(pcidev);
> -       free_irq(pcr->irq, (void *)pcr);
> -       if (pcr->msi_en)
> -               pci_disable_msi(pcr->pci);
> +       pci_free_irq(pcr->pci, 0, pcr);
>  }
>
>  #else /* CONFIG_PM */
> diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
> index e964bbd03..10abfe7f2 100644
> --- a/include/linux/rtsx_pci.h
> +++ b/include/linux/rtsx_pci.h
> @@ -1190,7 +1190,6 @@ struct rtsx_pcr {
>         /* pci resources */
>         unsigned long                   addr;
>         void __iomem                    *remap_addr;
> -       int                             irq;
>
>         /* host reserved buffer */
>         void                            *rtsx_resv_buf;
> --
> 2.20.0
>

As mentioned in the last email, the rtsx_pci seems to make no
difference. I still tried the kernel with this patch applied, the
problem still persists. I also tried the vendor driver and it works
without any problem. I'd rather like to find out the root cause
instead of a workaround. Any better idea?

Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-19 15:32             ` Chris Chiu
@ 2018-12-19 19:41               ` Heiner Kallweit
  2018-12-20  9:43                 ` Chris Chiu
  0 siblings, 1 reply; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-19 19:41 UTC (permalink / raw)
  To: Chris Chiu; +Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On 19.12.2018 16:32, Chris Chiu wrote:
> On Wed, Dec 19, 2018 at 4:28 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 18.12.2018 14:25, Chris Chiu wrote:
>>> On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> On 17.12.2018 14:25, Chris Chiu wrote:
>>>>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>
>>>>>> On 14.12.2018 04:33, Chris Chiu wrote:
>>>>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>     We got an acer laptop which has a problem with ethernet networking after
>>>>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>>>>>> follows.
>>>>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>>>>>
>>>>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>>>>>
>>>>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
>>>>> (uninitialized): mac_version = 0x2b
>>>>> [   22.365580] libphy: r8169: probed
>>>>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
>>>>> XID 5c800800, IRQ 38
>>>>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
>>>>> bytes, tx checksumming: ko]
>>>>>
>>>> Thanks for the info.
>>>>
>>>>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
>>>>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>>>>>> interface, the networking is back to alive. But it's dead again after
>>>>>>>> I stop tcpdump.
>>>>>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>>>>>
>>>>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>>>>>> to find out whether there's a difference.
>>>>>>
>>>>>
>>>>> Actually, I just found I lead the wrong direction. The S3 suspend does
>>>>> help to reproduce,
>>>>> but it's not necessary. All I need to do is ping around 5 mins and the
>>>>> network connection
>>>>> fails.  And I also find one thing interesting, disabling the  MSI-X
>>>>> interrupt like commit
>>>>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
>>>>> Although I don't
>>>>> understand the root cause. Anything I can do to help?
>>>>>
>>>> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
>>>> the issue, but also pinging the machine from outside brings back the network.
>>>> Both actions affect totally different corners.
>>>>
>>>> The commit and related issue you mention was a workaround in the driver,
>>>> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
>>>> in the PCI core. After this was fixed we removed the workaround again.
>>>> This shouldn't be related to your issue.
>>>>
>>>> Hard to say for now is whether the issue is:
>>>> - a driver issue
>>>> - a hardware issue in the RTL8411
>>>> - an issue with the chipset on your mainboard
>>>>
>>>> According to your description it doesn't take a special scenario to trigger
>>>> the issue, so most likely also other users of Acer notebooks with RTL8411
>>>> should be affected (after briefly checking this should be at least Aspire
>>>> F15, V15, V7). Therefore I wonder why there aren't more reports.
>>>>
>>>> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
>>>> So you could test this revision and the one before.
>>>>
>>>> Eventually, if the issue really should be caused by a side effect of using
>>>> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
>>>> in general or just for RTL8411 and a certain subsystem id.
>>>>
>>>
>>> I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
>>> interrupt handling"),
>>> the problem still there. Then I revert to the previous revision, the
>>> problem goes away.
>>> So I think it's pretty much the side effect of MSI-X. However, as you
>>> mentioned that
>>> you didn't hit this problem, I'll ask the vendor to verify if this
>>> problem also happens on
>>> other machines with the same chip. Then we can determine to disable for specific
>>> mac version or just a certain subsystem id.
>>>
>>>>>>>>     I tried the latest 4.20 rc version but the problem still there. I
>>>>>>>> also tried some
>>>>>>>> hw_reset or init thing in the resume path but no effect. Any
>>>>>>>> suggestion for this?
>>>>>>>> Thanks
>>>>>>>>
>>>>>> Did previous kernel versions work? If it's a regression, a bisect would be
>>>>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>>>>>
>>>>>>>> Chris
>>>>>>>
>>>>>>> Gentle ping. Any additional information required?
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>> Heiner
>>>>>
>>>>
>>>
>>
>> As an additional note:
>> I found that the rtsx_pci driver doesn't support MSI-X currently.
>> The following patch adds MSI-X support (it's compile-tested only
>> because I don't have a system with RTL8411).
>> Would be interesting to see whether it makes a difference if both
>> components on this combo chip use MSI-X.
>>
>> ---
>>  drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
>>  include/linux/rtsx_pci.h           |  1 -
>>  2 files changed, 16 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
>> index da445223f..d1349c248 100644
>> --- a/drivers/misc/cardreader/rtsx_pcr.c
>> +++ b/drivers/misc/cardreader/rtsx_pcr.c
>> @@ -35,10 +35,6 @@
>>
>>  #include "rtsx_pcr.h"
>>
>> -static bool msi_en = true;
>> -module_param(msi_en, bool, S_IRUGO | S_IWUSR);
>> -MODULE_PARM_DESC(msi_en, "Enable MSI");
>> -
>>  static DEFINE_IDR(rtsx_pci_idr);
>>  static DEFINE_SPINLOCK(rtsx_pci_lock);
>>
>> @@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
>>
>>  static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
>>  {
>> -       pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
>> -                       __func__, pcr->msi_en, pcr->pci->irq);
>> +       int ret;
>>
>> -       if (request_irq(pcr->pci->irq, rtsx_pci_isr,
>> -                       pcr->msi_en ? 0 : IRQF_SHARED,
>> -                       DRV_NAME_RTSX_PCI, pcr)) {
>> -               dev_err(&(pcr->pci->dev),
>> -                       "rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
>> -                       pcr->pci->irq);
>> -               return -1;
>> -       }
>> +       ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
>> +       if (ret < 0)
>> +               goto err;
>>
>> -       pcr->irq = pcr->pci->irq;
>> -       pci_intx(pcr->pci, !pcr->msi_en);
>> +       ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
>> +                             DRV_NAME_RTSX_PCI);
>> +       if (ret)
>> +               goto err;
>>
>>         return 0;
>> +err:
>> +       pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
>> +       return ret;
>>  }
>>
>>  static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
>> @@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>>         INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
>>         INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
>>
>> -       pcr->msi_en = msi_en;
>> -       if (pcr->msi_en) {
>> -               ret = pci_enable_msi(pcidev);
>> -               if (ret)
>> -                       pcr->msi_en = false;
>> -       }
>> -
>>         ret = rtsx_pci_acquire_irq(pcr);
>>         if (ret < 0)
>> -               goto disable_msi;
>> +               goto free_dma;
>>
>>         pci_set_master(pcidev);
>> -       synchronize_irq(pcr->irq);
>>
>>         ret = rtsx_pci_init_chip(pcr);
>>         if (ret < 0)
>> @@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>>         return 0;
>>
>>  disable_irq:
>> -       free_irq(pcr->irq, (void *)pcr);
>> -disable_msi:
>> -       if (pcr->msi_en)
>> -               pci_disable_msi(pcr->pci);
>> +       pci_free_irq(pcr->pci, 0, pcr);
>> +free_dma:
>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
>>  unmap:
>> @@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
>>
>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
>> -       free_irq(pcr->irq, (void *)pcr);
>> -       if (pcr->msi_en)
>> -               pci_disable_msi(pcr->pci);
>> +       pci_free_irq(pcr->pci, 0, pcr);
>>         iounmap(pcr->remap_addr);
>>
>>         pci_release_regions(pcidev);
>> @@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>>
>>         pci_disable_device(pcidev);
>> -       free_irq(pcr->irq, (void *)pcr);
>> -       if (pcr->msi_en)
>> -               pci_disable_msi(pcr->pci);
>> +       pci_free_irq(pcr->pci, 0, pcr);
>>  }
>>
>>  #else /* CONFIG_PM */
>> diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
>> index e964bbd03..10abfe7f2 100644
>> --- a/include/linux/rtsx_pci.h
>> +++ b/include/linux/rtsx_pci.h
>> @@ -1190,7 +1190,6 @@ struct rtsx_pcr {
>>         /* pci resources */
>>         unsigned long                   addr;
>>         void __iomem                    *remap_addr;
>> -       int                             irq;
>>
>>         /* host reserved buffer */
>>         void                            *rtsx_resv_buf;
>> --
>> 2.20.0
>>
> 
> As mentioned in the last email, the rtsx_pci seems to make no
> difference. I still tried the kernel with this patch applied, the
> problem still persists. I also tried the vendor driver and it works
> without any problem. I'd rather like to find out the root cause
> instead of a workaround. Any better idea?
> 
Thanks for your efforts! The vendor driver doesn't support MSI-X,
therefore the issue doesn't occur. I'm running out of ideas, so
I will write to a contact in Realtek who few times provided helpful
information already.

> Chris
> 
Heiner

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-19 19:41               ` Heiner Kallweit
@ 2018-12-20  9:43                 ` Chris Chiu
  2018-12-20 18:48                   ` Heiner Kallweit
  2018-12-20 19:21                   ` Heiner Kallweit
  0 siblings, 2 replies; 17+ messages in thread
From: Chris Chiu @ 2018-12-20  9:43 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Thu, Dec 20, 2018 at 3:41 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 19.12.2018 16:32, Chris Chiu wrote:
> > On Wed, Dec 19, 2018 at 4:28 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> On 18.12.2018 14:25, Chris Chiu wrote:
> >>> On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>
> >>>> On 17.12.2018 14:25, Chris Chiu wrote:
> >>>>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>
> >>>>>> On 14.12.2018 04:33, Chris Chiu wrote:
> >>>>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>     We got an acer laptop which has a problem with ethernet networking after
> >>>>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >>>>>>>> follows.
> >>>>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>>>>>>>
> >>>>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
> >>>>>>
> >>>>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> >>>>> (uninitialized): mac_version = 0x2b
> >>>>> [   22.365580] libphy: r8169: probed
> >>>>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> >>>>> XID 5c800800, IRQ 38
> >>>>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> >>>>> bytes, tx checksumming: ko]
> >>>>>
> >>>> Thanks for the info.
> >>>>
> >>>>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
> >>>>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
> >>>>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >>>>>>>> interface, the networking is back to alive. But it's dead again after
> >>>>>>>> I stop tcpdump.
> >>>>>>>> One more thing, if I ping the problematic machine from others, it achieves the
> >>>>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>>>>>>>
> >>>>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
> >>>>>> to find out whether there's a difference.
> >>>>>>
> >>>>>
> >>>>> Actually, I just found I lead the wrong direction. The S3 suspend does
> >>>>> help to reproduce,
> >>>>> but it's not necessary. All I need to do is ping around 5 mins and the
> >>>>> network connection
> >>>>> fails.  And I also find one thing interesting, disabling the  MSI-X
> >>>>> interrupt like commit
> >>>>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> >>>>> Although I don't
> >>>>> understand the root cause. Anything I can do to help?
> >>>>>
> >>>> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
> >>>> the issue, but also pinging the machine from outside brings back the network.
> >>>> Both actions affect totally different corners.
> >>>>
> >>>> The commit and related issue you mention was a workaround in the driver,
> >>>> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
> >>>> in the PCI core. After this was fixed we removed the workaround again.
> >>>> This shouldn't be related to your issue.
> >>>>
> >>>> Hard to say for now is whether the issue is:
> >>>> - a driver issue
> >>>> - a hardware issue in the RTL8411
> >>>> - an issue with the chipset on your mainboard
> >>>>
> >>>> According to your description it doesn't take a special scenario to trigger
> >>>> the issue, so most likely also other users of Acer notebooks with RTL8411
> >>>> should be affected (after briefly checking this should be at least Aspire
> >>>> F15, V15, V7). Therefore I wonder why there aren't more reports.
> >>>>
> >>>> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
> >>>> So you could test this revision and the one before.
> >>>>
> >>>> Eventually, if the issue really should be caused by a side effect of using
> >>>> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
> >>>> in general or just for RTL8411 and a certain subsystem id.
> >>>>
> >>>
> >>> I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
> >>> interrupt handling"),
> >>> the problem still there. Then I revert to the previous revision, the
> >>> problem goes away.
> >>> So I think it's pretty much the side effect of MSI-X. However, as you
> >>> mentioned that
> >>> you didn't hit this problem, I'll ask the vendor to verify if this
> >>> problem also happens on
> >>> other machines with the same chip. Then we can determine to disable for specific
> >>> mac version or just a certain subsystem id.
> >>>
> >>>>>>>>     I tried the latest 4.20 rc version but the problem still there. I
> >>>>>>>> also tried some
> >>>>>>>> hw_reset or init thing in the resume path but no effect. Any
> >>>>>>>> suggestion for this?
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>> Did previous kernel versions work? If it's a regression, a bisect would be
> >>>>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
> >>>>>>
> >>>>>>>> Chris
> >>>>>>>
> >>>>>>> Gentle ping. Any additional information required?
> >>>>>>>
> >>>>>>> Chris
> >>>>>>>
> >>>>>> Heiner
> >>>>>
> >>>>
> >>>
> >>
> >> As an additional note:
> >> I found that the rtsx_pci driver doesn't support MSI-X currently.
> >> The following patch adds MSI-X support (it's compile-tested only
> >> because I don't have a system with RTL8411).
> >> Would be interesting to see whether it makes a difference if both
> >> components on this combo chip use MSI-X.
> >>
> >> ---
> >>  drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
> >>  include/linux/rtsx_pci.h           |  1 -
> >>  2 files changed, 16 insertions(+), 36 deletions(-)
> >>
> >> diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
> >> index da445223f..d1349c248 100644
> >> --- a/drivers/misc/cardreader/rtsx_pcr.c
> >> +++ b/drivers/misc/cardreader/rtsx_pcr.c
> >> @@ -35,10 +35,6 @@
> >>
> >>  #include "rtsx_pcr.h"
> >>
> >> -static bool msi_en = true;
> >> -module_param(msi_en, bool, S_IRUGO | S_IWUSR);
> >> -MODULE_PARM_DESC(msi_en, "Enable MSI");
> >> -
> >>  static DEFINE_IDR(rtsx_pci_idr);
> >>  static DEFINE_SPINLOCK(rtsx_pci_lock);
> >>
> >> @@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
> >>
> >>  static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
> >>  {
> >> -       pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
> >> -                       __func__, pcr->msi_en, pcr->pci->irq);
> >> +       int ret;
> >>
> >> -       if (request_irq(pcr->pci->irq, rtsx_pci_isr,
> >> -                       pcr->msi_en ? 0 : IRQF_SHARED,
> >> -                       DRV_NAME_RTSX_PCI, pcr)) {
> >> -               dev_err(&(pcr->pci->dev),
> >> -                       "rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
> >> -                       pcr->pci->irq);
> >> -               return -1;
> >> -       }
> >> +       ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
> >> +       if (ret < 0)
> >> +               goto err;
> >>
> >> -       pcr->irq = pcr->pci->irq;
> >> -       pci_intx(pcr->pci, !pcr->msi_en);
> >> +       ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
> >> +                             DRV_NAME_RTSX_PCI);
> >> +       if (ret)
> >> +               goto err;
> >>
> >>         return 0;
> >> +err:
> >> +       pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
> >> +       return ret;
> >>  }
> >>
> >>  static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
> >> @@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
> >>         INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
> >>         INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
> >>
> >> -       pcr->msi_en = msi_en;
> >> -       if (pcr->msi_en) {
> >> -               ret = pci_enable_msi(pcidev);
> >> -               if (ret)
> >> -                       pcr->msi_en = false;
> >> -       }
> >> -
> >>         ret = rtsx_pci_acquire_irq(pcr);
> >>         if (ret < 0)
> >> -               goto disable_msi;
> >> +               goto free_dma;
> >>
> >>         pci_set_master(pcidev);
> >> -       synchronize_irq(pcr->irq);
> >>
> >>         ret = rtsx_pci_init_chip(pcr);
> >>         if (ret < 0)
> >> @@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
> >>         return 0;
> >>
> >>  disable_irq:
> >> -       free_irq(pcr->irq, (void *)pcr);
> >> -disable_msi:
> >> -       if (pcr->msi_en)
> >> -               pci_disable_msi(pcr->pci);
> >> +       pci_free_irq(pcr->pci, 0, pcr);
> >> +free_dma:
> >>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
> >>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
> >>  unmap:
> >> @@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
> >>
> >>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
> >>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
> >> -       free_irq(pcr->irq, (void *)pcr);
> >> -       if (pcr->msi_en)
> >> -               pci_disable_msi(pcr->pci);
> >> +       pci_free_irq(pcr->pci, 0, pcr);
> >>         iounmap(pcr->remap_addr);
> >>
> >>         pci_release_regions(pcidev);
> >> @@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
> >>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
> >>
> >>         pci_disable_device(pcidev);
> >> -       free_irq(pcr->irq, (void *)pcr);
> >> -       if (pcr->msi_en)
> >> -               pci_disable_msi(pcr->pci);
> >> +       pci_free_irq(pcr->pci, 0, pcr);
> >>  }
> >>
> >>  #else /* CONFIG_PM */
> >> diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
> >> index e964bbd03..10abfe7f2 100644
> >> --- a/include/linux/rtsx_pci.h
> >> +++ b/include/linux/rtsx_pci.h
> >> @@ -1190,7 +1190,6 @@ struct rtsx_pcr {
> >>         /* pci resources */
> >>         unsigned long                   addr;
> >>         void __iomem                    *remap_addr;
> >> -       int                             irq;
> >>
> >>         /* host reserved buffer */
> >>         void                            *rtsx_resv_buf;
> >> --
> >> 2.20.0
> >>
> >
> > As mentioned in the last email, the rtsx_pci seems to make no
> > difference. I still tried the kernel with this patch applied, the
> > problem still persists. I also tried the vendor driver and it works
> > without any problem. I'd rather like to find out the root cause
> > instead of a workaround. Any better idea?
> >
> Thanks for your efforts! The vendor driver doesn't support MSI-X,
> therefore the issue doesn't occur. I'm running out of ideas, so
> I will write to a contact in Realtek who few times provided helpful
> information already.
>

Hi Heiner,
    After lots of repeating tests, I have to correct my previous finding
to prevent from leading the wrong way. Sometimes the network also
fails with unknown reason. Here's the summarize.
1. The S3 suspend resume can reproduce it 100%. However, echo
different types (core, devices...) in /sys/power/pm_test is not able
to achieve the same thing.
2. The network could randomly fail at any time. Maybe during boot,
sometimes fail after few minutes web surfing.
3. After many times of verifications, it's not about MSI-X. I repeatedly
boot from my own build kernel (w/ MSI-X workaround, w/ pci_alloc_irq,
w/o pci_alloc_irq), even the revision before 6c6aa15fdea5 ("r8169:
improve interrupt handling")
still fails after S3, but I get the wrong impression because I access the
internet w/o problem for quite a long time.
4. When it happens, executing tcpdump on this NIC can always get
network access back. But fails again after stop tcpdump.
5. Vendor driver works w/o any problem. I'm still trying to find the difference.

Sorry that if I caused any confusion. I'll appreciate if there's any kind of
useful information. Thanks.

> > Chris
> >
> Heiner

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-20  9:43                 ` Chris Chiu
@ 2018-12-20 18:48                   ` Heiner Kallweit
  2018-12-20 19:21                   ` Heiner Kallweit
  1 sibling, 0 replies; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-20 18:48 UTC (permalink / raw)
  To: Chris Chiu; +Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On 20.12.2018 10:43, Chris Chiu wrote:
> On Thu, Dec 20, 2018 at 3:41 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 19.12.2018 16:32, Chris Chiu wrote:
>>> On Wed, Dec 19, 2018 at 4:28 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> On 18.12.2018 14:25, Chris Chiu wrote:
>>>>> On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>
>>>>>> On 17.12.2018 14:25, Chris Chiu wrote:
>>>>>>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On 14.12.2018 04:33, Chris Chiu wrote:
>>>>>>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>     We got an acer laptop which has a problem with ethernet networking after
>>>>>>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>>>>>>>> follows.
>>>>>>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>>>>>>>
>>>>>>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>>>>>>>
>>>>>>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
>>>>>>> (uninitialized): mac_version = 0x2b
>>>>>>> [   22.365580] libphy: r8169: probed
>>>>>>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
>>>>>>> XID 5c800800, IRQ 38
>>>>>>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
>>>>>>> bytes, tx checksumming: ko]
>>>>>>>
>>>>>> Thanks for the info.
>>>>>>
>>>>>>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
>>>>>>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>>>>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>>>>>>>> interface, the networking is back to alive. But it's dead again after
>>>>>>>>>> I stop tcpdump.
>>>>>>>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>>>>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>>>>>>>
>>>>>>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>>>>>>>> to find out whether there's a difference.
>>>>>>>>
>>>>>>>
>>>>>>> Actually, I just found I lead the wrong direction. The S3 suspend does
>>>>>>> help to reproduce,
>>>>>>> but it's not necessary. All I need to do is ping around 5 mins and the
>>>>>>> network connection
>>>>>>> fails.  And I also find one thing interesting, disabling the  MSI-X
>>>>>>> interrupt like commit
>>>>>>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
>>>>>>> Although I don't
>>>>>>> understand the root cause. Anything I can do to help?
>>>>>>>
>>>>>> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
>>>>>> the issue, but also pinging the machine from outside brings back the network.
>>>>>> Both actions affect totally different corners.
>>>>>>
>>>>>> The commit and related issue you mention was a workaround in the driver,
>>>>>> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
>>>>>> in the PCI core. After this was fixed we removed the workaround again.
>>>>>> This shouldn't be related to your issue.
>>>>>>
>>>>>> Hard to say for now is whether the issue is:
>>>>>> - a driver issue
>>>>>> - a hardware issue in the RTL8411
>>>>>> - an issue with the chipset on your mainboard
>>>>>>
>>>>>> According to your description it doesn't take a special scenario to trigger
>>>>>> the issue, so most likely also other users of Acer notebooks with RTL8411
>>>>>> should be affected (after briefly checking this should be at least Aspire
>>>>>> F15, V15, V7). Therefore I wonder why there aren't more reports.
>>>>>>
>>>>>> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
>>>>>> So you could test this revision and the one before.
>>>>>>
>>>>>> Eventually, if the issue really should be caused by a side effect of using
>>>>>> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
>>>>>> in general or just for RTL8411 and a certain subsystem id.
>>>>>>
>>>>>
>>>>> I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
>>>>> interrupt handling"),
>>>>> the problem still there. Then I revert to the previous revision, the
>>>>> problem goes away.
>>>>> So I think it's pretty much the side effect of MSI-X. However, as you
>>>>> mentioned that
>>>>> you didn't hit this problem, I'll ask the vendor to verify if this
>>>>> problem also happens on
>>>>> other machines with the same chip. Then we can determine to disable for specific
>>>>> mac version or just a certain subsystem id.
>>>>>
>>>>>>>>>>     I tried the latest 4.20 rc version but the problem still there. I
>>>>>>>>>> also tried some
>>>>>>>>>> hw_reset or init thing in the resume path but no effect. Any
>>>>>>>>>> suggestion for this?
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>> Did previous kernel versions work? If it's a regression, a bisect would be
>>>>>>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>>>>>>>
>>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>> Gentle ping. Any additional information required?
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>> Heiner
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> As an additional note:
>>>> I found that the rtsx_pci driver doesn't support MSI-X currently.
>>>> The following patch adds MSI-X support (it's compile-tested only
>>>> because I don't have a system with RTL8411).
>>>> Would be interesting to see whether it makes a difference if both
>>>> components on this combo chip use MSI-X.
>>>>
>>>> ---
>>>>  drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
>>>>  include/linux/rtsx_pci.h           |  1 -
>>>>  2 files changed, 16 insertions(+), 36 deletions(-)
>>>>
>>>> diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
>>>> index da445223f..d1349c248 100644
>>>> --- a/drivers/misc/cardreader/rtsx_pcr.c
>>>> +++ b/drivers/misc/cardreader/rtsx_pcr.c
>>>> @@ -35,10 +35,6 @@
>>>>
>>>>  #include "rtsx_pcr.h"
>>>>
>>>> -static bool msi_en = true;
>>>> -module_param(msi_en, bool, S_IRUGO | S_IWUSR);
>>>> -MODULE_PARM_DESC(msi_en, "Enable MSI");
>>>> -
>>>>  static DEFINE_IDR(rtsx_pci_idr);
>>>>  static DEFINE_SPINLOCK(rtsx_pci_lock);
>>>>
>>>> @@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
>>>>
>>>>  static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
>>>>  {
>>>> -       pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
>>>> -                       __func__, pcr->msi_en, pcr->pci->irq);
>>>> +       int ret;
>>>>
>>>> -       if (request_irq(pcr->pci->irq, rtsx_pci_isr,
>>>> -                       pcr->msi_en ? 0 : IRQF_SHARED,
>>>> -                       DRV_NAME_RTSX_PCI, pcr)) {
>>>> -               dev_err(&(pcr->pci->dev),
>>>> -                       "rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
>>>> -                       pcr->pci->irq);
>>>> -               return -1;
>>>> -       }
>>>> +       ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
>>>> +       if (ret < 0)
>>>> +               goto err;
>>>>
>>>> -       pcr->irq = pcr->pci->irq;
>>>> -       pci_intx(pcr->pci, !pcr->msi_en);
>>>> +       ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
>>>> +                             DRV_NAME_RTSX_PCI);
>>>> +       if (ret)
>>>> +               goto err;
>>>>
>>>>         return 0;
>>>> +err:
>>>> +       pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
>>>> +       return ret;
>>>>  }
>>>>
>>>>  static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
>>>> @@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>>>>         INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
>>>>         INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
>>>>
>>>> -       pcr->msi_en = msi_en;
>>>> -       if (pcr->msi_en) {
>>>> -               ret = pci_enable_msi(pcidev);
>>>> -               if (ret)
>>>> -                       pcr->msi_en = false;
>>>> -       }
>>>> -
>>>>         ret = rtsx_pci_acquire_irq(pcr);
>>>>         if (ret < 0)
>>>> -               goto disable_msi;
>>>> +               goto free_dma;
>>>>
>>>>         pci_set_master(pcidev);
>>>> -       synchronize_irq(pcr->irq);
>>>>
>>>>         ret = rtsx_pci_init_chip(pcr);
>>>>         if (ret < 0)
>>>> @@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>>>>         return 0;
>>>>
>>>>  disable_irq:
>>>> -       free_irq(pcr->irq, (void *)pcr);
>>>> -disable_msi:
>>>> -       if (pcr->msi_en)
>>>> -               pci_disable_msi(pcr->pci);
>>>> +       pci_free_irq(pcr->pci, 0, pcr);
>>>> +free_dma:
>>>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>>>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
>>>>  unmap:
>>>> @@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
>>>>
>>>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>>>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
>>>> -       free_irq(pcr->irq, (void *)pcr);
>>>> -       if (pcr->msi_en)
>>>> -               pci_disable_msi(pcr->pci);
>>>> +       pci_free_irq(pcr->pci, 0, pcr);
>>>>         iounmap(pcr->remap_addr);
>>>>
>>>>         pci_release_regions(pcidev);
>>>> @@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>>>>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>>>>
>>>>         pci_disable_device(pcidev);
>>>> -       free_irq(pcr->irq, (void *)pcr);
>>>> -       if (pcr->msi_en)
>>>> -               pci_disable_msi(pcr->pci);
>>>> +       pci_free_irq(pcr->pci, 0, pcr);
>>>>  }
>>>>
>>>>  #else /* CONFIG_PM */
>>>> diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
>>>> index e964bbd03..10abfe7f2 100644
>>>> --- a/include/linux/rtsx_pci.h
>>>> +++ b/include/linux/rtsx_pci.h
>>>> @@ -1190,7 +1190,6 @@ struct rtsx_pcr {
>>>>         /* pci resources */
>>>>         unsigned long                   addr;
>>>>         void __iomem                    *remap_addr;
>>>> -       int                             irq;
>>>>
>>>>         /* host reserved buffer */
>>>>         void                            *rtsx_resv_buf;
>>>> --
>>>> 2.20.0
>>>>
>>>
>>> As mentioned in the last email, the rtsx_pci seems to make no
>>> difference. I still tried the kernel with this patch applied, the
>>> problem still persists. I also tried the vendor driver and it works
>>> without any problem. I'd rather like to find out the root cause
>>> instead of a workaround. Any better idea?
>>>
>> Thanks for your efforts! The vendor driver doesn't support MSI-X,
>> therefore the issue doesn't occur. I'm running out of ideas, so
>> I will write to a contact in Realtek who few times provided helpful
>> information already.
>>
> 
> Hi Heiner,
>     After lots of repeating tests, I have to correct my previous finding
> to prevent from leading the wrong way. Sometimes the network also
> fails with unknown reason. Here's the summarize.
> 1. The S3 suspend resume can reproduce it 100%. However, echo
> different types (core, devices...) in /sys/power/pm_test is not able
> to achieve the same thing.
> 2. The network could randomly fail at any time. Maybe during boot,
> sometimes fail after few minutes web surfing.
> 3. After many times of verifications, it's not about MSI-X. I repeatedly
> boot from my own build kernel (w/ MSI-X workaround, w/ pci_alloc_irq,
> w/o pci_alloc_irq), even the revision before 6c6aa15fdea5 ("r8169:
> improve interrupt handling")
> still fails after S3, but I get the wrong impression because I access the
> internet w/o problem for quite a long time.
> 4. When it happens, executing tcpdump on this NIC can always get
> network access back. But fails again after stop tcpdump.
> 5. Vendor driver works w/o any problem. I'm still trying to find the difference.
> 
> Sorry that if I caused any confusion. I'll appreciate if there's any kind of
> useful information. Thanks.
> 
Thanks for the update. One thing is slightly confusing me: You first thought
that switching from MSI-X to MSI fixes the issue, but also state that the issue
can be easily reproduced by resuming from suspend. Didn't you test a suspend
cycle with MSI?

A simplified summary of the issue would be: nothing goes out unless something
comes in. tcpdump switches to promiscuous mode, therefore all incoming packets
are accepted, not only the ones for the own MAC.

As one more small test: Could you comment out the last statement in
rtl_hw_start_8411_2() ? Then ASPM should be disabled on the chip side.

I got a reply from Realtek, however have to prepare a patch first based on
their feedback.

>>> Chris
>>>
>> Heiner
> .
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-20  9:43                 ` Chris Chiu
  2018-12-20 18:48                   ` Heiner Kallweit
@ 2018-12-20 19:21                   ` Heiner Kallweit
  2018-12-21 15:16                     ` Chris Chiu
  1 sibling, 1 reply; 17+ messages in thread
From: Heiner Kallweit @ 2018-12-20 19:21 UTC (permalink / raw)
  To: Chris Chiu; +Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On 20.12.2018 10:43, Chris Chiu wrote:
> On Thu, Dec 20, 2018 at 3:41 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>
>> On 19.12.2018 16:32, Chris Chiu wrote:
>>> On Wed, Dec 19, 2018 at 4:28 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> On 18.12.2018 14:25, Chris Chiu wrote:
>>>>> On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>
>>>>>> On 17.12.2018 14:25, Chris Chiu wrote:
>>>>>>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On 14.12.2018 04:33, Chris Chiu wrote:
>>>>>>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>     We got an acer laptop which has a problem with ethernet networking after
>>>>>>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>>>>>>>>>> follows.
>>>>>>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>>>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
>>>>>>>>>>
>>>>>>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
>>>>>>>>
>>>>>>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
>>>>>>> (uninitialized): mac_version = 0x2b
>>>>>>> [   22.365580] libphy: r8169: probed
>>>>>>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
>>>>>>> XID 5c800800, IRQ 38
>>>>>>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
>>>>>>> bytes, tx checksumming: ko]
>>>>>>>
>>>>>> Thanks for the info.
>>>>>>
>>>>>>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
>>>>>>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
>>>>>>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
>>>>>>>>>> interface, the networking is back to alive. But it's dead again after
>>>>>>>>>> I stop tcpdump.
>>>>>>>>>> One more thing, if I ping the problematic machine from others, it achieves the
>>>>>>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
>>>>>>>>>>
>>>>>>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
>>>>>>>> to find out whether there's a difference.
>>>>>>>>
>>>>>>>
>>>>>>> Actually, I just found I lead the wrong direction. The S3 suspend does
>>>>>>> help to reproduce,
>>>>>>> but it's not necessary. All I need to do is ping around 5 mins and the
>>>>>>> network connection
>>>>>>> fails.  And I also find one thing interesting, disabling the  MSI-X
>>>>>>> interrupt like commit
>>>>>>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
>>>>>>> Although I don't
>>>>>>> understand the root cause. Anything I can do to help?
>>>>>>>
>>>>>> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
>>>>>> the issue, but also pinging the machine from outside brings back the network.
>>>>>> Both actions affect totally different corners.
>>>>>>
>>>>>> The commit and related issue you mention was a workaround in the driver,
>>>>>> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
>>>>>> in the PCI core. After this was fixed we removed the workaround again.
>>>>>> This shouldn't be related to your issue.
>>>>>>
>>>>>> Hard to say for now is whether the issue is:
>>>>>> - a driver issue
>>>>>> - a hardware issue in the RTL8411
>>>>>> - an issue with the chipset on your mainboard
>>>>>>
>>>>>> According to your description it doesn't take a special scenario to trigger
>>>>>> the issue, so most likely also other users of Acer notebooks with RTL8411
>>>>>> should be affected (after briefly checking this should be at least Aspire
>>>>>> F15, V15, V7). Therefore I wonder why there aren't more reports.
>>>>>>
>>>>>> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
>>>>>> So you could test this revision and the one before.
>>>>>>
>>>>>> Eventually, if the issue really should be caused by a side effect of using
>>>>>> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
>>>>>> in general or just for RTL8411 and a certain subsystem id.
>>>>>>
>>>>>
>>>>> I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
>>>>> interrupt handling"),
>>>>> the problem still there. Then I revert to the previous revision, the
>>>>> problem goes away.
>>>>> So I think it's pretty much the side effect of MSI-X. However, as you
>>>>> mentioned that
>>>>> you didn't hit this problem, I'll ask the vendor to verify if this
>>>>> problem also happens on
>>>>> other machines with the same chip. Then we can determine to disable for specific
>>>>> mac version or just a certain subsystem id.
>>>>>
>>>>>>>>>>     I tried the latest 4.20 rc version but the problem still there. I
>>>>>>>>>> also tried some
>>>>>>>>>> hw_reset or init thing in the resume path but no effect. Any
>>>>>>>>>> suggestion for this?
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>> Did previous kernel versions work? If it's a regression, a bisect would be
>>>>>>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
>>>>>>>>
>>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>> Gentle ping. Any additional information required?
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>> Heiner
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> As an additional note:
>>>> I found that the rtsx_pci driver doesn't support MSI-X currently.
>>>> The following patch adds MSI-X support (it's compile-tested only
>>>> because I don't have a system with RTL8411).
>>>> Would be interesting to see whether it makes a difference if both
>>>> components on this combo chip use MSI-X.
>>>>
>>>> ---
>>>>  drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
>>>>  include/linux/rtsx_pci.h           |  1 -
>>>>  2 files changed, 16 insertions(+), 36 deletions(-)
>>>>
>>>> diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
>>>> index da445223f..d1349c248 100644
>>>> --- a/drivers/misc/cardreader/rtsx_pcr.c
>>>> +++ b/drivers/misc/cardreader/rtsx_pcr.c
>>>> @@ -35,10 +35,6 @@
>>>>
>>>>  #include "rtsx_pcr.h"
>>>>
>>>> -static bool msi_en = true;
>>>> -module_param(msi_en, bool, S_IRUGO | S_IWUSR);
>>>> -MODULE_PARM_DESC(msi_en, "Enable MSI");
>>>> -
>>>>  static DEFINE_IDR(rtsx_pci_idr);
>>>>  static DEFINE_SPINLOCK(rtsx_pci_lock);
>>>>
>>>> @@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
>>>>
>>>>  static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
>>>>  {
>>>> -       pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
>>>> -                       __func__, pcr->msi_en, pcr->pci->irq);
>>>> +       int ret;
>>>>
>>>> -       if (request_irq(pcr->pci->irq, rtsx_pci_isr,
>>>> -                       pcr->msi_en ? 0 : IRQF_SHARED,
>>>> -                       DRV_NAME_RTSX_PCI, pcr)) {
>>>> -               dev_err(&(pcr->pci->dev),
>>>> -                       "rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
>>>> -                       pcr->pci->irq);
>>>> -               return -1;
>>>> -       }
>>>> +       ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
>>>> +       if (ret < 0)
>>>> +               goto err;
>>>>
>>>> -       pcr->irq = pcr->pci->irq;
>>>> -       pci_intx(pcr->pci, !pcr->msi_en);
>>>> +       ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
>>>> +                             DRV_NAME_RTSX_PCI);
>>>> +       if (ret)
>>>> +               goto err;
>>>>
>>>>         return 0;
>>>> +err:
>>>> +       pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
>>>> +       return ret;
>>>>  }
>>>>
>>>>  static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
>>>> @@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>>>>         INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
>>>>         INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
>>>>
>>>> -       pcr->msi_en = msi_en;
>>>> -       if (pcr->msi_en) {
>>>> -               ret = pci_enable_msi(pcidev);
>>>> -               if (ret)
>>>> -                       pcr->msi_en = false;
>>>> -       }
>>>> -
>>>>         ret = rtsx_pci_acquire_irq(pcr);
>>>>         if (ret < 0)
>>>> -               goto disable_msi;
>>>> +               goto free_dma;
>>>>
>>>>         pci_set_master(pcidev);
>>>> -       synchronize_irq(pcr->irq);
>>>>
>>>>         ret = rtsx_pci_init_chip(pcr);
>>>>         if (ret < 0)
>>>> @@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
>>>>         return 0;
>>>>
>>>>  disable_irq:
>>>> -       free_irq(pcr->irq, (void *)pcr);
>>>> -disable_msi:
>>>> -       if (pcr->msi_en)
>>>> -               pci_disable_msi(pcr->pci);
>>>> +       pci_free_irq(pcr->pci, 0, pcr);
>>>> +free_dma:
>>>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>>>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
>>>>  unmap:
>>>> @@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
>>>>
>>>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
>>>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
>>>> -       free_irq(pcr->irq, (void *)pcr);
>>>> -       if (pcr->msi_en)
>>>> -               pci_disable_msi(pcr->pci);
>>>> +       pci_free_irq(pcr->pci, 0, pcr);
>>>>         iounmap(pcr->remap_addr);
>>>>
>>>>         pci_release_regions(pcidev);
>>>> @@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>>>>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>>>>
>>>>         pci_disable_device(pcidev);
>>>> -       free_irq(pcr->irq, (void *)pcr);
>>>> -       if (pcr->msi_en)
>>>> -               pci_disable_msi(pcr->pci);
>>>> +       pci_free_irq(pcr->pci, 0, pcr);
>>>>  }
>>>>
>>>>  #else /* CONFIG_PM */
>>>> diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
>>>> index e964bbd03..10abfe7f2 100644
>>>> --- a/include/linux/rtsx_pci.h
>>>> +++ b/include/linux/rtsx_pci.h
>>>> @@ -1190,7 +1190,6 @@ struct rtsx_pcr {
>>>>         /* pci resources */
>>>>         unsigned long                   addr;
>>>>         void __iomem                    *remap_addr;
>>>> -       int                             irq;
>>>>
>>>>         /* host reserved buffer */
>>>>         void                            *rtsx_resv_buf;
>>>> --
>>>> 2.20.0
>>>>
>>>
>>> As mentioned in the last email, the rtsx_pci seems to make no
>>> difference. I still tried the kernel with this patch applied, the
>>> problem still persists. I also tried the vendor driver and it works
>>> without any problem. I'd rather like to find out the root cause
>>> instead of a workaround. Any better idea?
>>>
>> Thanks for your efforts! The vendor driver doesn't support MSI-X,
>> therefore the issue doesn't occur. I'm running out of ideas, so
>> I will write to a contact in Realtek who few times provided helpful
>> information already.
>>
> 
> Hi Heiner,
>     After lots of repeating tests, I have to correct my previous finding
> to prevent from leading the wrong way. Sometimes the network also
> fails with unknown reason. Here's the summarize.
> 1. The S3 suspend resume can reproduce it 100%. However, echo
> different types (core, devices...) in /sys/power/pm_test is not able
> to achieve the same thing.
> 2. The network could randomly fail at any time. Maybe during boot,
> sometimes fail after few minutes web surfing.
> 3. After many times of verifications, it's not about MSI-X. I repeatedly
> boot from my own build kernel (w/ MSI-X workaround, w/ pci_alloc_irq,
> w/o pci_alloc_irq), even the revision before 6c6aa15fdea5 ("r8169:
> improve interrupt handling")
> still fails after S3, but I get the wrong impression because I access the
> internet w/o problem for quite a long time.
> 4. When it happens, executing tcpdump on this NIC can always get
> network access back. But fails again after stop tcpdump.
> 5. Vendor driver works w/o any problem. I'm still trying to find the difference.
> 
> Sorry that if I caused any confusion. I'll appreciate if there's any kind of
> useful information. Thanks.
> 
>>> Chris
>>>
>> Heiner
> .
> 

OK, here come two patches based on ideas from Realtek. Could you please test:
1. patch 1
2. patch 2
3. patch 1 + 2


patch 1:
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 99bc3de90..eda35c0ce 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5049,7 +5049,7 @@ static void rtl_hw_start_8168g(struct rtl8169_private *tp)
 	RTL_W8(tp, MaxTxPacketSize, EarlySize);
 
 	rtl_eri_write(tp, 0xc0, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
-	rtl_eri_write(tp, 0xb8, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
+	rtl_eri_write(tp, 0xb8, ERIAR_MASK_1111, 0, ERIAR_EXGMAC);
 
 	/* Adjust EEE LED frequency */
 	RTL_W8(tp, EEE_LED, RTL_R8(tp, EEE_LED) & ~0x07);
-- 
2.20.1


patch 2:
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index eda35c0ce..921e258a3 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5109,7 +5109,137 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
 	/* disable aspm and clock request before access ephy */
 	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8411_2, ARRAY_SIZE(e_info_8411_2));
-	rtl_hw_aspm_clkreq_enable(tp, true);
+
+	r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
+	r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
+	r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
+	r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
+	r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
+	r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
+	r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
+	r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
+	msleep(3);
+	r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
+
+	r8168_mac_ocp_write( tp, 0xF800, 0xE008 );
+	r8168_mac_ocp_write( tp, 0xF802, 0xE00A );
+	r8168_mac_ocp_write( tp, 0xF804, 0xE00C );
+	r8168_mac_ocp_write( tp, 0xF806, 0xE00E );
+	r8168_mac_ocp_write( tp, 0xF808, 0xE027 );
+	r8168_mac_ocp_write( tp, 0xF80A, 0xE04F );
+	r8168_mac_ocp_write( tp, 0xF80C, 0xE05E );
+	r8168_mac_ocp_write( tp, 0xF80E, 0xE065 );
+	r8168_mac_ocp_write( tp, 0xF810, 0xC602 );
+	r8168_mac_ocp_write( tp, 0xF812, 0xBE00 );
+	r8168_mac_ocp_write( tp, 0xF814, 0x0000 );
+	r8168_mac_ocp_write( tp, 0xF816, 0xC502 );
+	r8168_mac_ocp_write( tp, 0xF818, 0xBD00 );
+	r8168_mac_ocp_write( tp, 0xF81A, 0x074C );
+	r8168_mac_ocp_write( tp, 0xF81C, 0xC302 );
+	r8168_mac_ocp_write( tp, 0xF81E, 0xBB00 );
+	r8168_mac_ocp_write( tp, 0xF820, 0x080A );
+	r8168_mac_ocp_write( tp, 0xF822, 0x6420 );
+	r8168_mac_ocp_write( tp, 0xF824, 0x48C2 );
+	r8168_mac_ocp_write( tp, 0xF826, 0x8C20 );
+	r8168_mac_ocp_write( tp, 0xF828, 0xC516 );
+	r8168_mac_ocp_write( tp, 0xF82A, 0x64A4 );
+	r8168_mac_ocp_write( tp, 0xF82C, 0x49C0 );
+	r8168_mac_ocp_write( tp, 0xF82E, 0xF009 );
+	r8168_mac_ocp_write( tp, 0xF830, 0x74A2 );
+	r8168_mac_ocp_write( tp, 0xF832, 0x8CA5 );
+	r8168_mac_ocp_write( tp, 0xF834, 0x74A0 );
+	r8168_mac_ocp_write( tp, 0xF836, 0xC50E );
+	r8168_mac_ocp_write( tp, 0xF838, 0x9CA2 );
+	r8168_mac_ocp_write( tp, 0xF83A, 0x1C11 );
+	r8168_mac_ocp_write( tp, 0xF83C, 0x9CA0 );
+	r8168_mac_ocp_write( tp, 0xF83E, 0xE006 );
+	r8168_mac_ocp_write( tp, 0xF840, 0x74F8 );
+	r8168_mac_ocp_write( tp, 0xF842, 0x48C4 );
+	r8168_mac_ocp_write( tp, 0xF844, 0x8CF8 );
+	r8168_mac_ocp_write( tp, 0xF846, 0xC404 );
+	r8168_mac_ocp_write( tp, 0xF848, 0xBC00 );
+	r8168_mac_ocp_write( tp, 0xF84A, 0xC403 );
+	r8168_mac_ocp_write( tp, 0xF84C, 0xBC00 );
+	r8168_mac_ocp_write( tp, 0xF84E, 0x0BF2 );
+	r8168_mac_ocp_write( tp, 0xF850, 0x0C0A );
+	r8168_mac_ocp_write( tp, 0xF852, 0xE434 );
+	r8168_mac_ocp_write( tp, 0xF854, 0xD3C0 );
+	r8168_mac_ocp_write( tp, 0xF856, 0x49D9 );
+	r8168_mac_ocp_write( tp, 0xF858, 0xF01F );
+	r8168_mac_ocp_write( tp, 0xF85A, 0xC526 );
+	r8168_mac_ocp_write( tp, 0xF85C, 0x64A5 );
+	r8168_mac_ocp_write( tp, 0xF85E, 0x1400 );
+	r8168_mac_ocp_write( tp, 0xF860, 0xF007 );
+	r8168_mac_ocp_write( tp, 0xF862, 0x0C01 );
+	r8168_mac_ocp_write( tp, 0xF864, 0x8CA5 );
+	r8168_mac_ocp_write( tp, 0xF866, 0x1C15 );
+	r8168_mac_ocp_write( tp, 0xF868, 0xC51B );
+	r8168_mac_ocp_write( tp, 0xF86A, 0x9CA0 );
+	r8168_mac_ocp_write( tp, 0xF86C, 0xE013 );
+	r8168_mac_ocp_write( tp, 0xF86E, 0xC519 );
+	r8168_mac_ocp_write( tp, 0xF870, 0x74A0 );
+	r8168_mac_ocp_write( tp, 0xF872, 0x48C4 );
+	r8168_mac_ocp_write( tp, 0xF874, 0x8CA0 );
+	r8168_mac_ocp_write( tp, 0xF876, 0xC516 );
+	r8168_mac_ocp_write( tp, 0xF878, 0x74A4 );
+	r8168_mac_ocp_write( tp, 0xF87A, 0x48C8 );
+	r8168_mac_ocp_write( tp, 0xF87C, 0x48CA );
+	r8168_mac_ocp_write( tp, 0xF87E, 0x9CA4 );
+	r8168_mac_ocp_write( tp, 0xF880, 0xC512 );
+	r8168_mac_ocp_write( tp, 0xF882, 0x1B00 );
+	r8168_mac_ocp_write( tp, 0xF884, 0x9BA0 );
+	r8168_mac_ocp_write( tp, 0xF886, 0x1B1C );
+	r8168_mac_ocp_write( tp, 0xF888, 0x483F );
+	r8168_mac_ocp_write( tp, 0xF88A, 0x9BA2 );
+	r8168_mac_ocp_write( tp, 0xF88C, 0x1B04 );
+	r8168_mac_ocp_write( tp, 0xF88E, 0xC508 );
+	r8168_mac_ocp_write( tp, 0xF890, 0x9BA0 );
+	r8168_mac_ocp_write( tp, 0xF892, 0xC505 );
+	r8168_mac_ocp_write( tp, 0xF894, 0xBD00 );
+	r8168_mac_ocp_write( tp, 0xF896, 0xC502 );
+	r8168_mac_ocp_write( tp, 0xF898, 0xBD00 );
+	r8168_mac_ocp_write( tp, 0xF89A, 0x0300 );
+	r8168_mac_ocp_write( tp, 0xF89C, 0x051E );
+	r8168_mac_ocp_write( tp, 0xF89E, 0xE434 );
+	r8168_mac_ocp_write( tp, 0xF8A0, 0xE018 );
+	r8168_mac_ocp_write( tp, 0xF8A2, 0xE092 );
+	r8168_mac_ocp_write( tp, 0xF8A4, 0xDE20 );
+	r8168_mac_ocp_write( tp, 0xF8A6, 0xD3C0 );
+	r8168_mac_ocp_write( tp, 0xF8A8, 0xC50F );
+	r8168_mac_ocp_write( tp, 0xF8AA, 0x76A4 );
+	r8168_mac_ocp_write( tp, 0xF8AC, 0x49E3 );
+	r8168_mac_ocp_write( tp, 0xF8AE, 0xF007 );
+	r8168_mac_ocp_write( tp, 0xF8B0, 0x49C0 );
+	r8168_mac_ocp_write( tp, 0xF8B2, 0xF103 );
+	r8168_mac_ocp_write( tp, 0xF8B4, 0xC607 );
+	r8168_mac_ocp_write( tp, 0xF8B6, 0xBE00 );
+	r8168_mac_ocp_write( tp, 0xF8B8, 0xC606 );
+	r8168_mac_ocp_write( tp, 0xF8BA, 0xBE00 );
+	r8168_mac_ocp_write( tp, 0xF8BC, 0xC602 );
+	r8168_mac_ocp_write( tp, 0xF8BE, 0xBE00 );
+	r8168_mac_ocp_write( tp, 0xF8C0, 0x0C4C );
+	r8168_mac_ocp_write( tp, 0xF8C2, 0x0C28 );
+	r8168_mac_ocp_write( tp, 0xF8C4, 0x0C2C );
+	r8168_mac_ocp_write( tp, 0xF8C6, 0xDC00 );
+	r8168_mac_ocp_write( tp, 0xF8C8, 0xC707 );
+	r8168_mac_ocp_write( tp, 0xF8CA, 0x1D00 );
+	r8168_mac_ocp_write( tp, 0xF8CC, 0x8DE2 );
+	r8168_mac_ocp_write( tp, 0xF8CE, 0x48C1 );
+	r8168_mac_ocp_write( tp, 0xF8D0, 0xC502 );
+	r8168_mac_ocp_write( tp, 0xF8D2, 0xBD00 );
+	r8168_mac_ocp_write( tp, 0xF8D4, 0x00AA );
+	r8168_mac_ocp_write( tp, 0xF8D6, 0xE0C0 );
+	r8168_mac_ocp_write( tp, 0xF8D8, 0xC502 );
+	r8168_mac_ocp_write( tp, 0xF8DA, 0xBD00 );
+	r8168_mac_ocp_write( tp, 0xF8DC, 0x0132 );
+	r8168_mac_ocp_write( tp, 0xFC26, 0x8000 );
+	r8168_mac_ocp_write( tp, 0xFC2A, 0x0743 );
+	r8168_mac_ocp_write( tp, 0xFC2C, 0x0801 );
+	r8168_mac_ocp_write( tp, 0xFC2E, 0x0BE9 );
+	r8168_mac_ocp_write( tp, 0xFC30, 0x02FD );
+	r8168_mac_ocp_write( tp, 0xFC32, 0x0C25 );
+	r8168_mac_ocp_write( tp, 0xFC34, 0x00A9 );
+	r8168_mac_ocp_write( tp, 0xFC36, 0x012D );
 }
 
 static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
-- 
2.20.1

















^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: A weird problem of Realtek r8168 after resume from S3
  2018-12-20 19:21                   ` Heiner Kallweit
@ 2018-12-21 15:16                     ` Chris Chiu
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Chiu @ 2018-12-21 15:16 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: nic_swsd, davem, netdev, Linux Kernel, Linux Upstreaming Team

On Fri, Dec 21, 2018 at 3:22 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
>
> On 20.12.2018 10:43, Chris Chiu wrote:
> > On Thu, Dec 20, 2018 at 3:41 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>
> >> On 19.12.2018 16:32, Chris Chiu wrote:
> >>> On Wed, Dec 19, 2018 at 4:28 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>
> >>>> On 18.12.2018 14:25, Chris Chiu wrote:
> >>>>> On Tue, Dec 18, 2018 at 3:08 AM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>
> >>>>>> On 17.12.2018 14:25, Chris Chiu wrote:
> >>>>>>> On Fri, Dec 14, 2018 at 3:37 PM Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On 14.12.2018 04:33, Chris Chiu wrote:
> >>>>>>>>> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu <chiu@endlessm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>     We got an acer laptop which has a problem with ethernet networking after
> >>>>>>>>>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
> >>>>>>>>>> follows.
> >>>>>>>>>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
> >>>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 12)
> >>>>>>>>>>
> >>>>>>>> Helpful would be a "dmesg | grep r8169", especially chip name + XID.
> >>>>>>>>
> >>>>>>> [   22.362774] r8169 0000:02:00.1 (unnamed net_device)
> >>>>>>> (uninitialized): mac_version = 0x2b
> >>>>>>> [   22.365580] libphy: r8169: probed
> >>>>>>> [   22.365958] r8169 0000:02:00.1 eth0: RTL8411, 00:e0:b8:1f:cb:83,
> >>>>>>> XID 5c800800, IRQ 38
> >>>>>>> [   22.365961] r8169 0000:02:00.1 eth0: jumbo features [frames: 9200
> >>>>>>> bytes, tx checksumming: ko]
> >>>>>>>
> >>>>>> Thanks for the info.
> >>>>>>
> >>>>>>>>>>     The problem is the ethernet is not accessible after resume. Pinging via
> >>>>>>>>>> ethernet always shows the response `Destination Host Unreachable`. However,
> >>>>>>>>>> the interesting part is, when I run tcpdump to monitor the problematic ethernet
> >>>>>>>>>> interface, the networking is back to alive. But it's dead again after
> >>>>>>>>>> I stop tcpdump.
> >>>>>>>>>> One more thing, if I ping the problematic machine from others, it achieves the
> >>>>>>>>>> same effect as above tcpdump. Maybe it's about the register setting for RX path?
> >>>>>>>>>>
> >>>>>>>> You could compare the register dumps (ethtool -d) before and after S3 sleep
> >>>>>>>> to find out whether there's a difference.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Actually, I just found I lead the wrong direction. The S3 suspend does
> >>>>>>> help to reproduce,
> >>>>>>> but it's not necessary. All I need to do is ping around 5 mins and the
> >>>>>>> network connection
> >>>>>>> fails.  And I also find one thing interesting, disabling the  MSI-X
> >>>>>>> interrupt like commit
> >>>>>>> [d49c88d7677ba737e9d2759a87db0402d5ab2607] can fix this problem.
> >>>>>>> Although I don't
> >>>>>>> understand the root cause. Anything I can do to help?
> >>>>>>>
> >>>>>> This is indeed very, very weird. You say switching from MSI-X to MSI fixes
> >>>>>> the issue, but also pinging the machine from outside brings back the network.
> >>>>>> Both actions affect totally different corners.
> >>>>>>
> >>>>>> The commit and related issue you mention was a workaround in the driver,
> >>>>>> the root cause was a MSI-X-related  issue with certain Intel chipsets deep
> >>>>>> in the PCI core. After this was fixed we removed the workaround again.
> >>>>>> This shouldn't be related to your issue.
> >>>>>>
> >>>>>> Hard to say for now is whether the issue is:
> >>>>>> - a driver issue
> >>>>>> - a hardware issue in the RTL8411
> >>>>>> - an issue with the chipset on your mainboard
> >>>>>>
> >>>>>> According to your description it doesn't take a special scenario to trigger
> >>>>>> the issue, so most likely also other users of Acer notebooks with RTL8411
> >>>>>> should be affected (after briefly checking this should be at least Aspire
> >>>>>> F15, V15, V7). Therefore I wonder why there aren't more reports.
> >>>>>>
> >>>>>> This commit added MSI-X support: 6c6aa15fdea5 ("r8169: improve interrupt handling")
> >>>>>> So you could test this revision and the one before.
> >>>>>>
> >>>>>> Eventually, if the issue really should be caused by a side effect of using
> >>>>>> MSI-X, then the question is whether we need to disable MSI-X for RTL8411
> >>>>>> in general or just for RTL8411 and a certain subsystem id.
> >>>>>>
> >>>>>
> >>>>> I tried the kernel with the head on 6c6aa15fdea5 ("r8169: improve
> >>>>> interrupt handling"),
> >>>>> the problem still there. Then I revert to the previous revision, the
> >>>>> problem goes away.
> >>>>> So I think it's pretty much the side effect of MSI-X. However, as you
> >>>>> mentioned that
> >>>>> you didn't hit this problem, I'll ask the vendor to verify if this
> >>>>> problem also happens on
> >>>>> other machines with the same chip. Then we can determine to disable for specific
> >>>>> mac version or just a certain subsystem id.
> >>>>>
> >>>>>>>>>>     I tried the latest 4.20 rc version but the problem still there. I
> >>>>>>>>>> also tried some
> >>>>>>>>>> hw_reset or init thing in the resume path but no effect. Any
> >>>>>>>>>> suggestion for this?
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>> Did previous kernel versions work? If it's a regression, a bisect would be
> >>>>>>>> appreciated, because with the chip versions I've got I can't reproduce the issue.
> >>>>>>>>
> >>>>>>>>>> Chris
> >>>>>>>>>
> >>>>>>>>> Gentle ping. Any additional information required?
> >>>>>>>>>
> >>>>>>>>> Chris
> >>>>>>>>>
> >>>>>>>> Heiner
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> As an additional note:
> >>>> I found that the rtsx_pci driver doesn't support MSI-X currently.
> >>>> The following patch adds MSI-X support (it's compile-tested only
> >>>> because I don't have a system with RTL8411).
> >>>> Would be interesting to see whether it makes a difference if both
> >>>> components on this combo chip use MSI-X.
> >>>>
> >>>> ---
> >>>>  drivers/misc/cardreader/rtsx_pcr.c | 51 ++++++++++--------------------
> >>>>  include/linux/rtsx_pci.h           |  1 -
> >>>>  2 files changed, 16 insertions(+), 36 deletions(-)
> >>>>
> >>>> diff --git a/drivers/misc/cardreader/rtsx_pcr.c b/drivers/misc/cardreader/rtsx_pcr.c
> >>>> index da445223f..d1349c248 100644
> >>>> --- a/drivers/misc/cardreader/rtsx_pcr.c
> >>>> +++ b/drivers/misc/cardreader/rtsx_pcr.c
> >>>> @@ -35,10 +35,6 @@
> >>>>
> >>>>  #include "rtsx_pcr.h"
> >>>>
> >>>> -static bool msi_en = true;
> >>>> -module_param(msi_en, bool, S_IRUGO | S_IWUSR);
> >>>> -MODULE_PARM_DESC(msi_en, "Enable MSI");
> >>>> -
> >>>>  static DEFINE_IDR(rtsx_pci_idr);
> >>>>  static DEFINE_SPINLOCK(rtsx_pci_lock);
> >>>>
> >>>> @@ -1049,22 +1045,21 @@ static irqreturn_t rtsx_pci_isr(int irq, void *dev_id)
> >>>>
> >>>>  static int rtsx_pci_acquire_irq(struct rtsx_pcr *pcr)
> >>>>  {
> >>>> -       pcr_dbg(pcr, "%s: pcr->msi_en = %d, pci->irq = %d\n",
> >>>> -                       __func__, pcr->msi_en, pcr->pci->irq);
> >>>> +       int ret;
> >>>>
> >>>> -       if (request_irq(pcr->pci->irq, rtsx_pci_isr,
> >>>> -                       pcr->msi_en ? 0 : IRQF_SHARED,
> >>>> -                       DRV_NAME_RTSX_PCI, pcr)) {
> >>>> -               dev_err(&(pcr->pci->dev),
> >>>> -                       "rtsx_sdmmc: unable to grab IRQ %d, disabling device\n",
> >>>> -                       pcr->pci->irq);
> >>>> -               return -1;
> >>>> -       }
> >>>> +       ret = pci_alloc_irq_vectors(pcr->pci, 1, 1, PCI_IRQ_ALL_TYPES);
> >>>> +       if (ret < 0)
> >>>> +               goto err;
> >>>>
> >>>> -       pcr->irq = pcr->pci->irq;
> >>>> -       pci_intx(pcr->pci, !pcr->msi_en);
> >>>> +       ret = pci_request_irq(pcr->pci, 0, rtsx_pci_isr, NULL, pcr,
> >>>> +                             DRV_NAME_RTSX_PCI);
> >>>> +       if (ret)
> >>>> +               goto err;
> >>>>
> >>>>         return 0;
> >>>> +err:
> >>>> +       pci_err(pcr->pci, "rtsx_sdmmc: unable to grab interrupt\n");
> >>>> +       return ret;
> >>>>  }
> >>>>
> >>>>  static void rtsx_enable_aspm(struct rtsx_pcr *pcr)
> >>>> @@ -1496,19 +1491,11 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
> >>>>         INIT_DELAYED_WORK(&pcr->carddet_work, rtsx_pci_card_detect);
> >>>>         INIT_DELAYED_WORK(&pcr->idle_work, rtsx_pci_idle_work);
> >>>>
> >>>> -       pcr->msi_en = msi_en;
> >>>> -       if (pcr->msi_en) {
> >>>> -               ret = pci_enable_msi(pcidev);
> >>>> -               if (ret)
> >>>> -                       pcr->msi_en = false;
> >>>> -       }
> >>>> -
> >>>>         ret = rtsx_pci_acquire_irq(pcr);
> >>>>         if (ret < 0)
> >>>> -               goto disable_msi;
> >>>> +               goto free_dma;
> >>>>
> >>>>         pci_set_master(pcidev);
> >>>> -       synchronize_irq(pcr->irq);
> >>>>
> >>>>         ret = rtsx_pci_init_chip(pcr);
> >>>>         if (ret < 0)
> >>>> @@ -1528,10 +1515,8 @@ static int rtsx_pci_probe(struct pci_dev *pcidev,
> >>>>         return 0;
> >>>>
> >>>>  disable_irq:
> >>>> -       free_irq(pcr->irq, (void *)pcr);
> >>>> -disable_msi:
> >>>> -       if (pcr->msi_en)
> >>>> -               pci_disable_msi(pcr->pci);
> >>>> +       pci_free_irq(pcr->pci, 0, pcr);
> >>>> +free_dma:
> >>>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
> >>>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
> >>>>  unmap:
> >>>> @@ -1568,9 +1553,7 @@ static void rtsx_pci_remove(struct pci_dev *pcidev)
> >>>>
> >>>>         dma_free_coherent(&(pcr->pci->dev), RTSX_RESV_BUF_LEN,
> >>>>                         pcr->rtsx_resv_buf, pcr->rtsx_resv_buf_addr);
> >>>> -       free_irq(pcr->irq, (void *)pcr);
> >>>> -       if (pcr->msi_en)
> >>>> -               pci_disable_msi(pcr->pci);
> >>>> +       pci_free_irq(pcr->pci, 0, pcr);
> >>>>         iounmap(pcr->remap_addr);
> >>>>
> >>>>         pci_release_regions(pcidev);
> >>>> @@ -1664,9 +1647,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
> >>>>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
> >>>>
> >>>>         pci_disable_device(pcidev);
> >>>> -       free_irq(pcr->irq, (void *)pcr);
> >>>> -       if (pcr->msi_en)
> >>>> -               pci_disable_msi(pcr->pci);
> >>>> +       pci_free_irq(pcr->pci, 0, pcr);
> >>>>  }
> >>>>
> >>>>  #else /* CONFIG_PM */
> >>>> diff --git a/include/linux/rtsx_pci.h b/include/linux/rtsx_pci.h
> >>>> index e964bbd03..10abfe7f2 100644
> >>>> --- a/include/linux/rtsx_pci.h
> >>>> +++ b/include/linux/rtsx_pci.h
> >>>> @@ -1190,7 +1190,6 @@ struct rtsx_pcr {
> >>>>         /* pci resources */
> >>>>         unsigned long                   addr;
> >>>>         void __iomem                    *remap_addr;
> >>>> -       int                             irq;
> >>>>
> >>>>         /* host reserved buffer */
> >>>>         void                            *rtsx_resv_buf;
> >>>> --
> >>>> 2.20.0
> >>>>
> >>>
> >>> As mentioned in the last email, the rtsx_pci seems to make no
> >>> difference. I still tried the kernel with this patch applied, the
> >>> problem still persists. I also tried the vendor driver and it works
> >>> without any problem. I'd rather like to find out the root cause
> >>> instead of a workaround. Any better idea?
> >>>
> >> Thanks for your efforts! The vendor driver doesn't support MSI-X,
> >> therefore the issue doesn't occur. I'm running out of ideas, so
> >> I will write to a contact in Realtek who few times provided helpful
> >> information already.
> >>
> >
> > Hi Heiner,
> >     After lots of repeating tests, I have to correct my previous finding
> > to prevent from leading the wrong way. Sometimes the network also
> > fails with unknown reason. Here's the summarize.
> > 1. The S3 suspend resume can reproduce it 100%. However, echo
> > different types (core, devices...) in /sys/power/pm_test is not able
> > to achieve the same thing.
> > 2. The network could randomly fail at any time. Maybe during boot,
> > sometimes fail after few minutes web surfing.
> > 3. After many times of verifications, it's not about MSI-X. I repeatedly
> > boot from my own build kernel (w/ MSI-X workaround, w/ pci_alloc_irq,
> > w/o pci_alloc_irq), even the revision before 6c6aa15fdea5 ("r8169:
> > improve interrupt handling")
> > still fails after S3, but I get the wrong impression because I access the
> > internet w/o problem for quite a long time.
> > 4. When it happens, executing tcpdump on this NIC can always get
> > network access back. But fails again after stop tcpdump.
> > 5. Vendor driver works w/o any problem. I'm still trying to find the difference.
> >
> > Sorry that if I caused any confusion. I'll appreciate if there's any kind of
> > useful information. Thanks.
> >
> >>> Chris
> >>>
> >> Heiner
> > .
> >

Sorry that I complicated the problem. Because this laptop will fail to
enter suspend with old amdgpu driver, so it's not easy to do kernel
bisecting. And the problem only 100% reproducible after S3
suspend/resume, it was difficult for me to define a criteria to judge
good/bad for each test. The following test result is based on the
kernel 4.20.0-rc7 which has no suspend/resume problem. Run each test
for 3 reboots.

>As one more small test: Could you comment out the last statement in
>rtl_hw_start_8411_2() ? Then ASPM should be disabled on the chip side.

2 of 3 reboots, it never gets IP address from DHCP since boot and each
plug/unplug ethernet cable.
The other reboot, It fails after resume from S3.

> OK, here come two patches based on ideas from Realtek. Could you please test:
> 1. patch 1
2 of 3 reboots, It fails after resume from S3.
The other reboot never gets IP address from DHCP since boot.

> 2. patch 2
It's just like the test result of patch 1 only.

> 3. patch 1 + 2
>
It never gets IP address from DHCP since boot and each plug/unplug
ethernet cable for 3 reboots.

I think we can simple take these results as all failed.





>
> patch 1:
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 99bc3de90..eda35c0ce 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -5049,7 +5049,7 @@ static void rtl_hw_start_8168g(struct rtl8169_private *tp)
>         RTL_W8(tp, MaxTxPacketSize, EarlySize);
>
>         rtl_eri_write(tp, 0xc0, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
> -       rtl_eri_write(tp, 0xb8, ERIAR_MASK_0011, 0x0000, ERIAR_EXGMAC);
> +       rtl_eri_write(tp, 0xb8, ERIAR_MASK_1111, 0, ERIAR_EXGMAC);
>
>         /* Adjust EEE LED frequency */
>         RTL_W8(tp, EEE_LED, RTL_R8(tp, EEE_LED) & ~0x07);
> --
> 2.20.1
>
>
> patch 2:
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index eda35c0ce..921e258a3 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -5109,7 +5109,137 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>         /* disable aspm and clock request before access ephy */
>         rtl_hw_aspm_clkreq_enable(tp, false);
>         rtl_ephy_init(tp, e_info_8411_2, ARRAY_SIZE(e_info_8411_2));
> -       rtl_hw_aspm_clkreq_enable(tp, true);
> +
> +       r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
> +       r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
> +       r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
> +       r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
> +       r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
> +       r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
> +       r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
> +       r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
> +       msleep(3);
> +       r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
> +
> +       r8168_mac_ocp_write( tp, 0xF800, 0xE008 );
> +       r8168_mac_ocp_write( tp, 0xF802, 0xE00A );
> +       r8168_mac_ocp_write( tp, 0xF804, 0xE00C );
> +       r8168_mac_ocp_write( tp, 0xF806, 0xE00E );
> +       r8168_mac_ocp_write( tp, 0xF808, 0xE027 );
> +       r8168_mac_ocp_write( tp, 0xF80A, 0xE04F );
> +       r8168_mac_ocp_write( tp, 0xF80C, 0xE05E );
> +       r8168_mac_ocp_write( tp, 0xF80E, 0xE065 );
> +       r8168_mac_ocp_write( tp, 0xF810, 0xC602 );
> +       r8168_mac_ocp_write( tp, 0xF812, 0xBE00 );
> +       r8168_mac_ocp_write( tp, 0xF814, 0x0000 );
> +       r8168_mac_ocp_write( tp, 0xF816, 0xC502 );
> +       r8168_mac_ocp_write( tp, 0xF818, 0xBD00 );
> +       r8168_mac_ocp_write( tp, 0xF81A, 0x074C );
> +       r8168_mac_ocp_write( tp, 0xF81C, 0xC302 );
> +       r8168_mac_ocp_write( tp, 0xF81E, 0xBB00 );
> +       r8168_mac_ocp_write( tp, 0xF820, 0x080A );
> +       r8168_mac_ocp_write( tp, 0xF822, 0x6420 );
> +       r8168_mac_ocp_write( tp, 0xF824, 0x48C2 );
> +       r8168_mac_ocp_write( tp, 0xF826, 0x8C20 );
> +       r8168_mac_ocp_write( tp, 0xF828, 0xC516 );
> +       r8168_mac_ocp_write( tp, 0xF82A, 0x64A4 );
> +       r8168_mac_ocp_write( tp, 0xF82C, 0x49C0 );
> +       r8168_mac_ocp_write( tp, 0xF82E, 0xF009 );
> +       r8168_mac_ocp_write( tp, 0xF830, 0x74A2 );
> +       r8168_mac_ocp_write( tp, 0xF832, 0x8CA5 );
> +       r8168_mac_ocp_write( tp, 0xF834, 0x74A0 );
> +       r8168_mac_ocp_write( tp, 0xF836, 0xC50E );
> +       r8168_mac_ocp_write( tp, 0xF838, 0x9CA2 );
> +       r8168_mac_ocp_write( tp, 0xF83A, 0x1C11 );
> +       r8168_mac_ocp_write( tp, 0xF83C, 0x9CA0 );
> +       r8168_mac_ocp_write( tp, 0xF83E, 0xE006 );
> +       r8168_mac_ocp_write( tp, 0xF840, 0x74F8 );
> +       r8168_mac_ocp_write( tp, 0xF842, 0x48C4 );
> +       r8168_mac_ocp_write( tp, 0xF844, 0x8CF8 );
> +       r8168_mac_ocp_write( tp, 0xF846, 0xC404 );
> +       r8168_mac_ocp_write( tp, 0xF848, 0xBC00 );
> +       r8168_mac_ocp_write( tp, 0xF84A, 0xC403 );
> +       r8168_mac_ocp_write( tp, 0xF84C, 0xBC00 );
> +       r8168_mac_ocp_write( tp, 0xF84E, 0x0BF2 );
> +       r8168_mac_ocp_write( tp, 0xF850, 0x0C0A );
> +       r8168_mac_ocp_write( tp, 0xF852, 0xE434 );
> +       r8168_mac_ocp_write( tp, 0xF854, 0xD3C0 );
> +       r8168_mac_ocp_write( tp, 0xF856, 0x49D9 );
> +       r8168_mac_ocp_write( tp, 0xF858, 0xF01F );
> +       r8168_mac_ocp_write( tp, 0xF85A, 0xC526 );
> +       r8168_mac_ocp_write( tp, 0xF85C, 0x64A5 );
> +       r8168_mac_ocp_write( tp, 0xF85E, 0x1400 );
> +       r8168_mac_ocp_write( tp, 0xF860, 0xF007 );
> +       r8168_mac_ocp_write( tp, 0xF862, 0x0C01 );
> +       r8168_mac_ocp_write( tp, 0xF864, 0x8CA5 );
> +       r8168_mac_ocp_write( tp, 0xF866, 0x1C15 );
> +       r8168_mac_ocp_write( tp, 0xF868, 0xC51B );
> +       r8168_mac_ocp_write( tp, 0xF86A, 0x9CA0 );
> +       r8168_mac_ocp_write( tp, 0xF86C, 0xE013 );
> +       r8168_mac_ocp_write( tp, 0xF86E, 0xC519 );
> +       r8168_mac_ocp_write( tp, 0xF870, 0x74A0 );
> +       r8168_mac_ocp_write( tp, 0xF872, 0x48C4 );
> +       r8168_mac_ocp_write( tp, 0xF874, 0x8CA0 );
> +       r8168_mac_ocp_write( tp, 0xF876, 0xC516 );
> +       r8168_mac_ocp_write( tp, 0xF878, 0x74A4 );
> +       r8168_mac_ocp_write( tp, 0xF87A, 0x48C8 );
> +       r8168_mac_ocp_write( tp, 0xF87C, 0x48CA );
> +       r8168_mac_ocp_write( tp, 0xF87E, 0x9CA4 );
> +       r8168_mac_ocp_write( tp, 0xF880, 0xC512 );
> +       r8168_mac_ocp_write( tp, 0xF882, 0x1B00 );
> +       r8168_mac_ocp_write( tp, 0xF884, 0x9BA0 );
> +       r8168_mac_ocp_write( tp, 0xF886, 0x1B1C );
> +       r8168_mac_ocp_write( tp, 0xF888, 0x483F );
> +       r8168_mac_ocp_write( tp, 0xF88A, 0x9BA2 );
> +       r8168_mac_ocp_write( tp, 0xF88C, 0x1B04 );
> +       r8168_mac_ocp_write( tp, 0xF88E, 0xC508 );
> +       r8168_mac_ocp_write( tp, 0xF890, 0x9BA0 );
> +       r8168_mac_ocp_write( tp, 0xF892, 0xC505 );
> +       r8168_mac_ocp_write( tp, 0xF894, 0xBD00 );
> +       r8168_mac_ocp_write( tp, 0xF896, 0xC502 );
> +       r8168_mac_ocp_write( tp, 0xF898, 0xBD00 );
> +       r8168_mac_ocp_write( tp, 0xF89A, 0x0300 );
> +       r8168_mac_ocp_write( tp, 0xF89C, 0x051E );
> +       r8168_mac_ocp_write( tp, 0xF89E, 0xE434 );
> +       r8168_mac_ocp_write( tp, 0xF8A0, 0xE018 );
> +       r8168_mac_ocp_write( tp, 0xF8A2, 0xE092 );
> +       r8168_mac_ocp_write( tp, 0xF8A4, 0xDE20 );
> +       r8168_mac_ocp_write( tp, 0xF8A6, 0xD3C0 );
> +       r8168_mac_ocp_write( tp, 0xF8A8, 0xC50F );
> +       r8168_mac_ocp_write( tp, 0xF8AA, 0x76A4 );
> +       r8168_mac_ocp_write( tp, 0xF8AC, 0x49E3 );
> +       r8168_mac_ocp_write( tp, 0xF8AE, 0xF007 );
> +       r8168_mac_ocp_write( tp, 0xF8B0, 0x49C0 );
> +       r8168_mac_ocp_write( tp, 0xF8B2, 0xF103 );
> +       r8168_mac_ocp_write( tp, 0xF8B4, 0xC607 );
> +       r8168_mac_ocp_write( tp, 0xF8B6, 0xBE00 );
> +       r8168_mac_ocp_write( tp, 0xF8B8, 0xC606 );
> +       r8168_mac_ocp_write( tp, 0xF8BA, 0xBE00 );
> +       r8168_mac_ocp_write( tp, 0xF8BC, 0xC602 );
> +       r8168_mac_ocp_write( tp, 0xF8BE, 0xBE00 );
> +       r8168_mac_ocp_write( tp, 0xF8C0, 0x0C4C );
> +       r8168_mac_ocp_write( tp, 0xF8C2, 0x0C28 );
> +       r8168_mac_ocp_write( tp, 0xF8C4, 0x0C2C );
> +       r8168_mac_ocp_write( tp, 0xF8C6, 0xDC00 );
> +       r8168_mac_ocp_write( tp, 0xF8C8, 0xC707 );
> +       r8168_mac_ocp_write( tp, 0xF8CA, 0x1D00 );
> +       r8168_mac_ocp_write( tp, 0xF8CC, 0x8DE2 );
> +       r8168_mac_ocp_write( tp, 0xF8CE, 0x48C1 );
> +       r8168_mac_ocp_write( tp, 0xF8D0, 0xC502 );
> +       r8168_mac_ocp_write( tp, 0xF8D2, 0xBD00 );
> +       r8168_mac_ocp_write( tp, 0xF8D4, 0x00AA );
> +       r8168_mac_ocp_write( tp, 0xF8D6, 0xE0C0 );
> +       r8168_mac_ocp_write( tp, 0xF8D8, 0xC502 );
> +       r8168_mac_ocp_write( tp, 0xF8DA, 0xBD00 );
> +       r8168_mac_ocp_write( tp, 0xF8DC, 0x0132 );
> +       r8168_mac_ocp_write( tp, 0xFC26, 0x8000 );
> +       r8168_mac_ocp_write( tp, 0xFC2A, 0x0743 );
> +       r8168_mac_ocp_write( tp, 0xFC2C, 0x0801 );
> +       r8168_mac_ocp_write( tp, 0xFC2E, 0x0BE9 );
> +       r8168_mac_ocp_write( tp, 0xFC30, 0x02FD );
> +       r8168_mac_ocp_write( tp, 0xFC32, 0x0C25 );
> +       r8168_mac_ocp_write( tp, 0xFC34, 0x00A9 );
> +       r8168_mac_ocp_write( tp, 0xFC36, 0x012D );
>  }
>
>  static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
> --
> 2.20.1
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-12-21 15:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-13  2:20 A weird problem of Realtek r8168 after resume from S3 Chris Chiu
2018-12-14  3:33 ` Chris Chiu
2018-12-14  7:36   ` Heiner Kallweit
2018-12-17 13:25     ` Chris Chiu
2018-12-17 19:08       ` Heiner Kallweit
2018-12-18 13:25         ` Chris Chiu
2018-12-18 18:21           ` Heiner Kallweit
2018-12-19 14:37             ` Chris Chiu
2018-12-18 20:28           ` Heiner Kallweit
2018-12-19 15:32             ` Chris Chiu
2018-12-19 19:41               ` Heiner Kallweit
2018-12-20  9:43                 ` Chris Chiu
2018-12-20 18:48                   ` Heiner Kallweit
2018-12-20 19:21                   ` Heiner Kallweit
2018-12-21 15:16                     ` Chris Chiu
2018-12-17 21:45       ` Heiner Kallweit
2018-12-18 12:31         ` Chris Chiu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).