stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
@ 2021-11-24  7:28 Stefan Dietrich
  2021-11-24  7:33 ` Greg KH
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Stefan Dietrich @ 2021-11-24  7:28 UTC (permalink / raw)
  To: stable; +Cc: regressions

Summary: When attempting to rise or shut down a NIC manually or via
network-manager under 5.15, the machine reboots or freezes.

Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline (
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as
liquorix flavours.
Does not occur with: 5.14 and 5.13 (both with various flavours)


Hi all,

I'm experiencing a severe bug that causes the machine to reboot or
freeze when trying to login and/or rise/shutdown a NIC. Here's a brief
description of scenarios I've tested:

Scenario 1: enp6s0 managed manually using /etc/networking/interfaces,
DHCP
a. Issuing ifdown enp6s0 in terminal will throw
	"/etc/resolvconf/update.d/libc: Warning: /etc/resolv.conf is
	not a symbolic link to /run/resolvconf/resolv.conf"
and cause the machine to reboot after ~10s of showing a blinking cursor

b. Issuing shutdown -h now or trying to shutdown/reboot machine via
GUI:
shutdown will stop on "stop job is running for ifdown enp6s0" and after
approx. 10..15s the countdown freezes. Repeated ALT-SysReq-REISUB does
not reboot the machine, a hard reset is required.

--

Scenario 2: enp6s0 managed manually using /etc/networking/interfaces,
STATIC
a. Issuing ifdown enp6s0 in terminal will throw
	"send_packet: Operation not permitted
	dhclient.c:3010: Failed to send 300 byte long packet over
	fallback interface."
and cause the machine to reboot after ~10s of blinking cursor.

b. Issuing shutdown -h now or trying to shutdown or reboot machine via
GUI: shutdown will stop on "stop job is running for ifdown enp6s0" and
after approx. 10..15s the countdown freezes. Repeated ALT-SysReq-REISUB
does not reboot the machine, a hard reset is required.

--

Scenario 3: enp6s0 managed by network manager
a. After booting and logging in either via GUI or TTY, the display will
stay blank and only show a blinking cursor and then freeze after
5..10s. ALT-SysReq-REISUB does not reboot the machine, a hard reset is
required.

--

Here's a snippet from the journal for Scenario 1a:

Nov 21 10:39:25 computer sudo[5606]:    user : TTY=pts/0 ;
PWD=/home/user ; USER=root ; COMMAND=/usr/sbin/ifdown enp6s0
Nov 21 10:39:25 computer sudo[5606]: pam_unix(sudo:session): session
opened for user root by (uid=0)
-- Reboot --
Nov 21 10:40:14 computer systemd-journald[478]: Journal started

--

I'm running Alder Lake i9 12900K but I have E-cores disabled in BIOS.
Here are some more specs with working kernel:

$ inxi -bxz
System:    Kernel: 5.14.0-19.2-liquorix-amd64 x86_64 bits: 64 compiler:
N/A Desktop: Xfce 4.16.3
           Distro: Ubuntu 20.04.3 LTS (Focal Fossa)
Machine:   Type: Desktop System: ASUS product: N/A v: N/A serial: N/A
           Mobo: ASUSTeK model: ROG STRIX Z690-A GAMING WIFI D4 v: Rev
1.xx serial: <filter>
           UEFI [Legacy]: American Megatrends v: 0707 date: 11/10/2021
CPU:       8-Core: 12th Gen Intel Core i9-12900K type: MT MCP arch: N/A
speed: 5381 MHz max: 3201 MHz
Graphics:  Device-1: NVIDIA vendor: Gigabyte driver: nvidia v: 470.86
bus ID: 01:00.0
           Display: server: X.Org 1.20.11 driver: nvidia tty: N/A
           Message: Unable to show advanced data. Required tool glxinfo
missing.
Network:   Device-1: Intel vendor: ASUSTeK driver: igc v: kernel port:
4000 bus ID: 06:00.0


Please advice how I may assist in debugging!

Thanks.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-24  7:28 [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown Stefan Dietrich
@ 2021-11-24  7:33 ` Greg KH
  2021-11-24  7:42   ` Stefan Dietrich
  2021-11-24 17:20   ` Stefan Dietrich
  2021-11-24  7:48 ` [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown Thorsten Leemhuis
  2021-11-24  8:05 ` Stefan Dietrich
  2 siblings, 2 replies; 15+ messages in thread
From: Greg KH @ 2021-11-24  7:33 UTC (permalink / raw)
  To: Stefan Dietrich; +Cc: stable, regressions

On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:
> Summary: When attempting to rise or shut down a NIC manually or via
> network-manager under 5.15, the machine reboots or freezes.
> 
> Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline (
> https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as
> liquorix flavours.
> Does not occur with: 5.14 and 5.13 (both with various flavours)

Can you use 'git bisect' between 5.14 and 5.15 to find the problem
commit?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-24  7:33 ` Greg KH
@ 2021-11-24  7:42   ` Stefan Dietrich
  2021-11-24 17:20   ` Stefan Dietrich
  1 sibling, 0 replies; 15+ messages in thread
From: Stefan Dietrich @ 2021-11-24  7:42 UTC (permalink / raw)
  To: Greg KH; +Cc: stable, regressions

Hi Greg,

I have never done kernel bisect before so I need to do some reading
first. I will report back a.s.a.p.


Stefan


On Wed, 2021-11-24 at 08:33 +0100, Greg KH wrote:
> On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:
> > Summary: When attempting to rise or shut down a NIC manually or via
> > network-manager under 5.15, the machine reboots or freezes.
> >
> > Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline (
> > https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as
> > liquorix flavours.
> > Does not occur with: 5.14 and 5.13 (both with various flavours)
>
> Can you use 'git bisect' between 5.14 and 5.15 to find the problem
> commit?
>
> thanks,
>
> greg k-h


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-24  7:28 [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown Stefan Dietrich
  2021-11-24  7:33 ` Greg KH
@ 2021-11-24  7:48 ` Thorsten Leemhuis
  2021-11-24  8:05 ` Stefan Dietrich
  2 siblings, 0 replies; 15+ messages in thread
From: Thorsten Leemhuis @ 2021-11-24  7:48 UTC (permalink / raw)
  To: Stefan Dietrich, stable; +Cc: regressions

Hi, this is your Linux kernel regression tracker speaking.

On 24.11.21 08:28, Stefan Dietrich wrote:
> Summary: When attempting to rise or shut down a NIC manually or via
> network-manager under 5.15, the machine reboots or freezes.
> 
> Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline (
> https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as
> liquorix flavours.
> Does not occur with: 5.14 and 5.13 (both with various flavours)

Thx for the report. Small detail: you CCed the stable list, but this
afaics is a mainline regression. Likely one in the network subsystem, so
it might be good to get the mailing list where the network developer
hang out in the loop. But as Greg already said: a bisection would help a
lot to find the root cause and thus the developers that need to take
care of this.

Anyway, to be sure this issue doesn't fall through the cracks unnoticed,
I'm adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced v5.14..v5.15
#regzbot ignore-activity

Ciao, Thorsten, your Linux kernel regression tracker.

P.S.: If you want to know more about regzbot, check out its
web-interface, the getting start guide, and/or the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for the reporter: when reporting a regression it's in your interest
to tell #regzbot about it in the report, as that will ensure the
regression gets on the radar of regzbot and the regression tracker.
That's in your interest, as they will make sure the report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot, just
fix the issue as you normally would. Just remember to include a 'Link:'
tag to the report in the commit message, as explained in
Documentation/process/submitting-patches.rst
That aspect was recently was made more explicit in commit 1f57bd42b77c:
https://git.kernel.org/linus/1f57bd42b77c

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply. That's in everyone's interest, as
what I wrote above might be misleading to everyone reading this; any
suggestion I gave they thus might sent someone reading this down the
wrong rabbit hole, which none of us wants.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-24  7:28 [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown Stefan Dietrich
  2021-11-24  7:33 ` Greg KH
  2021-11-24  7:48 ` [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown Thorsten Leemhuis
@ 2021-11-24  8:05 ` Stefan Dietrich
  2 siblings, 0 replies; 15+ messages in thread
From: Stefan Dietrich @ 2021-11-24  8:05 UTC (permalink / raw)
  To: stable, netdev, Thorsten Leemhuis; +Cc: regressions

Hi Thorsten,

thanks for the pointer. netdev should be in the loop now.


Stefan

On Wed, 2021-11-24 at 08:28 +0100, Stefan Dietrich wrote:
> Summary: When attempting to rise or shut down a NIC manually or via
> network-manager under 5.15, the machine reboots or freezes.
>
> Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline (
> https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as
> liquorix flavours.
> Does not occur with: 5.14 and 5.13 (both with various flavours)
>
>
> Hi all,
>
> I'm experiencing a severe bug that causes the machine to reboot or
> freeze when trying to login and/or rise/shutdown a NIC. Here's a
> brief
> description of scenarios I've tested:
>
> Scenario 1: enp6s0 managed manually using /etc/networking/interfaces,
> DHCP
> a. Issuing ifdown enp6s0 in terminal will throw
> 	"/etc/resolvconf/update.d/libc: Warning: /etc/resolv.conf is
> 	not a symbolic link to /run/resolvconf/resolv.conf"
> and cause the machine to reboot after ~10s of showing a blinking
> cursor
>
> b. Issuing shutdown -h now or trying to shutdown/reboot machine via
> GUI:
> shutdown will stop on "stop job is running for ifdown enp6s0" and
> after
> approx. 10..15s the countdown freezes. Repeated ALT-SysReq-REISUB
> does
> not reboot the machine, a hard reset is required.
>
> --
>
> Scenario 2: enp6s0 managed manually using /etc/networking/interfaces,
> STATIC
> a. Issuing ifdown enp6s0 in terminal will throw
> 	"send_packet: Operation not permitted
> 	dhclient.c:3010: Failed to send 300 byte long packet over
> 	fallback interface."
> and cause the machine to reboot after ~10s of blinking cursor.
>
> b. Issuing shutdown -h now or trying to shutdown or reboot machine
> via
> GUI: shutdown will stop on "stop job is running for ifdown enp6s0"
> and
> after approx. 10..15s the countdown freezes. Repeated ALT-SysReq-
> REISUB
> does not reboot the machine, a hard reset is required.
>
> --
>
> Scenario 3: enp6s0 managed by network manager
> a. After booting and logging in either via GUI or TTY, the display
> will
> stay blank and only show a blinking cursor and then freeze after
> 5..10s. ALT-SysReq-REISUB does not reboot the machine, a hard reset
> is
> required.
>
> --
>
> Here's a snippet from the journal for Scenario 1a:
>
> Nov 21 10:39:25 computer sudo[5606]:    user : TTY=pts/0 ;
> PWD=/home/user ; USER=root ; COMMAND=/usr/sbin/ifdown enp6s0
> Nov 21 10:39:25 computer sudo[5606]: pam_unix(sudo:session): session
> opened for user root by (uid=0)
> -- Reboot --
> Nov 21 10:40:14 computer systemd-journald[478]: Journal started
>
> --
>
> I'm running Alder Lake i9 12900K but I have E-cores disabled in BIOS.
> Here are some more specs with working kernel:
>
> $ inxi -bxz
> System:    Kernel: 5.14.0-19.2-liquorix-amd64 x86_64 bits: 64
> compiler:
> N/A Desktop: Xfce 4.16.3
>            Distro: Ubuntu 20.04.3 LTS (Focal Fossa)
> Machine:   Type: Desktop System: ASUS product: N/A v: N/A serial: N/A
>            Mobo: ASUSTeK model: ROG STRIX Z690-A GAMING WIFI D4 v:
> Rev
> 1.xx serial: <filter>
>            UEFI [Legacy]: American Megatrends v: 0707 date:
> 11/10/2021
> CPU:       8-Core: 12th Gen Intel Core i9-12900K type: MT MCP arch:
> N/A
> speed: 5381 MHz max: 3201 MHz
> Graphics:  Device-1: NVIDIA vendor: Gigabyte driver: nvidia v: 470.86
> bus ID: 01:00.0
>            Display: server: X.Org 1.20.11 driver: nvidia tty: N/A
>            Message: Unable to show advanced data. Required tool
> glxinfo
> missing.
> Network:   Device-1: Intel vendor: ASUSTeK driver: igc v: kernel
> port:
> 4000 bus ID: 06:00.0
>
>
> Please advice how I may assist in debugging!
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-24  7:33 ` Greg KH
  2021-11-24  7:42   ` Stefan Dietrich
@ 2021-11-24 17:20   ` Stefan Dietrich
  2021-11-24 23:34     ` Jakub Kicinski
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Dietrich @ 2021-11-24 17:20 UTC (permalink / raw)
  To: Greg KH, netdev; +Cc: stable, regressions

Hi all,

six exciting hours and a lot of learning later, here it is.
Symptomatically, the critical commit appears for me between 5.14.21-
051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64
build for rc1.

Please see the git-bisect output below and let me know how I may
further assist in debugging!


Cheers,
Stefan


a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit
commit a90ec84837325df4b9a6798c2cc0df202b5680bd
Author: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Date:   Mon Jul 26 20:36:57 2021 -0700

    igc: Add support for PTP getcrosststamp()

    i225 supports PCIe Precision Time Measurement (PTM), allowing us to
    support the PTP_SYS_OFFSET_PRECISE ioctl() in the driver via the
    getcrosststamp() function.

    The easiest way to expose the PTM registers would be to configure
the PTM
    dialogs to run periodically, but the PTP_SYS_OFFSET_PRECISE ioctl()
    semantics are more aligned to using a kind of "one-shot" way of
retrieving
    the PTM timestamps. But this causes a bit more code to be written:
the
    trigger registers for the PTM dialogs are not cleared
automatically.

    i225 can be configured to send "fake" packets with the PTM
    information, adding support for handling these types of packets is
    left for the future.

    PTM improves the accuracy of time synchronization, for example,
using
    phc2sys, while a simple application is sending packets as fast as
    possible. First, without .getcrosststamp():

    phc2sys[191.382]: enp4s0 sys offset      -959 s2 freq    -454
delay   4492
    phc2sys[191.482]: enp4s0 sys offset       798 s2 freq   +1015
delay   4069
    phc2sys[191.583]: enp4s0 sys offset       962 s2 freq   +1418
delay   3849
    phc2sys[191.683]: enp4s0 sys offset       924 s2 freq   +1669
delay   3753
    phc2sys[191.783]: enp4s0 sys offset       664 s2 freq   +1686
delay   3349
    phc2sys[191.883]: enp4s0 sys offset       218 s2 freq   +1439
delay   2585
    phc2sys[191.983]: enp4s0 sys offset       761 s2 freq   +2048
delay   3750
    phc2sys[192.083]: enp4s0 sys offset       756 s2 freq   +2271
delay   4061
    phc2sys[192.183]: enp4s0 sys offset       809 s2 freq   +2551
delay   4384
    phc2sys[192.283]: enp4s0 sys offset      -108 s2 freq   +1877
delay   2480
    phc2sys[192.383]: enp4s0 sys offset     -1145 s2 freq    +807
delay   4438
    phc2sys[192.484]: enp4s0 sys offset       571 s2 freq   +2180
delay   3849
    phc2sys[192.584]: enp4s0 sys offset       241 s2 freq   +2021
delay   3389
    phc2sys[192.684]: enp4s0 sys offset       405 s2 freq   +2257
delay   3829
    phc2sys[192.784]: enp4s0 sys offset        17 s2 freq   +1991
delay   3273
    phc2sys[192.884]: enp4s0 sys offset       152 s2 freq   +2131
delay   3948
    phc2sys[192.984]: enp4s0 sys offset      -187 s2 freq   +1837
delay   3162
    phc2sys[193.084]: enp4s0 sys offset     -1595 s2 freq    +373
delay   4557
    phc2sys[193.184]: enp4s0 sys offset       107 s2 freq   +1597
delay   3740
    phc2sys[193.284]: enp4s0 sys offset       199 s2 freq   +1721
delay   4010
    phc2sys[193.385]: enp4s0 sys offset      -169 s2 freq   +1413
delay   3701
    phc2sys[193.485]: enp4s0 sys offset       -47 s2 freq   +1484
delay   3581
    phc2sys[193.585]: enp4s0 sys offset       -65 s2 freq   +1452
delay   3778
    phc2sys[193.685]: enp4s0 sys offset        95 s2 freq   +1592
delay   3888
    phc2sys[193.785]: enp4s0 sys offset       206 s2 freq   +1732
delay   4445
    phc2sys[193.885]: enp4s0 sys offset      -652 s2 freq    +936
delay   2521
    phc2sys[193.985]: enp4s0 sys offset      -203 s2 freq   +1189
delay   3391
    phc2sys[194.085]: enp4s0 sys offset      -376 s2 freq    +955
delay   2951
    phc2sys[194.185]: enp4s0 sys offset      -134 s2 freq   +1084
delay   3330
    phc2sys[194.285]: enp4s0 sys offset       -22 s2 freq   +1156
delay   3479
    phc2sys[194.386]: enp4s0 sys offset        32 s2 freq   +1204
delay   3602
    phc2sys[194.486]: enp4s0 sys offset       122 s2 freq   +1303
delay   3731

    Statistics for this run (total of 2179 lines), in nanoseconds:
      average: -1.12
      stdev: 634.80
      max: 1551
      min: -2215

    With .getcrosststamp() via PCIe PTM:

    phc2sys[367.859]: enp4s0 sys offset         6 s2 freq   +1727
delay      0
    phc2sys[367.959]: enp4s0 sys offset        -2 s2 freq   +1721
delay      0
    phc2sys[368.059]: enp4s0 sys offset         5 s2 freq   +1727
delay      0
    phc2sys[368.160]: enp4s0 sys offset        -1 s2 freq   +1723
delay      0
    phc2sys[368.260]: enp4s0 sys offset        -4 s2 freq   +1719
delay      0
    phc2sys[368.360]: enp4s0 sys offset        -5 s2 freq   +1717
delay      0
    phc2sys[368.460]: enp4s0 sys offset         1 s2 freq   +1722
delay      0
    phc2sys[368.560]: enp4s0 sys offset        -3 s2 freq   +1718
delay      0
    phc2sys[368.660]: enp4s0 sys offset         5 s2 freq   +1725
delay      0
    phc2sys[368.760]: enp4s0 sys offset        -1 s2 freq   +1721
delay      0
    phc2sys[368.860]: enp4s0 sys offset         0 s2 freq   +1721
delay      0
    phc2sys[368.960]: enp4s0 sys offset         0 s2 freq   +1721
delay      0
    phc2sys[369.061]: enp4s0 sys offset         4 s2 freq   +1725
delay      0
    phc2sys[369.161]: enp4s0 sys offset         1 s2 freq   +1724
delay      0
    phc2sys[369.261]: enp4s0 sys offset         4 s2 freq   +1727
delay      0
    phc2sys[369.361]: enp4s0 sys offset         8 s2 freq   +1732
delay      0
    phc2sys[369.461]: enp4s0 sys offset         7 s2 freq   +1733
delay      0
    phc2sys[369.561]: enp4s0 sys offset         4 s2 freq   +1733
delay      0
    phc2sys[369.661]: enp4s0 sys offset         1 s2 freq   +1731
delay      0
    phc2sys[369.761]: enp4s0 sys offset         1 s2 freq   +1731
delay      0
    phc2sys[369.861]: enp4s0 sys offset        -5 s2 freq   +1725
delay      0
    phc2sys[369.961]: enp4s0 sys offset        -4 s2 freq   +1725
delay      0
    phc2sys[370.062]: enp4s0 sys offset         2 s2 freq   +1730
delay      0
    phc2sys[370.162]: enp4s0 sys offset        -7 s2 freq   +1721
delay      0
    phc2sys[370.262]: enp4s0 sys offset        -3 s2 freq   +1723
delay      0
    phc2sys[370.362]: enp4s0 sys offset         1 s2 freq   +1726
delay      0
    phc2sys[370.462]: enp4s0 sys offset        -3 s2 freq   +1723
delay      0
    phc2sys[370.562]: enp4s0 sys offset        -1 s2 freq   +1724
delay      0
    phc2sys[370.662]: enp4s0 sys offset        -4 s2 freq   +1720
delay      0
    phc2sys[370.762]: enp4s0 sys offset        -7 s2 freq   +1716
delay      0
    phc2sys[370.862]: enp4s0 sys offset        -2 s2 freq   +1719
delay      0

    Statistics for this run (total of 2179 lines), in nanoseconds:
      average: 0.14
      stdev: 5.03
      max: 48
      min: -27

    For reference, the statistics for runs without PCIe congestion show
    that the improvements from enabling PTM are less dramatic. For two
    runs of 16466 entries:
      without PTM: avg -0.04 stdev 10.57 max 39 min -42
      with PTM: avg 0.01 stdev 4.20 max 19 min -16

    One possible explanation is that when PTM is not enabled, and
there's a lot
    of traffic in the PCIe fabric, some register reads will take more
time
    than the others because of congestion on the PCIe fabric.

    When PTM is enabled, even if the PTM dialogs take more time to
    complete under heavy traffic, the time measurements do not depend
on
    the time to read the registers.

    This was implemented following the i225 EAS version 0.993.

    Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
    Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

 drivers/net/ethernet/intel/igc/igc.h         |   1 +
 drivers/net/ethernet/intel/igc/igc_defines.h |  31 +++++
 drivers/net/ethernet/intel/igc/igc_ptp.c     | 179
+++++++++++++++++++++++++++
 drivers/net/ethernet/intel/igc/igc_regs.h    |  23 ++++
 4 files changed, 234 insertions(+)


On Wed, 2021-11-24 at 08:33 +0100, Greg KH wrote:
> On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:
> > Summary: When attempting to rise or shut down a NIC manually or via
> > network-manager under 5.15, the machine reboots or freezes.
> >
> > Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline (
> > https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as
> > liquorix flavours.
> > Does not occur with: 5.14 and 5.13 (both with various flavours)
>
> Can you use 'git bisect' between 5.14 and 5.15 to find the problem
> commit?
>
> thanks,
>
> greg k-h


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-24 17:20   ` Stefan Dietrich
@ 2021-11-24 23:34     ` Jakub Kicinski
  2021-11-25  1:07       ` Vinicius Costa Gomes
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2021-11-24 23:34 UTC (permalink / raw)
  To: Stefan Dietrich
  Cc: Greg KH, netdev, stable, regressions, Vinicius Costa Gomes,
	Dvora Fuxbrumer, Tony Nguyen, intel-wired-lan

On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
> Hi all,
> 
> six exciting hours and a lot of learning later, here it is.
> Symptomatically, the critical commit appears for me between 5.14.21-
> 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64
> build for rc1.
> 
> Please see the git-bisect output below and let me know how I may
> further assist in debugging!

Well, let's CC those involved, shall we? :)

Thanks for working thru the bisection!

> a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit
> commit a90ec84837325df4b9a6798c2cc0df202b5680bd
> Author: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Date:   Mon Jul 26 20:36:57 2021 -0700
> 
>     igc: Add support for PTP getcrosststamp()
> 
>     i225 supports PCIe Precision Time Measurement (PTM), allowing us to
>     support the PTP_SYS_OFFSET_PRECISE ioctl() in the driver via the
>     getcrosststamp() function.
> 
>     The easiest way to expose the PTM registers would be to configure
> the PTM
>     dialogs to run periodically, but the PTP_SYS_OFFSET_PRECISE ioctl()
>     semantics are more aligned to using a kind of "one-shot" way of
> retrieving
>     the PTM timestamps. But this causes a bit more code to be written:
> the
>     trigger registers for the PTM dialogs are not cleared
> automatically.
> 
>     i225 can be configured to send "fake" packets with the PTM
>     information, adding support for handling these types of packets is
>     left for the future.
> 
>     PTM improves the accuracy of time synchronization, for example,
> using
>     phc2sys, while a simple application is sending packets as fast as
>     possible. First, without .getcrosststamp():
> 
>     phc2sys[191.382]: enp4s0 sys offset      -959 s2 freq    -454
> delay   4492
>     phc2sys[191.482]: enp4s0 sys offset       798 s2 freq   +1015
> delay   4069
>     phc2sys[191.583]: enp4s0 sys offset       962 s2 freq   +1418
> delay   3849
>     phc2sys[191.683]: enp4s0 sys offset       924 s2 freq   +1669
> delay   3753
>     phc2sys[191.783]: enp4s0 sys offset       664 s2 freq   +1686
> delay   3349
>     phc2sys[191.883]: enp4s0 sys offset       218 s2 freq   +1439
> delay   2585
>     phc2sys[191.983]: enp4s0 sys offset       761 s2 freq   +2048
> delay   3750
>     phc2sys[192.083]: enp4s0 sys offset       756 s2 freq   +2271
> delay   4061
>     phc2sys[192.183]: enp4s0 sys offset       809 s2 freq   +2551
> delay   4384
>     phc2sys[192.283]: enp4s0 sys offset      -108 s2 freq   +1877
> delay   2480
>     phc2sys[192.383]: enp4s0 sys offset     -1145 s2 freq    +807
> delay   4438
>     phc2sys[192.484]: enp4s0 sys offset       571 s2 freq   +2180
> delay   3849
>     phc2sys[192.584]: enp4s0 sys offset       241 s2 freq   +2021
> delay   3389
>     phc2sys[192.684]: enp4s0 sys offset       405 s2 freq   +2257
> delay   3829
>     phc2sys[192.784]: enp4s0 sys offset        17 s2 freq   +1991
> delay   3273
>     phc2sys[192.884]: enp4s0 sys offset       152 s2 freq   +2131
> delay   3948
>     phc2sys[192.984]: enp4s0 sys offset      -187 s2 freq   +1837
> delay   3162
>     phc2sys[193.084]: enp4s0 sys offset     -1595 s2 freq    +373
> delay   4557
>     phc2sys[193.184]: enp4s0 sys offset       107 s2 freq   +1597
> delay   3740
>     phc2sys[193.284]: enp4s0 sys offset       199 s2 freq   +1721
> delay   4010
>     phc2sys[193.385]: enp4s0 sys offset      -169 s2 freq   +1413
> delay   3701
>     phc2sys[193.485]: enp4s0 sys offset       -47 s2 freq   +1484
> delay   3581
>     phc2sys[193.585]: enp4s0 sys offset       -65 s2 freq   +1452
> delay   3778
>     phc2sys[193.685]: enp4s0 sys offset        95 s2 freq   +1592
> delay   3888
>     phc2sys[193.785]: enp4s0 sys offset       206 s2 freq   +1732
> delay   4445
>     phc2sys[193.885]: enp4s0 sys offset      -652 s2 freq    +936
> delay   2521
>     phc2sys[193.985]: enp4s0 sys offset      -203 s2 freq   +1189
> delay   3391
>     phc2sys[194.085]: enp4s0 sys offset      -376 s2 freq    +955
> delay   2951
>     phc2sys[194.185]: enp4s0 sys offset      -134 s2 freq   +1084
> delay   3330
>     phc2sys[194.285]: enp4s0 sys offset       -22 s2 freq   +1156
> delay   3479
>     phc2sys[194.386]: enp4s0 sys offset        32 s2 freq   +1204
> delay   3602
>     phc2sys[194.486]: enp4s0 sys offset       122 s2 freq   +1303
> delay   3731
> 
>     Statistics for this run (total of 2179 lines), in nanoseconds:
>       average: -1.12
>       stdev: 634.80
>       max: 1551
>       min: -2215
> 
>     With .getcrosststamp() via PCIe PTM:
> 
>     phc2sys[367.859]: enp4s0 sys offset         6 s2 freq   +1727
> delay      0
>     phc2sys[367.959]: enp4s0 sys offset        -2 s2 freq   +1721
> delay      0
>     phc2sys[368.059]: enp4s0 sys offset         5 s2 freq   +1727
> delay      0
>     phc2sys[368.160]: enp4s0 sys offset        -1 s2 freq   +1723
> delay      0
>     phc2sys[368.260]: enp4s0 sys offset        -4 s2 freq   +1719
> delay      0
>     phc2sys[368.360]: enp4s0 sys offset        -5 s2 freq   +1717
> delay      0
>     phc2sys[368.460]: enp4s0 sys offset         1 s2 freq   +1722
> delay      0
>     phc2sys[368.560]: enp4s0 sys offset        -3 s2 freq   +1718
> delay      0
>     phc2sys[368.660]: enp4s0 sys offset         5 s2 freq   +1725
> delay      0
>     phc2sys[368.760]: enp4s0 sys offset        -1 s2 freq   +1721
> delay      0
>     phc2sys[368.860]: enp4s0 sys offset         0 s2 freq   +1721
> delay      0
>     phc2sys[368.960]: enp4s0 sys offset         0 s2 freq   +1721
> delay      0
>     phc2sys[369.061]: enp4s0 sys offset         4 s2 freq   +1725
> delay      0
>     phc2sys[369.161]: enp4s0 sys offset         1 s2 freq   +1724
> delay      0
>     phc2sys[369.261]: enp4s0 sys offset         4 s2 freq   +1727
> delay      0
>     phc2sys[369.361]: enp4s0 sys offset         8 s2 freq   +1732
> delay      0
>     phc2sys[369.461]: enp4s0 sys offset         7 s2 freq   +1733
> delay      0
>     phc2sys[369.561]: enp4s0 sys offset         4 s2 freq   +1733
> delay      0
>     phc2sys[369.661]: enp4s0 sys offset         1 s2 freq   +1731
> delay      0
>     phc2sys[369.761]: enp4s0 sys offset         1 s2 freq   +1731
> delay      0
>     phc2sys[369.861]: enp4s0 sys offset        -5 s2 freq   +1725
> delay      0
>     phc2sys[369.961]: enp4s0 sys offset        -4 s2 freq   +1725
> delay      0
>     phc2sys[370.062]: enp4s0 sys offset         2 s2 freq   +1730
> delay      0
>     phc2sys[370.162]: enp4s0 sys offset        -7 s2 freq   +1721
> delay      0
>     phc2sys[370.262]: enp4s0 sys offset        -3 s2 freq   +1723
> delay      0
>     phc2sys[370.362]: enp4s0 sys offset         1 s2 freq   +1726
> delay      0
>     phc2sys[370.462]: enp4s0 sys offset        -3 s2 freq   +1723
> delay      0
>     phc2sys[370.562]: enp4s0 sys offset        -1 s2 freq   +1724
> delay      0
>     phc2sys[370.662]: enp4s0 sys offset        -4 s2 freq   +1720
> delay      0
>     phc2sys[370.762]: enp4s0 sys offset        -7 s2 freq   +1716
> delay      0
>     phc2sys[370.862]: enp4s0 sys offset        -2 s2 freq   +1719
> delay      0
> 
>     Statistics for this run (total of 2179 lines), in nanoseconds:
>       average: 0.14
>       stdev: 5.03
>       max: 48
>       min: -27
> 
>     For reference, the statistics for runs without PCIe congestion show
>     that the improvements from enabling PTM are less dramatic. For two
>     runs of 16466 entries:
>       without PTM: avg -0.04 stdev 10.57 max 39 min -42
>       with PTM: avg 0.01 stdev 4.20 max 19 min -16
> 
>     One possible explanation is that when PTM is not enabled, and
> there's a lot
>     of traffic in the PCIe fabric, some register reads will take more
> time
>     than the others because of congestion on the PCIe fabric.
> 
>     When PTM is enabled, even if the PTM dialogs take more time to
>     complete under heavy traffic, the time measurements do not depend
> on
>     the time to read the registers.
> 
>     This was implemented following the i225 EAS version 0.993.
> 
>     Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>     Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
>     Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> 
>  drivers/net/ethernet/intel/igc/igc.h         |   1 +
>  drivers/net/ethernet/intel/igc/igc_defines.h |  31 +++++
>  drivers/net/ethernet/intel/igc/igc_ptp.c     | 179
> +++++++++++++++++++++++++++
>  drivers/net/ethernet/intel/igc/igc_regs.h    |  23 ++++
>  4 files changed, 234 insertions(+)
> 
> 
> On Wed, 2021-11-24 at 08:33 +0100, Greg KH wrote:
> > On Wed, Nov 24, 2021 at 08:28:39AM +0100, Stefan Dietrich wrote:  
> > > Summary: When attempting to rise or shut down a NIC manually or via
> > > network-manager under 5.15, the machine reboots or freezes.
> > >
> > > Occurs with: 5.15.4-051504-generic and earlier 5.15 mainline (
> > > https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.4/) as well as
> > > liquorix flavours.
> > > Does not occur with: 5.14 and 5.13 (both with various flavours)  
> >
> > Can you use 'git bisect' between 5.14 and 5.15 to find the problem
> > commit?
> >
> > thanks,
> >
> > greg k-h  
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-24 23:34     ` Jakub Kicinski
@ 2021-11-25  1:07       ` Vinicius Costa Gomes
  2021-11-25  1:13         ` Jakub Kicinski
  2021-11-25  8:41         ` Stefan Dietrich
  0 siblings, 2 replies; 15+ messages in thread
From: Vinicius Costa Gomes @ 2021-11-25  1:07 UTC (permalink / raw)
  To: Jakub Kicinski, Stefan Dietrich
  Cc: Greg KH, netdev, stable, regressions, Dvora Fuxbrumer,
	Tony Nguyen, intel-wired-lan

Hi Stefan,

Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
>> Hi all,
>> 
>> six exciting hours and a lot of learning later, here it is.
>> Symptomatically, the critical commit appears for me between 5.14.21-
>> 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64
>> build for rc1.
>> 
>> Please see the git-bisect output below and let me know how I may
>> further assist in debugging!
>
> Well, let's CC those involved, shall we? :)
>
> Thanks for working thru the bisection!
>
>> a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit
>> commit a90ec84837325df4b9a6798c2cc0df202b5680bd
>> Author: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>> Date:   Mon Jul 26 20:36:57 2021 -0700
>> 
>>     igc: Add support for PTP getcrosststamp()

Oh! That's interesting.

Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it
works, then it's a point in favor that this commit is indeed the
problematic one.

I am still trying to think of what could be causing the lockup you are
seeing.


Cheers,
-- 
Vinicius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-25  1:07       ` Vinicius Costa Gomes
@ 2021-11-25  1:13         ` Jakub Kicinski
  2021-11-25  8:41         ` Stefan Dietrich
  1 sibling, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2021-11-25  1:13 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: Stefan Dietrich, Greg KH, netdev, stable, regressions,
	Dvora Fuxbrumer, Tony Nguyen, intel-wired-lan

On Wed, 24 Nov 2021 17:07:16 -0800 Vinicius Costa Gomes wrote:
> Hi Stefan,
> 
> Jakub Kicinski <kuba@kernel.org> writes:
> 
> > On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:  
> >> Hi all,
> >> 
> >> six exciting hours and a lot of learning later, here it is.
> >> Symptomatically, the critical commit appears for me between 5.14.21-
> >> 051421-generic and 5.15.0-051500rc2-generic - I did not find an amd64
> >> build for rc1.
> >> 
> >> Please see the git-bisect output below and let me know how I may
> >> further assist in debugging!  
> >
> > Well, let's CC those involved, shall we? :)
> >
> > Thanks for working thru the bisection!
> >  
> >> a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit
> >> commit a90ec84837325df4b9a6798c2cc0df202b5680bd
> >> Author: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >> Date:   Mon Jul 26 20:36:57 2021 -0700
> >> 
> >>     igc: Add support for PTP getcrosststamp()  
> 
> Oh! That's interesting.
> 
> Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it
> works, then it's a point in favor that this commit is indeed the
> problematic one.
> 
> I am still trying to think of what could be causing the lockup you are
> seeing.

Actually we just had another report pointing at commit f32a21376573
("ethtool: runtime-resume netdev parent before ethtool ioctl ops").
That seems more likely :(

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-25  1:07       ` Vinicius Costa Gomes
  2021-11-25  1:13         ` Jakub Kicinski
@ 2021-11-25  8:41         ` Stefan Dietrich
  2021-12-01 11:45           ` Thorsten Leemhuis
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Dietrich @ 2021-11-25  8:41 UTC (permalink / raw)
  To: Vinicius Costa Gomes, Jakub Kicinski
  Cc: Greg KH, netdev, stable, regressions, Dvora Fuxbrumer,
	Tony Nguyen, intel-wired-lan

Hi Vinicius,

thanks - this was spot-on: disabling CONFIG_PCIE_PTM resolves the issue
for latest 5.15.4 (stable from git) for both manual and network-manager
NIC configuration.

Let me know if I may assist in debugging this further.


Cheers,
Stefan


On Wed, 2021-11-24 at 17:07 -0800, Vinicius Costa Gomes wrote:
> Hi Stefan,
>
> Jakub Kicinski <kuba@kernel.org> writes:
>
> > On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
> > > Hi all,
> > >
> > > six exciting hours and a lot of learning later, here it is.
> > > Symptomatically, the critical commit appears for me between
> > > 5.14.21-
> > > 051421-generic and 5.15.0-051500rc2-generic - I did not find an
> > > amd64
> > > build for rc1.
> > >
> > > Please see the git-bisect output below and let me know how I may
> > > further assist in debugging!
> >
> > Well, let's CC those involved, shall we? :)
> >
> > Thanks for working thru the bisection!
> >
> > > a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit
> > > commit a90ec84837325df4b9a6798c2cc0df202b5680bd
> > > Author: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > > Date:   Mon Jul 26 20:36:57 2021 -0700
> > >
> > >     igc: Add support for PTP getcrosststamp()
>
> Oh! That's interesting.
>
> Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it
> works, then it's a point in favor that this commit is indeed the
> problematic one.
>
> I am still trying to think of what could be causing the lockup you
> are
> seeing.
>
>
> Cheers,


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-11-25  8:41         ` Stefan Dietrich
@ 2021-12-01 11:45           ` Thorsten Leemhuis
  2021-12-01 17:47             ` Vinicius Costa Gomes
  0 siblings, 1 reply; 15+ messages in thread
From: Thorsten Leemhuis @ 2021-12-01 11:45 UTC (permalink / raw)
  To: Stefan Dietrich, Vinicius Costa Gomes, Jakub Kicinski
  Cc: Greg KH, netdev, stable, regressions, Dvora Fuxbrumer,
	Tony Nguyen, intel-wired-lan

Hi, this is your Linux kernel regression tracker speaking.

On 25.11.21 09:41, Stefan Dietrich wrote:
> 
> thanks - this was spot-on: disabling CONFIG_PCIE_PTM resolves the issue
> for latest 5.15.4 (stable from git) for both manual and network-manager
> NIC configuration.
> 
> Let me know if I may assist in debugging this further.

What is the status here? There afaics hasn't been any progress since
nearly a week.

Vinicius, do you still have this on your radar? Or was there some progress?

Or is this really related to another issue, as Jakub suspected? Then it
might be solved by the patch here:

https://bugzilla.kernel.org/show_bug.cgi?id=215129

Ciao, Thorsten

> On Wed, 2021-11-24 at 17:07 -0800, Vinicius Costa Gomes wrote:
>> Hi Stefan,
>>
>> Jakub Kicinski <kuba@kernel.org> writes:
>>
>>> On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
>>>> Hi all,
>>>>
>>>> six exciting hours and a lot of learning later, here it is.
>>>> Symptomatically, the critical commit appears for me between
>>>> 5.14.21-
>>>> 051421-generic and 5.15.0-051500rc2-generic - I did not find an
>>>> amd64
>>>> build for rc1.
>>>>
>>>> Please see the git-bisect output below and let me know how I may
>>>> further assist in debugging!
>>>
>>> Well, let's CC those involved, shall we? :)
>>>
>>> Thanks for working thru the bisection!
>>>
>>>> a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit
>>>> commit a90ec84837325df4b9a6798c2cc0df202b5680bd
>>>> Author: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>>> Date:   Mon Jul 26 20:36:57 2021 -0700
>>>>
>>>>     igc: Add support for PTP getcrosststamp()
>>
>> Oh! That's interesting.
>>
>> Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it
>> works, then it's a point in favor that this commit is indeed the
>> problematic one.
>>
>> I am still trying to think of what could be causing the lockup you
>> are
>> seeing.
>>
>>

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply. That's in everyone's interest, as
what I wrote above might be misleading to everyone reading this; any
suggestion I gave they thus might sent someone reading this down the
wrong rabbit hole, which none of us wants.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

#regzbot poke

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown
  2021-12-01 11:45           ` Thorsten Leemhuis
@ 2021-12-01 17:47             ` Vinicius Costa Gomes
  2021-12-01 18:57               ` [PATCH] igc: Avoid possible deadlock during suspend/resume Vinicius Costa Gomes
  0 siblings, 1 reply; 15+ messages in thread
From: Vinicius Costa Gomes @ 2021-12-01 17:47 UTC (permalink / raw)
  To: Thorsten Leemhuis, Stefan Dietrich, Jakub Kicinski
  Cc: Greg KH, netdev, stable, regressions, Dvora Fuxbrumer,
	Tony Nguyen, intel-wired-lan

Hi,

Thorsten Leemhuis <regressions@leemhuis.info> writes:

> Hi, this is your Linux kernel regression tracker speaking.
>
> On 25.11.21 09:41, Stefan Dietrich wrote:
>> 
>> thanks - this was spot-on: disabling CONFIG_PCIE_PTM resolves the issue
>> for latest 5.15.4 (stable from git) for both manual and network-manager
>> NIC configuration.
>> 
>> Let me know if I may assist in debugging this further.
>
> What is the status here? There afaics hasn't been any progress since
> nearly a week.
>
> Vinicius, do you still have this on your radar? Or was there some progress?
>
> Or is this really related to another issue, as Jakub suspected? Then it
> might be solved by the patch here:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=215129

What I am thinking right now is that we are facing a similar problem as
the bug above, only in the igc driver. The difference is that it's the
PCIe PTM messages (from the PCIe root) that are triggering the deadlock
in the suspend/resume path in igc.

I will produce a patch in a few moments, very similar to the one in the
bug report, let's see if it helps.

>
> Ciao, Thorsten
>
>> On Wed, 2021-11-24 at 17:07 -0800, Vinicius Costa Gomes wrote:
>>> Hi Stefan,
>>>
>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>
>>>> On Wed, 24 Nov 2021 18:20:40 +0100 Stefan Dietrich wrote:
>>>>> Hi all,
>>>>>
>>>>> six exciting hours and a lot of learning later, here it is.
>>>>> Symptomatically, the critical commit appears for me between
>>>>> 5.14.21-
>>>>> 051421-generic and 5.15.0-051500rc2-generic - I did not find an
>>>>> amd64
>>>>> build for rc1.
>>>>>
>>>>> Please see the git-bisect output below and let me know how I may
>>>>> further assist in debugging!
>>>>
>>>> Well, let's CC those involved, shall we? :)
>>>>
>>>> Thanks for working thru the bisection!
>>>>
>>>>> a90ec84837325df4b9a6798c2cc0df202b5680bd is the first bad commit
>>>>> commit a90ec84837325df4b9a6798c2cc0df202b5680bd
>>>>> Author: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>>>> Date:   Mon Jul 26 20:36:57 2021 -0700
>>>>>
>>>>>     igc: Add support for PTP getcrosststamp()
>>>
>>> Oh! That's interesting.
>>>
>>> Can you try disabling CONFIG_PCIE_PTM in your kernel config? If it
>>> works, then it's a point in favor that this commit is indeed the
>>> problematic one.
>>>
>>> I am still trying to think of what could be causing the lockup you
>>> are
>>> seeing.
>>>
>>>
>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply. That's in everyone's interest, as
> what I wrote above might be misleading to everyone reading this; any
> suggestion I gave they thus might sent someone reading this down the
> wrong rabbit hole, which none of us wants.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> #regzbot poke

-- 
Vinicius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] igc: Avoid possible deadlock during suspend/resume
  2021-12-01 17:47             ` Vinicius Costa Gomes
@ 2021-12-01 18:57               ` Vinicius Costa Gomes
  2021-12-02  6:41                 ` Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Vinicius Costa Gomes @ 2021-12-01 18:57 UTC (permalink / raw)
  To: roots
  Cc: Vinicius Costa Gomes, kuba, greg, netdev, intel-wired-lan,
	stable, regressions

Inspired by:
https://bugzilla.kernel.org/show_bug.cgi?id=215129

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
Just to see if it's indeed the same problem as the bug report above.

 drivers/net/ethernet/intel/igc/igc_main.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 0e19b4d02e62..c58bf557a2a1 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -6619,7 +6619,7 @@ static void igc_deliver_wake_packet(struct net_device *netdev)
 	netif_rx(skb);
 }
 
-static int __maybe_unused igc_resume(struct device *dev)
+static int __maybe_unused __igc_resume(struct device *dev, bool rpm)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct net_device *netdev = pci_get_drvdata(pdev);
@@ -6661,20 +6661,27 @@ static int __maybe_unused igc_resume(struct device *dev)
 
 	wr32(IGC_WUS, ~0);
 
-	rtnl_lock();
+	if (!rpm)
+		rtnl_lock();
 	if (!err && netif_running(netdev))
 		err = __igc_open(netdev, true);
 
 	if (!err)
 		netif_device_attach(netdev);
-	rtnl_unlock();
+	if (!rpm)
+		rtnl_unlock();
 
 	return err;
 }
 
 static int __maybe_unused igc_runtime_resume(struct device *dev)
 {
-	return igc_resume(dev);
+	return __igc_resume(dev, true);
+}
+
+static int __maybe_unused igc_resume(struct device *dev)
+{
+	return __igc_resume(dev, false);
 }
 
 static int __maybe_unused igc_suspend(struct device *dev)
@@ -6738,7 +6745,7 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev,
  *  @pdev: Pointer to PCI device
  *
  *  Restart the card from scratch, as if from a cold-boot. Implementation
- *  resembles the first-half of the igc_resume routine.
+ *  resembles the first-half of the __igc_resume routine.
  **/
 static pci_ers_result_t igc_io_slot_reset(struct pci_dev *pdev)
 {
@@ -6777,7 +6784,7 @@ static pci_ers_result_t igc_io_slot_reset(struct pci_dev *pdev)
  *
  *  This callback is called when the error recovery driver tells us that
  *  its OK to resume normal operation. Implementation resembles the
- *  second-half of the igc_resume routine.
+ *  second-half of the __igc_resume routine.
  */
 static void igc_io_resume(struct pci_dev *pdev)
 {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] igc: Avoid possible deadlock during suspend/resume
  2021-12-01 18:57               ` [PATCH] igc: Avoid possible deadlock during suspend/resume Vinicius Costa Gomes
@ 2021-12-02  6:41                 ` Greg KH
  2021-12-02  6:50                   ` Vinicius Costa Gomes
  0 siblings, 1 reply; 15+ messages in thread
From: Greg KH @ 2021-12-02  6:41 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: roots, kuba, netdev, intel-wired-lan, stable, regressions

On Wed, Dec 01, 2021 at 10:57:31AM -0800, Vinicius Costa Gomes wrote:
> Inspired by:
> https://bugzilla.kernel.org/show_bug.cgi?id=215129
> 

This changelog does not say anything at all, sorry.  Please explain what
is happening here as the kernel documentation asks you to.

> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> ---
> Just to see if it's indeed the same problem as the bug report above.

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] igc: Avoid possible deadlock during suspend/resume
  2021-12-02  6:41                 ` Greg KH
@ 2021-12-02  6:50                   ` Vinicius Costa Gomes
  0 siblings, 0 replies; 15+ messages in thread
From: Vinicius Costa Gomes @ 2021-12-02  6:50 UTC (permalink / raw)
  To: Greg KH; +Cc: roots, kuba, netdev, intel-wired-lan, stable, regressions

Greg KH <gregkh@linuxfoundation.org> writes:

> On Wed, Dec 01, 2021 at 10:57:31AM -0800, Vinicius Costa Gomes wrote:
>> Inspired by:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215129
>> 
>
> This changelog does not say anything at all, sorry.  Please explain what
> is happening here as the kernel documentation asks you to.

It was intended as just some patch for the reporter to try while
narrowing the problem down. Sorry for the noise.

I should have thought about removing stable from CC.


Thank you,
-- 
Vinicius

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-12-02  6:50 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-24  7:28 [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown Stefan Dietrich
2021-11-24  7:33 ` Greg KH
2021-11-24  7:42   ` Stefan Dietrich
2021-11-24 17:20   ` Stefan Dietrich
2021-11-24 23:34     ` Jakub Kicinski
2021-11-25  1:07       ` Vinicius Costa Gomes
2021-11-25  1:13         ` Jakub Kicinski
2021-11-25  8:41         ` Stefan Dietrich
2021-12-01 11:45           ` Thorsten Leemhuis
2021-12-01 17:47             ` Vinicius Costa Gomes
2021-12-01 18:57               ` [PATCH] igc: Avoid possible deadlock during suspend/resume Vinicius Costa Gomes
2021-12-02  6:41                 ` Greg KH
2021-12-02  6:50                   ` Vinicius Costa Gomes
2021-11-24  7:48 ` [REGRESSION] Kernel 5.15 reboots / freezes upon ifup/ifdown Thorsten Leemhuis
2021-11-24  8:05 ` Stefan Dietrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).