All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paul Menzel <pmenzel@molgen.mpg.de>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
Date: Tue, 3 Sep 2019 11:39:35 +0200	[thread overview]
Message-ID: <f5b988f9-25cf-db4a-53f5-9bb7edc8fb51@molgen.mpg.de> (raw)
In-Reply-To: <5B8DA87D05A7694D9FA63FD143655C1B9DCAC1FF@hasmsx108.ger.corp.intel.com>

Dear Tomas,


On 2019-09-03 11:28, Winkler, Tomas wrote:

>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:

>>> On 03.09.19 09:56, Gavin Lambert wrote:
>>>> On 2019-08-20 14:15, I wrote:
>>>>> Does anyone have any ideas about this?? Either towards further
>>>>> investigation or to a possible resolution?
>>>>>
>>>>> This is at the point of hardware internals now, so I have no idea
>>>>> how to proceed in either area.
>>>>
>>>> To recap (plus some new info):
>>>>
>>>> 1. I am using a kernel module which uses the code from the e1000e
>>>> driver to communicate with the hardware without actually registering
>>>> it as a Linux netdev.? (This is partly because it can get used in a
>>>> Xenomai context outside of Linux itself, although I'm not doing that
>>>> myself.) This historically works fine.
>>>>
>>>> 2. On certain Linux versions, I encountered an issue where
>>>> disconnecting the network cable and reconnecting it almost always
>>>> results in not being able to send any packets.? (I cannot determine
>>>> if receiving packets works in this case, as the network design will
>>>> not receive packets unless some are sent first.)? Restarting the
>>>> driver (rmmod+modprobe) does recover from this case (until the next
>>>> link loss), but simply replugging the cable never does.
>>>>
>>>> 3. The problem was observed with both I219-V and I219-LM (on
>>>> motherboard), but was *not* observed with 82571EB (PCIE).? The
>>>> problem was not observed with a motherboard igb-based I211.? I
>>>> suspect the issue is limited to motherboard-based e1000e adapters.
>>>> (Or perhaps there's something different about how the IGBs are
>>>> internally connected.)
>>>>
>>>> 4. The problem does not occur when the e1000e driver is registered
>>>> "normally" as a Linux netdev.
>>>>
>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
>>>> platform with D0i3" (which has been backported to 4.4+, as far as I can
>> tell).
>>>> Excluding this commit reliably resolves the issue and including it
>>>> reliably breaks it.
>>>
>>> The commit hash in the master branch is
>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.
>>>
>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.
>>>
>>>> 6. Applying the previously suggested patch
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.
>>>> git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b6
>>>> 2a05712f35a7fa5f5e56 has no effect; the E1000_STATUS_PCIM_STATE bit
>>>> is not set when the issue occurs.
>>>>
>>>> 7. Given the content of the change in #5, I assumed that the problem
>>>> was power-management related, perhaps a side effect of the e1000e
>>>> driver not being registered as a netdev.? (So perhaps something
>>>> thinks that no devices are in use and turns something off?)
>>>>
>>>> 8. I've previously posted register dumps from an e1000e in both the
>>>> "normal" and "link up but not transmitting" states.? They seemed
>>>> very similar, but as I'm not familiar with the register meanings I
>>>> may have overlooked something significant.? (Note that the dumps
>>>> were captured inside the watchdog task, when it detects link up but
>>>> before it sets
>>>> E1000_TCTL_EN.)
>>>>
>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
>>>> runtime_idles and then a runtime_suspend during system startup.? (I
>>>> added a log to runtime_resume that is missing in the driver source,
>>>> but it appears this does not get called in my scenario.)? Note that
>>>> the e1000e driver is still working ok after this.. at least at first.
>>>>
>>>> 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
>>>> => "suspended"
>>>>  ??? "cat
>>>>
>> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
>>>> => "unsupported"
>>>>  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
>>>> => "active"
>>>>  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
>>>> => "active" (this is the actual NIC)
>>>>  ??? These don't change between the working and non-working states.
>>>> (It's possible that some other device does, but I haven't found it
>>>> yet.)
>>>>
>>>> 11. I did try forcing the above to unsuspend, but this did not
>>>> recover from the e1000e issue.
>>>>
>>>> 12. I also tried calling e1000e_reset on link-down.? This produces
>>>> different register output on link-up, but doesn't recover from the
>>>> issue.
>>>>
>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
>>>> power management).? This *does* resolve the problem (but is a very
>>>> big hammer).
>>>>
>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
>>>> the problem remains (suggesting #12 was counter-productive).
>>>>
>>>> FYI the hardware on one of the test machines is as follows:
>>>>  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>>>>  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
>>>> (x16) (rev 05)
>>>>  ??? 00:02.0 VGA compatible controller: Intel Corporation Device
>>>> 5912 (rev 04)
>>>>  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian
>>>> Mixture Model
>>>>  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
>>>> 3.0  xHCI Controller (rev 31)
>>>>  ??? 00:14.2 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Thermal subsystem (rev 31)
>>>>  ??? 00:15.0 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO I2C Controller #0 (rev 31)
>>>>  ??? 00:15.1 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO I2C Controller #1 (rev 31)
>>>>  ??? 00:16.0 Communication controller: Intel Corporation Sunrise
>>>> Point-H CSME HECI #1 (rev 31)
>>>>  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA
>>>> controller [AHCI mode] (rev 31)
>>>>  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>> Port #19 (rev f1)
>>>>  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>> Port #20 (rev f1)
>>>>  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>> Express Root Port #5 (rev f1)
>>>>  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>> Express Root Port #11 (rev f1)
>>>>  ??? 00:1e.0 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO UART #0 (rev 31)
>>>>  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
>>>> Controller (rev 31)
>>>>  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
>>>> PMC (rev 31)
>>>>  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
>>>>  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
>>>> Connection (2) I219-LM (rev 31)
>>>>  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)
>>>>  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)
>>>>  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)

(Tomas, your MUA wrapped the lines messing up the formatting.)

>>>> I'm happy to add any code instrumentation or make any other changes
>>>> needed to locate and resolve the problem, and I can readily
>>>> reproduce it
>>>> -- I'm just at a complete loss as to where to start looking, and am
>>>> still hoping for some suggestions in that regard.
>>>>
>>>> If there's anywhere (or anyone) else better for me to talk to about
>>>> this issue, please let me know that too.
>>>
>>> It is not clear to me, if this is still reproducible on Linux 5.3-rc7
>>> (or Linus? master branch).
>>>
>>> If it is, this is a definitely regression, and the commits need to be
>>> reverted due to Linux? no regression policy.
>>
>> So I should revert this from 4.4.y and 4.9.y?
> 
> The issue is not in mei driver, it is in e1000 driver, I my best
> knowledge there should be fix, please Vitaly can it be backported to
> older kernels?

Tomas, backporting the commit supposedly fixing this, does *not* help.
Also, it does not matter for the no regression policy.

Let?s wait until Gavin can confirm if it is happening with Linux 5.3-rc7.


Kind regards,

Paul

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5174 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20190903/a8484910/attachment-0001.p7s>

  reply	other threads:[~2019-09-03  9:39 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-11  6:50 [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver Gavin Lambert
2019-07-12  3:23 ` Gavin Lambert
2019-07-18  8:06   ` Gavin Lambert
2019-07-18  8:22     ` Paul Menzel
2019-07-18  8:24     ` Neftin, Sasha
2019-07-19  0:40       ` Gavin Lambert
2019-07-19  1:02         ` Gavin Lambert
2019-08-20  2:15           ` Gavin Lambert
2019-09-03  7:56             ` Gavin Lambert
2019-09-03  8:35               ` Paul Menzel
2019-09-03  9:20                 ` Greg Kroah-Hartman
2019-09-03  9:28                   ` Winkler, Tomas
2019-09-03  9:39                     ` Paul Menzel [this message]
2019-09-03 11:00                       ` Gavin Lambert
2019-09-04 10:06                         ` Winkler, Tomas
2019-09-04 11:08                           ` Gavin Lambert
2019-09-04 12:31                             ` Lifshits, Vitaly
2019-09-05  3:59                             ` Gavin Lambert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5b988f9-25cf-db4a-53f5-9bb7edc8fb51@molgen.mpg.de \
    --to=pmenzel@molgen.mpg.de \
    --cc=intel-wired-lan@osuosl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.