All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
@ 2019-07-11  6:50 Gavin Lambert
  2019-07-12  3:23 ` Gavin Lambert
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Lambert @ 2019-07-11  6:50 UTC (permalink / raw)
  To: intel-wired-lan

This might be a bit of a tricky question, but I'm not really sure where 
else to ask.  Please cc me on any replies or I might overlook them.

I'm using a system with an e1000e network driver which has been patched 
to bypass the regular Linux network stack (because it can get called 
from a Xenomai RT context, among other reasons -- although in my case 
I'm not doing that).  The complete source for the patched version of the 
code can be found here:
     
https://github.com/ribalda/ethercat/blob/master/devices/e1000e/netdev-4.9-ethercat.c
(There are some minor changes to other files, but the majority of 
changes are only to this file.  You can see just the changes at 
https://gist.github.com/uecasm/5e36a15bda6ffd53079344fc443dcc5f/revisions 
.)

It was originally based on the in-kernel e1000e driver as of Linux 
4.9.65.  (I'm not the person who originally made the patches, but I am 
the person who rebased them to kernel 4.9 and I'm the one trying to 
maintain them for newer kernel versions.  Though I'm also not the person 
who made that github repo.)

On a Debian system with kernel linux-image-4.9.0-4-rt-amd64 (4.9.65) 
installed, this works perfectly.  It also works perfectly with 
linux-image-4.9.0-8-rt-amd64 (4.9.110).

However, with kernel linux-image-4.9.0-9-rt-amd64 (4.9.168) installed 
(and no other changes to the system other than building the patched 
e1000e module against this kernel's headers), something weird happens 
when the driver is running in its alternate "ecdev" mode.

Specifically, when the module is initially loaded, it works as expected 
and can send/receive without problems.  When link is removed (by 
disconnecting the Ethernet cable), it detects this as expected.  When 
link is restored, it detects this and reports it but is then unable to 
actually send any packets.  (Note: to send packets the external code 
calls the "ndo_start_xmit" operation directly, and to receive packets it 
calls "ec_poll".  Also note that it won't receive a packet unless it 
sends one first, due to the way that the network it's connected to 
works, so I can't tell if receives work or not when sends don't work.)  
Unloading and reloading the module fixes this, even if the link is 
initially down and then reconnected after the module is reloaded.  (So 
perhaps the problem is something it does at the link-loss event?)

Occasionally, it does manage to survive one or two replugs before 
getting into the problem state.  But once there, no amount of replugging 
appears to recover it; only reloading the module.

I do know that when it's in the failure state (not actually sending 
packets), e1000_xmit_frame continues to get all the way to the bottom 
and return NETDEV_TX_OK.

Note that the e1000e code being used is still the code as shown in the 
link above, not the code as exists in Linux 4.9.168.  I did try rebasing 
the ethercat patches onto the new driver version, but this didn?t seem 
to change the behavior.

Also note that the bad behavior was observed on an I219-V and an 
I219-LM, but does not appear to happen with an 82571EB (these are the 
only devices I have handy at the moment).  The problem also doesn't 
occur when using the unpatched driver from 4.9.168 as a standard Linux 
network driver.

Obviously, something the patches are doing is causing problems, but it 
seems odd that the issue only occurs with certain hardware and with 
certain kernel versions.  Any ideas on what could be the cause and 
solution (or how to narrow it down further)?  I can easily make changes 
to the driver code; it's a lot harder to try kernel versions between the 
two above, however, but I might be able to do that too.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-07-11  6:50 [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver Gavin Lambert
@ 2019-07-12  3:23 ` Gavin Lambert
  2019-07-18  8:06   ` Gavin Lambert
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Lambert @ 2019-07-12  3:23 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-07-11 18:50, I wrote:
> This might be a bit of a tricky question, but I'm not really sure
> where else to ask.  Please cc me on any replies or I might overlook
> them.
> 
> I'm using a system with an e1000e network driver which has been
> patched to bypass the regular Linux network stack (because it can get
> called from a Xenomai RT context, among other reasons -- although in
> my case I'm not doing that).  The complete source for the patched
> version of the code can be found here:
> 
> https://github.com/ribalda/ethercat/blob/master/devices/e1000e/netdev-4.9-ethercat.c
> (There are some minor changes to other files, but the majority of
> changes are only to this file.  You can see just the changes at
> https://gist.github.com/uecasm/5e36a15bda6ffd53079344fc443dcc5f/revisions
> .)
> 
> It was originally based on the in-kernel e1000e driver as of Linux
> 4.9.65.  (I'm not the person who originally made the patches, but I am
> the person who rebased them to kernel 4.9 and I'm the one trying to
> maintain them for newer kernel versions.  Though I'm also not the
> person who made that github repo.)
> 
> On a Debian system with kernel linux-image-4.9.0-4-rt-amd64 (4.9.65)
> installed, this works perfectly.  It also works perfectly with
> linux-image-4.9.0-8-rt-amd64 (4.9.110).
> 
> However, with kernel linux-image-4.9.0-9-rt-amd64 (4.9.168) installed
> (and no other changes to the system other than building the patched
> e1000e module against this kernel's headers), something weird happens
> when the driver is running in its alternate "ecdev" mode.
> 
> Specifically, when the module is initially loaded, it works as
> expected and can send/receive without problems.  When link is removed
> (by disconnecting the Ethernet cable), it detects this as expected.
> When link is restored, it detects this and reports it but is then
> unable to actually send any packets.  (Note: to send packets the
> external code calls the "ndo_start_xmit" operation directly, and to
> receive packets it calls "ec_poll".  Also note that it won't receive a
> packet unless it sends one first, due to the way that the network it's
> connected to works, so I can't tell if receives work or not when sends
> don't work.)  Unloading and reloading the module fixes this, even if
> the link is initially down and then reconnected after the module is
> reloaded.  (So perhaps the problem is something it does at the
> link-loss event?)
> 
> Occasionally, it does manage to survive one or two replugs before
> getting into the problem state.  But once there, no amount of
> replugging appears to recover it; only reloading the module.
> 
> I do know that when it's in the failure state (not actually sending
> packets), e1000_xmit_frame continues to get all the way to the bottom
> and return NETDEV_TX_OK.
> 
> Note that the e1000e code being used is still the code as shown in the
> link above, not the code as exists in Linux 4.9.168.  I did try
> rebasing the ethercat patches onto the new driver version, but this
> didn?t seem to change the behavior.
> 
> Also note that the bad behavior was observed on an I219-V and an
> I219-LM, but does not appear to happen with an 82571EB (these are the
> only devices I have handy at the moment).  The problem also doesn't
> occur when using the unpatched driver from 4.9.168 as a standard Linux
> network driver.
> 
> Obviously, something the patches are doing is causing problems, but it
> seems odd that the issue only occurs with certain hardware and with
> certain kernel versions.  Any ideas on what could be the cause and
> solution (or how to narrow it down further)?  I can easily make
> changes to the driver code; it's a lot harder to try kernel versions
> between the two above, however, but I might be able to do that too.

(I wouldn't normally quote that much, but I haven't seen this message 
appear on the mailing list yet, so I'm not sure if it got through or 
not.)

Another data point: on linux-image-4.9.0-8-rt-amd64 (4.9.110), which 
works ok with the code previously given, if I apply the attached patch 
(which is the rebase to bring the base driver up to date with 4.9.168) 
then the same problem occurs.

So *either* applying this patch or updating to 4.9.168 without applying 
this patch introduces the problem.

Making the further change below to the code fixes the problem in 
4.9.110, but not in 4.9.168:

--- a/netdev-4.9-ethercat.c
+++ b/netdev-4.9-ethercat.c
@@ -5407,7 +5407,7 @@ static void e1000_watchdog_task(struct w
  			 * reset the controller to flush the Tx packet buffers.
  			 */
  			if ((adapter->flags & FLAG_RX_NEEDS_RESTART) ||
-			    e1000_desc_unused(tx_ring) + 1 < tx_ring->count)
+			    (!adapter->ecdev && e1000_desc_unused(tx_ring) + 1 < 
tx_ring->count))
  				adapter->flags |= FLAG_RESTART_NOW;
  			else
  				pm_schedule_suspend(netdev->dev.parent,

Since this was mostly just a rebase error (you can see a similar change 
in the old location of this code), I'm not sure if this helps narrow 
down the source of the problem between 4.9.110 and 4.9.168 or not.  I'm 
still looking for ideas for that.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: e1000e_problem.diff
Type: text/x-diff
Size: 7603 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20190712/081f7e46/attachment-0001.diff>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-07-12  3:23 ` Gavin Lambert
@ 2019-07-18  8:06   ` Gavin Lambert
  2019-07-18  8:22     ` Paul Menzel
  2019-07-18  8:24     ` Neftin, Sasha
  0 siblings, 2 replies; 18+ messages in thread
From: Gavin Lambert @ 2019-07-18  8:06 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-07-12 15:23, I wrote:
> On 2019-07-11 18:50, I wrote:
>> On a Debian system with kernel linux-image-4.9.0-4-rt-amd64 (4.9.65)
>> installed, this works perfectly.  It also works perfectly with
>> linux-image-4.9.0-8-rt-amd64 (4.9.110).
>> 
>> However, with kernel linux-image-4.9.0-9-rt-amd64 (4.9.168) installed
>> (and no other changes to the system other than building the patched
>> e1000e module against this kernel's headers), something weird happens
>> when the driver is running in its alternate "ecdev" mode.
[...]
> Since this was mostly just a rebase error (you can see a similar
> change in the old location of this code), I'm not sure if this helps
> narrow down the source of the problem between 4.9.110 and 4.9.168 or
> not.  I'm still looking for ideas for that.

Using this kernel tree:
   
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/?h=v4.9-rt&ofs=3120

I've identified that the code at tag v4.9.126 is "good" and the code at 
tag v4.9.127 is "bad".

I've done a bisect (twice, from different starting points) and both 
times settled on this commit as the one which introduced the problem I'm 
experiencing:

commit c0b809985a7a418fcc3361c239ae79250245282d (refs/bisect/bad)
Author: Tomas Winkler <tomas.winkler@intel.com>
Date:   Tue Jan 2 12:01:41 2018 +0200

     mei: me: allow runtime pm for platform with D0i3

     commit cc365dcf0e56271bedf3de95f88922abe248e951 upstream.

     >From the pci power documentation:
     "The driver itself should not call pm_runtime_allow(), though. 
Instead,
     it should let user space or some platform-specific code do that 
(user space
     can do it via sysfs as stated above)..."

     However, the S0ix residency cannot be reached without MEI device 
getting
     into low power state. Hence, for mei devices that support D0i3, it's 
better
     to make runtime power management mandatory and not rely on the 
system
     integration such as udev rules.
     This policy cannot be applied globally as some older platforms
     were found to have broken power management.

     Cc: <stable@vger.kernel.org> v4.13+
     Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
     Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
     Reviewed-by: Alexander Usyskin <alexander.usyskin@intel.com>
     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

It is reproducible every time; if I build at the parent commit 
(3d3432580911) then the driver works, and if I add the commit above then 
it fails.

However it's unclear to me how this is affecting my modified e1000e 
driver in this way, except that it is perhaps power management related?

Since it appears to be a pm_runtime-related thing, just as an experiment 
I did try commenting out every single call to pm_runtime* functions in 
netdev.c, but this did not resolve the problem.  Ditto for anything with 
the word "suspend" in it.  I also tried adding e_info() logging calls to 
most places that used pm_ calls other than pm_runtime_get/put (and in 
particular, in all of the pm_ops callbacks), and none of them were hit 
during the problem events.

And even when it's not working, if I `cat` various things in 
`/sys/bus/pci/.../power/` on the adapter device, it appears to all be 
non-suspended, which makes me doubt that it really is a PM issue, unless 
I'm just looking in the wrong places.

Any ideas?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-07-18  8:06   ` Gavin Lambert
@ 2019-07-18  8:22     ` Paul Menzel
  2019-07-18  8:24     ` Neftin, Sasha
  1 sibling, 0 replies; 18+ messages in thread
From: Paul Menzel @ 2019-07-18  8:22 UTC (permalink / raw)
  To: intel-wired-lan

[private answer]

Dear Gavin,


Your messages were delivered to the list subscribers.

On 18.07.19 10:06, Gavin Lambert wrote:
> On 2019-07-12 15:23, I wrote:
>> On 2019-07-11 18:50, I wrote:
>>> On a Debian system with kernel linux-image-4.9.0-4-rt-amd64 (4.9.65)
>>> installed, this works perfectly.? It also works perfectly with
>>> linux-image-4.9.0-8-rt-amd64 (4.9.110).
>>>
>>> However, with kernel linux-image-4.9.0-9-rt-amd64 (4.9.168) installed
>>> (and no other changes to the system other than building the patched
>>> e1000e module against this kernel's headers), something weird happens
>>> when the driver is running in its alternate "ecdev" mode.
> [...]
>> Since this was mostly just a rebase error (you can see a similar
>> change in the old location of this code), I'm not sure if this helps
>> narrow down the source of the problem between 4.9.110 and 4.9.168 or
>> not.? I'm still looking for ideas for that.
> 
> Using this kernel tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/?h=v4.9-rt&ofs=3120 
> 
> I've identified that the code at tag v4.9.126 is "good" and the code at 
> tag v4.9.127 is "bad".
> 
> I've done a bisect (twice, from different starting points) and both 
> times settled on this commit as the one which introduced the problem I'm 
> experiencing:
> 
> commit c0b809985a7a418fcc3361c239ae79250245282d (refs/bisect/bad)
> Author: Tomas Winkler <tomas.winkler@intel.com>
> Date:?? Tue Jan 2 12:01:41 2018 +0200
> 
>  ??? mei: me: allow runtime pm for platform with D0i3
> 
>  ??? commit cc365dcf0e56271bedf3de95f88922abe248e951 upstream.
> 
>  ??? >From the pci power documentation:
>  ??? "The driver itself should not call pm_runtime_allow(), though. 
> Instead,
>  ??? it should let user space or some platform-specific code do that 
> (user space
>  ??? can do it via sysfs as stated above)..."
> 
>  ??? However, the S0ix residency cannot be reached without MEI device 
> getting
>  ??? into low power state. Hence, for mei devices that support D0i3, 
> it's better
>  ??? to make runtime power management mandatory and not rely on the system
>  ??? integration such as udev rules.
>  ??? This policy cannot be applied globally as some older platforms
>  ??? were found to have broken power management.
> 
>  ??? Cc: <stable@vger.kernel.org> v4.13+
>  ??? Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>  ??? Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
>  ??? Reviewed-by: Alexander Usyskin <alexander.usyskin@intel.com>
>  ??? Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

This commit was added in v4.16-rc1.

> It is reproducible every time; if I build at the parent commit 
> (3d3432580911) then the driver works, and if I add the commit above then 
> it fails.
> 
> However it's unclear to me how this is affecting my modified e1000e 
> driver in this way, except that it is perhaps power management related?
> 
> Since it appears to be a pm_runtime-related thing, just as an experiment 
> I did try commenting out every single call to pm_runtime* functions in 
> netdev.c, but this did not resolve the problem.? Ditto for anything with 
> the word "suspend" in it.? I also tried adding e_info() logging calls to 
> most places that used pm_ calls other than pm_runtime_get/put (and in 
> particular, in all of the pm_ops callbacks), and none of them were hit 
> during the problem events.
> 
> And even when it's not working, if I `cat` various things in 
> `/sys/bus/pci/.../power/` on the adapter device, it appears to all be 
> non-suspended, which makes me doubt that it really is a PM issue, unless 
> I'm just looking in the wrong places.

If you found a faulty commit, please CC the commit authors, reviewers, 
and subsystem maintainers and maybe even the regression address.

If you have time, please check with Linux master tree to see if a commit 
fixing this has been added or you still need to revert it.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-07-18  8:06   ` Gavin Lambert
  2019-07-18  8:22     ` Paul Menzel
@ 2019-07-18  8:24     ` Neftin, Sasha
  2019-07-19  0:40       ` Gavin Lambert
  1 sibling, 1 reply; 18+ messages in thread
From: Neftin, Sasha @ 2019-07-18  8:24 UTC (permalink / raw)
  To: intel-wired-lan

On 7/18/2019 11:06, Gavin Lambert wrote:
> On 2019-07-12 15:23, I wrote:
>> On 2019-07-11 18:50, I wrote:
>>> On a Debian system with kernel linux-image-4.9.0-4-rt-amd64 (4.9.65)
>>> installed, this works perfectly.? It also works perfectly with
>>> linux-image-4.9.0-8-rt-amd64 (4.9.110).
>>>
>>> However, with kernel linux-image-4.9.0-9-rt-amd64 (4.9.168) installed
>>> (and no other changes to the system other than building the patched
>>> e1000e module against this kernel's headers), something weird happens
>>> when the driver is running in its alternate "ecdev" mode.
> [...]
>> Since this was mostly just a rebase error (you can see a similar
>> change in the old location of this code), I'm not sure if this helps
>> narrow down the source of the problem between 4.9.110 and 4.9.168 or
>> not.? I'm still looking for ideas for that.
> 
> Using this kernel tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/?h=v4.9-rt&ofs=3120 
> 
> 
> I've identified that the code at tag v4.9.126 is "good" and the code at 
> tag v4.9.127 is "bad".
> 
> I've done a bisect (twice, from different starting points) and both 
> times settled on this commit as the one which introduced the problem I'm 
> experiencing:
> 
> commit c0b809985a7a418fcc3361c239ae79250245282d (refs/bisect/bad)
> Author: Tomas Winkler <tomas.winkler@intel.com>
> Date:?? Tue Jan 2 12:01:41 2018 +0200
> 
>  ??? mei: me: allow runtime pm for platform with D0i3
> 
>  ??? commit cc365dcf0e56271bedf3de95f88922abe248e951 upstream.
> 
>  ??? >From the pci power documentation:
>  ??? "The driver itself should not call pm_runtime_allow(), though. 
> Instead,
>  ??? it should let user space or some platform-specific code do that 
> (user space
>  ??? can do it via sysfs as stated above)..."
> 
>  ??? However, the S0ix residency cannot be reached without MEI device 
> getting
>  ??? into low power state. Hence, for mei devices that support D0i3, 
> it's better
>  ??? to make runtime power management mandatory and not rely on the system
>  ??? integration such as udev rules.
>  ??? This policy cannot be applied globally as some older platforms
>  ??? were found to have broken power management.
> 
>  ??? Cc: <stable@vger.kernel.org> v4.13+
>  ??? Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>  ??? Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
>  ??? Reviewed-by: Alexander Usyskin <alexander.usyskin@intel.com>
>  ??? Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
> It is reproducible every time; if I build at the parent commit 
> (3d3432580911) then the driver works, and if I add the commit above then 
> it fails.
> 
> However it's unclear to me how this is affecting my modified e1000e 
> driver in this way, except that it is perhaps power management related?
> 
> Since it appears to be a pm_runtime-related thing, just as an experiment 
> I did try commenting out every single call to pm_runtime* functions in 
> netdev.c, but this did not resolve the problem.? Ditto for anything with 
> the word "suspend" in it.? I also tried adding e_info() logging calls to 
> most places that used pm_ calls other than pm_runtime_get/put (and in 
> particular, in all of the pm_ops callbacks), and none of them were hit 
> during the problem events.
> 
> And even when it's not working, if I `cat` various things in 
> `/sys/bus/pci/.../power/` on the adapter device, it appears to all be 
> non-suspended, which makes me doubt that it really is a PM issue, unless 
> I'm just looking in the wrong places.
> 
> Any ideas?
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Please, refer to the commit def4ec6dce393e2136b62a05712f35a7fa5f5e56
on the Jeff Kirsher's next-queue: 
https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56

We are working to push this patch to upstream.
Thanks,
Sasha



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-07-18  8:24     ` Neftin, Sasha
@ 2019-07-19  0:40       ` Gavin Lambert
  2019-07-19  1:02         ` Gavin Lambert
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Lambert @ 2019-07-19  0:40 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-07-18 20:24, Neftin, Sasha wrote:
> Please, refer to the commit def4ec6dce393e2136b62a05712f35a7fa5f5e56
> on the Jeff Kirsher's next-queue:
> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56
> 
> We are working to push this patch to upstream.

Thanks, that does sound identical to my symptoms.

However I tried applying this patch to my driver in 4.9 and it does not 
resolve the problem.

Are some additional patches required as well?


FWIW, I added some extra logging around the new code.  I can confirm 
that it does execute on link regain but doesn't actually enter the loop 
in my problem case.  The pcim_state is 0x00080083 at the time.  So the 
e1000_phy_hw_reset is never actually called.  If I try changing it to 
call that unconditionally, then it can't successfully establish a link 
in the first place.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-07-19  0:40       ` Gavin Lambert
@ 2019-07-19  1:02         ` Gavin Lambert
  2019-08-20  2:15           ` Gavin Lambert
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Lambert @ 2019-07-19  1:02 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-07-19 12:40, I wrote:
> FWIW, I added some extra logging around the new code.  I can confirm
> that it does execute on link regain but doesn't actually enter the
> loop in my problem case.  The pcim_state is 0x00080083 at the time.
> So the e1000_phy_hw_reset is never actually called.  If I try changing
> it to call that unconditionally, then it can't successfully establish
> a link in the first place.

I added a call to e1000e_dump at the point of link regain, in hopes that 
it might shed more light.

On startup, when it does successfully link and send/receive packets:

     0000:00:1f.6: Register Dump
       Register Name   Value
      CTRL            58180240
      STATUS          00080083
      CTRL_EXT        995a1027
      ICR             00000000
      RCTL            04008002
      RDLEN           00001000
      RDH             00000000
      RDT             000000f0
      RDTR            00000000
      RXDCTL[0-1]     00010000 00010000
      ERT             00000000
      RDBAL           6061c000
      RDBAH           00000002
      RDFH            00000000
      RDFT            00000000
      RDFHS           00000000
      RDFTS           00000000
      RDFPC           00000000
      TCTL            3103f0f8
      TDBAL           5e8a0000
      TDBAH           00000002
      TDLEN           00001000
      TDH             00000000
      TDT             00000000
      TIDV            00000008
      TXDCTL[0-1]     0141001f 0141001f
      TADV            00000020
      TARC[0-1]       3d800403 45000403
      TDFH            00000d00
      TDFT            00000d00
      TDFHS           00000d00
      TDFTS           00000d00
      TDFPC           00000000
      ecm0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

On disconnecting and reconnecting the cable, when it does get link but 
then can't actually send any packets:

     0000:00:1f.6: Register Dump
       Register Name   Value
      CTRL            58180240
      STATUS          00080083
      CTRL_EXT        995a1027
      ICR             00000000
      RCTL            04008002
      RDLEN           00001000
      RDH             000000d1
      RDT             000000c0
      RDTR            00000000
      RXDCTL[0-1]     00010000 00010000
      ERT             00000000
      RDBAL           6061c000
      RDBAH           00000002
      RDFH            00000582
      RDFT            00000582
      RDFHS           00000582
      RDFTS           00000582
      RDFPC           00000000
      TCTL            3103f0fa
      TDBAL           5e8a0000
      TDBAH           00000002
      TDLEN           00001000
      TDH             00000050
      TDT             0000003d
      TIDV            00000008
      TXDCTL[0-1]     0141001f 0141001f
      TADV            00000020
      TARC[0-1]       3d800403 45000403
      TDFH            00000f0a
      TDFT            00000f1c
      TDFHS           00000f0a
      TDFTS           00000f0a
      TDFPC           00000000
      ecm0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-07-19  1:02         ` Gavin Lambert
@ 2019-08-20  2:15           ` Gavin Lambert
  2019-09-03  7:56             ` Gavin Lambert
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Lambert @ 2019-08-20  2:15 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-07-19 13:02, I wrote:
> On 2019-07-19 12:40, I wrote:
>> FWIW, I added some extra logging around the new code.  I can confirm
>> that it does execute on link regain but doesn't actually enter the
>> loop in my problem case.  The pcim_state is 0x00080083 at the time.
>> So the e1000_phy_hw_reset is never actually called.  If I try changing
>> it to call that unconditionally, then it can't successfully establish
>> a link in the first place.
> 
> I added a call to e1000e_dump at the point of link regain, in hopes
> that it might shed more light.
[register dumps clipped]

Does anyone have any ideas about this?  Either towards further 
investigation or to a possible resolution?

This is at the point of hardware internals now, so I have no idea how to 
proceed in either area.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-08-20  2:15           ` Gavin Lambert
@ 2019-09-03  7:56             ` Gavin Lambert
  2019-09-03  8:35               ` Paul Menzel
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Lambert @ 2019-09-03  7:56 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-08-20 14:15, I wrote:
> Does anyone have any ideas about this?  Either towards further
> investigation or to a possible resolution?
> 
> This is at the point of hardware internals now, so I have no idea how
> to proceed in either area.

To recap (plus some new info):

1. I am using a kernel module which uses the code from the e1000e driver 
to communicate with the hardware without actually registering it as a 
Linux netdev.  (This is partly because it can get used in a Xenomai 
context outside of Linux itself, although I'm not doing that myself.)  
This historically works fine.

2. On certain Linux versions, I encountered an issue where disconnecting 
the network cable and reconnecting it almost always results in not being 
able to send any packets.  (I cannot determine if receiving packets 
works in this case, as the network design will not receive packets 
unless some are sent first.)  Restarting the driver (rmmod+modprobe) 
does recover from this case (until the next link loss), but simply 
replugging the cable never does.

3. The problem was observed with both I219-V and I219-LM (on 
motherboard), but was *not* observed with 82571EB (PCIE).  The problem 
was not observed with a motherboard igb-based I211.  I suspect the issue 
is limited to motherboard-based e1000e adapters.  (Or perhaps there's 
something different about how the IGBs are internally connected.)

4. The problem does not occur when the e1000e driver is registered 
"normally" as a Linux netdev.

5. The problem was introduced by "mei: me: allow runtime pm for platform 
with D0i3" (which has been backported to 4.4+, as far as I can tell).  
Excluding this commit reliably resolves the issue and including it 
reliably breaks it.

6. Applying the previously suggested patch 
https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56 
has no effect; the E1000_STATUS_PCIM_STATE bit is not set when the issue 
occurs.

7. Given the content of the change in #5, I assumed that the problem was 
power-management related, perhaps a side effect of the e1000e driver not 
being registered as a netdev.  (So perhaps something thinks that no 
devices are in use and turns something off?)

8. I've previously posted register dumps from an e1000e in both the 
"normal" and "link up but not transmitting" states.  They seemed very 
similar, but as I'm not familiar with the register meanings I may have 
overlooked something significant.  (Note that the dumps were captured 
inside the watchdog task, when it detects link up but before it sets 
E1000_TCTL_EN.)

9. I enabled debug logging in the mei driver; it logs a couple of 
runtime_idles and then a runtime_suspend during system startup.  (I 
added a log to runtime_resume that is missing in the driver source, but 
it appears this does not get called in my scenario.)  Note that the 
e1000e driver is still working ok after this.. at least at first.

10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status" 
=> "suspended"
     "cat 
/sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status" 
=> "unsupported"
     "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status" 
=> "active"
     "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status" 
=> "active" (this is the actual NIC)
     These don't change between the working and non-working states.  
(It's possible that some other device does, but I haven't found it yet.)

11. I did try forcing the above to unsuspend, but this did not recover 
from the e1000e issue.

12. I also tried calling e1000e_reset on link-down.  This produces 
different register output on link-up, but doesn't recover from the 
issue.

13. I also tried recompiling the kernel with CONFIG_PM disabled (no 
power management).  This *does* resolve the problem (but is a very big 
hammer).

14. Possibly also of interest is that if I do *both* #12 and #13, the 
problem remains (suggesting #12 was counter-productive).

FYI the hardware on one of the test machines is as follows:
     00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
     00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) 
(rev 05)
     00:02.0 VGA compatible controller: Intel Corporation Device 5912 
(rev 04)
     00:08.0 System peripheral: Intel Corporation Skylake Gaussian 
Mixture Model
     00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 
xHCI Controller (rev 31)
     00:14.2 Signal processing controller: Intel Corporation Sunrise 
Point-H Thermal subsystem (rev 31)
     00:15.0 Signal processing controller: Intel Corporation Sunrise 
Point-H Serial IO I2C Controller #0 (rev 31)
     00:15.1 Signal processing controller: Intel Corporation Sunrise 
Point-H Serial IO I2C Controller #1 (rev 31)
     00:16.0 Communication controller: Intel Corporation Sunrise Point-H 
CSME HECI #1 (rev 31)
     00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA 
controller [AHCI mode] (rev 31)
     00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port 
#19 (rev f1)
     00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port 
#20 (rev f1)
     00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express 
Root Port #5 (rev f1)
     00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express 
Root Port #11 (rev f1)
     00:1e.0 Signal processing controller: Intel Corporation Sunrise 
Point-H Serial IO UART #0 (rev 31)
     00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller 
(rev 31)
     00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC 
(rev 31)
     00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
     00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection 
(2) I219-LM (rev 31)
     02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
Connection (rev 03)
     03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
Connection (rev 03)
     05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
Connection (rev 03)

I'm happy to add any code instrumentation or make any other changes 
needed to locate and resolve the problem, and I can readily reproduce it 
-- I'm just at a complete loss as to where to start looking, and am 
still hoping for some suggestions in that regard.

If there's anywhere (or anyone) else better for me to talk to about this 
issue, please let me know that too.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-03  7:56             ` Gavin Lambert
@ 2019-09-03  8:35               ` Paul Menzel
  2019-09-03  9:20                 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Menzel @ 2019-09-03  8:35 UTC (permalink / raw)
  To: intel-wired-lan

Dear Gavin,


Thank you for following up on this.

On 03.09.19 09:56, Gavin Lambert wrote:
> On 2019-08-20 14:15, I wrote:
>> Does anyone have any ideas about this?? Either towards further
>> investigation or to a possible resolution?
>>
>> This is at the point of hardware internals now, so I have no idea how
>> to proceed in either area.
> 
> To recap (plus some new info):
> 
> 1. I am using a kernel module which uses the code from the e1000e driver 
> to communicate with the hardware without actually registering it as a 
> Linux netdev.? (This is partly because it can get used in a Xenomai 
> context outside of Linux itself, although I'm not doing that myself.) 
> This historically works fine.
> 
> 2. On certain Linux versions, I encountered an issue where disconnecting 
> the network cable and reconnecting it almost always results in not being 
> able to send any packets.? (I cannot determine if receiving packets 
> works in this case, as the network design will not receive packets 
> unless some are sent first.)? Restarting the driver (rmmod+modprobe) 
> does recover from this case (until the next link loss), but simply 
> replugging the cable never does.
> 
> 3. The problem was observed with both I219-V and I219-LM (on 
> motherboard), but was *not* observed with 82571EB (PCIE).? The problem 
> was not observed with a motherboard igb-based I211.? I suspect the issue 
> is limited to motherboard-based e1000e adapters.? (Or perhaps there's 
> something different about how the IGBs are internally connected.)
> 
> 4. The problem does not occur when the e1000e driver is registered 
> "normally" as a Linux netdev.
> 
> 5. The problem was introduced by "mei: me: allow runtime pm for platform 
> with D0i3" (which has been backported to 4.4+, as far as I can tell). 
> Excluding this commit reliably resolves the issue and including it 
> reliably breaks it.

The commit hash in the master branch is 
cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.

Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.

> 6. Applying the previously suggested patch 
> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56 
> has no effect; the E1000_STATUS_PCIM_STATE bit is not set when the issue 
> occurs.
> 
> 7. Given the content of the change in #5, I assumed that the problem was 
> power-management related, perhaps a side effect of the e1000e driver not 
> being registered as a netdev.? (So perhaps something thinks that no 
> devices are in use and turns something off?)
> 
> 8. I've previously posted register dumps from an e1000e in both the 
> "normal" and "link up but not transmitting" states.? They seemed very 
> similar, but as I'm not familiar with the register meanings I may have 
> overlooked something significant.? (Note that the dumps were captured 
> inside the watchdog task, when it detects link up but before it sets 
> E1000_TCTL_EN.)
> 
> 9. I enabled debug logging in the mei driver; it logs a couple of 
> runtime_idles and then a runtime_suspend during system startup.? (I 
> added a log to runtime_resume that is missing in the driver source, but 
> it appears this does not get called in my scenario.)? Note that the 
> e1000e driver is still working ok after this.. at least at first.
> 
> 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status" 
> => "suspended"
>  ??? "cat 
> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status" 
> => "unsupported"
>  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status" 
> => "active"
>  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status" 
> => "active" (this is the actual NIC)
>  ??? These don't change between the working and non-working states. 
> (It's possible that some other device does, but I haven't found it yet.)
> 
> 11. I did try forcing the above to unsuspend, but this did not recover 
> from the e1000e issue.
> 
> 12. I also tried calling e1000e_reset on link-down.? This produces 
> different register output on link-up, but doesn't recover from the issue.
> 
> 13. I also tried recompiling the kernel with CONFIG_PM disabled (no 
> power management).? This *does* resolve the problem (but is a very big 
> hammer).
> 
> 14. Possibly also of interest is that if I do *both* #12 and #13, the 
> problem remains (suggesting #12 was counter-productive).
> 
> FYI the hardware on one of the test machines is as follows:
>  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05)
>  ??? 00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04)
>  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian  Mixture Model
>  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0  xHCI Controller (rev 31)
>  ??? 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
>  ??? 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
>  ??? 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
>  ??? 00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
>  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
>  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #19 (rev f1)
>  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #20 (rev f1)
>  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
>  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #11 (rev f1)
>  ??? 00:1e.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO UART #0 (rev 31)
>  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
>  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
>  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
>  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
>  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
>  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
>  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> 
> I'm happy to add any code instrumentation or make any other changes 
> needed to locate and resolve the problem, and I can readily reproduce it 
> -- I'm just at a complete loss as to where to start looking, and am 
> still hoping for some suggestions in that regard.
> 
> If there's anywhere (or anyone) else better for me to talk to about this 
> issue, please let me know that too.

It is not clear to me, if this is still reproducible on Linux 5.3-rc7 
(or Linus? master branch).

If it is, this is a definitely regression, and the commits need to be 
reverted due to Linux? no regression policy.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-03  8:35               ` Paul Menzel
@ 2019-09-03  9:20                 ` Greg Kroah-Hartman
  2019-09-03  9:28                   ` Winkler, Tomas
  0 siblings, 1 reply; 18+ messages in thread
From: Greg Kroah-Hartman @ 2019-09-03  9:20 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
> Dear Gavin,
> 
> 
> Thank you for following up on this.
> 
> On 03.09.19 09:56, Gavin Lambert wrote:
> > On 2019-08-20 14:15, I wrote:
> > > Does anyone have any ideas about this?? Either towards further
> > > investigation or to a possible resolution?
> > > 
> > > This is at the point of hardware internals now, so I have no idea how
> > > to proceed in either area.
> > 
> > To recap (plus some new info):
> > 
> > 1. I am using a kernel module which uses the code from the e1000e driver
> > to communicate with the hardware without actually registering it as a
> > Linux netdev.? (This is partly because it can get used in a Xenomai
> > context outside of Linux itself, although I'm not doing that myself.)
> > This historically works fine.
> > 
> > 2. On certain Linux versions, I encountered an issue where disconnecting
> > the network cable and reconnecting it almost always results in not being
> > able to send any packets.? (I cannot determine if receiving packets
> > works in this case, as the network design will not receive packets
> > unless some are sent first.)? Restarting the driver (rmmod+modprobe)
> > does recover from this case (until the next link loss), but simply
> > replugging the cable never does.
> > 
> > 3. The problem was observed with both I219-V and I219-LM (on
> > motherboard), but was *not* observed with 82571EB (PCIE).? The problem
> > was not observed with a motherboard igb-based I211.? I suspect the issue
> > is limited to motherboard-based e1000e adapters.? (Or perhaps there's
> > something different about how the IGBs are internally connected.)
> > 
> > 4. The problem does not occur when the e1000e driver is registered
> > "normally" as a Linux netdev.
> > 
> > 5. The problem was introduced by "mei: me: allow runtime pm for platform
> > with D0i3" (which has been backported to 4.4+, as far as I can tell).
> > Excluding this commit reliably resolves the issue and including it
> > reliably breaks it.
> 
> The commit hash in the master branch is
> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.
> 
> Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.
> 
> > 6. Applying the previously suggested patch https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56
> > has no effect; the E1000_STATUS_PCIM_STATE bit is not set when the issue
> > occurs.
> > 
> > 7. Given the content of the change in #5, I assumed that the problem was
> > power-management related, perhaps a side effect of the e1000e driver not
> > being registered as a netdev.? (So perhaps something thinks that no
> > devices are in use and turns something off?)
> > 
> > 8. I've previously posted register dumps from an e1000e in both the
> > "normal" and "link up but not transmitting" states.? They seemed very
> > similar, but as I'm not familiar with the register meanings I may have
> > overlooked something significant.? (Note that the dumps were captured
> > inside the watchdog task, when it detects link up but before it sets
> > E1000_TCTL_EN.)
> > 
> > 9. I enabled debug logging in the mei driver; it logs a couple of
> > runtime_idles and then a runtime_suspend during system startup.? (I
> > added a log to runtime_resume that is missing in the driver source, but
> > it appears this does not get called in my scenario.)? Note that the
> > e1000e driver is still working ok after this.. at least at first.
> > 
> > 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
> > => "suspended"
> >  ??? "cat
> > /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
> > => "unsupported"
> >  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
> > => "active"
> >  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
> > => "active" (this is the actual NIC)
> >  ??? These don't change between the working and non-working states.
> > (It's possible that some other device does, but I haven't found it yet.)
> > 
> > 11. I did try forcing the above to unsuspend, but this did not recover
> > from the e1000e issue.
> > 
> > 12. I also tried calling e1000e_reset on link-down.? This produces
> > different register output on link-up, but doesn't recover from the
> > issue.
> > 
> > 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
> > power management).? This *does* resolve the problem (but is a very big
> > hammer).
> > 
> > 14. Possibly also of interest is that if I do *both* #12 and #13, the
> > problem remains (suggesting #12 was counter-productive).
> > 
> > FYI the hardware on one of the test machines is as follows:
> >  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
> >  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05)
> >  ??? 00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04)
> >  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian  Mixture Model
> >  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0  xHCI Controller (rev 31)
> >  ??? 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
> >  ??? 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
> >  ??? 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
> >  ??? 00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
> >  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
> >  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #19 (rev f1)
> >  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #20 (rev f1)
> >  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
> >  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #11 (rev f1)
> >  ??? 00:1e.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO UART #0 (rev 31)
> >  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
> >  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
> >  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
> >  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
> >  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> >  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> >  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> > 
> > I'm happy to add any code instrumentation or make any other changes
> > needed to locate and resolve the problem, and I can readily reproduce it
> > -- I'm just at a complete loss as to where to start looking, and am
> > still hoping for some suggestions in that regard.
> > 
> > If there's anywhere (or anyone) else better for me to talk to about this
> > issue, please let me know that too.
> 
> It is not clear to me, if this is still reproducible on Linux 5.3-rc7 (or
> Linus? master branch).
> 
> If it is, this is a definitely regression, and the commits need to be
> reverted due to Linux? no regression policy.

So I should revert this from 4.4.y and 4.9.y?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-03  9:20                 ` Greg Kroah-Hartman
@ 2019-09-03  9:28                   ` Winkler, Tomas
  2019-09-03  9:39                     ` Paul Menzel
  0 siblings, 1 reply; 18+ messages in thread
From: Winkler, Tomas @ 2019-09-03  9:28 UTC (permalink / raw)
  To: intel-wired-lan



> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
> > Dear Gavin,
> >
> >
> > Thank you for following up on this.
> >
> > On 03.09.19 09:56, Gavin Lambert wrote:
> > > On 2019-08-20 14:15, I wrote:
> > > > Does anyone have any ideas about this?? Either towards further
> > > > investigation or to a possible resolution?
> > > >
> > > > This is at the point of hardware internals now, so I have no idea
> > > > how to proceed in either area.
> > >
> > > To recap (plus some new info):
> > >
> > > 1. I am using a kernel module which uses the code from the e1000e
> > > driver to communicate with the hardware without actually registering
> > > it as a Linux netdev.? (This is partly because it can get used in a
> > > Xenomai context outside of Linux itself, although I'm not doing that
> > > myself.) This historically works fine.
> > >
> > > 2. On certain Linux versions, I encountered an issue where
> > > disconnecting the network cable and reconnecting it almost always
> > > results in not being able to send any packets.? (I cannot determine
> > > if receiving packets works in this case, as the network design will
> > > not receive packets unless some are sent first.)? Restarting the
> > > driver (rmmod+modprobe) does recover from this case (until the next
> > > link loss), but simply replugging the cable never does.
> > >
> > > 3. The problem was observed with both I219-V and I219-LM (on
> > > motherboard), but was *not* observed with 82571EB (PCIE).? The
> > > problem was not observed with a motherboard igb-based I211.? I
> > > suspect the issue is limited to motherboard-based e1000e adapters.
> > > (Or perhaps there's something different about how the IGBs are
> > > internally connected.)
> > >
> > > 4. The problem does not occur when the e1000e driver is registered
> > > "normally" as a Linux netdev.
> > >
> > > 5. The problem was introduced by "mei: me: allow runtime pm for
> > > platform with D0i3" (which has been backported to 4.4+, as far as I can
> tell).
> > > Excluding this commit reliably resolves the issue and including it
> > > reliably breaks it.
> >
> > The commit hash in the master branch is
> > cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.
> >
> > Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.
> >
> > > 6. Applying the previously suggested patch
> > > https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.
> > > git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b6
> > > 2a05712f35a7fa5f5e56 has no effect; the E1000_STATUS_PCIM_STATE bit
> > > is not set when the issue occurs.
> > >
> > > 7. Given the content of the change in #5, I assumed that the problem
> > > was power-management related, perhaps a side effect of the e1000e
> > > driver not being registered as a netdev.? (So perhaps something
> > > thinks that no devices are in use and turns something off?)
> > >
> > > 8. I've previously posted register dumps from an e1000e in both the
> > > "normal" and "link up but not transmitting" states.? They seemed
> > > very similar, but as I'm not familiar with the register meanings I
> > > may have overlooked something significant.? (Note that the dumps
> > > were captured inside the watchdog task, when it detects link up but
> > > before it sets
> > > E1000_TCTL_EN.)
> > >
> > > 9. I enabled debug logging in the mei driver; it logs a couple of
> > > runtime_idles and then a runtime_suspend during system startup.? (I
> > > added a log to runtime_resume that is missing in the driver source,
> > > but it appears this does not get called in my scenario.)? Note that
> > > the e1000e driver is still working ok after this.. at least at first.
> > >
> > > 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
> > > => "suspended"
> > >  ??? "cat
> > >
> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
> > > => "unsupported"
> > >  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
> > > => "active"
> > >  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
> > > => "active" (this is the actual NIC)
> > >  ??? These don't change between the working and non-working states.
> > > (It's possible that some other device does, but I haven't found it
> > > yet.)
> > >
> > > 11. I did try forcing the above to unsuspend, but this did not
> > > recover from the e1000e issue.
> > >
> > > 12. I also tried calling e1000e_reset on link-down.? This produces
> > > different register output on link-up, but doesn't recover from the
> > > issue.
> > >
> > > 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
> > > power management).? This *does* resolve the problem (but is a very
> > > big hammer).
> > >
> > > 14. Possibly also of interest is that if I do *both* #12 and #13,
> > > the problem remains (suggesting #12 was counter-productive).
> > >
> > > FYI the hardware on one of the test machines is as follows:
> > >  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
> > >  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
> > > (x16) (rev 05)
> > >  ??? 00:02.0 VGA compatible controller: Intel Corporation Device
> > > 5912 (rev 04)
> > >  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian
> > > Mixture Model
> > >  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
> > > 3.0  xHCI Controller (rev 31)
> > >  ??? 00:14.2 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Thermal subsystem (rev 31)
> > >  ??? 00:15.0 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Serial IO I2C Controller #0 (rev 31)
> > >  ??? 00:15.1 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Serial IO I2C Controller #1 (rev 31)
> > >  ??? 00:16.0 Communication controller: Intel Corporation Sunrise
> > > Point-H CSME HECI #1 (rev 31)
> > >  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA
> > > controller [AHCI mode] (rev 31)
> > >  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
> > > Port #19 (rev f1)
> > >  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
> > > Port #20 (rev f1)
> > >  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
> > > Express Root Port #5 (rev f1)
> > >  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
> > > Express Root Port #11 (rev f1)
> > >  ??? 00:1e.0 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Serial IO UART #0 (rev 31)
> > >  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
> > > Controller (rev 31)
> > >  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
> > > PMC (rev 31)
> > >  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
> > >  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
> > > Connection (2) I219-LM (rev 31)
> > >  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> > > Network Connection (rev 03)
> > >  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> > > Network Connection (rev 03)
> > >  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> > > Network Connection (rev 03)
> > >
> > > I'm happy to add any code instrumentation or make any other changes
> > > needed to locate and resolve the problem, and I can readily
> > > reproduce it
> > > -- I'm just at a complete loss as to where to start looking, and am
> > > still hoping for some suggestions in that regard.
> > >
> > > If there's anywhere (or anyone) else better for me to talk to about
> > > this issue, please let me know that too.
> >
> > It is not clear to me, if this is still reproducible on Linux 5.3-rc7
> > (or Linus? master branch).
> >
> > If it is, this is a definitely regression, and the commits need to be
> > reverted due to Linux? no regression policy.
> 
> So I should revert this from 4.4.y and 4.9.y?

The issue is not in mei driver,  it is in e1000 driver, I my best knowledge there should be fix, please Vitaly can it be backported to older kernels?
Thanks
Tomas



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-03  9:28                   ` Winkler, Tomas
@ 2019-09-03  9:39                     ` Paul Menzel
  2019-09-03 11:00                       ` Gavin Lambert
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Menzel @ 2019-09-03  9:39 UTC (permalink / raw)
  To: intel-wired-lan

Dear Tomas,


On 2019-09-03 11:28, Winkler, Tomas wrote:

>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:

>>> On 03.09.19 09:56, Gavin Lambert wrote:
>>>> On 2019-08-20 14:15, I wrote:
>>>>> Does anyone have any ideas about this?? Either towards further
>>>>> investigation or to a possible resolution?
>>>>>
>>>>> This is at the point of hardware internals now, so I have no idea
>>>>> how to proceed in either area.
>>>>
>>>> To recap (plus some new info):
>>>>
>>>> 1. I am using a kernel module which uses the code from the e1000e
>>>> driver to communicate with the hardware without actually registering
>>>> it as a Linux netdev.? (This is partly because it can get used in a
>>>> Xenomai context outside of Linux itself, although I'm not doing that
>>>> myself.) This historically works fine.
>>>>
>>>> 2. On certain Linux versions, I encountered an issue where
>>>> disconnecting the network cable and reconnecting it almost always
>>>> results in not being able to send any packets.? (I cannot determine
>>>> if receiving packets works in this case, as the network design will
>>>> not receive packets unless some are sent first.)? Restarting the
>>>> driver (rmmod+modprobe) does recover from this case (until the next
>>>> link loss), but simply replugging the cable never does.
>>>>
>>>> 3. The problem was observed with both I219-V and I219-LM (on
>>>> motherboard), but was *not* observed with 82571EB (PCIE).? The
>>>> problem was not observed with a motherboard igb-based I211.? I
>>>> suspect the issue is limited to motherboard-based e1000e adapters.
>>>> (Or perhaps there's something different about how the IGBs are
>>>> internally connected.)
>>>>
>>>> 4. The problem does not occur when the e1000e driver is registered
>>>> "normally" as a Linux netdev.
>>>>
>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
>>>> platform with D0i3" (which has been backported to 4.4+, as far as I can
>> tell).
>>>> Excluding this commit reliably resolves the issue and including it
>>>> reliably breaks it.
>>>
>>> The commit hash in the master branch is
>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.
>>>
>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.
>>>
>>>> 6. Applying the previously suggested patch
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.
>>>> git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b6
>>>> 2a05712f35a7fa5f5e56 has no effect; the E1000_STATUS_PCIM_STATE bit
>>>> is not set when the issue occurs.
>>>>
>>>> 7. Given the content of the change in #5, I assumed that the problem
>>>> was power-management related, perhaps a side effect of the e1000e
>>>> driver not being registered as a netdev.? (So perhaps something
>>>> thinks that no devices are in use and turns something off?)
>>>>
>>>> 8. I've previously posted register dumps from an e1000e in both the
>>>> "normal" and "link up but not transmitting" states.? They seemed
>>>> very similar, but as I'm not familiar with the register meanings I
>>>> may have overlooked something significant.? (Note that the dumps
>>>> were captured inside the watchdog task, when it detects link up but
>>>> before it sets
>>>> E1000_TCTL_EN.)
>>>>
>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
>>>> runtime_idles and then a runtime_suspend during system startup.? (I
>>>> added a log to runtime_resume that is missing in the driver source,
>>>> but it appears this does not get called in my scenario.)? Note that
>>>> the e1000e driver is still working ok after this.. at least at first.
>>>>
>>>> 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
>>>> => "suspended"
>>>>  ??? "cat
>>>>
>> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
>>>> => "unsupported"
>>>>  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
>>>> => "active"
>>>>  ??? "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
>>>> => "active" (this is the actual NIC)
>>>>  ??? These don't change between the working and non-working states.
>>>> (It's possible that some other device does, but I haven't found it
>>>> yet.)
>>>>
>>>> 11. I did try forcing the above to unsuspend, but this did not
>>>> recover from the e1000e issue.
>>>>
>>>> 12. I also tried calling e1000e_reset on link-down.? This produces
>>>> different register output on link-up, but doesn't recover from the
>>>> issue.
>>>>
>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
>>>> power management).? This *does* resolve the problem (but is a very
>>>> big hammer).
>>>>
>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
>>>> the problem remains (suggesting #12 was counter-productive).
>>>>
>>>> FYI the hardware on one of the test machines is as follows:
>>>>  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>>>>  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
>>>> (x16) (rev 05)
>>>>  ??? 00:02.0 VGA compatible controller: Intel Corporation Device
>>>> 5912 (rev 04)
>>>>  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian
>>>> Mixture Model
>>>>  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
>>>> 3.0  xHCI Controller (rev 31)
>>>>  ??? 00:14.2 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Thermal subsystem (rev 31)
>>>>  ??? 00:15.0 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO I2C Controller #0 (rev 31)
>>>>  ??? 00:15.1 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO I2C Controller #1 (rev 31)
>>>>  ??? 00:16.0 Communication controller: Intel Corporation Sunrise
>>>> Point-H CSME HECI #1 (rev 31)
>>>>  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA
>>>> controller [AHCI mode] (rev 31)
>>>>  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>> Port #19 (rev f1)
>>>>  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>> Port #20 (rev f1)
>>>>  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>> Express Root Port #5 (rev f1)
>>>>  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>> Express Root Port #11 (rev f1)
>>>>  ??? 00:1e.0 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO UART #0 (rev 31)
>>>>  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
>>>> Controller (rev 31)
>>>>  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
>>>> PMC (rev 31)
>>>>  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
>>>>  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
>>>> Connection (2) I219-LM (rev 31)
>>>>  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)
>>>>  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)
>>>>  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)

(Tomas, your MUA wrapped the lines messing up the formatting.)

>>>> I'm happy to add any code instrumentation or make any other changes
>>>> needed to locate and resolve the problem, and I can readily
>>>> reproduce it
>>>> -- I'm just at a complete loss as to where to start looking, and am
>>>> still hoping for some suggestions in that regard.
>>>>
>>>> If there's anywhere (or anyone) else better for me to talk to about
>>>> this issue, please let me know that too.
>>>
>>> It is not clear to me, if this is still reproducible on Linux 5.3-rc7
>>> (or Linus? master branch).
>>>
>>> If it is, this is a definitely regression, and the commits need to be
>>> reverted due to Linux? no regression policy.
>>
>> So I should revert this from 4.4.y and 4.9.y?
> 
> The issue is not in mei driver, it is in e1000 driver, I my best
> knowledge there should be fix, please Vitaly can it be backported to
> older kernels?

Tomas, backporting the commit supposedly fixing this, does *not* help.
Also, it does not matter for the no regression policy.

Let?s wait until Gavin can confirm if it is happening with Linux 5.3-rc7.


Kind regards,

Paul

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5174 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20190903/a8484910/attachment-0001.p7s>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-03  9:39                     ` Paul Menzel
@ 2019-09-03 11:00                       ` Gavin Lambert
  2019-09-04 10:06                         ` Winkler, Tomas
  0 siblings, 1 reply; 18+ messages in thread
From: Gavin Lambert @ 2019-09-03 11:00 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-09-03 21:39, Paul Menzel wrote:
> Dear Tomas,
> 
> On 2019-09-03 11:28, Winkler, Tomas wrote:
> 
>>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
> 
>>>> On 03.09.19 09:56, Gavin Lambert wrote:
>>>>> On 2019-08-20 14:15, I wrote:
>>>>>> Does anyone have any ideas about this?? Either towards further
>>>>>> investigation or to a possible resolution?
>>>>>> 
>>>>>> This is at the point of hardware internals now, so I have no idea
>>>>>> how to proceed in either area.
>>>>> 
>>>>> To recap (plus some new info):
>>>>> 
>>>>> 1. I am using a kernel module which uses the code from the e1000e
>>>>> driver to communicate with the hardware without actually 
>>>>> registering
>>>>> it as a Linux netdev.? (This is partly because it can get used in a
>>>>> Xenomai context outside of Linux itself, although I'm not doing 
>>>>> that
>>>>> myself.) This historically works fine.
>>>>> 
>>>>> 2. On certain Linux versions, I encountered an issue where
>>>>> disconnecting the network cable and reconnecting it almost always
>>>>> results in not being able to send any packets.? (I cannot determine
>>>>> if receiving packets works in this case, as the network design will
>>>>> not receive packets unless some are sent first.)? Restarting the
>>>>> driver (rmmod+modprobe) does recover from this case (until the next
>>>>> link loss), but simply replugging the cable never does.
>>>>> 
>>>>> 3. The problem was observed with both I219-V and I219-LM (on
>>>>> motherboard), but was *not* observed with 82571EB (PCIE).? The
>>>>> problem was not observed with a motherboard igb-based I211.? I
>>>>> suspect the issue is limited to motherboard-based e1000e adapters.
>>>>> (Or perhaps there's something different about how the IGBs are
>>>>> internally connected.)
>>>>> 
>>>>> 4. The problem does not occur when the e1000e driver is registered
>>>>> "normally" as a Linux netdev.
>>>>> 
>>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
>>>>> platform with D0i3" (which has been backported to 4.4+, as far as I 
>>>>> can
>>> tell).
>>>>> Excluding this commit reliably resolves the issue and including it
>>>>> reliably breaks it.
>>>> 
>>>> The commit hash in the master branch is
>>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since 
>>>> v4.16-rc1.
>>>> 
>>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for 
>>>> v4.13+.
>>>> 
>>>>> 6. Applying the previously suggested patch
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.
>>>>> git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b6
>>>>> 2a05712f35a7fa5f5e56 has no effect; the E1000_STATUS_PCIM_STATE bit
>>>>> is not set when the issue occurs.
>>>>> 
>>>>> 7. Given the content of the change in #5, I assumed that the 
>>>>> problem
>>>>> was power-management related, perhaps a side effect of the e1000e
>>>>> driver not being registered as a netdev.? (So perhaps something
>>>>> thinks that no devices are in use and turns something off?)
>>>>> 
>>>>> 8. I've previously posted register dumps from an e1000e in both the
>>>>> "normal" and "link up but not transmitting" states.? They seemed
>>>>> very similar, but as I'm not familiar with the register meanings I
>>>>> may have overlooked something significant.? (Note that the dumps
>>>>> were captured inside the watchdog task, when it detects link up but
>>>>> before it sets
>>>>> E1000_TCTL_EN.)
>>>>> 
>>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
>>>>> runtime_idles and then a runtime_suspend during system startup.? (I
>>>>> added a log to runtime_resume that is missing in the driver source,
>>>>> but it appears this does not get called in my scenario.)? Note that
>>>>> the e1000e driver is still working ok after this.. at least at 
>>>>> first.
>>>>> 
>>>>> 10. "cat 
>>>>> /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
>>>>> => "suspended"
>>>>>  ??? "cat
>>>>> 
>>> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
>>>>> => "unsupported"
>>>>>  ??? "cat 
>>>>> /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
>>>>> => "active"
>>>>>  ??? "cat 
>>>>> /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
>>>>> => "active" (this is the actual NIC)
>>>>>  ??? These don't change between the working and non-working states.
>>>>> (It's possible that some other device does, but I haven't found it
>>>>> yet.)
>>>>> 
>>>>> 11. I did try forcing the above to unsuspend, but this did not
>>>>> recover from the e1000e issue.
>>>>> 
>>>>> 12. I also tried calling e1000e_reset on link-down.? This produces
>>>>> different register output on link-up, but doesn't recover from the
>>>>> issue.
>>>>> 
>>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
>>>>> power management).? This *does* resolve the problem (but is a very
>>>>> big hammer).
>>>>> 
>>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
>>>>> the problem remains (suggesting #12 was counter-productive).
>>>>> 
>>>>> FYI the hardware on one of the test machines is as follows:
>>>>>  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>>>>>  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
>>>>> (x16) (rev 05)
>>>>>  ??? 00:02.0 VGA compatible controller: Intel Corporation Device
>>>>> 5912 (rev 04)
>>>>>  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian
>>>>> Mixture Model
>>>>>  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
>>>>> 3.0  xHCI Controller (rev 31)
>>>>>  ??? 00:14.2 Signal processing controller: Intel Corporation 
>>>>> Sunrise
>>>>> Point-H Thermal subsystem (rev 31)
>>>>>  ??? 00:15.0 Signal processing controller: Intel Corporation 
>>>>> Sunrise
>>>>> Point-H Serial IO I2C Controller #0 (rev 31)
>>>>>  ??? 00:15.1 Signal processing controller: Intel Corporation 
>>>>> Sunrise
>>>>> Point-H Serial IO I2C Controller #1 (rev 31)
>>>>>  ??? 00:16.0 Communication controller: Intel Corporation Sunrise
>>>>> Point-H CSME HECI #1 (rev 31)
>>>>>  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H 
>>>>> SATA
>>>>> controller [AHCI mode] (rev 31)
>>>>>  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>>> Port #19 (rev f1)
>>>>>  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>>> Port #20 (rev f1)
>>>>>  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>>> Express Root Port #5 (rev f1)
>>>>>  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>>> Express Root Port #11 (rev f1)
>>>>>  ??? 00:1e.0 Signal processing controller: Intel Corporation 
>>>>> Sunrise
>>>>> Point-H Serial IO UART #0 (rev 31)
>>>>>  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
>>>>> Controller (rev 31)
>>>>>  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
>>>>> PMC (rev 31)
>>>>>  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 
>>>>> 31)
>>>>>  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
>>>>> Connection (2) I219-LM (rev 31)
>>>>>  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>>> Network Connection (rev 03)
>>>>>  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>>> Network Connection (rev 03)
>>>>>  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>>> Network Connection (rev 03)
> 
> (Tomas, your MUA wrapped the lines messing up the formatting.)
> 
>>>>> I'm happy to add any code instrumentation or make any other changes
>>>>> needed to locate and resolve the problem, and I can readily
>>>>> reproduce it
>>>>> -- I'm just at a complete loss as to where to start looking, and am
>>>>> still hoping for some suggestions in that regard.
>>>>> 
>>>>> If there's anywhere (or anyone) else better for me to talk to about
>>>>> this issue, please let me know that too.
>>>> 
>>>> It is not clear to me, if this is still reproducible on Linux 
>>>> 5.3-rc7
>>>> (or Linus? master branch).
>>>> 
>>>> If it is, this is a definitely regression, and the commits need to 
>>>> be
>>>> reverted due to Linux? no regression policy.
>>> 
>>> So I should revert this from 4.4.y and 4.9.y?
>> 
>> The issue is not in mei driver, it is in e1000 driver, I my best
>> knowledge there should be fix, please Vitaly can it be backported to
>> older kernels?
> 
> Tomas, backporting the commit supposedly fixing this, does *not* help.
> Also, it does not matter for the no regression policy.
> 
> Let?s wait until Gavin can confirm if it is happening with Linux 
> 5.3-rc7.

As noted above (and in a prior email), the problem doesn't occur when 
using the driver "normally" within Linux.  The triggering environment is 
where the driver init/send/receive code is being executed directly 
*without* being registered as a Linux netdev.

It is likely that the "real problem" is some side effect of this, such 
as something checking if a child device is in use or powered down but 
it's not registered.

My environment is currently based on this tree:

> Using this kernel tree:
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/?h=v4.9-rt&ofs=3120
> 
> I've identified that the code at tag v4.9.126 is "good" and the
> code at tag v4.9.127 is "bad".
(I then narrowed it down to that specific commit.)

To reiterate, there is probably no problem with standard usage of the 
drivers as part of Linux.

But in this particular non-standard-edge-case-usage, there seems to be 
some unfortunate interaction between the mei driver power management 
change and link-loss in onboard e1000e, and I'm trying to figure out the 
cause and hopefully a fix/workaround (or at least one less serious than 
disabling power management entirely).

Some more context from my original email:
> I'm using a system with an e1000e network driver which has been patched 
> to bypass the regular Linux network stack (because it can get called 
> from a Xenomai RT context, among other reasons -- although in my case 
> I'm not doing that).  The complete source for the patched version of 
> the code can be found here:
>     
> https://github.com/ribalda/ethercat/blob/master/devices/e1000e/netdev-4.9-ethercat.c
> (There are some minor changes to other files, but the majority of 
> changes are only to this file.  You can see just the changes at 
> https://gist.github.com/uecasm/5e36a15bda6ffd53079344fc443dcc5f/revisions 
> .)
> 
> It was originally based on the in-kernel e1000e driver as of Linux 
> 4.9.65.  (I'm not the person who originally made the patches, but I am 
> the person who rebased them to kernel 4.9 and I'm the one trying to 
> maintain them for newer kernel versions.  Though I'm also not the 
> person who made that github repo.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-03 11:00                       ` Gavin Lambert
@ 2019-09-04 10:06                         ` Winkler, Tomas
  2019-09-04 11:08                           ` Gavin Lambert
  0 siblings, 1 reply; 18+ messages in thread
From: Winkler, Tomas @ 2019-09-04 10:06 UTC (permalink / raw)
  To: intel-wired-lan

> 
> On 2019-09-03 21:39, Paul Menzel wrote:
> > Dear Tomas,
> >
> > On 2019-09-03 11:28, Winkler, Tomas wrote:
> >
> >>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
> >
> >>>> On 03.09.19 09:56, Gavin Lambert wrote:
> >>>>> On 2019-08-20 14:15, I wrote:
> >>>>>> Does anyone have any ideas about this?? Either towards further
> >>>>>> investigation or to a possible resolution?
> >>>>>>
> >>>>>> This is at the point of hardware internals now, so I have no idea
> >>>>>> how to proceed in either area.
> >>>>>
> >>>>> To recap (plus some new info):
> >>>>>
> >>>>> 1. I am using a kernel module which uses the code from the e1000e
> >>>>> driver to communicate with the hardware without actually
> >>>>> registering it as a Linux netdev.? (This is partly because it can
> >>>>> get used in a Xenomai context outside of Linux itself, although
> >>>>> I'm not doing that
> >>>>> myself.) This historically works fine.
> >>>>>
> >>>>> 2. On certain Linux versions, I encountered an issue where
> >>>>> disconnecting the network cable and reconnecting it almost always
> >>>>> results in not being able to send any packets.? (I cannot
> >>>>> determine if receiving packets works in this case, as the network
> >>>>> design will not receive packets unless some are sent first.)
> >>>>> Restarting the driver (rmmod+modprobe) does recover from this case
> >>>>> (until the next link loss), but simply replugging the cable never does.
> >>>>>
> >>>>> 3. The problem was observed with both I219-V and I219-LM (on
> >>>>> motherboard), but was *not* observed with 82571EB (PCIE).? The
> >>>>> problem was not observed with a motherboard igb-based I211.? I
> >>>>> suspect the issue is limited to motherboard-based e1000e adapters.
> >>>>> (Or perhaps there's something different about how the IGBs are
> >>>>> internally connected.)
> >>>>>
> >>>>> 4. The problem does not occur when the e1000e driver is registered
> >>>>> "normally" as a Linux netdev.
> >>>>>
> >>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
> >>>>> platform with D0i3" (which has been backported to 4.4+, as far as
> >>>>> I can
> >>> tell).
> >>>>> Excluding this commit reliably resolves the issue and including it
> >>>>> reliably breaks it.
> >>>>
> >>>> The commit hash in the master branch is
> >>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since
> >>>> v4.16-rc1.
> >>>>
> >>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for
> >>>> v4.13+.
> >>>>
> >>>>> 6. Applying the previously suggested patch
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.
> >>>>>
> git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136
> >>>>> b6
> >>>>> 2a05712f35a7fa5f5e56 has no effect; the E1000_STATUS_PCIM_STATE
> >>>>> bit is not set when the issue occurs.
> >>>>>
> >>>>> 7. Given the content of the change in #5, I assumed that the
> >>>>> problem was power-management related, perhaps a side effect of the
> >>>>> e1000e driver not being registered as a netdev.? (So perhaps
> >>>>> something thinks that no devices are in use and turns something
> >>>>> off?)
> >>>>>
> >>>>> 8. I've previously posted register dumps from an e1000e in both
> >>>>> the "normal" and "link up but not transmitting" states.? They
> >>>>> seemed very similar, but as I'm not familiar with the register
> >>>>> meanings I may have overlooked something significant.? (Note that
> >>>>> the dumps were captured inside the watchdog task, when it detects
> >>>>> link up but before it sets
> >>>>> E1000_TCTL_EN.)
> >>>>>
> >>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
> >>>>> runtime_idles and then a runtime_suspend during system startup.
> >>>>> (I added a log to runtime_resume that is missing in the driver
> >>>>> source, but it appears this does not get called in my scenario.)
> >>>>> Note that the e1000e driver is still working ok after this.. at
> >>>>> least at first.
> >>>>>
> >>>>> 10. "cat
> >>>>> /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
> >>>>> => "suspended"
> >>>>>  ??? "cat
> >>>>>
> >>>
> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
> >>>>> => "unsupported"
> >>>>>  ??? "cat
> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
> >>>>> => "active"
> >>>>>  ??? "cat
> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
> >>>>> => "active" (this is the actual NIC)
> >>>>>  ??? These don't change between the working and non-working states.
> >>>>> (It's possible that some other device does, but I haven't found it
> >>>>> yet.)
> >>>>>
> >>>>> 11. I did try forcing the above to unsuspend, but this did not
> >>>>> recover from the e1000e issue.
> >>>>>
> >>>>> 12. I also tried calling e1000e_reset on link-down.? This produces
> >>>>> different register output on link-up, but doesn't recover from the
> >>>>> issue.
> >>>>>
> >>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled
> >>>>> (no power management).? This *does* resolve the problem (but is a
> >>>>> very big hammer).
> >>>>>
> >>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
> >>>>> the problem remains (suggesting #12 was counter-productive).
> >>>>>
> >>>>> FYI the hardware on one of the test machines is as follows:
> >>>>>  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
> >>>>>  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
> >>>>> (x16) (rev 05)
> >>>>>  ??? 00:02.0 VGA compatible controller: Intel Corporation Device
> >>>>> 5912 (rev 04)
> >>>>>  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian
> >>>>> Mixture Model
> >>>>>  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
> >>>>> 3.0  xHCI Controller (rev 31)
> >>>>>  ??? 00:14.2 Signal processing controller: Intel Corporation
> >>>>> Sunrise Point-H Thermal subsystem (rev 31)
> >>>>>  ??? 00:15.0 Signal processing controller: Intel Corporation
> >>>>> Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
> >>>>>  ??? 00:15.1 Signal processing controller: Intel Corporation
> >>>>> Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
> >>>>>  ??? 00:16.0 Communication controller: Intel Corporation Sunrise
> >>>>> Point-H CSME HECI #1 (rev 31)
> >>>>>  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H
> >>>>> SATA controller [AHCI mode] (rev 31)
> >>>>>  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
> >>>>> Root Port #19 (rev f1)
> >>>>>  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI
> >>>>> Root Port #20 (rev f1)
> >>>>>  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
> >>>>> Express Root Port #5 (rev f1)
> >>>>>  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
> >>>>> Express Root Port #11 (rev f1)
> >>>>>  ??? 00:1e.0 Signal processing controller: Intel Corporation
> >>>>> Sunrise Point-H Serial IO UART #0 (rev 31)
> >>>>>  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
> >>>>> Controller (rev 31)
> >>>>>  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
> >>>>> PMC (rev 31)
> >>>>>  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev
> >>>>> 31)
> >>>>>  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
> >>>>> Connection (2) I219-LM (rev 31)
> >>>>>  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> >>>>> Network Connection (rev 03)
> >>>>>  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> >>>>> Network Connection (rev 03)
> >>>>>  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> >>>>> Network Connection (rev 03)
> >
> > (Tomas, your MUA wrapped the lines messing up the formatting.)


Sorry, it's outlook.  

> >
> >>>>> I'm happy to add any code instrumentation or make any other
> >>>>> changes needed to locate and resolve the problem, and I can
> >>>>> readily reproduce it
> >>>>> -- I'm just at a complete loss as to where to start looking, and
> >>>>> am still hoping for some suggestions in that regard.
> >>>>>
> >>>>> If there's anywhere (or anyone) else better for me to talk to
> >>>>> about this issue, please let me know that too.
> >>>>
> >>>> It is not clear to me, if this is still reproducible on Linux
> >>>> 5.3-rc7
> >>>> (or Linus? master branch).
> >>>>
> >>>> If it is, this is a definitely regression, and the commits need to
> >>>> be reverted due to Linux? no regression policy.
> >>>
> >>> So I should revert this from 4.4.y and 4.9.y?
> >>
> >> The issue is not in mei driver, it is in e1000 driver, I my best
> >> knowledge there should be fix, please Vitaly can it be backported to
> >> older kernels?
> >
> > Tomas, backporting the commit supposedly fixing this, does *not* help.

I hope that Vitaly can address that.

> > Also, it does not matter for the no regression policy.

There are power consumption implication if you revert this commit for everyone, while the issue is present only on some platforms.
You can still disable runtime power management via sysfs and permanently using udev rule on your particular system.
e.g. ATTR{../../power/control}="on"

> >
> > Let?s wait until Gavin can confirm if it is happening with Linux
> > 5.3-rc7.
> 
> As noted above (and in a prior email), the problem doesn't occur when using
> the driver "normally" within Linux.  The triggering environment is where the
> driver init/send/receive code is being executed directly
> *without* being registered as a Linux netdev.
> 
> It is likely that the "real problem" is some side effect of this, such as
> something checking if a child device is in use or powered down but it's not
> registered.
> 
> My environment is currently based on this tree:
> 
> > Using this kernel tree:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git
> > /log/?h=v4.9-rt&ofs=3120
> >
> > I've identified that the code at tag v4.9.126 is "good" and the code
> > at tag v4.9.127 is "bad".
> (I then narrowed it down to that specific commit.)
> 
> To reiterate, there is probably no problem with standard usage of the
> drivers as part of Linux.
> 
> But in this particular non-standard-edge-case-usage, there seems to be some
> unfortunate interaction between the mei driver power management change
> and link-loss in onboard e1000e, and I'm trying to figure out the cause and
> hopefully a fix/workaround (or at least one less serious than disabling power
> management entirely).
This is some underlying issue, I'm don't think you can be able to resolve it yourself,  e1000 guys should provide the fix.
Unfortunately I cannot really fix this issue form the mei side. 

> 
> Some more context from my original email:
> > I'm using a system with an e1000e network driver which has been
> > patched to bypass the regular Linux network stack (because it can get
> > called from a Xenomai RT context, among other reasons -- although in
> > my case I'm not doing that).  The complete source for the patched
> > version of the code can be found here:
> >
> > https://github.com/ribalda/ethercat/blob/master/devices/e1000e/netdev-
> > 4.9-ethercat.c (There are some minor changes to other files, but the
> > majority of changes are only to this file.  You can see just the
> > changes at
> > https://gist.github.com/uecasm/5e36a15bda6ffd53079344fc443dcc5f/revisi
> > ons
> > .)
> >
> > It was originally based on the in-kernel e1000e driver as of Linux
> > 4.9.65.  (I'm not the person who originally made the patches, but I am
> > the person who rebased them to kernel 4.9 and I'm the one trying to
> > maintain them for newer kernel versions.  Though I'm also not the
> > person who made that github repo.)

You will need to eventually incorporate the e1000 fix when resolved also to your code base.
For now the easiest workaround is to disable power management on mei from outside on effected platforms.

Tomas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-04 10:06                         ` Winkler, Tomas
@ 2019-09-04 11:08                           ` Gavin Lambert
  2019-09-04 12:31                             ` Lifshits, Vitaly
  2019-09-05  3:59                             ` Gavin Lambert
  0 siblings, 2 replies; 18+ messages in thread
From: Gavin Lambert @ 2019-09-04 11:08 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-09-04 22:06, Winkler, Tomas wrote:
>> 
>> On 2019-09-03 21:39, Paul Menzel wrote:
>> > Dear Tomas,
>> >
>> > On 2019-09-03 11:28, Winkler, Tomas wrote:
>> >
>> >>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
>> >
>> >>>> On 03.09.19 09:56, Gavin Lambert wrote:
>> >>>>> On 2019-08-20 14:15, I wrote:
>> >>>>>> Does anyone have any ideas about this?? Either towards further
>> >>>>>> investigation or to a possible resolution?
>> >>>>>>
>> >>>>>> This is at the point of hardware internals now, so I have no idea
>> >>>>>> how to proceed in either area.
>> >>>>>
>> >>>>> To recap (plus some new info):
>> >>>>>
>> >>>>> 1. I am using a kernel module which uses the code from the e1000e
>> >>>>> driver to communicate with the hardware without actually
>> >>>>> registering it as a Linux netdev.? (This is partly because it can
>> >>>>> get used in a Xenomai context outside of Linux itself, although
>> >>>>> I'm not doing that
>> >>>>> myself.) This historically works fine.
>> >>>>>
>> >>>>> 2. On certain Linux versions, I encountered an issue where
>> >>>>> disconnecting the network cable and reconnecting it almost always
>> >>>>> results in not being able to send any packets.? (I cannot
>> >>>>> determine if receiving packets works in this case, as the network
>> >>>>> design will not receive packets unless some are sent first.)
>> >>>>> Restarting the driver (rmmod+modprobe) does recover from this case
>> >>>>> (until the next link loss), but simply replugging the cable never does.
>> >>>>>
>> >>>>> 3. The problem was observed with both I219-V and I219-LM (on
>> >>>>> motherboard), but was *not* observed with 82571EB (PCIE).? The
>> >>>>> problem was not observed with a motherboard igb-based I211.? I
>> >>>>> suspect the issue is limited to motherboard-based e1000e adapters.
>> >>>>> (Or perhaps there's something different about how the IGBs are
>> >>>>> internally connected.)
>> >>>>>
>> >>>>> 4. The problem does not occur when the e1000e driver is registered
>> >>>>> "normally" as a Linux netdev.
>> >>>>>
>> >>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
>> >>>>> platform with D0i3" (which has been backported to 4.4+, as far as
>> >>>>> I can tell).
>> >>>>> Excluding this commit reliably resolves the issue and including it
>> >>>>> reliably breaks it.
>> >>>>
>> >>>> The commit hash in the master branch is
>> >>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since
>> >>>> v4.16-rc1.
>> >>>>
>> >>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for
>> >>>> v4.13+.
>> >>>>
>> >>>>> 6. Applying the previously suggested patch
>> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56
>> >>>>> has no effect; the E1000_STATUS_PCIM_STATE
>> >>>>> bit is not set when the issue occurs.
>> >>>>>
>> >>>>> 7. Given the content of the change in #5, I assumed that the
>> >>>>> problem was power-management related, perhaps a side effect of the
>> >>>>> e1000e driver not being registered as a netdev.? (So perhaps
>> >>>>> something thinks that no devices are in use and turns something
>> >>>>> off?)
>> >>>>>
>> >>>>> 8. I've previously posted register dumps from an e1000e in both
>> >>>>> the "normal" and "link up but not transmitting" states.? They
>> >>>>> seemed very similar, but as I'm not familiar with the register
>> >>>>> meanings I may have overlooked something significant.? (Note that
>> >>>>> the dumps were captured inside the watchdog task, when it detects
>> >>>>> link up but before it sets
>> >>>>> E1000_TCTL_EN.)
>> >>>>>
>> >>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
>> >>>>> runtime_idles and then a runtime_suspend during system startup.
>> >>>>> (I added a log to runtime_resume that is missing in the driver
>> >>>>> source, but it appears this does not get called in my scenario.)
>> >>>>> Note that the e1000e driver is still working ok after this.. at
>> >>>>> least at first.
>> >>>>>
>> >>>>> 10. "cat
>> >>>>> /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
>> >>>>> => "suspended"
>> >>>>>  ??? "cat
>> >>>>>
>> >>>
>> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
>> >>>>> => "unsupported"
>> >>>>>  ??? "cat
>> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
>> >>>>> => "active"
>> >>>>>  ??? "cat
>> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
>> >>>>> => "active" (this is the actual NIC)
>> >>>>>  ??? These don't change between the working and non-working states.
>> >>>>> (It's possible that some other device does, but I haven't found it
>> >>>>> yet.)
>> >>>>>
>> >>>>> 11. I did try forcing the above to unsuspend, but this did not
>> >>>>> recover from the e1000e issue.
>> >>>>>
>> >>>>> 12. I also tried calling e1000e_reset on link-down.? This produces
>> >>>>> different register output on link-up, but doesn't recover from the
>> >>>>> issue.
>> >>>>>
>> >>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled
>> >>>>> (no power management).? This *does* resolve the problem (but is a
>> >>>>> very big hammer).
>> >>>>>
>> >>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
>> >>>>> the problem remains (suggesting #12 was counter-productive).
>> >>>>>
>> >>>>> FYI the hardware on one of the test machines is as follows:
>> >>>>>  ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>> >>>>>  ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
>> >>>>> (x16) (rev 05)
>> >>>>>  ??? 00:02.0 VGA compatible controller: Intel Corporation Device
>> >>>>> 5912 (rev 04)
>> >>>>>  ??? 00:08.0 System peripheral: Intel Corporation Skylake Gaussian
>> >>>>> Mixture Model
>> >>>>>  ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
>> >>>>> 3.0  xHCI Controller (rev 31)
>> >>>>>  ??? 00:14.2 Signal processing controller: Intel Corporation
>> >>>>> Sunrise Point-H Thermal subsystem (rev 31)
>> >>>>>  ??? 00:15.0 Signal processing controller: Intel Corporation
>> >>>>> Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
>> >>>>>  ??? 00:15.1 Signal processing controller: Intel Corporation
>> >>>>> Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
>> >>>>>  ??? 00:16.0 Communication controller: Intel Corporation Sunrise
>> >>>>> Point-H CSME HECI #1 (rev 31)
>> >>>>>  ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H
>> >>>>> SATA controller [AHCI mode] (rev 31)
>> >>>>>  ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>> >>>>> Root Port #19 (rev f1)
>> >>>>>  ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI
>> >>>>> Root Port #20 (rev f1)
>> >>>>>  ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>> >>>>> Express Root Port #5 (rev f1)
>> >>>>>  ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>> >>>>> Express Root Port #11 (rev f1)
>> >>>>>  ??? 00:1e.0 Signal processing controller: Intel Corporation
>> >>>>> Sunrise Point-H Serial IO UART #0 (rev 31)
>> >>>>>  ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
>> >>>>> Controller (rev 31)
>> >>>>>  ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
>> >>>>> PMC (rev 31)
>> >>>>>  ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev
>> >>>>> 31)
>> >>>>>  ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
>> >>>>> Connection (2) I219-LM (rev 31)
>> >>>>>  ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>> >>>>> Network Connection (rev 03)
>> >>>>>  ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>> >>>>> Network Connection (rev 03)
>> >>>>>  ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>> >>>>> Network Connection (rev 03)
>> >
>> > (Tomas, your MUA wrapped the lines messing up the formatting.)
> 
> 
> Sorry, it's outlook.
> 
>> >
>> >>>>> I'm happy to add any code instrumentation or make any other
>> >>>>> changes needed to locate and resolve the problem, and I can
>> >>>>> readily reproduce it
>> >>>>> -- I'm just at a complete loss as to where to start looking, and
>> >>>>> am still hoping for some suggestions in that regard.
>> >>>>>
>> >>>>> If there's anywhere (or anyone) else better for me to talk to
>> >>>>> about this issue, please let me know that too.
>> >>>>
>> >>>> It is not clear to me, if this is still reproducible on Linux
>> >>>> 5.3-rc7 (or Linus? master branch).
>> >>>>
>> >>>> If it is, this is a definitely regression, and the commits need to
>> >>>> be reverted due to Linux? no regression policy.
>> >>>
>> >>> So I should revert this from 4.4.y and 4.9.y?
>> >>
>> >> The issue is not in mei driver, it is in e1000 driver, I my best
>> >> knowledge there should be fix, please Vitaly can it be backported to
>> >> older kernels?
>> >
>> > Tomas, backporting the commit supposedly fixing this, does *not* help.
> 
> I hope that Vitaly can address that.
> 
>> > Also, it does not matter for the no regression policy.
> 
> There are power consumption implication if you revert this commit for
> everyone, while the issue is present only on some platforms.

I wouldn't suggest reverting that change, at least not solely on my 
account (unless it's affecting more people).  It's not only me using 
this code but it's still a very niche case, and outside of "normal" 
Linux usage.

Although it seems a little odd that it ended up in 4.4 and 4.9 when the 
commit said it was intended for 4.13+.  But I don't know how those 
things work.

(Though in a way this was good for me -- it would have been a lot harder 
to run into this issue when switching from 4.9 to 4.19 [which would have 
been the next step] rather than from 4.9.110 to 4.9.168 [which is what 
actually happened].)

> You can still disable runtime power management via sysfs and
> permanently using udev rule on your particular system.
> e.g. ATTR{../../power/control}="on"

I'll do some more testing on this tomorrow, but I do recall trying 
setting power/control to "on" (via sysfs) for the device:

   00:16.0 Communication controller: Intel Corporation Sunrise Point-H 
CSME HECI #1 (rev 31)

which was the one that I noticed was suspended.  Is this the mei device?

In any case when I tried it before it didn't seem to help, but I think 
this was after link-down and things had already failed.  I'll try 
testing a few more cases, including doing it pre-emptively.

>> > Let?s wait until Gavin can confirm if it is happening with Linux
>> > 5.3-rc7.
>> 
>> As noted above (and in a prior email), the problem doesn't occur when 
>> using
>> the driver "normally" within Linux.  The triggering environment is 
>> where the
>> driver init/send/receive code is being executed directly
>> *without* being registered as a Linux netdev.
>> 
>> It is likely that the "real problem" is some side effect of this, such 
>> as
>> something checking if a child device is in use or powered down but 
>> it's not
>> registered.
>> 
>> My environment is currently based on this tree:
>> 
>> > Using this kernel tree:
>> >
>> > https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/?h=v4.9-rt&ofs=3120
>> >
>> > I've identified that the code at tag v4.9.126 is "good" and the code
>> > at tag v4.9.127 is "bad".
>> (I then narrowed it down to that specific commit.)
>> 
>> To reiterate, there is probably no problem with standard usage of the
>> drivers as part of Linux.
>> 
>> But in this particular non-standard-edge-case-usage, there seems to be 
>> some
>> unfortunate interaction between the mei driver power management change
>> and link-loss in onboard e1000e, and I'm trying to figure out the 
>> cause and
>> hopefully a fix/workaround (or at least one less serious than 
>> disabling power
>> management entirely).
> This is some underlying issue, I'm don't think you can be able to
> resolve it yourself,  e1000 guys should provide the fix.
> Unfortunately I cannot really fix this issue form the mei side.
> 
>> 
>> Some more context from my original email:
>> > I'm using a system with an e1000e network driver which has been
>> > patched to bypass the regular Linux network stack (because it can get
>> > called from a Xenomai RT context, among other reasons -- although in
>> > my case I'm not doing that).  The complete source for the patched
>> > version of the code can be found here:
>> >
>> > https://github.com/ribalda/ethercat/blob/master/devices/e1000e/netdev-4.9-ethercat.c
>> > (There are some minor changes to other files, but the
>> > majority of changes are only to this file.  You can see just the
>> > changes at
>> > https://gist.github.com/uecasm/5e36a15bda6ffd53079344fc443dcc5f/revisions .)
>> >
>> > It was originally based on the in-kernel e1000e driver as of Linux
>> > 4.9.65.  (I'm not the person who originally made the patches, but I am
>> > the person who rebased them to kernel 4.9 and I'm the one trying to
>> > maintain them for newer kernel versions.  Though I'm also not the
>> > person who made that github repo.)
> 
> You will need to eventually incorporate the e1000 fix when resolved
> also to your code base.
> For now the easiest workaround is to disable power management on mei
> from outside on effected platforms.

Yeah, I'm hoping that the eventual solution will be a code change to the 
e1000e driver.  The way the distribution is structured it's very easy to 
apply a fix there and much much harder to apply one at any other point.  
Though userspace rule changes are also feasible.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-04 11:08                           ` Gavin Lambert
@ 2019-09-04 12:31                             ` Lifshits, Vitaly
  2019-09-05  3:59                             ` Gavin Lambert
  1 sibling, 0 replies; 18+ messages in thread
From: Lifshits, Vitaly @ 2019-09-04 12:31 UTC (permalink / raw)
  To: intel-wired-lan

On 9/4/2019 14:08, Gavin Lambert wrote:
> On 2019-09-04 22:06, Winkler, Tomas wrote:
>>>
>>> On 2019-09-03 21:39, Paul Menzel wrote:
>>> > Dear Tomas,
>>> >
>>> > On 2019-09-03 11:28, Winkler, Tomas wrote:
>>> >
>>> >>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
>>> >
>>> >>>> On 03.09.19 09:56, Gavin Lambert wrote:
>>> >>>>> On 2019-08-20 14:15, I wrote:
>>> >>>>>> Does anyone have any ideas about this?? Either towards further
>>> >>>>>> investigation or to a possible resolution?
>>> >>>>>>
>>> >>>>>> This is at the point of hardware internals now, so I have no 
>>> idea
>>> >>>>>> how to proceed in either area.
>>> >>>>>
>>> >>>>> To recap (plus some new info):
>>> >>>>>
>>> >>>>> 1. I am using a kernel module which uses the code from the e1000e
>>> >>>>> driver to communicate with the hardware without actually
>>> >>>>> registering it as a Linux netdev.? (This is partly because it can
>>> >>>>> get used in a Xenomai context outside of Linux itself, although
>>> >>>>> I'm not doing that
>>> >>>>> myself.) This historically works fine.
>>> >>>>>
>>> >>>>> 2. On certain Linux versions, I encountered an issue where
>>> >>>>> disconnecting the network cable and reconnecting it almost always
>>> >>>>> results in not being able to send any packets.? (I cannot
>>> >>>>> determine if receiving packets works in this case, as the network
>>> >>>>> design will not receive packets unless some are sent first.)
>>> >>>>> Restarting the driver (rmmod+modprobe) does recover from this 
>>> case
>>> >>>>> (until the next link loss), but simply replugging the cable 
>>> never does.
>>> >>>>>
>>> >>>>> 3. The problem was observed with both I219-V and I219-LM (on
>>> >>>>> motherboard), but was *not* observed with 82571EB (PCIE).? The
>>> >>>>> problem was not observed with a motherboard igb-based I211.? I
>>> >>>>> suspect the issue is limited to motherboard-based e1000e 
>>> adapters.
>>> >>>>> (Or perhaps there's something different about how the IGBs are
>>> >>>>> internally connected.)
>>> >>>>>
>>> >>>>> 4. The problem does not occur when the e1000e driver is 
>>> registered
>>> >>>>> "normally" as a Linux netdev.
>>> >>>>>
>>> >>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
>>> >>>>> platform with D0i3" (which has been backported to 4.4+, as far as
>>> >>>>> I can tell).
>>> >>>>> Excluding this commit reliably resolves the issue and 
>>> including it
>>> >>>>> reliably breaks it.
>>> >>>>
>>> >>>> The commit hash in the master branch is
>>> >>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since
>>> >>>> v4.16-rc1.
>>> >>>>
>>> >>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for
>>> >>>> v4.13+.
>>> >>>>
>>> >>>>> 6. Applying the previously suggested patch
>>> >>>>> 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56
>>> >>>>> has no effect; the E1000_STATUS_PCIM_STATE
>>> >>>>> bit is not set when the issue occurs.
>>> >>>>>
>>> >>>>> 7. Given the content of the change in #5, I assumed that the
>>> >>>>> problem was power-management related, perhaps a side effect of 
>>> the
>>> >>>>> e1000e driver not being registered as a netdev.? (So perhaps
>>> >>>>> something thinks that no devices are in use and turns something
>>> >>>>> off?)
>>> >>>>>
>>> >>>>> 8. I've previously posted register dumps from an e1000e in both
>>> >>>>> the "normal" and "link up but not transmitting" states.? They
>>> >>>>> seemed very similar, but as I'm not familiar with the register
>>> >>>>> meanings I may have overlooked something significant.? (Note that
>>> >>>>> the dumps were captured inside the watchdog task, when it detects
>>> >>>>> link up but before it sets
>>> >>>>> E1000_TCTL_EN.)
>>> >>>>>
>>> >>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
>>> >>>>> runtime_idles and then a runtime_suspend during system startup.
>>> >>>>> (I added a log to runtime_resume that is missing in the driver
>>> >>>>> source, but it appears this does not get called in my scenario.)
>>> >>>>> Note that the e1000e driver is still working ok after this.. at
>>> >>>>> least at first.
>>> >>>>>
>>> >>>>> 10. "cat
>>> >>>>> /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
>>> >>>>> => "suspended"
>>> >>>>>? ??? "cat
>>> >>>>>
>>> >>>
>>> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
>>> >>>>> => "unsupported"
>>> >>>>>? ??? "cat
>>> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
>>> >>>>> => "active"
>>> >>>>>? ??? "cat
>>> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
>>> >>>>> => "active" (this is the actual NIC)
>>> >>>>>? ??? These don't change between the working and non-working 
>>> states.
>>> >>>>> (It's possible that some other device does, but I haven't 
>>> found it
>>> >>>>> yet.)
>>> >>>>>
>>> >>>>> 11. I did try forcing the above to unsuspend, but this did not
>>> >>>>> recover from the e1000e issue.
>>> >>>>>
>>> >>>>> 12. I also tried calling e1000e_reset on link-down.? This 
>>> produces
>>> >>>>> different register output on link-up, but doesn't recover from 
>>> the
>>> >>>>> issue.
>>> >>>>>
>>> >>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled
>>> >>>>> (no power management).? This *does* resolve the problem (but is a
>>> >>>>> very big hammer).
>>> >>>>>
>>> >>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
>>> >>>>> the problem remains (suggesting #12 was counter-productive).
>>> >>>>>
>>> >>>>> FYI the hardware on one of the test machines is as follows:
>>> >>>>>? ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>>> >>>>>? ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe 
>>> Controller
>>> >>>>> (x16) (rev 05)
>>> >>>>>? ??? 00:02.0 VGA compatible controller: Intel Corporation Device
>>> >>>>> 5912 (rev 04)
>>> >>>>>? ??? 00:08.0 System peripheral: Intel Corporation Skylake 
>>> Gaussian
>>> >>>>> Mixture Model
>>> >>>>>? ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H 
>>> USB
>>> >>>>> 3.0? xHCI Controller (rev 31)
>>> >>>>>? ??? 00:14.2 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Thermal subsystem (rev 31)
>>> >>>>>? ??? 00:15.0 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
>>> >>>>>? ??? 00:15.1 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
>>> >>>>>? ??? 00:16.0 Communication controller: Intel Corporation Sunrise
>>> >>>>> Point-H CSME HECI #1 (rev 31)
>>> >>>>>? ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H
>>> >>>>> SATA controller [AHCI mode] (rev 31)
>>> >>>>>? ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Root Port #19 (rev f1)
>>> >>>>>? ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Root Port #20 (rev f1)
>>> >>>>>? ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Express Root Port #5 (rev f1)
>>> >>>>>? ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Express Root Port #11 (rev f1)
>>> >>>>>? ??? 00:1e.0 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Serial IO UART #0 (rev 31)
>>> >>>>>? ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
>>> >>>>> Controller (rev 31)
>>> >>>>>? ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
>>> >>>>> PMC (rev 31)
>>> >>>>>? ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev
>>> >>>>> 31)
>>> >>>>>? ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
>>> >>>>> Connection (2) I219-LM (rev 31)
>>> >>>>>? ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>> >>>>> Network Connection (rev 03)
>>> >>>>>? ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>> >>>>> Network Connection (rev 03)
>>> >>>>>? ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>> >>>>> Network Connection (rev 03)
>>> >
>>> > (Tomas, your MUA wrapped the lines messing up the formatting.)
>>
>>
>> Sorry, it's outlook.
>>
>>> >
>>> >>>>> I'm happy to add any code instrumentation or make any other
>>> >>>>> changes needed to locate and resolve the problem, and I can
>>> >>>>> readily reproduce it
>>> >>>>> -- I'm just at a complete loss as to where to start looking, and
>>> >>>>> am still hoping for some suggestions in that regard.
>>> >>>>>
>>> >>>>> If there's anywhere (or anyone) else better for me to talk to
>>> >>>>> about this issue, please let me know that too.
>>> >>>>
>>> >>>> It is not clear to me, if this is still reproducible on Linux
>>> >>>> 5.3-rc7 (or Linus? master branch).
>>> >>>>
>>> >>>> If it is, this is a definitely regression, and the commits need to
>>> >>>> be reverted due to Linux? no regression policy.
>>> >>>
>>> >>> So I should revert this from 4.4.y and 4.9.y?
>>> >>
>>> >> The issue is not in mei driver, it is in e1000 driver, I my best
>>> >> knowledge there should be fix, please Vitaly can it be backported to
>>> >> older kernels?
>>> >
>>> > Tomas, backporting the commit supposedly fixing this, does *not* 
>>> help.
>>
>> I hope that Vitaly can address that.

As far as I can see it's not the same issue we had in the upstream 
driver when the mei commit was added.

Backporting this commit is not possible and will not help.

>>
>>> > Also, it does not matter for the no regression policy.
>>
>> There are power consumption implication if you revert this commit for
>> everyone, while the issue is present only on some platforms.
>
> I wouldn't suggest reverting that change, at least not solely on my 
> account (unless it's affecting more people).? It's not only me using 
> this code but it's still a very niche case, and outside of "normal" 
> Linux usage.
>
> Although it seems a little odd that it ended up in 4.4 and 4.9 when 
> the commit said it was intended for 4.13+.? But I don't know how those 
> things work.
>
> (Though in a way this was good for me -- it would have been a lot 
> harder to run into this issue when switching from 4.9 to 4.19 [which 
> would have been the next step] rather than from 4.9.110 to 4.9.168 
> [which is what actually happened].)
>
>> You can still disable runtime power management via sysfs and
>> permanently using udev rule on your particular system.
>> e.g. ATTR{../../power/control}="on"
>
> I'll do some more testing on this tomorrow, but I do recall trying 
> setting power/control to "on" (via sysfs) for the device:
>
> ? 00:16.0 Communication controller: Intel Corporation Sunrise Point-H 
> CSME HECI #1 (rev 31)
>
> which was the one that I noticed was suspended.? Is this the mei device?
>
> In any case when I tried it before it didn't seem to help, but I think 
> this was after link-down and things had already failed. I'll try 
> testing a few more cases, including doing it pre-emptively.

I suggest testing this on kernel 5.e-rc7 as Paul advised. As the bug 
wasn't reproduced on the kernel .

>
>>> > Let?s wait until Gavin can confirm if it is happening with Linux
>>> > 5.3-rc7.
>>>
>>> As noted above (and in a prior email), the problem doesn't occur 
>>> when using
>>> the driver "normally" within Linux.? The triggering environment is 
>>> where the
>>> driver init/send/receive code is being executed directly
>>> *without* being registered as a Linux netdev.
>>>
>>> It is likely that the "real problem" is some side effect of this, 
>>> such as
>>> something checking if a child device is in use or powered down but 
>>> it's not
>>> registered.
>>>
>>> My environment is currently based on this tree:
>>>
>>> > Using this kernel tree:
>>> >
>>> > 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/?h=v4.9-rt&ofs=3120
>>> >
>>> > I've identified that the code at tag v4.9.126 is "good" and the code
>>> > at tag v4.9.127 is "bad".
>>> (I then narrowed it down to that specific commit.)
>>>
>>> To reiterate, there is probably no problem with standard usage of the
>>> drivers as part of Linux.
>>>
>>> But in this particular non-standard-edge-case-usage, there seems to 
>>> be some
>>> unfortunate interaction between the mei driver power management change
>>> and link-loss in onboard e1000e, and I'm trying to figure out the 
>>> cause and
>>> hopefully a fix/workaround (or at least one less serious than 
>>> disabling power
>>> management entirely).
>> This is some underlying issue, I'm don't think you can be able to
>> resolve it yourself,? e1000 guys should provide the fix.
>> Unfortunately I cannot really fix this issue form the mei side.
>>
>>>
>>> Some more context from my original email:
>>> > I'm using a system with an e1000e network driver which has been
>>> > patched to bypass the regular Linux network stack (because it can get
>>> > called from a Xenomai RT context, among other reasons -- although in
>>> > my case I'm not doing that).? The complete source for the patched
>>> > version of the code can be found here:
>>> >
>>> > 
>>> https://github.com/ribalda/ethercat/blob/master/devices/e1000e/netdev-4.9-ethercat.c
>>> > (There are some minor changes to other files, but the
>>> > majority of changes are only to this file.? You can see just the
>>> > changes at
>>> > 
>>> https://gist.github.com/uecasm/5e36a15bda6ffd53079344fc443dcc5f/revisions 
>>> .)
>>> >
>>> > It was originally based on the in-kernel e1000e driver as of Linux
>>> > 4.9.65.? (I'm not the person who originally made the patches, but 
>>> I am
>>> > the person who rebased them to kernel 4.9 and I'm the one trying to
>>> > maintain them for newer kernel versions.? Though I'm also not the
>>> > person who made that github repo.)
>>
>> You will need to eventually incorporate the e1000 fix when resolved
>> also to your code base.
>> For now the easiest workaround is to disable power management on mei
>> from outside on effected platforms.
>
> Yeah, I'm hoping that the eventual solution will be a code change to 
> the e1000e driver.? The way the distribution is structured it's very 
> easy to apply a fix there and much much harder to apply one at any 
> other point.? Though userspace rule changes are also feasible.

Please try our OOT driver which can be found in:

https://sourceforge.net/projects/e1000/files/e1000e%20stable/3.5.1/

Also please open a ticket for this issue in this source forge page.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
  2019-09-04 11:08                           ` Gavin Lambert
  2019-09-04 12:31                             ` Lifshits, Vitaly
@ 2019-09-05  3:59                             ` Gavin Lambert
  1 sibling, 0 replies; 18+ messages in thread
From: Gavin Lambert @ 2019-09-05  3:59 UTC (permalink / raw)
  To: intel-wired-lan

On 2019-09-04 23:08, I wrote:
> On 2019-09-04 22:06, Winkler, Tomas wrote:
>> You can still disable runtime power management via sysfs and
>> permanently using udev rule on your particular system.
>> e.g. ATTR{../../power/control}="on"
> 
> I'll do some more testing on this tomorrow, but I do recall trying
> setting power/control to "on" (via sysfs) for the device:
> 
>   00:16.0 Communication controller: Intel Corporation Sunrise Point-H
> CSME HECI #1 (rev 31)
> 
> which was the one that I noticed was suspended.  Is this the mei 
> device?
> 
> In any case when I tried it before it didn't seem to help, but I think
> this was after link-down and things had already failed.  I'll try
> testing a few more cases, including doing it pre-emptively.

Good news: while forcing the mei_me device to "on" does not recover from 
the problem once it has happened, it does appear to prevent the problem 
from happening.  (I assume this is because it effectively reverts the 
problem commit without any actual code changes.)

I was able to get this to happen on boot by setting 
/etc/udev/rules.d/20-mei.rules:

     ACTION=="add",KERNEL=="mei0",ATTR{../../power/control}="on"

(Initially I tried to match on DRIVER=="mei_me" to avoid the parent 
attribute reference, but this didn't trigger on boot.  The above seems 
to work though.)

This is probably a sufficient workaround for now to keep me happy.  Is 
there anything else you wanted me to test while I have the system handy?

(FWIW, I did previously verify that the original problem also occurs in 
Linux 4.19, though I don't recall the precise version at the moment.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-09-05  3:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-11  6:50 [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver Gavin Lambert
2019-07-12  3:23 ` Gavin Lambert
2019-07-18  8:06   ` Gavin Lambert
2019-07-18  8:22     ` Paul Menzel
2019-07-18  8:24     ` Neftin, Sasha
2019-07-19  0:40       ` Gavin Lambert
2019-07-19  1:02         ` Gavin Lambert
2019-08-20  2:15           ` Gavin Lambert
2019-09-03  7:56             ` Gavin Lambert
2019-09-03  8:35               ` Paul Menzel
2019-09-03  9:20                 ` Greg Kroah-Hartman
2019-09-03  9:28                   ` Winkler, Tomas
2019-09-03  9:39                     ` Paul Menzel
2019-09-03 11:00                       ` Gavin Lambert
2019-09-04 10:06                         ` Winkler, Tomas
2019-09-04 11:08                           ` Gavin Lambert
2019-09-04 12:31                             ` Lifshits, Vitaly
2019-09-05  3:59                             ` Gavin Lambert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.