regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
       [not found]           ` <CAPAtJa_hbFbVXQbiNnb_byLqtZ-Dy_EBcvTFH9GyPqt__dFmLQ@mail.gmail.com>
@ 2022-11-18 22:43             ` Conor Dooley
  2022-11-18 22:54               ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: Conor Dooley @ 2022-11-18 22:43 UTC (permalink / raw)
  To: Ivan Smirnov
  Cc: Neftin, Sasha, Fuxbrumer, Devora, intel-wired-lan,
	Jakub Kicinski, Ruinskiy, Dima, Avivi, Amir, regressions

Hey,

On Wed, Nov 16, 2022 at 02:23:57PM -0800, Ivan Smirnov wrote:
> Hi folks,
> 
> Is there any update for the community? More and more folks are asking. We
> are all techies and happy to help debug.

Vested interest since I am suffering from the same issue (X670E-F
Gaming), but is it okay to add this to regzbot? Not sure whether it
counts as a regression or not since it's new hw with the existing driver,
but this seems to be falling through the cracks without a response for
several weeks.

Thanks,
Conor.

> On Thu, Nov 10, 2022 at 03:44 Ivan Smirnov <isgsmirnov@gmail.com> wrote:
> 
> > Some more data from another user. Do you guys have any preliminary
> > investigation you could share back with the community?
> >
> > Same issue, been struggling with it for a last month or so: both with
> > Ubuntu and Arch Linux. I have a dual-boot system with Windows 11, and did
> > not notice any issues with ethernet or wifi on Windows. So this indeed
> > seems like a firmware issue, particularly in igc. Not the adapter itself
> >
> > Running on Arch Linux kernel 6.0.7, same motherboard as in your post
> >
> > https://gist.github.com/LilDojd/2f030ecc5c5b6f8c3285725adfb8c456
> >
> >
> >
> >
> > On Thu, Nov 3, 2022 at 05:53 Ivan Smirnov <isgsmirnov@gmail.com> wrote:
> >
> >> Here is the gist from one reddit user:
> >> https://gist.github.com/DarkArc/50ffca5fc343e2ff8166bc81d3ff8335
> >>
> >> Here are my gists (crash free for now):
> >> https://gist.github.com/issmirnov/b9ac74d232e1865ae849a3e64dce2afe
> >>
> >> --
> >> Ivan Smirnov
> >> https://ivans.io/ | https://blog.ivansmirnov.name/
> >> https://www.linkedin.com/in/ismirnov |
> >> *https://ivansmirnov.name/ <https://ivansmirnov.name/>*
> >> *https://github.com/issmirnov <https://ivansmirnov.name/>*
> >>
> >>
> >> On Wed, Nov 2, 2022 at 10:54 AM Ivan Smirnov <isgsmirnov@gmail.com>
> >> wrote:
> >>
> >>> Hi folks,
> >>>
> >>> As usual, the computers know when the experts join the chat... I haven't
> >>> been able to reproduce the issue for the past few days. Yay for stability,
> >>> boo for debugging.
> >>>
> >>> I posted on the reddit thread
> >>> <https://www.reddit.com/r/buildapc/comments/xypn1m/network_card_intel_ethernet_controller_i225v_igc/> asking
> >>> other users to post their output. I'll do my best to keep an eye out for
> >>> this issue and get you the logs ASAP once I repro the crash.
> >>>
> >>> Thank you for your responsiveness - will keep you posted!
> >>>
> >>> Best,
> >>> - Ivan
> >>> --
> >>> Ivan Smirnov
> >>> https://ivans.io/ | https://blog.ivansmirnov.name/
> >>> https://www.linkedin.com/in/ismirnov |
> >>> *https://ivansmirnov.name/ <https://ivansmirnov.name/>*
> >>> *https://github.com/issmirnov <https://ivansmirnov.name/>*
> >>>
> >>>
> >>> On Tue, Nov 1, 2022 at 10:21 AM Neftin, Sasha <sasha.neftin@intel.com>
> >>> wrote:
> >>>
> >>>> On 11/1/2022 02:05, Jakub Kicinski wrote:
> >>>> > CC: intel-wired
> >>>> >
> >>>> > On Sun, 30 Oct 2022 14:44:57 -0600 Ivan Smirnov wrote:
> >>>> >> Hi folks,
> >>>> >>
> >>>> >> I found your commits on the linux kernel igc
> >>>> >> <
> >>>> https://github.com/torvalds/linux/commits/master/drivers/net/ethernet/intel/igc
> >>>> >
> >>>> >> folder. There appears to be a bug with the igc kernel module on Intel
> >>>> >> I225-V chips.
> >>>> >>
> >>>> >> Specifically, the probe fails at startup with error: "igc: probe of
> >>>> >> 0000:06:00.0 failed with error -13". When it does load, it crashes
> >>>> after a
> >>>> >> few hours with error "igc failed to read reg 0xc030".
> >>>> >>
> >>>> Could you provide dmesg -w -T | grep -i igc on the boot stage? ethtool
> >>>> -i?
> >>>> I've cc'd our PAE expert Amir who also could try to look at this
> >>>> problem.
> >>>>
> >>>> >> There are several affected users posting on
> >>>> >>
> >>>> https://www.reddit.com/r/buildapc/comments/xypn1m/network_card_intel_ethernet_controller_i225v_igc/
> >>>> >> with more details.
> >>>> >>
> >>>> >> Could I help you debug this? This problem has been reproduced on the
> >>>> >> following setups:
> >>>> >>
> >>>> >> 1. Asus TUF-GAMING-Z690-PLUS-WIFI-D4
> >>>> >> <
> >>>> https://www.asus.com/motherboards-components/motherboards/tuf-gaming/tuf-gaming-z690-plus-wifi-d4/
> >>>> >
> >>>> >> on
> >>>> >> Arch Linux, kernel 6.0.2-arch1-1
> >>>> >> 2. rog strix x670e-e gaming wifi
> >>>> >> <
> >>>> https://rog.asus.com/us/motherboards/rog-strix/rog-strix-x670e-e-gaming-wifi-model/
> >>>> >
> >>>> >> on
> >>>> >> Proxmox 7, as well as Ubuntu Linux (kernel 5.19, I believe)
> >>>> >>
> >>>> >> I'm happy to load any debug modules or provide additional logs as per
> >>>> >> your request.
> >>>> >>
> >>>> >> Thank you
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> Ivan Smirnov
> >>>> >> https://ivans.io/ | https://blog.ivansmirnov.name/
> >>>> >> https://www.linkedin.com/in/ismirnov |
> >>>> >> *https://ivansmirnov.name/ <https://ivansmirnov.name/>*
> >>>> >> *https://github.com/issmirnov <https://ivansmirnov.name/>*
> >>>> >
> >>>>
> >>>> --
> > --
> > Ivan Smirnov
> > https://ivans.io/ | https://blog.ivansmirnov.name/
> > https://www.linkedin.com/in/ismirnov |
> > *https://ivansmirnov.name/ <https://ivansmirnov.name/>*
> > *https://github.com/issmirnov <https://ivansmirnov.name/>*
> >
> -- 
> --
> Ivan Smirnov
> https://ivans.io/ | https://blog.ivansmirnov.name/
> https://www.linkedin.com/in/ismirnov |
> *https://ivansmirnov.name/ <https://ivansmirnov.name/>*
> *https://github.com/issmirnov <https://ivansmirnov.name/>*

> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-11-18 22:43             ` [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V) Conor Dooley
@ 2022-11-18 22:54               ` Jakub Kicinski
  2022-11-18 23:21                 ` Conor Dooley
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2022-11-18 22:54 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Ivan Smirnov, Neftin, Sasha, Fuxbrumer, Devora, intel-wired-lan,
	Ruinskiy, Dima, Avivi, Amir, regressions

On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
> > Is there any update for the community? More and more folks are asking. We
> > are all techies and happy to help debug.  
> 
> Vested interest since I am suffering from the same issue (X670E-F
> Gaming), but is it okay to add this to regzbot? Not sure whether it
> counts as a regression or not since it's new hw with the existing driver,
> but this seems to be falling through the cracks without a response for
> several weeks.

Dunno, Thorsten's will decide. The line has to be drawn somewhere
on "vendor doesn't care about Linux support" vs "we broke uAPI".
This is the kind of situation I was alluding to in my line of
questioning at the maintainer summit: https://lwn.net/Articles/908324/

Finding a kernel release which does not suffer from the problem
would certainly strengthen your case.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-11-18 22:54               ` Jakub Kicinski
@ 2022-11-18 23:21                 ` Conor Dooley
  2022-11-19 18:06                   ` Neftin, Sasha
  2022-11-20 10:32                   ` Thorsten Leemhuis
  0 siblings, 2 replies; 10+ messages in thread
From: Conor Dooley @ 2022-11-18 23:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ivan Smirnov, Neftin, Sasha, Fuxbrumer, Devora, intel-wired-lan,
	Ruinskiy, Dima, Avivi, Amir, regressions

On Fri, Nov 18, 2022 at 02:54:43PM -0800, Jakub Kicinski wrote:
> On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
> > > Is there any update for the community? More and more folks are asking. We
> > > are all techies and happy to help debug.  
> > 
> > Vested interest since I am suffering from the same issue (X670E-F
> > Gaming), but is it okay to add this to regzbot? Not sure whether it
> > counts as a regression or not since it's new hw with the existing driver,
> > but this seems to be falling through the cracks without a response for
> > several weeks.
> 
> Dunno, Thorsten's will decide. The line has to be drawn somewhere
> on "vendor doesn't care about Linux support" vs "we broke uAPI".
> This is the kind of situation I was alluding to in my line of
> questioning at the maintainer summit: https://lwn.net/Articles/908324/

Yeah & it is /regression/ tracking which I don't (or rather didn't)
consider this situation to be. I'm generally a little unsure as to when
I should trigger regzbot in general:
- immediately when I find something?
- only if it goes a while with nothing constructive?
- is it okay to use it outside of "this used to work and now doesnt"?

Either way, but I did some more googling and found this reddit thread:
https://www.reddit.com/r/intel/comments/lqb4km/for_people_having_i225v_connection_issues/

That's being reported against windows & I dunno if the dude is using
firmware and driver interchangeably etc. But the disabling power saving
etc sounds oddly like the issue we have here, since that was a proposed
workaround in Ivan's 2022 reddit thread.

Supposedly I am on firmware-version 1082:8770, but /I/ I have no idea
how that corresponds to windows versioning. That may lend some credence
to your assertion about firmware being the source of many issues.

> Finding a kernel release which does not suffer from the problem
> would certainly strengthen your case.

Aye, likely to be a little difficult to do a meaningful bisection for
me at least, since the motherboard I have with the problem is an AM5
one for the new Zen4 stuff. I'm not an x86 person, so not entirely
sure when that support landed. I may do some poking tomorrow..


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-11-18 23:21                 ` Conor Dooley
@ 2022-11-19 18:06                   ` Neftin, Sasha
  2022-11-20 19:55                     ` Conor Dooley
  2022-11-20 10:32                   ` Thorsten Leemhuis
  1 sibling, 1 reply; 10+ messages in thread
From: Neftin, Sasha @ 2022-11-19 18:06 UTC (permalink / raw)
  To: Conor Dooley, Jakub Kicinski
  Cc: Ivan Smirnov, Fuxbrumer, Devora, intel-wired-lan, Ruinskiy, Dima,
	Avivi, Amir, regressions, Lifshits, Vitaly, naamax.meir, Meir,
	NaamaX

On 11/19/2022 01:21, Conor Dooley wrote:
> On Fri, Nov 18, 2022 at 02:54:43PM -0800, Jakub Kicinski wrote:
>> On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
>>>> Is there any update for the community? More and more folks are asking. We
>>>> are all techies and happy to help debug.
>>>
>>> Vested interest since I am suffering from the same issue (X670E-F
>>> Gaming), but is it okay to add this to regzbot? Not sure whether it
>>> counts as a regression or not since it's new hw with the existing driver,
>>> but this seems to be falling through the cracks without a response for
>>> several weeks.
>>
>> Dunno, Thorsten's will decide. The line has to be drawn somewhere
>> on "vendor doesn't care about Linux support" vs "we broke uAPI".
>> This is the kind of situation I was alluding to in my line of
>> questioning at the maintainer summit: https://lwn.net/Articles/908324/
> 
> Yeah & it is /regression/ tracking which I don't (or rather didn't)
> consider this situation to be. I'm generally a little unsure as to when
> I should trigger regzbot in general:
> - immediately when I find something?
> - only if it goes a while with nothing constructive?
> - is it okay to use it outside of "this used to work and now doesnt"?
> 
> Either way, but I did some more googling and found this reddit thread:
> https://www.reddit.com/r/intel/comments/lqb4km/for_people_having_i225v_connection_issues/
> 
> That's being reported against windows & I dunno if the dude is using
> firmware and driver interchangeably etc. But the disabling power saving
> etc sounds oddly like the issue we have here, since that was a proposed
> workaround in Ivan's 2022 reddit thread.
> 
> Supposedly I am on firmware-version 1082:8770, but /I/ I have no idea
> how that corresponds to windows versioning. That may lend some credence
> to your assertion about firmware being the source of many issues.
> 
>> Finding a kernel release which does not suffer from the problem
>> would certainly strengthen your case.
> 
> Aye, likely to be a little difficult to do a meaningful bisection for
> me at least, since the motherboard I have with the problem is an AM5
> one for the new Zen4 stuff. I'm not an x86 person, so not entirely
> sure when that support landed. I may do some poking tomorrow..
> 
I do not think we can resolve this problem on this forum.
In early Ivan's report was reported error to netdev "PCIe link lost, 
device now detached"). Since the PCIe link unexpectedly drops it could 
lead to many problems (not only crashes).
Before you go to SW/FW bisection (change FW(NVM), go back with a kernel 
version) - please, contact your board vendor (ASUS). Why PCIe link drop?
Circuit problem on board, the system performs power management flows and 
does not stop the driver.

"failed to read reg 0xc030" (just symptom) happen after PCIe link lost.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-11-18 23:21                 ` Conor Dooley
  2022-11-19 18:06                   ` Neftin, Sasha
@ 2022-11-20 10:32                   ` Thorsten Leemhuis
  2022-11-20 18:40                     ` Conor Dooley
  1 sibling, 1 reply; 10+ messages in thread
From: Thorsten Leemhuis @ 2022-11-20 10:32 UTC (permalink / raw)
  To: Conor Dooley, Jakub Kicinski
  Cc: Ivan Smirnov, Neftin, Sasha, Fuxbrumer, Devora, intel-wired-lan,
	Ruinskiy, Dima, Avivi, Amir, regressions

On 19.11.22 00:21, Conor Dooley wrote:
> On Fri, Nov 18, 2022 at 02:54:43PM -0800, Jakub Kicinski wrote:
>> On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
>>>> Is there any update for the community? More and more folks are asking. We
>>>> are all techies and happy to help debug.  
>>>
>>> Vested interest since I am suffering from the same issue (X670E-F
>>> Gaming), but is it okay to add this to regzbot? Not sure whether it
>>> counts as a regression or not since it's new hw with the existing driver,
>>> but this seems to be falling through the cracks without a response for
>>> several weeks.
>>
>> Dunno, Thorsten's will decide. The line has to be drawn somewhere
>> on "vendor doesn't care about Linux support" vs "we broke uAPI".
>> This is the kind of situation I was alluding to in my line of
>> questioning at the maintainer summit: https://lwn.net/Articles/908324/
> 
> Yeah & it is /regression/ tracking which I don't (or rather didn't)
> consider this situation to be.

Yeah, looks like this is not something that look track-worthy for
regzbot -- at least for now, maybe it one day makes sense to use and
improved regzbot for bug reports as well, but I'd like to focus on
establishing regression tracking properly first, which still requires a
lot of work.

> I'm generally a little unsure as to when
> I should trigger regzbot in general:
> - immediately when I find something?

Yes, ideally, as documented here:
https://docs.kernel.org/admin-guide/reporting-regressions.html

> - only if it goes a while with nothing constructive?

But that is fine as well. But FWIW, we all don't want bureaucracy. Even
I don't add each and every regression I see to the tracking yet.

> - is it okay to use it outside of "this used to work and now doesnt"?

Guess I should clarify that this is unwanted in above doc.

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-11-20 10:32                   ` Thorsten Leemhuis
@ 2022-11-20 18:40                     ` Conor Dooley
  0 siblings, 0 replies; 10+ messages in thread
From: Conor Dooley @ 2022-11-20 18:40 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Jakub Kicinski, Ivan Smirnov, Neftin, Sasha, Fuxbrumer, Devora,
	intel-wired-lan, Ruinskiy, Dima, Avivi, Amir, regressions

On Sun, Nov 20, 2022 at 11:32:36AM +0100, Thorsten Leemhuis wrote:
> On 19.11.22 00:21, Conor Dooley wrote:
> > On Fri, Nov 18, 2022 at 02:54:43PM -0800, Jakub Kicinski wrote:
> >> On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
> >>>> Is there any update for the community? More and more folks are asking. We
> >>>> are all techies and happy to help debug.  
> >>>
> >>> Vested interest since I am suffering from the same issue (X670E-F
> >>> Gaming), but is it okay to add this to regzbot? Not sure whether it
> >>> counts as a regression or not since it's new hw with the existing driver,
> >>> but this seems to be falling through the cracks without a response for
> >>> several weeks.
> >>
> >> Dunno, Thorsten's will decide. The line has to be drawn somewhere
> >> on "vendor doesn't care about Linux support" vs "we broke uAPI".
> >> This is the kind of situation I was alluding to in my line of
> >> questioning at the maintainer summit: https://lwn.net/Articles/908324/
> > 
> > Yeah & it is /regression/ tracking which I don't (or rather didn't)
> > consider this situation to be.
> 
> Yeah, looks like this is not something that look track-worthy for
> regzbot -- at least for now, maybe it one day makes sense to use and
> improved regzbot for bug reports as well, but I'd like to focus on
> establishing regression tracking properly first, which still requires a
> lot of work.
> 
> > I'm generally a little unsure as to when
> > I should trigger regzbot in general:
> > - immediately when I find something?
> 
> Yes, ideally, as documented here:
> https://docs.kernel.org/admin-guide/reporting-regressions.html
> 
> > - only if it goes a while with nothing constructive?
> 
> But that is fine as well. But FWIW, we all don't want bureaucracy. Even
> I don't add each and every regression I see to the tracking yet.
> 
> > - is it okay to use it outside of "this used to work and now doesnt"?
> 
> Guess I should clarify that this is unwanted in above doc.

Right. I wasn't sure if it was okay to use it for "this never worked"
type of issues. Thanks Thorsten!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-11-19 18:06                   ` Neftin, Sasha
@ 2022-11-20 19:55                     ` Conor Dooley
  2022-12-21 17:30                       ` Conor Dooley
  0 siblings, 1 reply; 10+ messages in thread
From: Conor Dooley @ 2022-11-20 19:55 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Jakub Kicinski, Ivan Smirnov, Fuxbrumer, Devora, intel-wired-lan,
	Ruinskiy, Dima, Avivi, Amir, regressions, Lifshits, Vitaly,
	naamax.meir, Meir, NaamaX

On Sat, Nov 19, 2022 at 08:06:05PM +0200, Neftin, Sasha wrote:
> On 11/19/2022 01:21, Conor Dooley wrote:
> > On Fri, Nov 18, 2022 at 02:54:43PM -0800, Jakub Kicinski wrote:
> > > On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
> > > > > Is there any update for the community? More and more folks are asking. We
> > > > > are all techies and happy to help debug.
> > > > 
> > > > Vested interest since I am suffering from the same issue (X670E-F
> > > > Gaming), but is it okay to add this to regzbot? Not sure whether it
> > > > counts as a regression or not since it's new hw with the existing driver,
> > > > but this seems to be falling through the cracks without a response for
> > > > several weeks.
> > > 
> > > Dunno, Thorsten's will decide. The line has to be drawn somewhere
> > > on "vendor doesn't care about Linux support" vs "we broke uAPI".
> > > This is the kind of situation I was alluding to in my line of
> > > questioning at the maintainer summit: https://lwn.net/Articles/908324/
> > 
> > Yeah & it is /regression/ tracking which I don't (or rather didn't)
> > consider this situation to be. I'm generally a little unsure as to when
> > I should trigger regzbot in general:
> > - immediately when I find something?
> > - only if it goes a while with nothing constructive?
> > - is it okay to use it outside of "this used to work and now doesnt"?
> > 
> > Either way, but I did some more googling and found this reddit thread:
> > https://www.reddit.com/r/intel/comments/lqb4km/for_people_having_i225v_connection_issues/
> > 
> > That's being reported against windows & I dunno if the dude is using
> > firmware and driver interchangeably etc. But the disabling power saving
> > etc sounds oddly like the issue we have here, since that was a proposed
> > workaround in Ivan's 2022 reddit thread.
> > 
> > Supposedly I am on firmware-version 1082:8770, but /I/ I have no idea
> > how that corresponds to windows versioning. That may lend some credence
> > to your assertion about firmware being the source of many issues.
> > 
> > > Finding a kernel release which does not suffer from the problem
> > > would certainly strengthen your case.
> > 
> > Aye, likely to be a little difficult to do a meaningful bisection for
> > me at least, since the motherboard I have with the problem is an AM5
> > one for the new Zen4 stuff. I'm not an x86 person, so not entirely
> > sure when that support landed. I may do some poking tomorrow..
> > 
> I do not think we can resolve this problem on this forum.
> In early Ivan's report was reported error to netdev "PCIe link lost, device
> now detached"). Since the PCIe link unexpectedly drops it could lead to many
> problems (not only crashes).

Hmm, I'll take a look at what mine spits out next time it dies, but I
would imagine that you're correct and I see it too.

> Before you go to SW/FW bisection (change FW(NVM), go back with a kernel
> version) - please, contact your board vendor (ASUS). Why PCIe link drop?

I dunno, I suppose it just entered a lower power state!

> Circuit problem on board, the system performs power management flows and
> does not stop the driver.

My GPU and other PCI devices are returning from lower power modes properly.
I wonder what's different about this specific device. As I said, not too
familiar with x86 stuff - is there someone from AMD worth poking as the
output from lspci is a wall of AMD bridges w/ endpoints mixed in.

Doing a cursory look at other x670 stuff - the non-asus ones that I
looked at are not using Intel ethernet.

> "failed to read reg 0xc030" (just symptom) happen after PCIe link lost.

Per 47e16692b26b ("igb/igc: warn when fatal read failure happens"), it
looks as though this is not a *new* problem though as you guys have seen
this while testing.

I've got a 1 G NIC, I like my dev machine to "just work" so I'll probably
throw that in and see how far that gets me. IIRC it's an igb one so will
at least make for a datapoint.

Thanks,
Conor.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-11-20 19:55                     ` Conor Dooley
@ 2022-12-21 17:30                       ` Conor Dooley
  2022-12-31 15:02                         ` Conor Dooley
  0 siblings, 1 reply; 10+ messages in thread
From: Conor Dooley @ 2022-12-21 17:30 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Jakub Kicinski, Ivan Smirnov, Fuxbrumer, Devora, intel-wired-lan,
	Ruinskiy, Dima, Avivi, Amir, regressions, Lifshits, Vitaly,
	naamax.meir, Meir, NaamaX

[-- Attachment #1: Type: text/plain, Size: 4620 bytes --]

On Sun, Nov 20, 2022 at 07:55:09PM +0000, Conor Dooley wrote:
> On Sat, Nov 19, 2022 at 08:06:05PM +0200, Neftin, Sasha wrote:
> > On 11/19/2022 01:21, Conor Dooley wrote:
> > > On Fri, Nov 18, 2022 at 02:54:43PM -0800, Jakub Kicinski wrote:
> > > > On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
> > > > > > Is there any update for the community? More and more folks are asking. We
> > > > > > are all techies and happy to help debug.
> > > > > 
> > > > > Vested interest since I am suffering from the same issue (X670E-F
> > > > > Gaming), but is it okay to add this to regzbot? Not sure whether it
> > > > > counts as a regression or not since it's new hw with the existing driver,
> > > > > but this seems to be falling through the cracks without a response for
> > > > > several weeks.
> > > > 
> > > > Dunno, Thorsten's will decide. The line has to be drawn somewhere
> > > > on "vendor doesn't care about Linux support" vs "we broke uAPI".
> > > > This is the kind of situation I was alluding to in my line of
> > > > questioning at the maintainer summit: https://lwn.net/Articles/908324/
> > > 
> > > Yeah & it is /regression/ tracking which I don't (or rather didn't)
> > > consider this situation to be. I'm generally a little unsure as to when
> > > I should trigger regzbot in general:
> > > - immediately when I find something?
> > > - only if it goes a while with nothing constructive?
> > > - is it okay to use it outside of "this used to work and now doesnt"?
> > > 
> > > Either way, but I did some more googling and found this reddit thread:
> > > https://www.reddit.com/r/intel/comments/lqb4km/for_people_having_i225v_connection_issues/
> > > 
> > > That's being reported against windows & I dunno if the dude is using
> > > firmware and driver interchangeably etc. But the disabling power saving
> > > etc sounds oddly like the issue we have here, since that was a proposed
> > > workaround in Ivan's 2022 reddit thread.
> > > 
> > > Supposedly I am on firmware-version 1082:8770, but /I/ I have no idea
> > > how that corresponds to windows versioning. That may lend some credence
> > > to your assertion about firmware being the source of many issues.
> > > 
> > > > Finding a kernel release which does not suffer from the problem
> > > > would certainly strengthen your case.
> > > 
> > > Aye, likely to be a little difficult to do a meaningful bisection for
> > > me at least, since the motherboard I have with the problem is an AM5
> > > one for the new Zen4 stuff. I'm not an x86 person, so not entirely
> > > sure when that support landed. I may do some poking tomorrow..
> > > 
> > I do not think we can resolve this problem on this forum.
> > In early Ivan's report was reported error to netdev "PCIe link lost, device
> > now detached"). Since the PCIe link unexpectedly drops it could lead to many
> > problems (not only crashes).
> 
> Hmm, I'll take a look at what mine spits out next time it dies, but I
> would imagine that you're correct and I see it too.

It does in fact say that, but interestingly only this peripheral has any
issues. My GPUs etc have no problem at all.

> > Before you go to SW/FW bisection (change FW(NVM), go back with a kernel
> > version) - please, contact your board vendor (ASUS). Why PCIe link drop?
> 
> I dunno, I suppose it just entered a lower power state!
> 
> > Circuit problem on board, the system performs power management flows and
> > does not stop the driver.
> 
> My GPU and other PCI devices are returning from lower power modes properly.
> I wonder what's different about this specific device. As I said, not too
> familiar with x86 stuff - is there someone from AMD worth poking as the
> output from lspci is a wall of AMD bridges w/ endpoints mixed in.
> 
> Doing a cursory look at other x670 stuff - the non-asus ones that I
> looked at are not using Intel ethernet.
> 
> > "failed to read reg 0xc030" (just symptom) happen after PCIe link lost.
> 
> Per 47e16692b26b ("igb/igc: warn when fatal read failure happens"), it
> looks as though this is not a *new* problem though as you guys have seen
> this while testing.
> 
> I've got a 1 G NIC, I like my dev machine to "just work" so I'll probably
> throw that in and see how far that gets me. IIRC it's an igb one so will
> at least make for a datapoint.

FWIW I gave up on the igc driver and am using my NIC, couldn't be
bothered with the disruption. I'll give the bios stuff mentioned
elsewhere a go over Christmas now that v6.1.1 exists and see if that
helps. Hopefully it does!

Conor.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-12-21 17:30                       ` Conor Dooley
@ 2022-12-31 15:02                         ` Conor Dooley
  2023-01-02 11:09                           ` Conor Dooley
  0 siblings, 1 reply; 10+ messages in thread
From: Conor Dooley @ 2022-12-31 15:02 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Jakub Kicinski, Ivan Smirnov, Fuxbrumer, Devora, intel-wired-lan,
	Ruinskiy, Dima, Avivi, Amir, regressions, Lifshits, Vitaly,
	naamax.meir, Meir, NaamaX, helgaas

[-- Attachment #1: Type: text/plain, Size: 5909 bytes --]

On Wed, Dec 21, 2022 at 05:30:54PM +0000, Conor Dooley wrote:
> On Sun, Nov 20, 2022 at 07:55:09PM +0000, Conor Dooley wrote:
> > On Sat, Nov 19, 2022 at 08:06:05PM +0200, Neftin, Sasha wrote:
> > > On 11/19/2022 01:21, Conor Dooley wrote:
> > > > On Fri, Nov 18, 2022 at 02:54:43PM -0800, Jakub Kicinski wrote:
> > > > > On Fri, 18 Nov 2022 22:43:29 +0000 Conor Dooley wrote:
> > > > > > > Is there any update for the community? More and more folks are asking. We
> > > > > > > are all techies and happy to help debug.
> > > > > > 
> > > > > > Vested interest since I am suffering from the same issue (X670E-F
> > > > > > Gaming), but is it okay to add this to regzbot? Not sure whether it
> > > > > > counts as a regression or not since it's new hw with the existing driver,
> > > > > > but this seems to be falling through the cracks without a response for
> > > > > > several weeks.
> > > > > 
> > > > > Dunno, Thorsten's will decide. The line has to be drawn somewhere
> > > > > on "vendor doesn't care about Linux support" vs "we broke uAPI".
> > > > > This is the kind of situation I was alluding to in my line of
> > > > > questioning at the maintainer summit: https://lwn.net/Articles/908324/
> > > > 
> > > > Yeah & it is /regression/ tracking which I don't (or rather didn't)
> > > > consider this situation to be. I'm generally a little unsure as to when
> > > > I should trigger regzbot in general:
> > > > - immediately when I find something?
> > > > - only if it goes a while with nothing constructive?
> > > > - is it okay to use it outside of "this used to work and now doesnt"?
> > > > 
> > > > Either way, but I did some more googling and found this reddit thread:
> > > > https://www.reddit.com/r/intel/comments/lqb4km/for_people_having_i225v_connection_issues/
> > > > 
> > > > That's being reported against windows & I dunno if the dude is using
> > > > firmware and driver interchangeably etc. But the disabling power saving
> > > > etc sounds oddly like the issue we have here, since that was a proposed
> > > > workaround in Ivan's 2022 reddit thread.
> > > > 
> > > > Supposedly I am on firmware-version 1082:8770, but /I/ I have no idea
> > > > how that corresponds to windows versioning. That may lend some credence
> > > > to your assertion about firmware being the source of many issues.
> > > > 
> > > > > Finding a kernel release which does not suffer from the problem
> > > > > would certainly strengthen your case.
> > > > 
> > > > Aye, likely to be a little difficult to do a meaningful bisection for
> > > > me at least, since the motherboard I have with the problem is an AM5
> > > > one for the new Zen4 stuff. I'm not an x86 person, so not entirely
> > > > sure when that support landed. I may do some poking tomorrow..
> > > > 
> > > I do not think we can resolve this problem on this forum.
> > > In early Ivan's report was reported error to netdev "PCIe link lost, device
> > > now detached"). Since the PCIe link unexpectedly drops it could lead to many
> > > problems (not only crashes).
> > 
> > Hmm, I'll take a look at what mine spits out next time it dies, but I
> > would imagine that you're correct and I see it too.
> 
> It does in fact say that, but interestingly only this peripheral has any
> issues. My GPUs etc have no problem at all.
> 
> > > Before you go to SW/FW bisection (change FW(NVM), go back with a kernel
> > > version) - please, contact your board vendor (ASUS). Why PCIe link drop?
> > 
> > I dunno, I suppose it just entered a lower power state!
> > 
> > > Circuit problem on board, the system performs power management flows and
> > > does not stop the driver.
> > 
> > My GPU and other PCI devices are returning from lower power modes properly.
> > I wonder what's different about this specific device. As I said, not too
> > familiar with x86 stuff - is there someone from AMD worth poking as the
> > output from lspci is a wall of AMD bridges w/ endpoints mixed in.
> > 
> > Doing a cursory look at other x670 stuff - the non-asus ones that I
> > looked at are not using Intel ethernet.
> > 
> > > "failed to read reg 0xc030" (just symptom) happen after PCIe link lost.
> > 
> > Per 47e16692b26b ("igb/igc: warn when fatal read failure happens"), it
> > looks as though this is not a *new* problem though as you guys have seen
> > this while testing.
> > 
> > I've got a 1 G NIC, I like my dev machine to "just work" so I'll probably
> > throw that in and see how far that gets me. IIRC it's an igb one so will
> > at least make for a datapoint.
> 
> FWIW I gave up on the igc driver and am using my NIC, couldn't be
> bothered with the disruption. I'll give the bios stuff mentioned
> elsewhere a go over Christmas now that v6.1.1 exists and see if that
> helps. Hopefully it does!

Hallo, me again...

I didn't actually give the bios stuff a go in the end. I figured that
changing everything at once would likely not be a good idea - but what I
did do was try v6.1.1 & have now been running for 50-something hours
without any issues while using the igc iface.

Whole-ly unscientific of course, but I had noticed this thread:
https://lore.kernel.org/all/20221226225045.GA400369@bhelgaas/
and that commit c01163dbd1b8 ("PCI/PM: Always disable PTM for all devices
during suspend") was not part of the v6.0.y kernels I was running but
*is* in v6.1.y, which was my impetus for trying the kernel upgrade.

I checked v6.0.16-rc2 and that commit does not appear to have been
backported yet.
Perhaps some of the other "victims" in this thread who have not yet
tried changing BIOS etc, could give v6.1.y a go & see if they still have
issues.

I may backport the aforementioned patch myself and see how it does, but
someone else trying v6.1.y & not seeing the iface dying would certainly
help with motivation :)

Thanks,
Conor.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V)
  2022-12-31 15:02                         ` Conor Dooley
@ 2023-01-02 11:09                           ` Conor Dooley
  0 siblings, 0 replies; 10+ messages in thread
From: Conor Dooley @ 2023-01-02 11:09 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Jakub Kicinski, Ivan Smirnov, Fuxbrumer, Devora, intel-wired-lan,
	Ruinskiy, Dima, Avivi, Amir, regressions, Lifshits, Vitaly,
	naamax.meir, Meir, NaamaX, helgaas

[-- Attachment #1: Type: text/plain, Size: 545 bytes --]

On Sat, Dec 31, 2022 at 03:02:57PM +0000, Conor Dooley wrote:

> I didn't actually give the bios stuff a go in the end. I figured that
> changing everything at once would likely not be a good idea - but what I
> did do was try v6.1.1 & have now been running for 50-something hours
> without any issues while using the igc iface.

Bah, it died last night about about the 90 hour mark. Still an order of
magnitude longer than I had got it to work sequentially for before, but
not fixed :(

I'll give the bios a go I suppose, sorry for the noise!


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-01-02 11:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAPAtJa_o5q-sU+AD=G3y43H_5pBKnOZTQGXM99uszPXNkn8Z9A@mail.gmail.com>
     [not found] ` <20221031170535.77be0eb5@kernel.org>
     [not found]   ` <03f7dc73-3e7c-1e6d-275f-85539493cd7f@intel.com>
     [not found]     ` <CAPAtJa8qupPZZ0AiMWSxNKSd-WMg0MQDQeZcCO_Z-GGBu3jZCg@mail.gmail.com>
     [not found]       ` <CAPAtJa_-yMusW5-C3BDivMu=MOyfKF9VQkxQotX3L_P+Q48oMA@mail.gmail.com>
     [not found]         ` <CAPAtJa_nL5edyiN61ghXZxVUSDBFQQR3uiYJM0uo9mEao=RC0w@mail.gmail.com>
     [not found]           ` <CAPAtJa_hbFbVXQbiNnb_byLqtZ-Dy_EBcvTFH9GyPqt__dFmLQ@mail.gmail.com>
2022-11-18 22:43             ` [Intel-wired-lan] igc kernel module crashes on new hardware (Intel Ethernet I225-V) Conor Dooley
2022-11-18 22:54               ` Jakub Kicinski
2022-11-18 23:21                 ` Conor Dooley
2022-11-19 18:06                   ` Neftin, Sasha
2022-11-20 19:55                     ` Conor Dooley
2022-12-21 17:30                       ` Conor Dooley
2022-12-31 15:02                         ` Conor Dooley
2023-01-02 11:09                           ` Conor Dooley
2022-11-20 10:32                   ` Thorsten Leemhuis
2022-11-20 18:40                     ` Conor Dooley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).