From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from wp530.webpack.hosteurope.de (wp530.webpack.hosteurope.de [80.237.130.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C0418ED1 for ; Sun, 10 Apr 2022 09:47:52 +0000 (UTC) Received: from ip4d144895.dynamic.kabel-deutschland.de ([77.20.72.149] helo=[192.168.66.200]); authenticated by wp530.webpack.hosteurope.de running ExIM with esmtpsa (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) id 1ndUAd-0000N0-E0; Sun, 10 Apr 2022 11:47:47 +0200 Message-ID: <05aef888-219a-3563-40f4-9720c64955c2@leemhuis.info> Date: Sun, 10 Apr 2022 11:47:46 +0200 Precedence: bulk X-Mailing-List: regressions@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: Bug 215689 - e1000e: regression for I219-V for kernel 5.14 onwards Content-Language: en-US To: "Neftin, Sasha" , Tony Nguyen , Jesse Brandeburg , "Fuxbrumer, Devora" , "Ruinskiy, Dima" , "naamax.meir" Cc: "regressions@lists.linux.dev" , intel-wired-lan , James References: <6801d0ef-9620-0f61-c107-c2c5524952ef@leemhuis.info> <1e0558eb-b1f1-edb5-e554-a41f2c160401@intel.com> <46696877-3dc9-0600-9a8f-eda42d029cd7@leemhuis.info> From: Thorsten Leemhuis In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-bounce-key: webpack.hosteurope.de;regressions@leemhuis.info;1649584072;35939c3e; X-HE-SMSGID: 1ndUAd-0000N0-E0 On 10.04.22 11:26, Neftin, Sasha wrote: > On 4/10/2022 11:21, Thorsten Leemhuis wrote: >> Hi, this is your Linux kernel regression tracker. Top-posting for once, >> to make this easily accessible to everyone. >> >> Hey Sasha and e1000e developers, what's up there? Two and a half weeks >> ago it seemed the root cause for this regression was found and a >> proposed patch to fix it was added to the bugzilla ticket and even >> tested by the reporter. But since then nothing happened afaics. What's >> up here? Or did I miss something? > Hello Thorsten, > Patch submitted to the IWL: > https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git/commit/?h=dev-queue&id=7dd121b8d5735780b6a70db735d44b3e5b856130 Ahh, great, many thx. That's hard to find for an outsider like other people that run into this problem (IOW: mentioning it in the bugzilla ticket would have been nice). Guess at least I might have found it, if intel-wired-lad was archived on lore. Anyway: I have to wonder: why is this in the "next"-queue? That patch doesn't look really dangerous and it's fixing a regression from v5.14, so why not merge it this cycle? Ohh, and a explicit tag to get it backported to stable quickly might be good as well afaics: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html Ciao, Thorsten >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >> >> P.S.: As the Linux kernel's regression tracker I'm getting a lot of >> reports on my table. I can only look briefly into most of them and lack >> knowledge about most of the areas they concern. I thus unfortunately >> will sometimes get things wrong or miss something important. I hope >> that's not the case here; if you think it is, don't hesitate to tell me >> in a public reply, it's in everyone's interest to set the public record >> straight. >> >> #regzbot poke >> >> On 24.03.22 20:36, Neftin, Sasha wrote: >>> On 3/24/2022 17:09, Neftin, Sasha wrote: >>>> On 3/24/2022 12:37, Thorsten Leemhuis wrote: >>>>> Hi, this is your Linux kernel regression tracker. >>>>> >>>>> I noticed a regression report in bugzilla.kernel.org that afaics >>>>> nobody >>>>> acted upon since it was reported about a week ago, that's why I >>>>> decided >>>>> to forward it to the lists and a few relevant people to the CC. To >>>>> quote >>>>> from https://bugzilla.kernel.org/show_bug.cgi?id=215689 : >>>>> >>>>>> [reply] [−] Description James 2022-03-15 13:45:38 UTC >>>>>> >>>>>> I run Arch linux on an Intel NUC 8i3BEH which has the following >>>>>> network card: >>>>>> >>>>>> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection >>>>>> (6) I219-V (rev 30) >>>>>>           DeviceName:  LAN >>>>>>           Subsystem: Intel Corporation Device 2074 >>>>>>           Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+ >>>>>>           Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >>>>>>> TAbort- SERR- >>>>>           Latency: 0 >>>>>>           Interrupt: pin A routed to IRQ 135 >>>>>>           Region 0: Memory at c0b00000 (32-bit, non-prefetchable) >>>>>> [size=128K] >>>>>>           Capabilities: [c8] Power Management version 3 >>>>>>                   Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA >>>>>> PME(D0+,D1-,D2-,D3hot+,D3cold+) >>>>>>                   Status: D0 NoSoftRst+ PME-Enable- DSel=0 >>>>>> DScale=1 PME- >>>>>>           Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ >>>>>>                   Address: 00000000fee003d8  Data: 0000 >>>>>>           Kernel driver in use: e1000e >>>>>>           Kernel modules: e1000e >>>>>> >>>>>> I found a major regression since the previous few kernel versions >>>>>> which causes several odd issues, most noteably I use the machine to >>>>>> stream live tv via TVheadend and was finding this to be unusable >>>>>> (picture freezes and sound breaks up very badly with continuity >>>>>> errors in the TVheadend logfile). >>>>>> >>>>>> I found the issue was introduced since the 5.14 kernel, and have >>>>>> eventually got round to performing a git bisect, which landed upon >>>>>> the following commit: >>>>>> >>>>>> 44a13a5: e1000e: Fix the max snoop/no-snoop latency for 10M >>>>>> >>>>>> Indeed, if I revert this single commit then the problem is resolved. >>>>>> >>>>>> To help diagnose the issue I applied the following patch to capture >>>>>> the values of the lat_enc, max_ltr_enc vs lat_enc_d, max_ltr_enc_d >>>>>> variables: >>>>>> >>>>>> diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c >>>>>> b/drivers/net/ethernet/intel/e1000e/ich8lan.c >>>>>> index d60e2016d..f4e5ffbcd 100644 >>>>>> --- a/drivers/net/ethernet/intel/e1000e/ich8lan.c >>>>>> +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c >>>>>> @@ -1012,6 +1012,7 @@ static s32 e1000_platform_pm_pch_lpt(struct >>>>>> e1000_hw *hw, bool link) >>>>>>           u16 max_ltr_enc_d = 0;  /* maximum LTR decoded by >>>>>> platform */ >>>>>>           u16 lat_enc_d = 0;      /* latency decoded */ >>>>>>           u16 lat_enc = 0;        /* latency encoded */ >>>>>> +       struct e1000_adapter *adapter = hw->adapter; >>>>>> >>>>>>           if (link) { >>>>>>                   u16 speed, duplex, scale = 0; >>>>>> @@ -1074,6 +1075,9 @@ static s32 e1000_platform_pm_pch_lpt(struct >>>>>> e1000_hw *hw, bool link) >>>>>>                                    ((max_ltr_enc & >>>>>> E1000_LTRV_SCALE_MASK) >>>>>>                                    >> E1000_LTRV_SCALE_SHIFT))); >>>>>> >>>>>> +               e_info("e1000e: lat_enc=%d max_ltr_enc=%d", lat_enc, >>>>>> max_ltr_enc); >>>>>> +               e_info("e1000e: lat_enc_d=%u max_ltr_enc_d=%u", >>>>>> lat_enc_d, max_ltr_enc_d); >>>>>> + >>>>>>                   if (lat_enc_d > max_ltr_enc_d) >>>>>>                           lat_enc = max_ltr_enc; >>>>>>           } >>>>>> >>>>>> With this in place I see the following in dmesg: >>>>>> >>>>>> [    3.241215] e1000e: Intel(R) PRO/1000 Network Driver >>>>>> [    3.241217] e1000e: Copyright(c) 1999 - 2015 Intel Corporation. >>>>>> [    3.243382] e1000e 0000:00:1f.6: Interrupt Throttling Rate >>>>>> (ints/sec) set to dynamic conservative mode >>>>>> [    3.749009] e1000e 0000:00:1f.6 0000:00:1f.6 (uninitialized): >>>>>> registered PHC clock >>>>>> [    3.824751] e1000e 0000:00:1f.6 eth0: (PCI Express:2.5GT/s:Width >>>>>> x1) 94:c6:91:ae:b3:7b >>>>>> [    3.824765] e1000e 0000:00:1f.6 eth0: Intel(R) PRO/1000 Network >>>>>> Connection >>>>>> [    3.824849] e1000e 0000:00:1f.6 eth0: MAC: 13, PHY: 12, PBA No: >>>>>> FFFFFF-0FF >>>>>> [    6.949327] e1000e 0000:00:1f.6 eth0: e1000e: lat_enc=2233 >>>>>> max_ltr_enc=4099 >>>>>> [    6.949331] e1000e 0000:00:1f.6 eth0: e1000e: lat_enc_d=58368 >>>>>> max_ltr_enc_d=0 >>>>>> [    6.951165] e1000e 0000:00:1f.6 eth0: NIC Link is Up 1000 Mbps >>>>>> Full Duplex, Flow Control: Rx/Tx >>>>>> >>>>>> Notice that lat_enc_d=58368 and max_ltr_enc_d=0 ! >>>>>> >>>>>> lat_enc_d is greater than max_ltr_enc_d so it's setting snoop >>>>>> latency to max_ltr_enc (i.e. 4099) where it would have previously >>>>>> been set to 2233 in this particular example. This seems to be where >>>>>> the problem lies. >>>>>> >>>>>> Prior to commit 44a13a5: >>>>>> >>>>>> if (lat_enc > max_ltr_enc) >>>>>>     lat_enc = max_ltr_enc; >>>>>> >>>>>> After commit 44a13a5: >>>>>> >>>>>> if (lat_enc_d > max_ltr_enc_d) >>>>>>     lat_enc = max_ltr_enc; >>>>>> >>>>>> >>>>>> I'm not sure whether it was intended for this new code to take >>>>>> effect for an I219 since the commit message on 44a13a5 indicates it >>>>>> was aimed at I217/I218. Seems strange that max_ltr_enc_d is getting >>>>>> set to 0? >>>>>> >>>>> >>>>> BTW, that commit is from Sasha Neftin. >>>> Hello Thorsten, >>>> I've expected follow decoded values (link 1G) >>>> lat_enc: 0x000008b9 => lat_enc_d: 189440 (1024*185) >>>> max_ltr_enc: 0x00001003 => max_ltr_enc_d: 3145728 (1048576*3) >>>> >>>> scale 0 - 1 >>>> scale 1 - 32 >>>> scale 2 - 1024 >>>> scale 3 - 32768 >>>> scale 4 - 1048576 (nano s) >>>> >>>> I've separated calculate: >>>> e_info("e1000e: 1* max_ltr_enc_d: %d\n", >>>>          max_ltr_enc & E1000_LTRV_VALUE_MASK); >>>> e_info("e1000e: 2* max_ltr_enc_d: %d\n", >>>>          (1U << (E1000_LTRV_SCALE_FACTOR * >>>>          ((max_ltr_enc & E1000_LTRV_SCALE_MASK) >>>>          >> E1000_LTRV_SCALE_SHIFT)))); >>>> I would expect: >>>> 1* max_ltr_enc_d (value): 3 >>>> 2* max_ltr_enc_d (scale): 1048576 >>>> and so: value * scale >>>> 1048576*3 = 3145728ns >>>> >>>> Please, let's check it. (I am wondering if over-calculate it) >>>> Thanks, >>>> Sasha >>> ok. Overflow... Instead of >>> +       u16 max_ltr_enc_d = 0;  /* maximum LTR decoded by platform */ >>> +       u16 lat_enc_d = 0;      /* latency decoded */ >>> >>> Should be: >>> +       u32 max_ltr_enc_d = 0;  /* maximum LTR decoded by platform */ >>> +       u32 lat_enc_d = 0;      /* latency decoded */ >>> I will process the patch address this overflow and some e_dbg to >>> eliminate calculation. >>> >>> sudo cat /sys/kernel/debug/pmc_core/ltr_show >>> SOUTHPORT_A                         LTR: RAW: 0x0 Non-Snoop(ns): >>> 0                   Snoop(ns): 0 >>> SOUTHPORT_B                         LTR: RAW: 0x0 Non-Snoop(ns): >>> 0                   Snoop(ns): 0 >>> SATA                                LTR: RAW: 0x900f Non-Snoop(ns): >>> 0                   Snoop(ns): 15728640 >>> GIGABIT_ETHERNET                    LTR: RAW: 0x88b988b9 Non-Snoop(ns): >>> 189440              Snoop(ns): 189440 >>> XHCI                                LTR: RAW: 0x891a Non-Snoop(ns): >>> 0                   Snoop(ns): 288768 >>> >>>>> >>>>> Could somebody take a look into this? Or was this discussed somewhere >>>>> else already? Or even fixed? >>>>> >>>>> Anyway, to get this tracked: >>>>> >>>>> #regzbot introduced: 44a13a5d99c71bf9e1676d9e51679daf4d7b3d73 >>>>> #regzbot from: James >>>>> #regzbot title: net: e1000e: instabilities on I219-V for kernel 5.14 >>>>> onwards >>>>> #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215689 >>>>> >>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' >>>>> hat) >>>>> >>>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of >>>>> reports on my table. I can only look briefly into most of them and >>>>> lack >>>>> knowledge about most of the areas they concern. I thus unfortunately >>>>> will sometimes get things wrong or miss something important. I hope >>>>> that's not the case here; if you think it is, don't hesitate to >>>>> tell me >>>>> in a public reply, it's in everyone's interest to set the public >>>>> record >>>>> straight. >>>>> >>>> >>> >>> >>> > Sasha >