dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Fwd: Unexplainable packet drop starting at v6.4
@ 2023-07-18  0:51 Bagas Sanjaya
  2023-07-19 11:49 ` Thorsten Leemhuis
  2023-07-25 23:50 ` Bagas Sanjaya
  0 siblings, 2 replies; 6+ messages in thread
From: Bagas Sanjaya @ 2023-07-18  0:51 UTC (permalink / raw)
  To: Andrzej Kacprowski, Krystian Pradzynski, Stanislaw Gruszka,
	Jacek Lawrynowicz, Oded Gabbay, Jesse Brandeburg, Tony Nguyen,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Linux regression tracking (Thorsten Leemhuis),
	hq.dev+kernel
  Cc: Linux Networking, Linux Intel Ethernet Drivers,
	Linux Kernel Mailing List, Linux DRI Development,
	Linux Regressions

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> Hi,
> 
> After I updated to 6.4 through Archlinux kernel update, suddenly I noticed random packet losses on my routers like nodes. I have these networking relevant config on my nodes
> 
> 1. Using archlinux
> 2. Network config through systemd-networkd
> 3. Using bird2 for BGP routing, but not relevant to this bug.
> 4. Using nftables for traffic control, but seems not relevant to this bug. 
> 5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level
> 
> After I ruled out systemd-networkd, nftables related issues. I tracked down issues to kernel.
> 
> Here's the tcpdump I'm seeing on one side of my node ""
> 
> ```
> sudo tcpdump -i fios_wan port 38851
> tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
> listening on fios_wan, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 10:33:06.073236 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> 10:33:11.406607 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> 10:33:16.739969 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> 10:33:21.859856 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> 10:33:27.193176 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> 5 packets captured
> 5 packets received by filter
> 0 packets dropped by kernel
> ```
> 
> But on the other side "[REDACTED_PUBLIC_IPv4_1]", tcpdump is replying packets in this wireguard stream. So packet is lost somewhere in the link.
> 
> From the otherside, I can do "mtr" to "[BOS1_NODE]"'s public IP and found the moment the link got lost is right at "[BOS1_NODE]", that means "[BOS1_NODE]"'s networking stack completely drop the inbound packets from specific ip addresses.
> 
> Some more digging
> 
> 1. This situation began after booting in different delays. Sometimes can trigger after 30 seconds after booting, and sometimes will be after 18 hours or more.
> 2. It can envolve into worse case that when I do "ip neigh show", the ipv4 ARP table and ipv6 neighbor discovery start to appear as "invalid", meaning the internet is completely loss.
> 3. When this happened to wan facing interface, it seems OK with lan facing interfaces. WAN interface was using Intel X710-T4L using i40e and lan side was using virtio
> 4. I tried to bisect in between 6.3 and 6.4, and the first bad commit it reports was "a3efabee5878b8d7b1863debb78cb7129d07a346". But this is not relevant to networking at all, maybe it's the wrong commit to look at. At the meantime, because I haven't found a reproducible way of 100% trigger the issue, it may be the case during bisect some "good" commits are actually bad. 
> 5. I also tried to look at "dmesg", nothing interesting pop up. But I'll make it available upon request.
> 
> This is my first bug reports. Sorry for any confusion it may lead to and thanks for reading.

See Bugzilla for the full thread.

Thorsten: The reporter had a bad bisect (some bad commits were marked as good
instead), hence SoB chain for culprit (unrelated) ipvu commit is in To:
list. I also asked the reporter (also in To:) to provide dmesg and request
rerunning bisection, but he doesn't currently have a reliable reproducer.
Is it the best I can do?

Anyway, I'm adding this regression to be tracked in regzbot:

#regzbot introduced: a3efabee5878b8 https://bugzilla.kernel.org/show_bug.cgi?id=217678
#regzbot title: packet drop on Intel X710-T4L due to ipvu boot fix

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217678

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: Unexplainable packet drop starting at v6.4
  2023-07-18  0:51 Fwd: Unexplainable packet drop starting at v6.4 Bagas Sanjaya
@ 2023-07-19 11:49 ` Thorsten Leemhuis
  2023-07-19 12:30   ` Bagas Sanjaya
  2023-07-25 23:50 ` Bagas Sanjaya
  1 sibling, 1 reply; 6+ messages in thread
From: Thorsten Leemhuis @ 2023-07-19 11:49 UTC (permalink / raw)
  To: Bagas Sanjaya, Andrzej Kacprowski, Krystian Pradzynski,
	Stanislaw Gruszka, Jacek Lawrynowicz, Oded Gabbay,
	Jesse Brandeburg, Tony Nguyen, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, hq.dev+kernel
  Cc: Linux Networking, Linux Intel Ethernet Drivers,
	Linux Kernel Mailing List, Linux DRI Development,
	Linux Regressions

On 18.07.23 02:51, Bagas Sanjaya wrote:
> 
> I notice a regression report on Bugzilla [1]. Quoting from it:
> 
>> After I updated to 6.4 through Archlinux kernel update, suddenly I noticed random packet losses on my routers like nodes. I have these networking relevant config on my nodes
>>
>> 1. Using archlinux
>> 2. Network config through systemd-networkd
>> 3. Using bird2 for BGP routing, but not relevant to this bug.
>> 4. Using nftables for traffic control, but seems not relevant to this bug. 
>> 5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level
>>
>> After I ruled out systemd-networkd, nftables related issues. I tracked down issues to kernel.
> [...]
> See Bugzilla for the full thread.
> 
> Thorsten: The reporter had a bad bisect (some bad commits were marked as good
> instead), hence SoB chain for culprit (unrelated) ipvu commit is in To:
> list. I also asked the reporter (also in To:) to provide dmesg and request
> rerunning bisection, but he doesn't currently have a reliable reproducer.
> Is it the best I can do?

When a bisection apparently went sideways it's best to not bother the
culprit's developers with it, they most likely will just be annoyed by
it (and then they might become annoyed by regression tracking, which we
need to avoid).

I'd have forwarded this to the network folks, but in a style along the
lines of "FYI, in case somebody has a idea or has heard about something
similar and thus can help; if not, no worries, reporter is repeating the
bisection".

> Anyway, I'm adding this regression to be tracked in regzbot:
> 
> #regzbot introduced: a3efabee5878b8 https://bugzilla.kernel.org/show_bug.cgi?id=217678
> #regzbot title: packet drop on Intel X710-T4L due to ipvu boot fix
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217678

Side note for the record: Stephen also forwarded this. And let me also
clear the commit you specified, as it sounds it's unlikely to be causing
this.

#regzbot introduced: v6.3..v6.4
#regzbot monitor:
https://lore.kernel.org/all/20230717115352.79aecc71@hermes.local/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: Unexplainable packet drop starting at v6.4
  2023-07-19 11:49 ` Thorsten Leemhuis
@ 2023-07-19 12:30   ` Bagas Sanjaya
  2023-07-19 13:00     ` Thorsten Leemhuis
  0 siblings, 1 reply; 6+ messages in thread
From: Bagas Sanjaya @ 2023-07-19 12:30 UTC (permalink / raw)
  To: Thorsten Leemhuis, Andrzej Kacprowski, Krystian Pradzynski,
	Stanislaw Gruszka, Jacek Lawrynowicz, Oded Gabbay,
	Jesse Brandeburg, Tony Nguyen, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, hq.dev+kernel
  Cc: Linux Networking, Linux Intel Ethernet Drivers,
	Linux Kernel Mailing List, Linux DRI Development,
	Linux Regressions

On 7/19/23 18:49, Thorsten Leemhuis wrote:
> On 18.07.23 02:51, Bagas Sanjaya wrote:
>>
>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>
>>> After I updated to 6.4 through Archlinux kernel update, suddenly I noticed random packet losses on my routers like nodes. I have these networking relevant config on my nodes
>>>
>>> 1. Using archlinux
>>> 2. Network config through systemd-networkd
>>> 3. Using bird2 for BGP routing, but not relevant to this bug.
>>> 4. Using nftables for traffic control, but seems not relevant to this bug. 
>>> 5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level
>>>
>>> After I ruled out systemd-networkd, nftables related issues. I tracked down issues to kernel.
>> [...]
>> See Bugzilla for the full thread.
>>
>> Thorsten: The reporter had a bad bisect (some bad commits were marked as good
>> instead), hence SoB chain for culprit (unrelated) ipvu commit is in To:
>> list. I also asked the reporter (also in To:) to provide dmesg and request
>> rerunning bisection, but he doesn't currently have a reliable reproducer.
>> Is it the best I can do?
> 
> When a bisection apparently went sideways it's best to not bother the
> culprit's developers with it, they most likely will just be annoyed by
> it (and then they might become annoyed by regression tracking, which we
> need to avoid).
> 

I mean don't Cc: the culprit author in that case?

> I'd have forwarded this to the network folks, but in a style along the
> lines of "FYI, in case somebody has a idea or has heard about something
> similar and thus can help; if not, no worries, reporter is repeating the
> bisection".
> 

Aha! I missed that point. I already have networking folks in To: list.

Thanks!

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: Unexplainable packet drop starting at v6.4
  2023-07-19 12:30   ` Bagas Sanjaya
@ 2023-07-19 13:00     ` Thorsten Leemhuis
  0 siblings, 0 replies; 6+ messages in thread
From: Thorsten Leemhuis @ 2023-07-19 13:00 UTC (permalink / raw)
  To: Bagas Sanjaya, Andrzej Kacprowski, Krystian Pradzynski,
	Stanislaw Gruszka, Jacek Lawrynowicz, Oded Gabbay,
	Jesse Brandeburg, Tony Nguyen, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, hq.dev+kernel
  Cc: Linux Networking, Linux Intel Ethernet Drivers,
	Linux Kernel Mailing List, Linux DRI Development,
	Linux Regressions

On 19.07.23 14:30, Bagas Sanjaya wrote:
> On 7/19/23 18:49, Thorsten Leemhuis wrote:
>> On 18.07.23 02:51, Bagas Sanjaya wrote:
>>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>>
>>>> After I updated to 6.4 through Archlinux kernel update, suddenly I noticed random packet losses on my routers like nodes. I have these networking relevant config on my nodes
>>>>
>>>> 1. Using archlinux
>>>> 2. Network config through systemd-networkd
>>>> 3. Using bird2 for BGP routing, but not relevant to this bug.
>>>> 4. Using nftables for traffic control, but seems not relevant to this bug. 
>>>> 5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level
>>>>
>>>> After I ruled out systemd-networkd, nftables related issues. I tracked down issues to kernel.
>>> [...]
>>> See Bugzilla for the full thread.
>>>
>>> Thorsten: The reporter had a bad bisect (some bad commits were marked as good
>>> instead), hence SoB chain for culprit (unrelated) ipvu commit is in To:
>>> list. I also asked the reporter (also in To:) to provide dmesg and request
>>> rerunning bisection, but he doesn't currently have a reliable reproducer.
>>> Is it the best I can do?
>>
>> When a bisection apparently went sideways it's best to not bother the
>> culprit's developers with it, they most likely will just be annoyed by
>> it (and then they might become annoyed by regression tracking, which we
>> need to avoid).
>
> I mean don't Cc: the culprit author in that case?

Yes. If a bisection lands on a commit that seems like a pretty unlikely
culprit for the problem at hand (which even the reporter admitted in the
report), then ask the reporter to verify the result (e.g. ideally by
trying to revert it ontop of latest mainline; checking the parent commit
again sometimes can do the trick as well)  before involving the people
that authored and handled said change. Otherwise you just raise a false
alarm and then people will be annoyed by our work or if we are unlucky
start to ignore us -- and we need to prevent that.

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: Unexplainable packet drop starting at v6.4
  2023-07-18  0:51 Fwd: Unexplainable packet drop starting at v6.4 Bagas Sanjaya
  2023-07-19 11:49 ` Thorsten Leemhuis
@ 2023-07-25 23:50 ` Bagas Sanjaya
  2023-07-26  0:47   ` Jakub Kicinski
  1 sibling, 1 reply; 6+ messages in thread
From: Bagas Sanjaya @ 2023-07-25 23:50 UTC (permalink / raw)
  To: Andrzej Kacprowski, Krystian Pradzynski, Stanislaw Gruszka,
	Jacek Lawrynowicz, Oded Gabbay, Jesse Brandeburg, Tony Nguyen,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Linux regression tracking (Thorsten Leemhuis),
	hq.dev+kernel, Linus Torvalds
  Cc: Linux Networking, Linux Intel Ethernet Drivers,
	Linux Kernel Mailing List, Linux DRI Development,
	Linux Regressions

[-- Attachment #1: Type: text/plain, Size: 3979 bytes --]

On Tue, Jul 18, 2023 at 07:51:24AM +0700, Bagas Sanjaya wrote:
> Hi,
> 
> I notice a regression report on Bugzilla [1]. Quoting from it:
> 
> > Hi,
> > 
> > After I updated to 6.4 through Archlinux kernel update, suddenly I noticed random packet losses on my routers like nodes. I have these networking relevant config on my nodes
> > 
> > 1. Using archlinux
> > 2. Network config through systemd-networkd
> > 3. Using bird2 for BGP routing, but not relevant to this bug.
> > 4. Using nftables for traffic control, but seems not relevant to this bug. 
> > 5. Not using fail2ban like dymanic filtering tools, at least at L3/L4 level
> > 
> > After I ruled out systemd-networkd, nftables related issues. I tracked down issues to kernel.
> > 
> > Here's the tcpdump I'm seeing on one side of my node ""
> > 
> > ```
> > sudo tcpdump -i fios_wan port 38851
> > tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
> > listening on fios_wan, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > 10:33:06.073236 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> > 10:33:11.406607 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> > 10:33:16.739969 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> > 10:33:21.859856 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> > 10:33:27.193176 IP [BOS1_NODE].38851 > [REDACTED_PUBLIC_IPv4_1].38851: UDP, length 148
> > 5 packets captured
> > 5 packets received by filter
> > 0 packets dropped by kernel
> > ```
> > 
> > But on the other side "[REDACTED_PUBLIC_IPv4_1]", tcpdump is replying packets in this wireguard stream. So packet is lost somewhere in the link.
> > 
> > From the otherside, I can do "mtr" to "[BOS1_NODE]"'s public IP and found the moment the link got lost is right at "[BOS1_NODE]", that means "[BOS1_NODE]"'s networking stack completely drop the inbound packets from specific ip addresses.
> > 
> > Some more digging
> > 
> > 1. This situation began after booting in different delays. Sometimes can trigger after 30 seconds after booting, and sometimes will be after 18 hours or more.
> > 2. It can envolve into worse case that when I do "ip neigh show", the ipv4 ARP table and ipv6 neighbor discovery start to appear as "invalid", meaning the internet is completely loss.
> > 3. When this happened to wan facing interface, it seems OK with lan facing interfaces. WAN interface was using Intel X710-T4L using i40e and lan side was using virtio
> > 4. I tried to bisect in between 6.3 and 6.4, and the first bad commit it reports was "a3efabee5878b8d7b1863debb78cb7129d07a346". But this is not relevant to networking at all, maybe it's the wrong commit to look at. At the meantime, because I haven't found a reproducible way of 100% trigger the issue, it may be the case during bisect some "good" commits are actually bad. 
> > 5. I also tried to look at "dmesg", nothing interesting pop up. But I'll make it available upon request.
> > 
> > This is my first bug reports. Sorry for any confusion it may lead to and thanks for reading.
> 
> See Bugzilla for the full thread.
> 
> Thorsten: The reporter had a bad bisect (some bad commits were marked as good
> instead), hence SoB chain for culprit (unrelated) ipvu commit is in To:
> list. I also asked the reporter (also in To:) to provide dmesg and request
> rerunning bisection, but he doesn't currently have a reliable reproducer.
> Is it the best I can do?
> 
> Anyway, I'm adding this regression to be tracked in regzbot:
> 
> #regzbot introduced: a3efabee5878b8 https://bugzilla.kernel.org/show_bug.cgi?id=217678
> #regzbot title: packet drop on Intel X710-T4L due to ipvu boot fix
> 

This time, the bisection points out to v6.4 networking pull, so:

#regzbot introduced: 6e98b09da931a0

(also Cc: Linus.)

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Unexplainable packet drop starting at v6.4
  2023-07-25 23:50 ` Bagas Sanjaya
@ 2023-07-26  0:47   ` Jakub Kicinski
  0 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2023-07-26  0:47 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Krystian Pradzynski, hq.dev+kernel, Linux Regressions,
	Linux Networking, Oded Gabbay, Jesse Brandeburg,
	Linux DRI Development, Linux Kernel Mailing List,
	Stanislaw Gruszka, Eric Dumazet,
	Linux regression tracking (Thorsten Leemhuis),
	Tony Nguyen, Jacek Lawrynowicz, Linux Intel Ethernet Drivers,
	Andrzej Kacprowski, Paolo Abeni, Linus Torvalds, David S. Miller

On Wed, 26 Jul 2023 06:50:52 +0700 Bagas Sanjaya wrote:
> This time, the bisection points out to v6.4 networking pull, so:
> 
> #regzbot introduced: 6e98b09da931a0

Ask the reporter to test 9b78d919632b, i.e. the tip of net-next before
the merge. It seems quite unlikely that the merge itself is the problem.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-26  0:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-18  0:51 Fwd: Unexplainable packet drop starting at v6.4 Bagas Sanjaya
2023-07-19 11:49 ` Thorsten Leemhuis
2023-07-19 12:30   ` Bagas Sanjaya
2023-07-19 13:00     ` Thorsten Leemhuis
2023-07-25 23:50 ` Bagas Sanjaya
2023-07-26  0:47   ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).