linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Intermittent performance regression related to ipset between 5.10 and 5.15
@ 2022-03-15 23:15 McLean, Patrick
  2022-03-16  9:17 ` Thorsten Leemhuis
  0 siblings, 1 reply; 20+ messages in thread
From: McLean, Patrick @ 2022-03-15 23:15 UTC (permalink / raw)
  To: netdev; +Cc: U'ren, Aaron, Brown, Russell, Rueger, Manuel, linux-kernel

When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) series, we encountered an intermittent performance regression that appears to be related to iptables / ipset. This regression was noticed on Kubernetes hosts that run kube-router and experience a high amount of churn to both iptables and ipsets. Specifically, when we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables wrapper xtables-nft-multi on the 5.15 series kernel, we end up getting extremely laggy response times when iptables attempts to lookup information on the ipsets that are used in the iptables definition. This issue isn’t reproducible on all hosts. However, our experience has been that across a fleet of ~50 hosts we experienced this issue on ~40% of the hosts. When the problem evidences, the time that it takes to run unrestricted iptables list commands like iptables -L or iptables-save gradually increases over the course of about 1 - 2 hours. Growing from less than a second to run, to taking sometimes over 2 minutes to run. After that 2 hour mark it seems to plateau and not grow any longer. Flushing tables or ipsets doesn’t seem to have any affect on the issue. However, rebooting the host does reset the issue. Occasionally, a machine that was evidencing the problem may no longer evidence it after being rebooted.

We did try to debug this to find a root cause, but ultimately ran short on time. We were not able to perform a set of bisects to hopefully narrow down the issue as the problem isn’t consistently reproducible. We were able to get some straces where it appears that most of the time is spent on getsockopt() operations. It appears that during iptables operations, it attempts to do some work to resolve the ipsets that are linked to the iptables definitions (perhaps getting the names of the ipsets themselves?). Slowly that getsockopt request takes more and more time on affected hosts. Here is an example strace of the operation in question:

0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
0.000036 close(4)                  = 0 <0.000008>
0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
0.109456 close(4)                  = 0 <0.000022>

On a host that is not evidencing the performance regression we normally see that operation take ~ 0.00001 as opposed to 0.109384.Additionally, hosts that were evidencing the problem we also saw high lock times with `klockstat` (unfortunately at the time we did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict to get the callers of the below commands).

klockstat -i 5 -n 10 (on a host experiencing the problem)
Caller   Avg Hold  Count   Max hold Total hold
b'[unknown]'  118490772     83  179899470 9834734132
b'[unknown]'  118416941     83  179850047 9828606138
# or somewhere later while iptables -vnL was running:
Caller   Avg Hold  Count   Max hold Total hold
b'[unknown]'  496466236     46 17919955720 22837446860
b'[unknown]'  496391064     46 17919893843 22833988950

klockstat -i 5 -n 10 (on a host not experiencing the problem)
Caller   Avg Hold  Count   Max hold Total hold
b'[unknown]'     120316   1510   85537797  181677885
b'[unknown]'    7119070     24   85527251  170857690

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-03-15 23:15 Intermittent performance regression related to ipset between 5.10 and 5.15 McLean, Patrick
@ 2022-03-16  9:17 ` Thorsten Leemhuis
  2022-04-11 10:36   ` Thorsten Leemhuis
  2022-07-01  5:18   ` Intermittent performance regression related to ipset between 5.10 and 5.15 #forregzbot Thorsten Leemhuis
  0 siblings, 2 replies; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-03-16  9:17 UTC (permalink / raw)
  To: McLean, Patrick, netdev
  Cc: U'ren, Aaron, Brown, Russell, Rueger, Manuel, linux-kernel,
	regressions

[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

On 16.03.22 00:15, McLean, Patrick wrote:
> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) series, we encountered an intermittent performance regression that appears to be related to iptables / ipset. This regression was noticed on Kubernetes hosts that run kube-router and experience a high amount of churn to both iptables and ipsets. Specifically, when we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables wrapper xtables-nft-multi on the 5.15 series kernel, we end up getting extremely laggy response times when iptables attempts to lookup information on the ipsets that are used in the iptables definition. This issue isn’t reproducible on all hosts. However, our experience has been that across a fleet of ~50 hosts we experienced this issue on ~40% of the hosts. When the problem evidences, the time that it takes to run unrestricted iptables list commands like iptables -L or iptables-save gradually increases over the course of about 1 - 2 hours. Growing from less than a second to run, to taking sometimes over 2 minutes to run. After that 2 hour mark it seems to plateau and not grow any longer. Flushing tables or ipsets doesn’t seem to have any affect on the issue. However, rebooting the host does reset the issue. Occasionally, a machine that was evidencing the problem may no longer evidence it after being rebooted.
> 
> We did try to debug this to find a root cause, but ultimately ran short on time. We were not able to perform a set of bisects to hopefully narrow down the issue as the problem isn’t consistently reproducible. We were able to get some straces where it appears that most of the time is spent on getsockopt() operations. It appears that during iptables operations, it attempts to do some work to resolve the ipsets that are linked to the iptables definitions (perhaps getting the names of the ipsets themselves?). Slowly that getsockopt request takes more and more time on affected hosts. Here is an example strace of the operation in question:
> 
> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
> 0.000036 close(4)                  = 0 <0.000008>
> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
> 0.109456 close(4)                  = 0 <0.000022>
> 
> On a host that is not evidencing the performance regression we normally see that operation take ~ 0.00001 as opposed to 0.109384.Additionally, hosts that were evidencing the problem we also saw high lock times with `klockstat` (unfortunately at the time we did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict to get the callers of the below commands).
> 
> klockstat -i 5 -n 10 (on a host experiencing the problem)
> Caller   Avg Hold  Count   Max hold Total hold
> b'[unknown]'  118490772     83  179899470 9834734132
> b'[unknown]'  118416941     83  179850047 9828606138
> # or somewhere later while iptables -vnL was running:
> Caller   Avg Hold  Count   Max hold Total hold
> b'[unknown]'  496466236     46 17919955720 22837446860
> b'[unknown]'  496391064     46 17919893843 22833988950
> 
> klockstat -i 5 -n 10 (on a host not experiencing the problem)
> Caller   Avg Hold  Count   Max hold Total hold
> b'[unknown]'     120316   1510   85537797  181677885
> b'[unknown]'    7119070     24   85527251  170857690

Hi, this is your Linux kernel regression tracker.

Thanks for the report.

CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced v5.10..v5.15
#regzbot title net: netfilter: Intermittent performance regression
related to ipset
#regzbot ignore-activity

If it turns out this isn't a regression, free free to remove it from the
tracking by sending a reply to this thread containing a paragraph like
"#regzbot invalid: reason why this is invalid" (without the quotes).

Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. Regzbot needs them to
automatically connect reports with fixes, but they are useful in
general, too.

I'm sending this to everyone that got the initial report, to make
everyone aware of the tracking. I also hope that messages like this
motivate people to directly get at least the regression mailing list and
ideally even regzbot involved when dealing with regressions, as messages
like this wouldn't be needed then. And don't worry, if I need to send
other mails regarding this regression only relevant for regzbot I'll
send them to the regressions lists only (with a tag in the subject so
people can filter them away). With a bit of luck no such messages will
be needed anyway.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

-- 
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-03-16  9:17 ` Thorsten Leemhuis
@ 2022-04-11 10:36   ` Thorsten Leemhuis
  2022-04-11 11:47     ` Jozsef Kadlecsik
  2022-07-01  5:18   ` Intermittent performance regression related to ipset between 5.10 and 5.15 #forregzbot Thorsten Leemhuis
  1 sibling, 1 reply; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-04-11 10:36 UTC (permalink / raw)
  To: Jozsef Kadlecsik, Pablo Neira Ayuso, netfilter-devel
  Cc: U'ren, Aaron, Brown, Russell, Rueger, Manuel, linux-kernel,
	regressions, Florian Westphal, netdev, McLean, Patrick

[CCing netfilter and netfilter maintainers]

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

What up here? Was this regression report addressed? It looks like it
fell through the cracks to me.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

#regzbot poke

On 16.03.22 10:17, Thorsten Leemhuis wrote:
> [TLDR: I'm adding the regression report below to regzbot, the Linux
> kernel regression tracking bot; all text you find below is compiled from
> a few templates paragraphs you might have encountered already already
> from similar mails.]
> 
> On 16.03.22 00:15, McLean, Patrick wrote:
>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) series, we encountered an intermittent performance regression that appears to be related to iptables / ipset. This regression was noticed on Kubernetes hosts that run kube-router and experience a high amount of churn to both iptables and ipsets. Specifically, when we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables wrapper xtables-nft-multi on the 5.15 series kernel, we end up getting extremely laggy response times when iptables attempts to lookup information on the ipsets that are used in the iptables definition. This issue isn’t reproducible on all hosts. However, our experience has been that across a fleet of ~50 hosts we experienced this issue on ~40% of the hosts. When the problem evidences, the time that it takes to run unrestricted iptables list commands like iptables -L or iptables-save gradually increases over the course of about 1 - 2 hours. Growing from less than a second to run, to taking sometimes over 2 minutes to run. After that 2 hour mark it seems to plateau and not grow any longer. Flushing tables or ipsets doesn’t seem to have any affect on the issue. However, rebooting the host does reset the issue. Occasionally, a machine that was evidencing the problem may no longer evidence it after being rebooted.
>>
>> We did try to debug this to find a root cause, but ultimately ran short on time. We were not able to perform a set of bisects to hopefully narrow down the issue as the problem isn’t consistently reproducible. We were able to get some straces where it appears that most of the time is spent on getsockopt() operations. It appears that during iptables operations, it attempts to do some work to resolve the ipsets that are linked to the iptables definitions (perhaps getting the names of the ipsets themselves?). Slowly that getsockopt request takes more and more time on affected hosts. Here is an example strace of the operation in question:
>>
>> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
>> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
>> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
>> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
>> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
>> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
>> 0.000036 close(4)                  = 0 <0.000008>
>> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
>> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
>> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
>> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
>> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
>> 0.109456 close(4)                  = 0 <0.000022>
>>
>> On a host that is not evidencing the performance regression we normally see that operation take ~ 0.00001 as opposed to 0.109384.Additionally, hosts that were evidencing the problem we also saw high lock times with `klockstat` (unfortunately at the time we did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict to get the callers of the below commands).
>>
>> klockstat -i 5 -n 10 (on a host experiencing the problem)
>> Caller   Avg Hold  Count   Max hold Total hold
>> b'[unknown]'  118490772     83  179899470 9834734132
>> b'[unknown]'  118416941     83  179850047 9828606138
>> # or somewhere later while iptables -vnL was running:
>> Caller   Avg Hold  Count   Max hold Total hold
>> b'[unknown]'  496466236     46 17919955720 22837446860
>> b'[unknown]'  496391064     46 17919893843 22833988950
>>
>> klockstat -i 5 -n 10 (on a host not experiencing the problem)
>> Caller   Avg Hold  Count   Max hold Total hold
>> b'[unknown]'     120316   1510   85537797  181677885
>> b'[unknown]'    7119070     24   85527251  170857690
> 
> Hi, this is your Linux kernel regression tracker.
> 
> Thanks for the report.
> 
> CCing the regression mailing list, as it should be in the loop for all
> regressions, as explained here:
> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
> 
> To be sure below issue doesn't fall through the cracks unnoticed, I'm
> adding it to regzbot, my Linux kernel regression tracking bot:
> 
> #regzbot ^introduced v5.10..v5.15
> #regzbot title net: netfilter: Intermittent performance regression
> related to ipset
> #regzbot ignore-activity
> 
> If it turns out this isn't a regression, free free to remove it from the
> tracking by sending a reply to this thread containing a paragraph like
> "#regzbot invalid: reason why this is invalid" (without the quotes).
> 
> Reminder for developers: when fixing the issue, please add a 'Link:'
> tags pointing to the report (the mail quoted above) using
> lore.kernel.org/r/, as explained in
> 'Documentation/process/submitting-patches.rst' and
> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
> automatically connect reports with fixes, but they are useful in
> general, too.
> 
> I'm sending this to everyone that got the initial report, to make
> everyone aware of the tracking. I also hope that messages like this
> motivate people to directly get at least the regression mailing list and
> ideally even regzbot involved when dealing with regressions, as messages
> like this wouldn't be needed then. And don't worry, if I need to send
> other mails regarding this regression only relevant for regzbot I'll
> send them to the regressions lists only (with a tag in the subject so
> people can filter them away). With a bit of luck no such messages will
> be needed anyway.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> 
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-04-11 10:36   ` Thorsten Leemhuis
@ 2022-04-11 11:47     ` Jozsef Kadlecsik
  2022-05-04 13:14       ` Thorsten Leemhuis
  0 siblings, 1 reply; 20+ messages in thread
From: Jozsef Kadlecsik @ 2022-04-11 11:47 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Pablo Neira Ayuso, netfilter-devel, U'ren, Aaron, Brown,
	Russell, Rueger, Manuel, linux-kernel, regressions,
	Florian Westphal, netdev, McLean, Patrick

[-- Attachment #1: Type: text/plain, Size: 7919 bytes --]

Hi,

On Mon, 11 Apr 2022, Thorsten Leemhuis wrote:

> On 16.03.22 10:17, Thorsten Leemhuis wrote:
> > [TLDR: I'm adding the regression report below to regzbot, the Linux
> > kernel regression tracking bot; all text you find below is compiled from
> > a few templates paragraphs you might have encountered already already
> > from similar mails.]
> > 
> > On 16.03.22 00:15, McLean, Patrick wrote:
>
> >> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) 
> >> series, we encountered an intermittent performance regression that 
> >> appears to be related to iptables / ipset. This regression was 
> >> noticed on Kubernetes hosts that run kube-router and experience a 
> >> high amount of churn to both iptables and ipsets. Specifically, when 
> >> we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables 
> >> wrapper xtables-nft-multi on the 5.15 series kernel, we end up 
> >> getting extremely laggy response times when iptables attempts to 
> >> lookup information on the ipsets that are used in the iptables 
> >> definition. This issue isn’t reproducible on all hosts. However, our 
> >> experience has been that across a fleet of ~50 hosts we experienced 
> >> this issue on ~40% of the hosts. When the problem evidences, the time 
> >> that it takes to run unrestricted iptables list commands like 
> >> iptables -L or iptables-save gradually increases over the course of 
> >> about 1 - 2 hours. Growing from less than a second to run, to takin
>  g sometimes over 2 minutes to run. After that 2 hour mark it seems to 
>  plateau and not grow any longer. Flushing tables or ipsets doesn’t seem 
>  to have any affect on the issue. However, rebooting the host does reset 
>  the issue. Occasionally, a machine that was evidencing the problem may 
>  no longer evidence it after being rebooted.
> >>
> >> We did try to debug this to find a root cause, but ultimately ran 
> >> short on time. We were not able to perform a set of bisects to 
> >> hopefully narrow down the issue as the problem isn’t consistently 
> >> reproducible. We were able to get some straces where it appears that 
> >> most of the time is spent on getsockopt() operations. It appears that 
> >> during iptables operations, it attempts to do some work to resolve 
> >> the ipsets that are linked to the iptables definitions (perhaps 
> >> getting the names of the ipsets themselves?). Slowly that getsockopt 
> >> request takes more and more time on affected hosts. Here is an 
> >> example strace of the operation in question:

Yes, iptables list/save have to get the names of the referenced sets and 
that is performed via getsockopt() calls.

I went through all of the ipset related patches between 5.10.6 (copy&paste 
error but just the range is larger) and 5.15.16 and as far as I see none 
of them can be responsible for the regression. More data is required to 
locate the source of the slowdown.

Best regards,
Jozsef

> >>
> >> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
> >> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
> >> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
> >> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
> >> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
> >> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
> >> 0.000036 close(4)                  = 0 <0.000008>
> >> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
> >> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
> >> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
> >> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> >> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
> >> 0.109456 close(4)                  = 0 <0.000022>
> >>
> >> On a host that is not evidencing the performance regression we 
> >> normally see that operation take ~ 0.00001 as opposed to 
> >> 0.109384.Additionally, hosts that were evidencing the problem we also 
> >> saw high lock times with `klockstat` (unfortunately at the time we 
> >> did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict 
> >> to get the callers of the below commands).
> >>
> >> klockstat -i 5 -n 10 (on a host experiencing the problem)
> >> Caller   Avg Hold  Count   Max hold Total hold
> >> b'[unknown]'  118490772     83  179899470 9834734132
> >> b'[unknown]'  118416941     83  179850047 9828606138
> >> # or somewhere later while iptables -vnL was running:
> >> Caller   Avg Hold  Count   Max hold Total hold
> >> b'[unknown]'  496466236     46 17919955720 22837446860
> >> b'[unknown]'  496391064     46 17919893843 22833988950
> >>
> >> klockstat -i 5 -n 10 (on a host not experiencing the problem)
> >> Caller   Avg Hold  Count   Max hold Total hold
> >> b'[unknown]'     120316   1510   85537797  181677885
> >> b'[unknown]'    7119070     24   85527251  170857690
> > 
> > Hi, this is your Linux kernel regression tracker.
> > 
> > Thanks for the report.
> > 
> > CCing the regression mailing list, as it should be in the loop for all
> > regressions, as explained here:
> > https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
> > 
> > To be sure below issue doesn't fall through the cracks unnoticed, I'm
> > adding it to regzbot, my Linux kernel regression tracking bot:
> > 
> > #regzbot ^introduced v5.10..v5.15
> > #regzbot title net: netfilter: Intermittent performance regression
> > related to ipset
> > #regzbot ignore-activity
> > 
> > If it turns out this isn't a regression, free free to remove it from the
> > tracking by sending a reply to this thread containing a paragraph like
> > "#regzbot invalid: reason why this is invalid" (without the quotes).
> > 
> > Reminder for developers: when fixing the issue, please add a 'Link:'
> > tags pointing to the report (the mail quoted above) using
> > lore.kernel.org/r/, as explained in
> > 'Documentation/process/submitting-patches.rst' and
> > 'Documentation/process/5.Posting.rst'. Regzbot needs them to
> > automatically connect reports with fixes, but they are useful in
> > general, too.
> > 
> > I'm sending this to everyone that got the initial report, to make
> > everyone aware of the tracking. I also hope that messages like this
> > motivate people to directly get at least the regression mailing list and
> > ideally even regzbot involved when dealing with regressions, as messages
> > like this wouldn't be needed then. And don't worry, if I need to send
> > other mails regarding this regression only relevant for regzbot I'll
> > send them to the regressions lists only (with a tag in the subject so
> > people can filter them away). With a bit of luck no such messages will
> > be needed anyway.
> > 
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > 
> > P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> > reports on my table. I can only look briefly into most of them and lack
> > knowledge about most of the areas they concern. I thus unfortunately
> > will sometimes get things wrong or miss something important. I hope
> > that's not the case here; if you think it is, don't hesitate to tell me
> > in a public reply, it's in everyone's interest to set the public record
> > straight.
> > 
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-04-11 11:47     ` Jozsef Kadlecsik
@ 2022-05-04 13:14       ` Thorsten Leemhuis
  2022-05-04 19:37         ` U'ren, Aaron
  0 siblings, 1 reply; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-05-04 13:14 UTC (permalink / raw)
  To: McLean, Patrick
  Cc: Pablo Neira Ayuso, netfilter-devel, U'ren, Aaron, Brown,
	Russell, Rueger, Manuel, linux-kernel, regressions,
	Florian Westphal, netdev, Jozsef Kadlecsik

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

Patrick, did you see the comment from Jozsef? Are you having trouble
providing additional data or what's the status here from your side? Or
is that something we can forget?

Ciao, Thorsten

#regzbot poke

On 11.04.22 13:47, Jozsef Kadlecsik wrote:
> Hi,
> 
> On Mon, 11 Apr 2022, Thorsten Leemhuis wrote:
> 
>> On 16.03.22 10:17, Thorsten Leemhuis wrote:
>>> [TLDR: I'm adding the regression report below to regzbot, the Linux
>>> kernel regression tracking bot; all text you find below is compiled from
>>> a few templates paragraphs you might have encountered already already
>>> from similar mails.]
>>>
>>> On 16.03.22 00:15, McLean, Patrick wrote:
>>
>>>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) 
>>>> series, we encountered an intermittent performance regression that 
>>>> appears to be related to iptables / ipset. This regression was 
>>>> noticed on Kubernetes hosts that run kube-router and experience a 
>>>> high amount of churn to both iptables and ipsets. Specifically, when 
>>>> we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables 
>>>> wrapper xtables-nft-multi on the 5.15 series kernel, we end up 
>>>> getting extremely laggy response times when iptables attempts to 
>>>> lookup information on the ipsets that are used in the iptables 
>>>> definition. This issue isn’t reproducible on all hosts. However, our 
>>>> experience has been that across a fleet of ~50 hosts we experienced 
>>>> this issue on ~40% of the hosts. When the problem evidences, the time 
>>>> that it takes to run unrestricted iptables list commands like 
>>>> iptables -L or iptables-save gradually increases over the course of 
>>>> about 1 - 2 hours. Growing from less than a second to run, to takin
>>  g sometimes over 2 minutes to run. After that 2 hour mark it seems to 
>>  plateau and not grow any longer. Flushing tables or ipsets doesn’t seem 
>>  to have any affect on the issue. However, rebooting the host does reset 
>>  the issue. Occasionally, a machine that was evidencing the problem may 
>>  no longer evidence it after being rebooted.
>>>>
>>>> We did try to debug this to find a root cause, but ultimately ran 
>>>> short on time. We were not able to perform a set of bisects to 
>>>> hopefully narrow down the issue as the problem isn’t consistently 
>>>> reproducible. We were able to get some straces where it appears that 
>>>> most of the time is spent on getsockopt() operations. It appears that 
>>>> during iptables operations, it attempts to do some work to resolve 
>>>> the ipsets that are linked to the iptables definitions (perhaps 
>>>> getting the names of the ipsets themselves?). Slowly that getsockopt 
>>>> request takes more and more time on affected hosts. Here is an 
>>>> example strace of the operation in question:
> 
> Yes, iptables list/save have to get the names of the referenced sets and 
> that is performed via getsockopt() calls.
> 
> I went through all of the ipset related patches between 5.10.6 (copy&paste 
> error but just the range is larger) and 5.15.16 and as far as I see none 
> of them can be responsible for the regression. More data is required to 
> locate the source of the slowdown.
> 
> Best regards,
> Jozsef
> 
>>>>
>>>> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
>>>> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
>>>> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
>>>> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
>>>> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
>>>> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
>>>> 0.000036 close(4)                  = 0 <0.000008>
>>>> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
>>>> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
>>>> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
>>>> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
>>>> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
>>>> 0.109456 close(4)                  = 0 <0.000022>
>>>>
>>>> On a host that is not evidencing the performance regression we 
>>>> normally see that operation take ~ 0.00001 as opposed to 
>>>> 0.109384.Additionally, hosts that were evidencing the problem we also 
>>>> saw high lock times with `klockstat` (unfortunately at the time we 
>>>> did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict 
>>>> to get the callers of the below commands).
>>>>
>>>> klockstat -i 5 -n 10 (on a host experiencing the problem)
>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>> b'[unknown]'  118490772     83  179899470 9834734132
>>>> b'[unknown]'  118416941     83  179850047 9828606138
>>>> # or somewhere later while iptables -vnL was running:
>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>> b'[unknown]'  496466236     46 17919955720 22837446860
>>>> b'[unknown]'  496391064     46 17919893843 22833988950
>>>>
>>>> klockstat -i 5 -n 10 (on a host not experiencing the problem)
>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>> b'[unknown]'     120316   1510   85537797  181677885
>>>> b'[unknown]'    7119070     24   85527251  170857690
>>>
>>> Hi, this is your Linux kernel regression tracker.
>>>
>>> Thanks for the report.
>>>
>>> CCing the regression mailing list, as it should be in the loop for all
>>> regressions, as explained here:
>>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
>>>
>>> To be sure below issue doesn't fall through the cracks unnoticed, I'm
>>> adding it to regzbot, my Linux kernel regression tracking bot:
>>>
>>> #regzbot ^introduced v5.10..v5.15
>>> #regzbot title net: netfilter: Intermittent performance regression
>>> related to ipset
>>> #regzbot ignore-activity
>>>
>>> If it turns out this isn't a regression, free free to remove it from the
>>> tracking by sending a reply to this thread containing a paragraph like
>>> "#regzbot invalid: reason why this is invalid" (without the quotes).
>>>
>>> Reminder for developers: when fixing the issue, please add a 'Link:'
>>> tags pointing to the report (the mail quoted above) using
>>> lore.kernel.org/r/, as explained in
>>> 'Documentation/process/submitting-patches.rst' and
>>> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
>>> automatically connect reports with fixes, but they are useful in
>>> general, too.
>>>
>>> I'm sending this to everyone that got the initial report, to make
>>> everyone aware of the tracking. I also hope that messages like this
>>> motivate people to directly get at least the regression mailing list and
>>> ideally even regzbot involved when dealing with regressions, as messages
>>> like this wouldn't be needed then. And don't worry, if I need to send
>>> other mails regarding this regression only relevant for regzbot I'll
>>> send them to the regressions lists only (with a tag in the subject so
>>> people can filter them away). With a bit of luck no such messages will
>>> be needed anyway.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>
>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>>> reports on my table. I can only look briefly into most of them and lack
>>> knowledge about most of the areas they concern. I thus unfortunately
>>> will sometimes get things wrong or miss something important. I hope
>>> that's not the case here; if you think it is, don't hesitate to tell me
>>> in a public reply, it's in everyone's interest to set the public record
>>> straight.
>>>
>>
> 
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-05-04 13:14       ` Thorsten Leemhuis
@ 2022-05-04 19:37         ` U'ren, Aaron
  2022-05-30 13:50           ` Thorsten Leemhuis
  0 siblings, 1 reply; 20+ messages in thread
From: U'ren, Aaron @ 2022-05-04 19:37 UTC (permalink / raw)
  To: Thorsten Leemhuis, McLean, Patrick
  Cc: Pablo Neira Ayuso, netfilter-devel, Brown, Russell, Rueger,
	Manuel, linux-kernel, regressions, Florian Westphal, netdev,
	Jozsef Kadlecsik

Thanks for the reply Jozsef.
 
It’s good to have the confirmation about why iptables list/save perform so many getsockopt() calls.
 
In terms of providing more information to locate the source of the slowdown, do you have any recommendations on what information would be helpful?
 
The only thing that I was able to think of was doing a git bisect on it, but that’s a pretty large range, and the problem isn’t always 100% reproducible. It seems like something about the state of the system needs to trigger the issue. So that approach seemed non-optimal.
 
I’m reasonably certain that if we took enough of our machines back to 5.15.16 we could get some of them to evidence the problem again. If we reproduced the problem, what types of diagnostics or debug could we give you to help further track down this issue?
 
Thanks for your time and help!
 
-Aaron



From: Thorsten Leemhuis <regressions@leemhuis.info>
Date: Wednesday, May 4, 2022 at 8:15 AM
To: McLean, Patrick <Patrick.Mclean@sony.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, U'ren, Aaron <Aaron.U'ren@sony.com>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

Patrick, did you see the comment from Jozsef? Are you having trouble
providing additional data or what's the status here from your side? Or
is that something we can forget?

Ciao, Thorsten

#regzbot poke

On 11.04.22 13:47, Jozsef Kadlecsik wrote:
> Hi,
> 
> On Mon, 11 Apr 2022, Thorsten Leemhuis wrote:
> 
>> On 16.03.22 10:17, Thorsten Leemhuis wrote:
>>> [TLDR: I'm adding the regression report below to regzbot, the Linux
>>> kernel regression tracking bot; all text you find below is compiled from
>>> a few templates paragraphs you might have encountered already already
>>> from similar mails.]
>>>
>>> On 16.03.22 00:15, McLean, Patrick wrote:
>>
>>>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) 
>>>> series, we encountered an intermittent performance regression that 
>>>> appears to be related to iptables / ipset. This regression was 
>>>> noticed on Kubernetes hosts that run kube-router and experience a 
>>>> high amount of churn to both iptables and ipsets. Specifically, when 
>>>> we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables 
>>>> wrapper xtables-nft-multi on the 5.15 series kernel, we end up 
>>>> getting extremely laggy response times when iptables attempts to 
>>>> lookup information on the ipsets that are used in the iptables 
>>>> definition. This issue isn’t reproducible on all hosts. However, our 
>>>> experience has been that across a fleet of ~50 hosts we experienced 
>>>> this issue on ~40% of the hosts. When the problem evidences, the time 
>>>> that it takes to run unrestricted iptables list commands like 
>>>> iptables -L or iptables-save gradually increases over the course of 
>>>> about 1 - 2 hours. Growing from less than a second to run, to takin
>>  g sometimes over 2 minutes to run. After that 2 hour mark it seems to 
>>  plateau and not grow any longer. Flushing tables or ipsets doesn’t seem 
>>  to have any affect on the issue. However, rebooting the host does reset 
>>  the issue. Occasionally, a machine that was evidencing the problem may 
>>  no longer evidence it after being rebooted.
>>>>
>>>> We did try to debug this to find a root cause, but ultimately ran 
>>>> short on time. We were not able to perform a set of bisects to 
>>>> hopefully narrow down the issue as the problem isn’t consistently 
>>>> reproducible. We were able to get some straces where it appears that 
>>>> most of the time is spent on getsockopt() operations. It appears that 
>>>> during iptables operations, it attempts to do some work to resolve 
>>>> the ipsets that are linked to the iptables definitions (perhaps 
>>>> getting the names of the ipsets themselves?). Slowly that getsockopt 
>>>> request takes more and more time on affected hosts. Here is an 
>>>> example strace of the operation in question:
> 
> Yes, iptables list/save have to get the names of the referenced sets and 
> that is performed via getsockopt() calls.
> 
> I went through all of the ipset related patches between 5.10.6 (copy&paste 
> error but just the range is larger) and 5.15.16 and as far as I see none 
> of them can be responsible for the regression. More data is required to 
> locate the source of the slowdown.
> 
> Best regards,
> Jozsef
> 
>>>>
>>>> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
>>>> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
>>>> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
>>>> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
>>>> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
>>>> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
>>>> 0.000036 close(4)                  = 0 <0.000008>
>>>> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
>>>> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
>>>> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
>>>> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
>>>> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
>>>> 0.109456 close(4)                  = 0 <0.000022>
>>>>
>>>> On a host that is not evidencing the performance regression we 
>>>> normally see that operation take ~ 0.00001 as opposed to 
>>>> 0.109384.Additionally, hosts that were evidencing the problem we also 
>>>> saw high lock times with `klockstat` (unfortunately at the time we 
>>>> did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict 
>>>> to get the callers of the below commands).
>>>>
>>>> klockstat -i 5 -n 10 (on a host experiencing the problem)
>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>> b'[unknown]'  118490772     83  179899470 9834734132
>>>> b'[unknown]'  118416941     83  179850047 9828606138
>>>> # or somewhere later while iptables -vnL was running:
>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>> b'[unknown]'  496466236     46 17919955720 22837446860
>>>> b'[unknown]'  496391064     46 17919893843 22833988950
>>>>
>>>> klockstat -i 5 -n 10 (on a host not experiencing the problem)
>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>> b'[unknown]'     120316   1510   85537797  181677885
>>>> b'[unknown]'    7119070     24   85527251  170857690
>>>
>>> Hi, this is your Linux kernel regression tracker.
>>>
>>> Thanks for the report.
>>>
>>> CCing the regression mailing list, as it should be in the loop for all
>>> regressions, as explained here:
>>> https://urldefense.com/v3/__https:/www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html__;!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpZdwVIY5$ 
>>>
>>> To be sure below issue doesn't fall through the cracks unnoticed, I'm
>>> adding it to regzbot, my Linux kernel regression tracking bot:
>>>
>>> #regzbot ^introduced v5.10..v5.15
>>> #regzbot title net: netfilter: Intermittent performance regression
>>> related to ipset
>>> #regzbot ignore-activity
>>>
>>> If it turns out this isn't a regression, free free to remove it from the
>>> tracking by sending a reply to this thread containing a paragraph like
>>> "#regzbot invalid: reason why this is invalid" (without the quotes).
>>>
>>> Reminder for developers: when fixing the issue, please add a 'Link:'
>>> tags pointing to the report (the mail quoted above) using
>>> lore.kernel.org/r/, as explained in
>>> 'Documentation/process/submitting-patches.rst' and
>>> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
>>> automatically connect reports with fixes, but they are useful in
>>> general, too.
>>>
>>> I'm sending this to everyone that got the initial report, to make
>>> everyone aware of the tracking. I also hope that messages like this
>>> motivate people to directly get at least the regression mailing list and
>>> ideally even regzbot involved when dealing with regressions, as messages
>>> like this wouldn't be needed then. And don't worry, if I need to send
>>> other mails regarding this regression only relevant for regzbot I'll
>>> send them to the regressions lists only (with a tag in the subject so
>>> people can filter them away). With a bit of luck no such messages will
>>> be needed anyway.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>
>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>>> reports on my table. I can only look briefly into most of them and lack
>>> knowledge about most of the areas they concern. I thus unfortunately
>>> will sometimes get things wrong or miss something important. I hope
>>> that's not the case here; if you think it is, don't hesitate to tell me
>>> in a public reply, it's in everyone's interest to set the public record
>>> straight.
>>>
>>
> 
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpRHTvk29$ 
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-05-04 19:37         ` U'ren, Aaron
@ 2022-05-30 13:50           ` Thorsten Leemhuis
  2022-05-31  7:41             ` Jozsef Kadlecsik
  0 siblings, 1 reply; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-05-30 13:50 UTC (permalink / raw)
  To: U'ren, Aaron, McLean, Patrick
  Cc: Pablo Neira Ayuso, netfilter-devel, Brown, Russell, Rueger,
	Manuel, linux-kernel, regressions, Florian Westphal, netdev,
	Jozsef Kadlecsik

On 04.05.22 21:37, U'ren, Aaron wrote:
> Thanks for the reply Jozsef.
>  
> It’s good to have the confirmation about why iptables list/save perform so many getsockopt() calls.
>  
> In terms of providing more information to locate the source of the slowdown, do you have any recommendations on what information would be helpful?
>  
> The only thing that I was able to think of was doing a git bisect on it, but that’s a pretty large range, and the problem isn’t always 100% reproducible. It seems like something about the state of the system needs to trigger the issue. So that approach seemed non-optimal.
>  
> I’m reasonably certain that if we took enough of our machines back to 5.15.16 we could get some of them to evidence the problem again. If we reproduced the problem, what types of diagnostics or debug could we give you to help further track down this issue?
>  
> Thanks for your time and help!

Jozsef, I still have this issue on my list of tracked regressions and it
looks like nothing happens since above mail (or did I miss it?). Could
you maybe provide some guidance to Aaron to get us all closer to the
root of the problem?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.


> From: Thorsten Leemhuis <regressions@leemhuis.info>
> Date: Wednesday, May 4, 2022 at 8:15 AM
> To: McLean, Patrick <Patrick.Mclean@sony.com>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, U'ren, Aaron <Aaron.U'ren@sony.com>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
> 
> Patrick, did you see the comment from Jozsef? Are you having trouble
> providing additional data or what's the status here from your side? Or
> is that something we can forget?
> 
> Ciao, Thorsten
> 
> #regzbot poke
> 
> On 11.04.22 13:47, Jozsef Kadlecsik wrote:
>> Hi,
>>
>> On Mon, 11 Apr 2022, Thorsten Leemhuis wrote:
>>
>>> On 16.03.22 10:17, Thorsten Leemhuis wrote:
>>>> [TLDR: I'm adding the regression report below to regzbot, the Linux
>>>> kernel regression tracking bot; all text you find below is compiled from
>>>> a few templates paragraphs you might have encountered already already
>>>> from similar mails.]
>>>>
>>>> On 16.03.22 00:15, McLean, Patrick wrote:
>>>
>>>>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) 
>>>>> series, we encountered an intermittent performance regression that 
>>>>> appears to be related to iptables / ipset. This regression was 
>>>>> noticed on Kubernetes hosts that run kube-router and experience a 
>>>>> high amount of churn to both iptables and ipsets. Specifically, when 
>>>>> we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables 
>>>>> wrapper xtables-nft-multi on the 5.15 series kernel, we end up 
>>>>> getting extremely laggy response times when iptables attempts to 
>>>>> lookup information on the ipsets that are used in the iptables 
>>>>> definition. This issue isn’t reproducible on all hosts. However, our 
>>>>> experience has been that across a fleet of ~50 hosts we experienced 
>>>>> this issue on ~40% of the hosts. When the problem evidences, the time 
>>>>> that it takes to run unrestricted iptables list commands like 
>>>>> iptables -L or iptables-save gradually increases over the course of 
>>>>> about 1 - 2 hours. Growing from less than a second to run, to takin
>>>   g sometimes over 2 minutes to run. After that 2 hour mark it seems to 
>>>   plateau and not grow any longer. Flushing tables or ipsets doesn’t seem 
>>>   to have any affect on the issue. However, rebooting the host does reset 
>>>   the issue. Occasionally, a machine that was evidencing the problem may 
>>>   no longer evidence it after being rebooted.
>>>>>
>>>>> We did try to debug this to find a root cause, but ultimately ran 
>>>>> short on time. We were not able to perform a set of bisects to 
>>>>> hopefully narrow down the issue as the problem isn’t consistently 
>>>>> reproducible. We were able to get some straces where it appears that 
>>>>> most of the time is spent on getsockopt() operations. It appears that 
>>>>> during iptables operations, it attempts to do some work to resolve 
>>>>> the ipsets that are linked to the iptables definitions (perhaps 
>>>>> getting the names of the ipsets themselves?). Slowly that getsockopt 
>>>>> request takes more and more time on affected hosts. Here is an 
>>>>> example strace of the operation in question:
>>
>> Yes, iptables list/save have to get the names of the referenced sets and 
>> that is performed via getsockopt() calls.
>>
>> I went through all of the ipset related patches between 5.10.6 (copy&paste 
>> error but just the range is larger) and 5.15.16 and as far as I see none 
>> of them can be responsible for the regression. More data is required to 
>> locate the source of the slowdown.
>>
>> Best regards,
>> Jozsef
>>
>>>>>
>>>>> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
>>>>> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
>>>>> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
>>>>> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
>>>>> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
>>>>> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
>>>>> 0.000036 close(4)                  = 0 <0.000008>
>>>>> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
>>>>> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
>>>>> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
>>>>> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
>>>>> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
>>>>> 0.109456 close(4)                  = 0 <0.000022>
>>>>>
>>>>> On a host that is not evidencing the performance regression we 
>>>>> normally see that operation take ~ 0.00001 as opposed to 
>>>>> 0.109384.Additionally, hosts that were evidencing the problem we also 
>>>>> saw high lock times with `klockstat` (unfortunately at the time we 
>>>>> did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict 
>>>>> to get the callers of the below commands).
>>>>>
>>>>> klockstat -i 5 -n 10 (on a host experiencing the problem)
>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>> b'[unknown]'  118490772     83  179899470 9834734132
>>>>> b'[unknown]'  118416941     83  179850047 9828606138
>>>>> # or somewhere later while iptables -vnL was running:
>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>> b'[unknown]'  496466236     46 17919955720 22837446860
>>>>> b'[unknown]'  496391064     46 17919893843 22833988950
>>>>>
>>>>> klockstat -i 5 -n 10 (on a host not experiencing the problem)
>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>> b'[unknown]'     120316   1510   85537797  181677885
>>>>> b'[unknown]'    7119070     24   85527251  170857690
>>>>
>>>> Hi, this is your Linux kernel regression tracker.
>>>>
>>>> Thanks for the report.
>>>>
>>>> CCing the regression mailing list, as it should be in the loop for all
>>>> regressions, as explained here:
>>>> https://urldefense.com/v3/__https:/www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html__;!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpZdwVIY5$ 
>>>>
>>>> To be sure below issue doesn't fall through the cracks unnoticed, I'm
>>>> adding it to regzbot, my Linux kernel regression tracking bot:
>>>>
>>>> #regzbot ^introduced v5.10..v5.15
>>>> #regzbot title net: netfilter: Intermittent performance regression
>>>> related to ipset
>>>> #regzbot ignore-activity
>>>>
>>>> If it turns out this isn't a regression, free free to remove it from the
>>>> tracking by sending a reply to this thread containing a paragraph like
>>>> "#regzbot invalid: reason why this is invalid" (without the quotes).
>>>>
>>>> Reminder for developers: when fixing the issue, please add a 'Link:'
>>>> tags pointing to the report (the mail quoted above) using
>>>> lore.kernel.org/r/, as explained in
>>>> 'Documentation/process/submitting-patches.rst' and
>>>> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
>>>> automatically connect reports with fixes, but they are useful in
>>>> general, too.
>>>>
>>>> I'm sending this to everyone that got the initial report, to make
>>>> everyone aware of the tracking. I also hope that messages like this
>>>> motivate people to directly get at least the regression mailing list and
>>>> ideally even regzbot involved when dealing with regressions, as messages
>>>> like this wouldn't be needed then. And don't worry, if I need to send
>>>> other mails regarding this regression only relevant for regzbot I'll
>>>> send them to the regressions lists only (with a tag in the subject so
>>>> people can filter them away). With a bit of luck no such messages will
>>>> be needed anyway.
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>
>>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>>>> reports on my table. I can only look briefly into most of them and lack
>>>> knowledge about most of the areas they concern. I thus unfortunately
>>>> will sometimes get things wrong or miss something important. I hope
>>>> that's not the case here; if you think it is, don't hesitate to tell me
>>>> in a public reply, it's in everyone's interest to set the public record
>>>> straight.
>>>>
>>>
>>
>> -
>> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
>> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpRHTvk29$ 
>> Address : Wigner Research Centre for Physics
>>            H-1525 Budapest 114, POB. 49, Hungary
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-05-30 13:50           ` Thorsten Leemhuis
@ 2022-05-31  7:41             ` Jozsef Kadlecsik
  2022-06-20  7:16               ` Thorsten Leemhuis
  0 siblings, 1 reply; 20+ messages in thread
From: Jozsef Kadlecsik @ 2022-05-31  7:41 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: U'ren, Aaron, McLean, Patrick, Pablo Neira Ayuso,
	netfilter-devel, Brown, Russell, Rueger, Manuel, linux-kernel,
	regressions, Florian Westphal, netdev

[-- Attachment #1: Type: text/plain, Size: 13225 bytes --]

Hi Aaron, Thorsten,

On Mon, 30 May 2022, Thorsten Leemhuis wrote:

> On 04.05.22 21:37, U'ren, Aaron wrote:
> >  It’s good to have the confirmation about why iptables list/save 
> > perform so many getsockopt() calls.

Every set lookups behind "iptables" needs two getsockopt() calls: you can 
see them in the strace logs. The first one check the internal protocol 
number of ipset and the second one verifies/gets the processed set (it's 
an extension to iptables and therefore there's no internal state to save 
the protocol version number).

> >  In terms of providing more information to locate the source of the 
> > slowdown, do you have any recommendations on what information would be 
> > helpful?
> >  The only thing that I was able to think of was doing a git bisect on 
> > it, but that’s a pretty large range, and the problem isn’t always 100% 
> > reproducible. It seems like something about the state of the system 
> > needs to trigger the issue. So that approach seemed non-optimal.
> >  I’m reasonably certain that if we took enough of our machines back to 
> > 5.15.16 we could get some of them to evidence the problem again. If we 
> > reproduced the problem, what types of diagnostics or debug could we 
> > give you to help further track down this issue?

In your strace log

0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
0.109456 close(4)                  = 0 <0.000022>

the only things which happen in the second sockopt function are to lock 
the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
setname, save the result in the case of a match and unlock the mutex. 
Nothing complicated, no deep, multi-level function calls. Just a few line 
of codes which haven't changed.

The only thing which can slow down the processing is the mutex handling. 
Don't you have accidentally wait/wound mutex debugging enabled in the 
kernel? If not, then bisecting the mutex related patches might help.

You wrote that flushing tables or ipsets didn't seem to help. That 
literally meant flushing i.e. the sets were emptied but not destroyed? Did 
you try both destroying or flushing?

> Jozsef, I still have this issue on my list of tracked regressions and it
> looks like nothing happens since above mail (or did I miss it?). Could
> you maybe provide some guidance to Aaron to get us all closer to the
> root of the problem?

I really hope it's an accidentally enabled debugging option in the kernel. 
Otherwise bisecting could help to uncover the issue.

Best regards,
Jozsef

> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.
> 
> 
> > From: Thorsten Leemhuis <regressions@leemhuis.info>
> > Date: Wednesday, May 4, 2022 at 8:15 AM
> > To: McLean, Patrick <Patrick.Mclean@sony.com>
> > Cc: Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, U'ren, Aaron <Aaron.U'ren@sony.com>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > Hi, this is your Linux kernel regression tracker. Top-posting for once,
> > to make this easily accessible to everyone.
> > 
> > Patrick, did you see the comment from Jozsef? Are you having trouble
> > providing additional data or what's the status here from your side? Or
> > is that something we can forget?
> > 
> > Ciao, Thorsten
> > 
> > #regzbot poke
> > 
> > On 11.04.22 13:47, Jozsef Kadlecsik wrote:
> >> Hi,
> >>
> >> On Mon, 11 Apr 2022, Thorsten Leemhuis wrote:
> >>
> >>> On 16.03.22 10:17, Thorsten Leemhuis wrote:
> >>>> [TLDR: I'm adding the regression report below to regzbot, the Linux
> >>>> kernel regression tracking bot; all text you find below is compiled from
> >>>> a few templates paragraphs you might have encountered already already
> >>>> from similar mails.]
> >>>>
> >>>> On 16.03.22 00:15, McLean, Patrick wrote:
> >>>
> >>>>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) 
> >>>>> series, we encountered an intermittent performance regression that 
> >>>>> appears to be related to iptables / ipset. This regression was 
> >>>>> noticed on Kubernetes hosts that run kube-router and experience a 
> >>>>> high amount of churn to both iptables and ipsets. Specifically, when 
> >>>>> we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables 
> >>>>> wrapper xtables-nft-multi on the 5.15 series kernel, we end up 
> >>>>> getting extremely laggy response times when iptables attempts to 
> >>>>> lookup information on the ipsets that are used in the iptables 
> >>>>> definition. This issue isn’t reproducible on all hosts. However, our 
> >>>>> experience has been that across a fleet of ~50 hosts we experienced 
> >>>>> this issue on ~40% of the hosts. When the problem evidences, the time 
> >>>>> that it takes to run unrestricted iptables list commands like 
> >>>>> iptables -L or iptables-save gradually increases over the course of 
> >>>>> about 1 - 2 hours. Growing from less than a second to run, to takin
> >>>   g sometimes over 2 minutes to run. After that 2 hour mark it seems to 
> >>>   plateau and not grow any longer. Flushing tables or ipsets doesn’t seem 
> >>>   to have any affect on the issue. However, rebooting the host does reset 
> >>>   the issue. Occasionally, a machine that was evidencing the problem may 
> >>>   no longer evidence it after being rebooted.
> >>>>>
> >>>>> We did try to debug this to find a root cause, but ultimately ran 
> >>>>> short on time. We were not able to perform a set of bisects to 
> >>>>> hopefully narrow down the issue as the problem isn’t consistently 
> >>>>> reproducible. We were able to get some straces where it appears that 
> >>>>> most of the time is spent on getsockopt() operations. It appears that 
> >>>>> during iptables operations, it attempts to do some work to resolve 
> >>>>> the ipsets that are linked to the iptables definitions (perhaps 
> >>>>> getting the names of the ipsets themselves?). Slowly that getsockopt 
> >>>>> request takes more and more time on affected hosts. Here is an 
> >>>>> example strace of the operation in question:
> >>
> >> Yes, iptables list/save have to get the names of the referenced sets and 
> >> that is performed via getsockopt() calls.
> >>
> >> I went through all of the ipset related patches between 5.10.6 (copy&paste 
> >> error but just the range is larger) and 5.15.16 and as far as I see none 
> >> of them can be responsible for the regression. More data is required to 
> >> locate the source of the slowdown.
> >>
> >> Best regards,
> >> Jozsef
> >>
> >>>>>
> >>>>> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
> >>>>> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
> >>>>> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
> >>>>> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
> >>>>> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
> >>>>> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
> >>>>> 0.000036 close(4)                  = 0 <0.000008>
> >>>>> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
> >>>>> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
> >>>>> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
> >>>>> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> >>>>> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
> >>>>> 0.109456 close(4)                  = 0 <0.000022>
> >>>>>
> >>>>> On a host that is not evidencing the performance regression we 
> >>>>> normally see that operation take ~ 0.00001 as opposed to 
> >>>>> 0.109384.Additionally, hosts that were evidencing the problem we also 
> >>>>> saw high lock times with `klockstat` (unfortunately at the time we 
> >>>>> did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict 
> >>>>> to get the callers of the below commands).
> >>>>>
> >>>>> klockstat -i 5 -n 10 (on a host experiencing the problem)
> >>>>> Caller   Avg Hold  Count   Max hold Total hold
> >>>>> b'[unknown]'  118490772     83  179899470 9834734132
> >>>>> b'[unknown]'  118416941     83  179850047 9828606138
> >>>>> # or somewhere later while iptables -vnL was running:
> >>>>> Caller   Avg Hold  Count   Max hold Total hold
> >>>>> b'[unknown]'  496466236     46 17919955720 22837446860
> >>>>> b'[unknown]'  496391064     46 17919893843 22833988950
> >>>>>
> >>>>> klockstat -i 5 -n 10 (on a host not experiencing the problem)
> >>>>> Caller   Avg Hold  Count   Max hold Total hold
> >>>>> b'[unknown]'     120316   1510   85537797  181677885
> >>>>> b'[unknown]'    7119070     24   85527251  170857690
> >>>>
> >>>> Hi, this is your Linux kernel regression tracker.
> >>>>
> >>>> Thanks for the report.
> >>>>
> >>>> CCing the regression mailing list, as it should be in the loop for all
> >>>> regressions, as explained here:
> >>>> https://urldefense.com/v3/__https:/www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html__;!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpZdwVIY5$ 
> >>>>
> >>>> To be sure below issue doesn't fall through the cracks unnoticed, I'm
> >>>> adding it to regzbot, my Linux kernel regression tracking bot:
> >>>>
> >>>> #regzbot ^introduced v5.10..v5.15
> >>>> #regzbot title net: netfilter: Intermittent performance regression
> >>>> related to ipset
> >>>> #regzbot ignore-activity
> >>>>
> >>>> If it turns out this isn't a regression, free free to remove it from the
> >>>> tracking by sending a reply to this thread containing a paragraph like
> >>>> "#regzbot invalid: reason why this is invalid" (without the quotes).
> >>>>
> >>>> Reminder for developers: when fixing the issue, please add a 'Link:'
> >>>> tags pointing to the report (the mail quoted above) using
> >>>> lore.kernel.org/r/, as explained in
> >>>> 'Documentation/process/submitting-patches.rst' and
> >>>> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
> >>>> automatically connect reports with fixes, but they are useful in
> >>>> general, too.
> >>>>
> >>>> I'm sending this to everyone that got the initial report, to make
> >>>> everyone aware of the tracking. I also hope that messages like this
> >>>> motivate people to directly get at least the regression mailing list and
> >>>> ideally even regzbot involved when dealing with regressions, as messages
> >>>> like this wouldn't be needed then. And don't worry, if I need to send
> >>>> other mails regarding this regression only relevant for regzbot I'll
> >>>> send them to the regressions lists only (with a tag in the subject so
> >>>> people can filter them away). With a bit of luck no such messages will
> >>>> be needed anyway.
> >>>>
> >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>
> >>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> >>>> reports on my table. I can only look briefly into most of them and lack
> >>>> knowledge about most of the areas they concern. I thus unfortunately
> >>>> will sometimes get things wrong or miss something important. I hope
> >>>> that's not the case here; if you think it is, don't hesitate to tell me
> >>>> in a public reply, it's in everyone's interest to set the public record
> >>>> straight.
> >>>>
> >>>
> >>
> >> -
> >> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> >> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpRHTvk29$ 
> >> Address : Wigner Research Centre for Physics
> >>            H-1525 Budapest 114, POB. 49, Hungary
> > 
> > 
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-05-31  7:41             ` Jozsef Kadlecsik
@ 2022-06-20  7:16               ` Thorsten Leemhuis
  2022-06-30 14:59                 ` U'ren, Aaron
  0 siblings, 1 reply; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-06-20  7:16 UTC (permalink / raw)
  To: U'ren, Aaron
  Cc: McLean, Patrick, Pablo Neira Ayuso, netfilter-devel, Brown,
	Russell, Rueger, Manuel, linux-kernel, regressions,
	Florian Westphal, netdev, Jozsef Kadlecsik

On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> On Mon, 30 May 2022, Thorsten Leemhuis wrote:
>> On 04.05.22 21:37, U'ren, Aaron wrote:
>>>  It’s good to have the confirmation about why iptables list/save 
>>> perform so many getsockopt() calls.
> 
> Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> see them in the strace logs. The first one check the internal protocol 
> number of ipset and the second one verifies/gets the processed set (it's 
> an extension to iptables and therefore there's no internal state to save 
> the protocol version number).

Hi Aaron! Did any of the suggestions from Jozsef help to track down the
root case? I have this issue on the list of tracked regressions and
wonder what the status is. Or can I mark this as resolved?

Side note: this is not a "something breaks" regressions and it seems to
progress slowly, so I'm putting it on the backburner:

#regzbot backburner: performance regression where the culprit is hard to
track down

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

>>>  In terms of providing more information to locate the source of the 
>>> slowdown, do you have any recommendations on what information would be 
>>> helpful?
>>>  The only thing that I was able to think of was doing a git bisect on 
>>> it, but that’s a pretty large range, and the problem isn’t always 100% 
>>> reproducible. It seems like something about the state of the system 
>>> needs to trigger the issue. So that approach seemed non-optimal.
>>>  I’m reasonably certain that if we took enough of our machines back to 
>>> 5.15.16 we could get some of them to evidence the problem again. If we 
>>> reproduced the problem, what types of diagnostics or debug could we 
>>> give you to help further track down this issue?
> 
> In your strace log
> 
> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> 0.109456 close(4)                  = 0 <0.000022>
> 
> the only things which happen in the second sockopt function are to lock 
> the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> setname, save the result in the case of a match and unlock the mutex. 
> Nothing complicated, no deep, multi-level function calls. Just a few line 
> of codes which haven't changed.
> 
> The only thing which can slow down the processing is the mutex handling. 
> Don't you have accidentally wait/wound mutex debugging enabled in the 
> kernel? If not, then bisecting the mutex related patches might help.
> 
> You wrote that flushing tables or ipsets didn't seem to help. That 
> literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> you try both destroying or flushing?
> 
>> Jozsef, I still have this issue on my list of tracked regressions and it
>> looks like nothing happens since above mail (or did I miss it?). Could
>> you maybe provide some guidance to Aaron to get us all closer to the
>> root of the problem?
> 
> I really hope it's an accidentally enabled debugging option in the kernel. 
> Otherwise bisecting could help to uncover the issue.
> 
> Best regards,
> Jozsef
> 
>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>> reports and sometimes miss something important when writing mails like
>> this. If that's the case here, don't hesitate to tell me in a public
>> reply, it's in everyone's interest to set the public record straight.
>>
>>
>>> From: Thorsten Leemhuis <regressions@leemhuis.info>
>>> Date: Wednesday, May 4, 2022 at 8:15 AM
>>> To: McLean, Patrick <Patrick.Mclean@sony.com>
>>> Cc: Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, U'ren, Aaron <Aaron.U'ren@sony.com>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
>>> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
>>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>>> to make this easily accessible to everyone.
>>>
>>> Patrick, did you see the comment from Jozsef? Are you having trouble
>>> providing additional data or what's the status here from your side? Or
>>> is that something we can forget?
>>>
>>> Ciao, Thorsten
>>>
>>> #regzbot poke
>>>
>>> On 11.04.22 13:47, Jozsef Kadlecsik wrote:
>>>> Hi,
>>>>
>>>> On Mon, 11 Apr 2022, Thorsten Leemhuis wrote:
>>>>
>>>>> On 16.03.22 10:17, Thorsten Leemhuis wrote:
>>>>>> [TLDR: I'm adding the regression report below to regzbot, the Linux
>>>>>> kernel regression tracking bot; all text you find below is compiled from
>>>>>> a few templates paragraphs you might have encountered already already
>>>>>> from similar mails.]
>>>>>>
>>>>>> On 16.03.22 00:15, McLean, Patrick wrote:
>>>>>
>>>>>>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) 
>>>>>>> series, we encountered an intermittent performance regression that 
>>>>>>> appears to be related to iptables / ipset. This regression was 
>>>>>>> noticed on Kubernetes hosts that run kube-router and experience a 
>>>>>>> high amount of churn to both iptables and ipsets. Specifically, when 
>>>>>>> we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables 
>>>>>>> wrapper xtables-nft-multi on the 5.15 series kernel, we end up 
>>>>>>> getting extremely laggy response times when iptables attempts to 
>>>>>>> lookup information on the ipsets that are used in the iptables 
>>>>>>> definition. This issue isn’t reproducible on all hosts. However, our 
>>>>>>> experience has been that across a fleet of ~50 hosts we experienced 
>>>>>>> this issue on ~40% of the hosts. When the problem evidences, the time 
>>>>>>> that it takes to run unrestricted iptables list commands like 
>>>>>>> iptables -L or iptables-save gradually increases over the course of 
>>>>>>> about 1 - 2 hours. Growing from less than a second to run, to takin
>>>>>   g sometimes over 2 minutes to run. After that 2 hour mark it seems to 
>>>>>   plateau and not grow any longer. Flushing tables or ipsets doesn’t seem 
>>>>>   to have any affect on the issue. However, rebooting the host does reset 
>>>>>   the issue. Occasionally, a machine that was evidencing the problem may 
>>>>>   no longer evidence it after being rebooted.
>>>>>>>
>>>>>>> We did try to debug this to find a root cause, but ultimately ran 
>>>>>>> short on time. We were not able to perform a set of bisects to 
>>>>>>> hopefully narrow down the issue as the problem isn’t consistently 
>>>>>>> reproducible. We were able to get some straces where it appears that 
>>>>>>> most of the time is spent on getsockopt() operations. It appears that 
>>>>>>> during iptables operations, it attempts to do some work to resolve 
>>>>>>> the ipsets that are linked to the iptables definitions (perhaps 
>>>>>>> getting the names of the ipsets themselves?). Slowly that getsockopt 
>>>>>>> request takes more and more time on affected hosts. Here is an 
>>>>>>> example strace of the operation in question:
>>>>
>>>> Yes, iptables list/save have to get the names of the referenced sets and 
>>>> that is performed via getsockopt() calls.
>>>>
>>>> I went through all of the ipset related patches between 5.10.6 (copy&paste 
>>>> error but just the range is larger) and 5.15.16 and as far as I see none 
>>>> of them can be responsible for the regression. More data is required to 
>>>> locate the source of the slowdown.
>>>>
>>>> Best regards,
>>>> Jozsef
>>>>
>>>>>>>
>>>>>>> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
>>>>>>> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
>>>>>>> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
>>>>>>> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
>>>>>>> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
>>>>>>> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
>>>>>>> 0.000036 close(4)                  = 0 <0.000008>
>>>>>>> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
>>>>>>> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
>>>>>>> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
>>>>>>> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
>>>>>>> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
>>>>>>> 0.109456 close(4)                  = 0 <0.000022>
>>>>>>>
>>>>>>> On a host that is not evidencing the performance regression we 
>>>>>>> normally see that operation take ~ 0.00001 as opposed to 
>>>>>>> 0.109384.Additionally, hosts that were evidencing the problem we also 
>>>>>>> saw high lock times with `klockstat` (unfortunately at the time we 
>>>>>>> did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict 
>>>>>>> to get the callers of the below commands).
>>>>>>>
>>>>>>> klockstat -i 5 -n 10 (on a host experiencing the problem)
>>>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>>>> b'[unknown]'  118490772     83  179899470 9834734132
>>>>>>> b'[unknown]'  118416941     83  179850047 9828606138
>>>>>>> # or somewhere later while iptables -vnL was running:
>>>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>>>> b'[unknown]'  496466236     46 17919955720 22837446860
>>>>>>> b'[unknown]'  496391064     46 17919893843 22833988950
>>>>>>>
>>>>>>> klockstat -i 5 -n 10 (on a host not experiencing the problem)
>>>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>>>> b'[unknown]'     120316   1510   85537797  181677885
>>>>>>> b'[unknown]'    7119070     24   85527251  170857690
>>>>>>
>>>>>> Hi, this is your Linux kernel regression tracker.
>>>>>>
>>>>>> Thanks for the report.
>>>>>>
>>>>>> CCing the regression mailing list, as it should be in the loop for all
>>>>>> regressions, as explained here:
>>>>>> https://urldefense.com/v3/__https:/www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html__;!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpZdwVIY5$ 
>>>>>>
>>>>>> To be sure below issue doesn't fall through the cracks unnoticed, I'm
>>>>>> adding it to regzbot, my Linux kernel regression tracking bot:
>>>>>>
>>>>>> #regzbot ^introduced v5.10..v5.15
>>>>>> #regzbot title net: netfilter: Intermittent performance regression
>>>>>> related to ipset
>>>>>> #regzbot ignore-activity
>>>>>>
>>>>>> If it turns out this isn't a regression, free free to remove it from the
>>>>>> tracking by sending a reply to this thread containing a paragraph like
>>>>>> "#regzbot invalid: reason why this is invalid" (without the quotes).
>>>>>>
>>>>>> Reminder for developers: when fixing the issue, please add a 'Link:'
>>>>>> tags pointing to the report (the mail quoted above) using
>>>>>> lore.kernel.org/r/, as explained in
>>>>>> 'Documentation/process/submitting-patches.rst' and
>>>>>> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
>>>>>> automatically connect reports with fixes, but they are useful in
>>>>>> general, too.
>>>>>>
>>>>>> I'm sending this to everyone that got the initial report, to make
>>>>>> everyone aware of the tracking. I also hope that messages like this
>>>>>> motivate people to directly get at least the regression mailing list and
>>>>>> ideally even regzbot involved when dealing with regressions, as messages
>>>>>> like this wouldn't be needed then. And don't worry, if I need to send
>>>>>> other mails regarding this regression only relevant for regzbot I'll
>>>>>> send them to the regressions lists only (with a tag in the subject so
>>>>>> people can filter them away). With a bit of luck no such messages will
>>>>>> be needed anyway.
>>>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>
>>>>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>>>>>> reports on my table. I can only look briefly into most of them and lack
>>>>>> knowledge about most of the areas they concern. I thus unfortunately
>>>>>> will sometimes get things wrong or miss something important. I hope
>>>>>> that's not the case here; if you think it is, don't hesitate to tell me
>>>>>> in a public reply, it's in everyone's interest to set the public record
>>>>>> straight.
>>>>>>
>>>>>
>>>>
>>>> -
>>>> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
>>>> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpRHTvk29$ 
>>>> Address : Wigner Research Centre for Physics
>>>>            H-1525 Budapest 114, POB. 49, Hungary
>>>
>>>
>>
> 
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-06-20  7:16               ` Thorsten Leemhuis
@ 2022-06-30 14:59                 ` U'ren, Aaron
  2022-06-30 18:04                   ` Jakub Kicinski
  0 siblings, 1 reply; 20+ messages in thread
From: U'ren, Aaron @ 2022-06-30 14:59 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: McLean, Patrick, Pablo Neira Ayuso, netfilter-devel, Brown,
	Russell, Rueger, Manuel, linux-kernel, regressions,
	Florian Westphal, netdev, Jozsef Kadlecsik

Thorsten / Jozsef -

Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.

Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.

The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.

After bisecting for 3 days, I have been able to narrow it down to a single commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f (netfilter: ipset: Expose the initval hash parameter to userspace)

I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.

I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.

-Aaron

From: Thorsten Leemhuis <regressions@leemhuis.info>
Date: Monday, June 20, 2022 at 2:16 AM
To: U'ren, Aaron <Aaron.U'ren@sony.com>
Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> On Mon, 30 May 2022, Thorsten Leemhuis wrote:
>> On 04.05.22 21:37, U'ren, Aaron wrote:
>>>  It’s good to have the confirmation about why iptables list/save 
>>> perform so many getsockopt() calls.
> 
> Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> see them in the strace logs. The first one check the internal protocol 
> number of ipset and the second one verifies/gets the processed set (it's 
> an extension to iptables and therefore there's no internal state to save 
> the protocol version number).

Hi Aaron! Did any of the suggestions from Jozsef help to track down the
root case? I have this issue on the list of tracked regressions and
wonder what the status is. Or can I mark this as resolved?

Side note: this is not a "something breaks" regressions and it seems to
progress slowly, so I'm putting it on the backburner:

#regzbot backburner: performance regression where the culprit is hard to
track down

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

>>>  In terms of providing more information to locate the source of the 
>>> slowdown, do you have any recommendations on what information would be 
>>> helpful?
>>>  The only thing that I was able to think of was doing a git bisect on 
>>> it, but that’s a pretty large range, and the problem isn’t always 100% 
>>> reproducible. It seems like something about the state of the system 
>>> needs to trigger the issue. So that approach seemed non-optimal.
>>>  I’m reasonably certain that if we took enough of our machines back to 
>>> 5.15.16 we could get some of them to evidence the problem again. If we 
>>> reproduced the problem, what types of diagnostics or debug could we 
>>> give you to help further track down this issue?
> 
> In your strace log
> 
> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> 0.109456 close(4)                  = 0 <0.000022>
> 
> the only things which happen in the second sockopt function are to lock 
> the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> setname, save the result in the case of a match and unlock the mutex. 
> Nothing complicated, no deep, multi-level function calls. Just a few line 
> of codes which haven't changed.
> 
> The only thing which can slow down the processing is the mutex handling. 
> Don't you have accidentally wait/wound mutex debugging enabled in the 
> kernel? If not, then bisecting the mutex related patches might help.
> 
> You wrote that flushing tables or ipsets didn't seem to help. That 
> literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> you try both destroying or flushing?
> 
>> Jozsef, I still have this issue on my list of tracked regressions and it
>> looks like nothing happens since above mail (or did I miss it?). Could
>> you maybe provide some guidance to Aaron to get us all closer to the
>> root of the problem?
> 
> I really hope it's an accidentally enabled debugging option in the kernel. 
> Otherwise bisecting could help to uncover the issue.
> 
> Best regards,
> Jozsef
> 
>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>> reports and sometimes miss something important when writing mails like
>> this. If that's the case here, don't hesitate to tell me in a public
>> reply, it's in everyone's interest to set the public record straight.
>>
>>
>>> From: Thorsten Leemhuis <regressions@leemhuis.info>
>>> Date: Wednesday, May 4, 2022 at 8:15 AM
>>> To: McLean, Patrick <Patrick.Mclean@sony.com>
>>> Cc: Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, U'ren, Aaron <Aaron.U'ren@sony.com>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
>>> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
>>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>>> to make this easily accessible to everyone.
>>>
>>> Patrick, did you see the comment from Jozsef? Are you having trouble
>>> providing additional data or what's the status here from your side? Or
>>> is that something we can forget?
>>>
>>> Ciao, Thorsten
>>>
>>> #regzbot poke
>>>
>>> On 11.04.22 13:47, Jozsef Kadlecsik wrote:
>>>> Hi,
>>>>
>>>> On Mon, 11 Apr 2022, Thorsten Leemhuis wrote:
>>>>
>>>>> On 16.03.22 10:17, Thorsten Leemhuis wrote:
>>>>>> [TLDR: I'm adding the regression report below to regzbot, the Linux
>>>>>> kernel regression tracking bot; all text you find below is compiled from
>>>>>> a few templates paragraphs you might have encountered already already
>>>>>> from similar mails.]
>>>>>>
>>>>>> On 16.03.22 00:15, McLean, Patrick wrote:
>>>>>
>>>>>>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) 
>>>>>>> series, we encountered an intermittent performance regression that 
>>>>>>> appears to be related to iptables / ipset. This regression was 
>>>>>>> noticed on Kubernetes hosts that run kube-router and experience a 
>>>>>>> high amount of churn to both iptables and ipsets. Specifically, when 
>>>>>>> we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables 
>>>>>>> wrapper xtables-nft-multi on the 5.15 series kernel, we end up 
>>>>>>> getting extremely laggy response times when iptables attempts to 
>>>>>>> lookup information on the ipsets that are used in the iptables 
>>>>>>> definition. This issue isn’t reproducible on all hosts. However, our 
>>>>>>> experience has been that across a fleet of ~50 hosts we experienced 
>>>>>>> this issue on ~40% of the hosts. When the problem evidences, the time 
>>>>>>> that it takes to run unrestricted iptables list commands like 
>>>>>>> iptables -L or iptables-save gradually increases over the course of 
>>>>>>> about 1 - 2 hours. Growing from less than a second to run, to takin
>>>>>   g sometimes over 2 minutes to run. After that 2 hour mark it seems to 
>>>>>   plateau and not grow any longer. Flushing tables or ipsets doesn’t seem 
>>>>>   to have any affect on the issue. However, rebooting the host does reset 
>>>>>   the issue. Occasionally, a machine that was evidencing the problem may 
>>>>>   no longer evidence it after being rebooted.
>>>>>>>
>>>>>>> We did try to debug this to find a root cause, but ultimately ran 
>>>>>>> short on time. We were not able to perform a set of bisects to 
>>>>>>> hopefully narrow down the issue as the problem isn’t consistently 
>>>>>>> reproducible. We were able to get some straces where it appears that 
>>>>>>> most of the time is spent on getsockopt() operations. It appears that 
>>>>>>> during iptables operations, it attempts to do some work to resolve 
>>>>>>> the ipsets that are linked to the iptables definitions (perhaps 
>>>>>>> getting the names of the ipsets themselves?). Slowly that getsockopt 
>>>>>>> request takes more and more time on affected hosts. Here is an 
>>>>>>> example strace of the operation in question:
>>>>
>>>> Yes, iptables list/save have to get the names of the referenced sets and 
>>>> that is performed via getsockopt() calls.
>>>>
>>>> I went through all of the ipset related patches between 5.10.6 (copy&paste 
>>>> error but just the range is larger) and 5.15.16 and as far as I see none 
>>>> of them can be responsible for the regression. More data is required to 
>>>> locate the source of the slowdown.
>>>>
>>>> Best regards,
>>>> Jozsef
>>>>
>>>>>>>
>>>>>>> 0.000074 newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=539, ...}, 0) = 0 <0.000017>
>>>>>>> 0.000064 openat(AT_FDCWD, "/var/db/protocols.db", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000017>
>>>>>>> 0.000057 openat(AT_FDCWD, "/etc/protocols", O_RDONLY|O_CLOEXEC) = 4 <0.000013>
>>>>>>> 0.000034 newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=6108, ...}, AT_EMPTY_PATH) = 0 <0.000009>
>>>>>>> 0.000032 lseek(4, 0, SEEK_SET)     = 0 <0.000008>
>>>>>>> 0.000025 read(4, "# /etc/protocols\n#\n# Internet (I"..., 4096) = 4096 <0.000010>
>>>>>>> 0.000036 close(4)                  = 0 <0.000008>
>>>>>>> 0.000028 write(1, "ANGME7BF25 - [0:0]\n:KUBE-POD-FW-"..., 4096) = 4096 <0.000028>
>>>>>>> 0.000049 socket(AF_INET, SOCK_RAW, IPPROTO_RAW) = 4 <0.000015>
>>>>>>> 0.000032 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 <0.000008>
>>>>>>> 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
>>>>>>> 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.109384>
>>>>>>> 0.109456 close(4)                  = 0 <0.000022>
>>>>>>>
>>>>>>> On a host that is not evidencing the performance regression we 
>>>>>>> normally see that operation take ~ 0.00001 as opposed to 
>>>>>>> 0.109384.Additionally, hosts that were evidencing the problem we also 
>>>>>>> saw high lock times with `klockstat` (unfortunately at the time we 
>>>>>>> did not know about or run echo "0" > /proc/sys/kernel/kptr_restrict 
>>>>>>> to get the callers of the below commands).
>>>>>>>
>>>>>>> klockstat -i 5 -n 10 (on a host experiencing the problem)
>>>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>>>> b'[unknown]'  118490772     83  179899470 9834734132
>>>>>>> b'[unknown]'  118416941     83  179850047 9828606138
>>>>>>> # or somewhere later while iptables -vnL was running:
>>>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>>>> b'[unknown]'  496466236     46 17919955720 22837446860
>>>>>>> b'[unknown]'  496391064     46 17919893843 22833988950
>>>>>>>
>>>>>>> klockstat -i 5 -n 10 (on a host not experiencing the problem)
>>>>>>> Caller   Avg Hold  Count   Max hold Total hold
>>>>>>> b'[unknown]'     120316   1510   85537797  181677885
>>>>>>> b'[unknown]'    7119070     24   85527251  170857690
>>>>>>
>>>>>> Hi, this is your Linux kernel regression tracker.
>>>>>>
>>>>>> Thanks for the report.
>>>>>>
>>>>>> CCing the regression mailing list, as it should be in the loop for all
>>>>>> regressions, as explained here:
>>>>>> https://urldefense.com/v3/__https:/www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html__;!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpZdwVIY5$ 
>>>>>>
>>>>>> To be sure below issue doesn't fall through the cracks unnoticed, I'm
>>>>>> adding it to regzbot, my Linux kernel regression tracking bot:
>>>>>>
>>>>>> #regzbot ^introduced v5.10..v5.15
>>>>>> #regzbot title net: netfilter: Intermittent performance regression
>>>>>> related to ipset
>>>>>> #regzbot ignore-activity
>>>>>>
>>>>>> If it turns out this isn't a regression, free free to remove it from the
>>>>>> tracking by sending a reply to this thread containing a paragraph like
>>>>>> "#regzbot invalid: reason why this is invalid" (without the quotes).
>>>>>>
>>>>>> Reminder for developers: when fixing the issue, please add a 'Link:'
>>>>>> tags pointing to the report (the mail quoted above) using
>>>>>> lore.kernel.org/r/, as explained in
>>>>>> 'Documentation/process/submitting-patches.rst' and
>>>>>> 'Documentation/process/5.Posting.rst'. Regzbot needs them to
>>>>>> automatically connect reports with fixes, but they are useful in
>>>>>> general, too.
>>>>>>
>>>>>> I'm sending this to everyone that got the initial report, to make
>>>>>> everyone aware of the tracking. I also hope that messages like this
>>>>>> motivate people to directly get at least the regression mailing list and
>>>>>> ideally even regzbot involved when dealing with regressions, as messages
>>>>>> like this wouldn't be needed then. And don't worry, if I need to send
>>>>>> other mails regarding this regression only relevant for regzbot I'll
>>>>>> send them to the regressions lists only (with a tag in the subject so
>>>>>> people can filter them away). With a bit of luck no such messages will
>>>>>> be needed anyway.
>>>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>
>>>>>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>>>>>> reports on my table. I can only look briefly into most of them and lack
>>>>>> knowledge about most of the areas they concern. I thus unfortunately
>>>>>> will sometimes get things wrong or miss something important. I hope
>>>>>> that's not the case here; if you think it is, don't hesitate to tell me
>>>>>> in a public reply, it's in everyone's interest to set the public record
>>>>>> straight.
>>>>>>
>>>>>
>>>>
>>>> -
>>>> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
>>>> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9uRzPn01pFuoHMQj2ZsxlSeY6NoNdYH6BxvEi_JHC4sZoqDTp8X2ZYrIRtIOhN7RM0PtxYLq4NIe9g0hJqZVpRHTvk29$ 
>>>> Address : Wigner Research Centre for Physics
>>>>            H-1525 Budapest 114, POB. 49, Hungary
>>>
>>>
>>
> 
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!4YE_vzTwH11tLmht_mqRbx-shqc9cGCmFtFWueoKCMt0uJaqWdFzheT3KdEC-7cAqnf0Gnu82USHlLemKtf5x5I6KaHM$ 
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-06-30 14:59                 ` U'ren, Aaron
@ 2022-06-30 18:04                   ` Jakub Kicinski
  2022-07-02  0:32                     ` U'ren, Aaron
  2022-07-02 17:40                     ` Jozsef Kadlecsik
  0 siblings, 2 replies; 20+ messages in thread
From: Jakub Kicinski @ 2022-06-30 18:04 UTC (permalink / raw)
  To: U'ren, Aaron
  Cc: Thorsten Leemhuis, McLean, Patrick, Pablo Neira Ayuso,
	netfilter-devel, Brown, Russell, Rueger, Manuel, linux-kernel,
	regressions, Florian Westphal, netdev, Jozsef Kadlecsik

Sounds like you're pretty close to figuring this out! Can you check 
if the user space is intentionally setting IPSET_ATTR_INITVAL?
Either that or IPSET_ATTR_GC was not as "unused" as initially thought.

Testing something like this could be a useful data point:

diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index 6397d75899bc..7caf9b53d2a7 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -92,7 +92,7 @@ enum {
 	/* Reserve empty slots */
 	IPSET_ATTR_CADT_MAX = 16,
 	/* Create-only specific attributes */
-	IPSET_ATTR_INITVAL,	/* was unused IPSET_ATTR_GC */
+	IPSET_ATTR_GC,
 	IPSET_ATTR_HASHSIZE,
 	IPSET_ATTR_MAXELEM,
 	IPSET_ATTR_NETMASK,
@@ -104,6 +104,8 @@ enum {
 	IPSET_ATTR_REFERENCES,
 	IPSET_ATTR_MEMSIZE,
 
+	IPSET_ATTR_INITVAL,
+
 	__IPSET_ATTR_CREATE_MAX,
 };
 #define IPSET_ATTR_CREATE_MAX	(__IPSET_ATTR_CREATE_MAX - 1)


On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> Thorsten / Jozsef -
> 
> Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> 
> Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> 
> The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> 
> After bisecting for 3 days, I have been able to narrow it down to a single commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f (netfilter: ipset: Expose the initval hash parameter to userspace)
> 
> I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> 
> I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> 
> -Aaron
> 
> From: Thorsten Leemhuis <regressions@leemhuis.info>
> Date: Monday, June 20, 2022 at 2:16 AM
> To: U'ren, Aaron <Aaron.U'ren@sony.com>
> Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> >> On 04.05.22 21:37, U'ren, Aaron wrote:  
>  [...]  
> > 
> > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > see them in the strace logs. The first one check the internal protocol 
> > number of ipset and the second one verifies/gets the processed set (it's 
> > an extension to iptables and therefore there's no internal state to save 
> > the protocol version number).  
> 
> Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> root case? I have this issue on the list of tracked regressions and
> wonder what the status is. Or can I mark this as resolved?
> 
> Side note: this is not a "something breaks" regressions and it seems to
> progress slowly, so I'm putting it on the backburner:
> 
> #regzbot backburner: performance regression where the culprit is hard to
> track down
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> 
> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.
> 
>  [...]  
> > 
> > In your strace log
> > 
> > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > 0.109456 close(4)                  = 0 <0.000022>
> > 
> > the only things which happen in the second sockopt function are to lock 
> > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > setname, save the result in the case of a match and unlock the mutex. 
> > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > of codes which haven't changed.
> > 
> > The only thing which can slow down the processing is the mutex handling. 
> > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > kernel? If not, then bisecting the mutex related patches might help.
> > 
> > You wrote that flushing tables or ipsets didn't seem to help. That 
> > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > you try both destroying or flushing?
> >   
> >> Jozsef, I still have this issue on my list of tracked regressions and it
> >> looks like nothing happens since above mail (or did I miss it?). Could
> >> you maybe provide some guidance to Aaron to get us all closer to the
> >> root of the problem?  
> > 
> > I really hope it's an accidentally enabled debugging option in the kernel. 
> > Otherwise bisecting could help to uncover the issue.
> > 
> > Best regards,
> > Jozsef
> >   
> >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> >> reports and sometimes miss something important when writing mails like
> >> this. If that's the case here, don't hesitate to tell me in a public
> >> reply, it's in everyone's interest to set the public record straight.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15 #forregzbot
  2022-03-16  9:17 ` Thorsten Leemhuis
  2022-04-11 10:36   ` Thorsten Leemhuis
@ 2022-07-01  5:18   ` Thorsten Leemhuis
  1 sibling, 0 replies; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-07-01  5:18 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, regressions

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

On 16.03.22 10:17, Thorsten Leemhuis wrote:
> [TLDR: I'm adding the regression report below to regzbot, the Linux
> kernel regression tracking bot; all text you find below is compiled from
> a few templates paragraphs you might have encountered already already
> from similar mails.]
> 
> On 16.03.22 00:15, McLean, Patrick wrote:
>> When we upgraded from the 5.10 (5.10.61) series to the 5.15 (5.15.16) series, we encountered an intermittent performance regression that appears to be related to iptables / ipset. This regression was noticed on Kubernetes hosts that run kube-router and experience a high amount of churn to both iptables and ipsets. Specifically, when we run the nftables (iptables-1.8.7 / nftables-1.0.0) iptables wrapper xtables-nft-multi on the 5.15 series kernel, we end up getting extremely laggy response times when iptables attempts to lookup information on the ipsets that are used in the iptables definition. This issue isn’t reproducible on all hosts. However, our experience has been that across a fleet of ~50 hosts we experienced this issue on ~40% of the hosts. When the problem evidences, the time that it takes to run unrestricted iptables list commands like iptables -L or iptables-save gradually increases over the course of about 1 - 2 hours. Growing from less than a second to run, to taking sometimes over 2 minutes to run. After that 2 hour mark it seems to plateau and not grow any longer. Flushing tables or ipsets doesn’t seem to have any affect on the issue. However, rebooting the host does reset the issue. Occasionally, a machine that was evidencing the problem may no longer evidence it after being rebooted.
>>
>> We did try to debug this to find a root cause, but ultimately ran short on time. We were not able to perform a set of bisects to hopefully narrow down the issue as the problem isn’t consistently reproducible. We were able to get some straces where it appears that most of the time is spent on getsockopt() operations. It appears that during iptables operations, it attempts to do some work to resolve the ipsets that are linked to the iptables definitions (perhaps getting the names of the ipsets themselves?). Slowly that getsockopt request takes more and more time on affected hosts. Here is an example strace of the operation in question:
> [...]
> #regzbot ^introduced v5.10..v5.15
> #regzbot title net: netfilter: Intermittent performance regression
> related to ipset
> #regzbot ignore-activity

#regzbot introduced 3976ca101990ca11ddf51f38bec7b86c19d0ca

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-06-30 18:04                   ` Jakub Kicinski
@ 2022-07-02  0:32                     ` U'ren, Aaron
  2022-07-02 17:40                     ` Jozsef Kadlecsik
  1 sibling, 0 replies; 20+ messages in thread
From: U'ren, Aaron @ 2022-07-02  0:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thorsten Leemhuis, McLean, Patrick, Pablo Neira Ayuso,
	netfilter-devel, Brown, Russell, Rueger, Manuel, linux-kernel,
	regressions, Florian Westphal, netdev, Jozsef Kadlecsik

Initial results so far is that adding back IPSET_ATTR_GC seems to make the difference. I've only done two rounds of testing so far and we've had at least one false positive in the past with our environments so I'd like to give it a little more time to say for sure. I should be able to get back with you early next week more conclusively.

This was my patch:
diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index 6397d75899bc..e5abb4d24e75 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -93,6 +93,7 @@ enum {
        IPSET_ATTR_CADT_MAX = 16,
        /* Create-only specific attributes */
        IPSET_ATTR_INITVAL,     /* was unused IPSET_ATTR_GC */
+       IPSET_ATTR_GC,
        IPSET_ATTR_HASHSIZE,
        IPSET_ATTR_MAXELEM,
        IPSET_ATTR_NETMASK,

It would be good to know though from the maintainers perspective whether or not this even makes sense or not?

The primary user of iptables / ipsets on our hosts is kube-router, the Kubernetes networking framework we use. Its usage for each of these tools involves exec-ing out to the ipset (https://github.com/cloudnativelabs/kube-router/blob/master/pkg/utils/ipset.go#L176) and iptables based binaries (via the iptables wrapper scripts from the coreos Go library: https://github.com/coreos/go-iptables).

In regards to ipsets, it mostly calls "ipset save" (https://github.com/cloudnativelabs/kube-router/blob/master/pkg/utils/ipset.go#L535) and "ipset restore -exist" (https://github.com/cloudnativelabs/kube-router/blob/master/pkg/utils/ipset.go#L551). There are a few outliers in the code that create and populate ipsets individually (also using the binaries), but most of those have been converted to save / restore because of the performance penalties of exec-ing out constantly.

On our images we run ipset 7.15 and iptables 1.8.8.

-Aaron

From: Jakub Kicinski <kuba@kernel.org>
Date: Thursday, June 30, 2022 at 1:04 PM
To: U'ren, Aaron <Aaron.U'ren@sony.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
Sounds like you're pretty close to figuring this out! Can you check 
if the user space is intentionally setting IPSET_ATTR_INITVAL?
Either that or IPSET_ATTR_GC was not as "unused" as initially thought.

Testing something like this could be a useful data point:

diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index 6397d75899bc..7caf9b53d2a7 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -92,7 +92,7 @@ enum {
         /* Reserve empty slots */
         IPSET_ATTR_CADT_MAX = 16,
         /* Create-only specific attributes */
-       IPSET_ATTR_INITVAL,     /* was unused IPSET_ATTR_GC */
+       IPSET_ATTR_GC,
         IPSET_ATTR_HASHSIZE,
         IPSET_ATTR_MAXELEM,
         IPSET_ATTR_NETMASK,
@@ -104,6 +104,8 @@ enum {
         IPSET_ATTR_REFERENCES,
         IPSET_ATTR_MEMSIZE,
 
+       IPSET_ATTR_INITVAL,
+
         __IPSET_ATTR_CREATE_MAX,
 };
 #define IPSET_ATTR_CREATE_MAX   (__IPSET_ATTR_CREATE_MAX - 1)


On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> Thorsten / Jozsef -
> 
> Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> 
> Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> 
> The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> 
> After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!_lqhDxhzSRqmfCd8UGNkA-LzIuwspFCIolqiglSgQ0Y0TPMlX67qYlEk4Bh7IVFlB53_TUqF-16t0Q$  (netfilter: ipset: Expose the initval hash parameter to userspace)
> 
> I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> 
> I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> 
> -Aaron
> 
> From: Thorsten Leemhuis <regressions@leemhuis.info>
> Date: Monday, June 20, 2022 at 2:16 AM
> To: U'ren, Aaron <Aaron.U'ren@sony.com>
> Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> >> On 04.05.22 21:37, U'ren, Aaron wrote:  
>  [...]  
> > 
> > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > see them in the strace logs. The first one check the internal protocol 
> > number of ipset and the second one verifies/gets the processed set (it's 
> > an extension to iptables and therefore there's no internal state to save 
> > the protocol version number).  
> 
> Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> root case? I have this issue on the list of tracked regressions and
> wonder what the status is. Or can I mark this as resolved?
> 
> Side note: this is not a "something breaks" regressions and it seems to
> progress slowly, so I'm putting it on the backburner:
> 
> #regzbot backburner: performance regression where the culprit is hard to
> track down
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> 
> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.
> 
>  [...]  
> > 
> > In your strace log
> > 
> > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > 0.109456 close(4)                  = 0 <0.000022>
> > 
> > the only things which happen in the second sockopt function are to lock 
> > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > setname, save the result in the case of a match and unlock the mutex. 
> > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > of codes which haven't changed.
> > 
> > The only thing which can slow down the processing is the mutex handling. 
> > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > kernel? If not, then bisecting the mutex related patches might help.
> > 
> > You wrote that flushing tables or ipsets didn't seem to help. That 
> > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > you try both destroying or flushing?
> >   
> >> Jozsef, I still have this issue on my list of tracked regressions and it
> >> looks like nothing happens since above mail (or did I miss it?). Could
> >> you maybe provide some guidance to Aaron to get us all closer to the
> >> root of the problem?  
> > 
> > I really hope it's an accidentally enabled debugging option in the kernel. 
> > Otherwise bisecting could help to uncover the issue.
> > 
> > Best regards,
> > Jozsef
> >   
> >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> >> reports and sometimes miss something important when writing mails like
> >> this. If that's the case here, don't hesitate to tell me in a public
> >> reply, it's in everyone's interest to set the public record straight.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-06-30 18:04                   ` Jakub Kicinski
  2022-07-02  0:32                     ` U'ren, Aaron
@ 2022-07-02 17:40                     ` Jozsef Kadlecsik
  2022-07-08 20:08                       ` U'ren, Aaron
  1 sibling, 1 reply; 20+ messages in thread
From: Jozsef Kadlecsik @ 2022-07-02 17:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: U'ren, Aaron, Thorsten Leemhuis, McLean, Patrick,
	Pablo Neira Ayuso, netfilter-devel, Brown, Russell, Rueger,
	Manuel, linux-kernel, regressions, Florian Westphal, netdev

[-- Attachment #1: Type: text/plain, Size: 8263 bytes --]

Hi,

On Thu, 30 Jun 2022, Jakub Kicinski wrote:

> Sounds like you're pretty close to figuring this out! Can you check 
> if the user space is intentionally setting IPSET_ATTR_INITVAL?
> Either that or IPSET_ATTR_GC was not as "unused" as initially thought.

IPSET_ATTR_GC was really unused. It was an old remnant from the time when 
ipset userspace-kernel communication was through set/getsockopt. However, 
when it was migrated to netlink, just the symbol was kept but it was not 
used either with the userspace tool or the kernel.

Aaron, could you send me how to reproduce the issue? I have no idea how 
that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL 
is totally independent from listing iptables rules. But if you have got a 
reproducer then I can dig into it.

Best regards,
Jozsef

> Testing something like this could be a useful data point:
> 
> diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
> index 6397d75899bc..7caf9b53d2a7 100644
> --- a/include/uapi/linux/netfilter/ipset/ip_set.h
> +++ b/include/uapi/linux/netfilter/ipset/ip_set.h
> @@ -92,7 +92,7 @@ enum {
>  	/* Reserve empty slots */
>  	IPSET_ATTR_CADT_MAX = 16,
>  	/* Create-only specific attributes */
> -	IPSET_ATTR_INITVAL,	/* was unused IPSET_ATTR_GC */
> +	IPSET_ATTR_GC,
>  	IPSET_ATTR_HASHSIZE,
>  	IPSET_ATTR_MAXELEM,
>  	IPSET_ATTR_NETMASK,
> @@ -104,6 +104,8 @@ enum {
>  	IPSET_ATTR_REFERENCES,
>  	IPSET_ATTR_MEMSIZE,
>  
> +	IPSET_ATTR_INITVAL,
> +
>  	__IPSET_ATTR_CREATE_MAX,
>  };
>  #define IPSET_ATTR_CREATE_MAX	(__IPSET_ATTR_CREATE_MAX - 1)
> 
> 
> On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> > Thorsten / Jozsef -
> > 
> > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> > 
> > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> > 
> > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> > 
> > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f (netfilter: ipset: Expose the initval hash parameter to userspace)
> > 
> > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> > 
> > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> > 
> > -Aaron
> > 
> > From: Thorsten Leemhuis <regressions@leemhuis.info>
> > Date: Monday, June 20, 2022 at 2:16 AM
> > To: U'ren, Aaron <Aaron.U'ren@sony.com>
> > Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> > >> On 04.05.22 21:37, U'ren, Aaron wrote:  
> >  [...]  
> > > 
> > > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > > see them in the strace logs. The first one check the internal protocol 
> > > number of ipset and the second one verifies/gets the processed set (it's 
> > > an extension to iptables and therefore there's no internal state to save 
> > > the protocol version number).  
> > 
> > Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> > root case? I have this issue on the list of tracked regressions and
> > wonder what the status is. Or can I mark this as resolved?
> > 
> > Side note: this is not a "something breaks" regressions and it seems to
> > progress slowly, so I'm putting it on the backburner:
> > 
> > #regzbot backburner: performance regression where the culprit is hard to
> > track down
> > 
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > 
> > P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > reports and sometimes miss something important when writing mails like
> > this. If that's the case here, don't hesitate to tell me in a public
> > reply, it's in everyone's interest to set the public record straight.
> > 
> >  [...]  
> > > 
> > > In your strace log
> > > 
> > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > > 0.109456 close(4)                  = 0 <0.000022>
> > > 
> > > the only things which happen in the second sockopt function are to lock 
> > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > > setname, save the result in the case of a match and unlock the mutex. 
> > > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > > of codes which haven't changed.
> > > 
> > > The only thing which can slow down the processing is the mutex handling. 
> > > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > > kernel? If not, then bisecting the mutex related patches might help.
> > > 
> > > You wrote that flushing tables or ipsets didn't seem to help. That 
> > > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > > you try both destroying or flushing?
> > >   
> > >> Jozsef, I still have this issue on my list of tracked regressions and it
> > >> looks like nothing happens since above mail (or did I miss it?). Could
> > >> you maybe provide some guidance to Aaron to get us all closer to the
> > >> root of the problem?  
> > > 
> > > I really hope it's an accidentally enabled debugging option in the kernel. 
> > > Otherwise bisecting could help to uncover the issue.
> > > 
> > > Best regards,
> > > Jozsef
> > >   
> > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > >> reports and sometimes miss something important when writing mails like
> > >> this. If that's the case here, don't hesitate to tell me in a public
> > >> reply, it's in everyone's interest to set the public record straight.
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-07-02 17:40                     ` Jozsef Kadlecsik
@ 2022-07-08 20:08                       ` U'ren, Aaron
  2022-07-29 20:21                         ` U'ren, Aaron
  0 siblings, 1 reply; 20+ messages in thread
From: U'ren, Aaron @ 2022-07-08 20:08 UTC (permalink / raw)
  To: Jozsef Kadlecsik, Jakub Kicinski
  Cc: Thorsten Leemhuis, McLean, Patrick, Pablo Neira Ayuso,
	netfilter-devel, Brown, Russell, Rueger, Manuel, linux-kernel,
	regressions, Florian Westphal, netdev

Jozsef / Jakub-

Given your latest email and the fact that just adding back in IPSET_ATTR_GC doesn't shed any light on the issue I wanted to spend a lot more time testing. Also, I wanted to try to provide as much context for this issue as possible.

I think that the iptables slowness is just a symptom not the cause of the issue. After spending a lot more time with it, I can see that iptables only runs slowly when an existing "ipset restore" process is being run by kube-router simultaneously. Given the other information that you've provided, my hunch is that iptables slows down when ipset restore is running because they are both vying for the same mutex? Anyway, I think troubleshooting it from the direction of iptables slowness is likely the wrong path to go down.

The true problem seems to be that when IPSET_ATTR_GC is not included, somehow nodes are able to get into a state where "ipset restore" goes from completing in less than a 10th of a second, to taking 30 seconds to a minute to complete. The hard part, is that I still don't know what causes a node to enter this state.

I have a Kubernetes cluster of about 7 nodes that I can reliably get into this state, but I have yet to be able to reproduce it consistently anywhere else. Other clusters will randomly exhibit the issue if IPSET_ATTR_GC is left out of the kernel, but not consistently. Since the email where we found the commit about 2 weeks ago, we have also been running 6 clusters of 9+ nodes with IPSET_ATTR_GC enabled and have not had any issues.

Since we have a custom kernel configuration, I have also tried using the vanilla Ubuntu kernel configuration (taken from 5.15.0-40-generic) as well just to ensure that we didn't have some errant configuration option enabled. However, this also reliably reproduced the issue when IPSET_ATTR_GC was removed and just as reliably removed the issue when IPSET_ATTR_GC was added back in.

I have also verified that neither ipset, iptables, or any of its dependent libraries have references to IPSET_ATTR_GC, going as far as to remove it from the ipset header file (https://git.netfilter.org/iptables/tree/include/linux/netfilter/ipset/ip_set.h#n86) and rebuild it (and all of the libraries and other tools) from scratch just as a hail mary. No changes to user-space seem to have an effect on this issue.

One other thing that I've done to help track down the issue is to add debug options to kube-router so that it outputs the file that it feeds into "ipset restore -exist". With this file, on nodes affected by this issue, I can reliably reproduce the issue by calling "ipset restore -exist <file" and see that it takes 30+ seconds to execute.

In a hope that maybe it sheds some light and gives you some more context, I'm going to be sending you and Jakub a copy of the strace and the ipset file that was used separately from this email.

At this point, I'm not sure how to proceed other than with the files that I'll be sending you. I'm highly confident that somehow the removal of IPSET_ATTR_GC is causing the issues that we see. At this point I've added and removed the options almost 20 times and done reboots across our cluster. Anytime that variable is missing, we see several nodes exhibit the performance issues immediately. Any time the variable is present, we see no nodes exhibit the performance issues.

Looking forward to hearing back from you and getting to the bottom of this very bizarre issue.

-Aaron

From: Jozsef Kadlecsik <kadlec@netfilter.org>
Date: Saturday, July 2, 2022 at 12:41 PM
To: Jakub Kicinski <kuba@kernel.org>
Cc: U'ren, Aaron <Aaron.U'ren@sony.com>, Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
Hi,

On Thu, 30 Jun 2022, Jakub Kicinski wrote:

> Sounds like you're pretty close to figuring this out! Can you check 
> if the user space is intentionally setting IPSET_ATTR_INITVAL?
> Either that or IPSET_ATTR_GC was not as "unused" as initially thought.

IPSET_ATTR_GC was really unused. It was an old remnant from the time when 
ipset userspace-kernel communication was through set/getsockopt. However, 
when it was migrated to netlink, just the symbol was kept but it was not 
used either with the userspace tool or the kernel.

Aaron, could you send me how to reproduce the issue? I have no idea how 
that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL 
is totally independent from listing iptables rules. But if you have got a 
reproducer then I can dig into it.

Best regards,
Jozsef

> Testing something like this could be a useful data point:
> 
> diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
> index 6397d75899bc..7caf9b53d2a7 100644
> --- a/include/uapi/linux/netfilter/ipset/ip_set.h
> +++ b/include/uapi/linux/netfilter/ipset/ip_set.h
> @@ -92,7 +92,7 @@ enum {
>        /* Reserve empty slots */
>        IPSET_ATTR_CADT_MAX = 16,
>        /* Create-only specific attributes */
> -     IPSET_ATTR_INITVAL,     /* was unused IPSET_ATTR_GC */
> +     IPSET_ATTR_GC,
>        IPSET_ATTR_HASHSIZE,
>        IPSET_ATTR_MAXELEM,
>        IPSET_ATTR_NETMASK,
> @@ -104,6 +104,8 @@ enum {
>        IPSET_ATTR_REFERENCES,
>        IPSET_ATTR_MEMSIZE,
>  
> +     IPSET_ATTR_INITVAL,
> +
>        __IPSET_ATTR_CREATE_MAX,
>  };
>  #define IPSET_ATTR_CREATE_MAX        (__IPSET_ATTR_CREATE_MAX - 1)
> 
> 
> On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> > Thorsten / Jozsef -
> > 
> > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> > 
> > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> > 
> > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> > 
> > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0IirIc4I$  (netfilter: ipset: Expose the initval hash parameter to userspace)
> > 
> > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> > 
> > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> > 
> > -Aaron
> > 
> > From: Thorsten Leemhuis <regressions@leemhuis.info>
> > Date: Monday, June 20, 2022 at 2:16 AM
> > To: U'ren, Aaron <Aaron.U'ren@sony.com>
> > Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> > >> On 04.05.22 21:37, U'ren, Aaron wrote:  
> >  [...]  
> > > 
> > > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > > see them in the strace logs. The first one check the internal protocol 
> > > number of ipset and the second one verifies/gets the processed set (it's 
> > > an extension to iptables and therefore there's no internal state to save 
> > > the protocol version number).  
> > 
> > Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> > root case? I have this issue on the list of tracked regressions and
> > wonder what the status is. Or can I mark this as resolved?
> > 
> > Side note: this is not a "something breaks" regressions and it seems to
> > progress slowly, so I'm putting it on the backburner:
> > 
> > #regzbot backburner: performance regression where the culprit is hard to
> > track down
> > 
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > 
> > P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > reports and sometimes miss something important when writing mails like
> > this. If that's the case here, don't hesitate to tell me in a public
> > reply, it's in everyone's interest to set the public record straight.
> > 
> >  [...]  
> > > 
> > > In your strace log
> > > 
> > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > > 0.109456 close(4)                  = 0 <0.000022>
> > > 
> > > the only things which happen in the second sockopt function are to lock 
> > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > > setname, save the result in the case of a match and unlock the mutex. 
> > > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > > of codes which haven't changed.
> > > 
> > > The only thing which can slow down the processing is the mutex handling. 
> > > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > > kernel? If not, then bisecting the mutex related patches might help.
> > > 
> > > You wrote that flushing tables or ipsets didn't seem to help. That 
> > > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > > you try both destroying or flushing?
> > >   
> > >> Jozsef, I still have this issue on my list of tracked regressions and it
> > >> looks like nothing happens since above mail (or did I miss it?). Could
> > >> you maybe provide some guidance to Aaron to get us all closer to the
> > >> root of the problem?  
> > > 
> > > I really hope it's an accidentally enabled debugging option in the kernel. 
> > > Otherwise bisecting could help to uncover the issue.
> > > 
> > > Best regards,
> > > Jozsef
> > >   
> > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > >> reports and sometimes miss something important when writing mails like
> > >> this. If that's the case here, don't hesitate to tell me in a public
> > >> reply, it's in everyone's interest to set the public record straight.
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0Udypzvg$ 
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-07-08 20:08                       ` U'ren, Aaron
@ 2022-07-29 20:21                         ` U'ren, Aaron
  2022-07-30  1:41                           ` Jakub Kicinski
  2022-07-30 10:43                           ` Jozsef Kadlecsik
  0 siblings, 2 replies; 20+ messages in thread
From: U'ren, Aaron @ 2022-07-29 20:21 UTC (permalink / raw)
  To: Jozsef Kadlecsik, Jakub Kicinski, Thorsten Leemhuis
  Cc: McLean, Patrick, Pablo Neira Ayuso, netfilter-devel, Brown,
	Russell, Rueger, Manuel, linux-kernel, regressions,
	Florian Westphal, netdev

Jozef / Jakub / Thorsten-

Thanks for all of your help with this issue. I think that we can close this out now.

After continuing to dig into this problem some more, I eventually figured out that the problem was caused because of how our userspace tooling was interacting with ipset save / restore and the new (ish) initval option that is included in saves / restores.

Specifically, kube-router runs an ipset save then processes the saved ipset data, messages it a bit based upon the state from the Kubernetes cluster, and then runs that data back through ipset restore. During this time, we create unique temporary sets based upon unique sets of options and then rotate in the new endpoints into the temporary set and then use swap instructions in order to minimize impact to the data path.

However, because we were only messaging options that were recognized and important to us, initval was left alone and blindly copied into our option strings for new and temporary sets. This caused initval to be used incorrectly (i.e. the same initval ID was used for multiple sets). I'm not 100% sure about all of the consequences of this, but it seems to have objectively caused some performance issues.

Additionally, since initval is intentionally unique between sets, this caused us to create many more temporary sets for swapping than was actually necessary. This caused obvious performance issues as restores now contained more instructions than they needed to.

Reverting the commit removed the issue we saw because it removed the portion of the kernel that generated the initvals which caused ipset save to revert to its previous (5.10 and below) functionality. Additionally, applying your patches also had the same impact because while I believed I was updating our userspace ipset tools in tandem, I found that the headers were actually being copied in from an alternate location and were still using the vanilla headers. This meant that while the kernel was generating initval values, the userspace actually recognized it as IPSET_ATTR_GC values which were then unused.

This was a very long process to come to such a simple recognition about the ipset save / restore format having been changed. I apologize for the noise.

-Aaron

From: U'ren, Aaron <Aaron.U'ren@sony.com>
Date: Friday, July 8, 2022 at 3:08 PM
To: Jozsef Kadlecsik <kadlec@netfilter.org>, Jakub Kicinski <kuba@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
Jozsef / Jakub-

Given your latest email and the fact that just adding back in IPSET_ATTR_GC doesn't shed any light on the issue I wanted to spend a lot more time testing. Also, I wanted to try to provide as much context for this issue as possible.

I think that the iptables slowness is just a symptom not the cause of the issue. After spending a lot more time with it, I can see that iptables only runs slowly when an existing "ipset restore" process is being run by kube-router simultaneously. Given the other information that you've provided, my hunch is that iptables slows down when ipset restore is running because they are both vying for the same mutex? Anyway, I think troubleshooting it from the direction of iptables slowness is likely the wrong path to go down.

The true problem seems to be that when IPSET_ATTR_GC is not included, somehow nodes are able to get into a state where "ipset restore" goes from completing in less than a 10th of a second, to taking 30 seconds to a minute to complete. The hard part, is that I still don't know what causes a node to enter this state.

I have a Kubernetes cluster of about 7 nodes that I can reliably get into this state, but I have yet to be able to reproduce it consistently anywhere else. Other clusters will randomly exhibit the issue if IPSET_ATTR_GC is left out of the kernel, but not consistently. Since the email where we found the commit about 2 weeks ago, we have also been running 6 clusters of 9+ nodes with IPSET_ATTR_GC enabled and have not had any issues.

Since we have a custom kernel configuration, I have also tried using the vanilla Ubuntu kernel configuration (taken from 5.15.0-40-generic) as well just to ensure that we didn't have some errant configuration option enabled. However, this also reliably reproduced the issue when IPSET_ATTR_GC was removed and just as reliably removed the issue when IPSET_ATTR_GC was added back in.

I have also verified that neither ipset, iptables, or any of its dependent libraries have references to IPSET_ATTR_GC, going as far as to remove it from the ipset header file (https://git.netfilter.org/iptables/tree/include/linux/netfilter/ipset/ip_set.h#n86) and rebuild it (and all of the libraries and other tools) from scratch just as a hail mary. No changes to user-space seem to have an effect on this issue.

One other thing that I've done to help track down the issue is to add debug options to kube-router so that it outputs the file that it feeds into "ipset restore -exist". With this file, on nodes affected by this issue, I can reliably reproduce the issue by calling "ipset restore -exist <file" and see that it takes 30+ seconds to execute.

In a hope that maybe it sheds some light and gives you some more context, I'm going to be sending you and Jakub a copy of the strace and the ipset file that was used separately from this email.

At this point, I'm not sure how to proceed other than with the files that I'll be sending you. I'm highly confident that somehow the removal of IPSET_ATTR_GC is causing the issues that we see. At this point I've added and removed the options almost 20 times and done reboots across our cluster. Anytime that variable is missing, we see several nodes exhibit the performance issues immediately. Any time the variable is present, we see no nodes exhibit the performance issues.

Looking forward to hearing back from you and getting to the bottom of this very bizarre issue.

-Aaron

From: Jozsef Kadlecsik <kadlec@netfilter.org>
Date: Saturday, July 2, 2022 at 12:41 PM
To: Jakub Kicinski <kuba@kernel.org>
Cc: U'ren, Aaron <Aaron.U'ren@sony.com>, Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
Hi,

On Thu, 30 Jun 2022, Jakub Kicinski wrote:

> Sounds like you're pretty close to figuring this out! Can you check 
> if the user space is intentionally setting IPSET_ATTR_INITVAL?
> Either that or IPSET_ATTR_GC was not as "unused" as initially thought.

IPSET_ATTR_GC was really unused. It was an old remnant from the time when 
ipset userspace-kernel communication was through set/getsockopt. However, 
when it was migrated to netlink, just the symbol was kept but it was not 
used either with the userspace tool or the kernel.

Aaron, could you send me how to reproduce the issue? I have no idea how 
that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL 
is totally independent from listing iptables rules. But if you have got a 
reproducer then I can dig into it.

Best regards,
Jozsef

> Testing something like this could be a useful data point:
> 
> diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
> index 6397d75899bc..7caf9b53d2a7 100644
> --- a/include/uapi/linux/netfilter/ipset/ip_set.h
> +++ b/include/uapi/linux/netfilter/ipset/ip_set.h
> @@ -92,7 +92,7 @@ enum {
>        /* Reserve empty slots */
>        IPSET_ATTR_CADT_MAX = 16,
>        /* Create-only specific attributes */
> -     IPSET_ATTR_INITVAL,     /* was unused IPSET_ATTR_GC */
> +     IPSET_ATTR_GC,
>        IPSET_ATTR_HASHSIZE,
>        IPSET_ATTR_MAXELEM,
>        IPSET_ATTR_NETMASK,
> @@ -104,6 +104,8 @@ enum {
>        IPSET_ATTR_REFERENCES,
>        IPSET_ATTR_MEMSIZE,
>  
> +     IPSET_ATTR_INITVAL,
> +
>        __IPSET_ATTR_CREATE_MAX,
>  };
>  #define IPSET_ATTR_CREATE_MAX        (__IPSET_ATTR_CREATE_MAX - 1)
> 
> 
> On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> > Thorsten / Jozsef -
> > 
> > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> > 
> > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> > 
> > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> > 
> > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0IirIc4I$  (netfilter: ipset: Expose the initval hash parameter to userspace)
> > 
> > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> > 
> > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> > 
> > -Aaron
> > 
> > From: Thorsten Leemhuis <regressions@leemhuis.info>
> > Date: Monday, June 20, 2022 at 2:16 AM
> > To: U'ren, Aaron <Aaron.U'ren@sony.com>
> > Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> > >> On 04.05.22 21:37, U'ren, Aaron wrote:  
> >  [...]  
> > > 
> > > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > > see them in the strace logs. The first one check the internal protocol 
> > > number of ipset and the second one verifies/gets the processed set (it's 
> > > an extension to iptables and therefore there's no internal state to save 
> > > the protocol version number).  
> > 
> > Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> > root case? I have this issue on the list of tracked regressions and
> > wonder what the status is. Or can I mark this as resolved?
> > 
> > Side note: this is not a "something breaks" regressions and it seems to
> > progress slowly, so I'm putting it on the backburner:
> > 
> > #regzbot backburner: performance regression where the culprit is hard to
> > track down
> > 
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > 
> > P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > reports and sometimes miss something important when writing mails like
> > this. If that's the case here, don't hesitate to tell me in a public
> > reply, it's in everyone's interest to set the public record straight.
> > 
> >  [...]  
> > > 
> > > In your strace log
> > > 
> > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > > 0.109456 close(4)                  = 0 <0.000022>
> > > 
> > > the only things which happen in the second sockopt function are to lock 
> > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > > setname, save the result in the case of a match and unlock the mutex. 
> > > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > > of codes which haven't changed.
> > > 
> > > The only thing which can slow down the processing is the mutex handling. 
> > > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > > kernel? If not, then bisecting the mutex related patches might help.
> > > 
> > > You wrote that flushing tables or ipsets didn't seem to help. That 
> > > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > > you try both destroying or flushing?
> > >   
> > >> Jozsef, I still have this issue on my list of tracked regressions and it
> > >> looks like nothing happens since above mail (or did I miss it?). Could
> > >> you maybe provide some guidance to Aaron to get us all closer to the
> > >> root of the problem?  
> > > 
> > > I really hope it's an accidentally enabled debugging option in the kernel. 
> > > Otherwise bisecting could help to uncover the issue.
> > > 
> > > Best regards,
> > > Jozsef
> > >   
> > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > >> reports and sometimes miss something important when writing mails like
> > >> this. If that's the case here, don't hesitate to tell me in a public
> > >> reply, it's in everyone's interest to set the public record straight.
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0Udypzvg$ 
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-07-29 20:21                         ` U'ren, Aaron
@ 2022-07-30  1:41                           ` Jakub Kicinski
  2022-07-30 10:43                           ` Jozsef Kadlecsik
  1 sibling, 0 replies; 20+ messages in thread
From: Jakub Kicinski @ 2022-07-30  1:41 UTC (permalink / raw)
  To: U'ren, Aaron
  Cc: Jozsef Kadlecsik, Thorsten Leemhuis, McLean, Patrick,
	Pablo Neira Ayuso, netfilter-devel, Brown, Russell, Rueger,
	Manuel, linux-kernel, regressions, Florian Westphal, netdev

On Fri, 29 Jul 2022 20:21:17 +0000 U'ren, Aaron wrote:
> Jozef / Jakub / Thorsten-
> 
> Thanks for all of your help with this issue. I think that we can close this out now.
> 
> After continuing to dig into this problem some more, I eventually figured out that the problem was caused because of how our userspace tooling was interacting with ipset save / restore and the new (ish) initval option that is included in saves / restores.
> 
> Specifically, kube-router runs an ipset save then processes the saved ipset data, messages it a bit based upon the state from the Kubernetes cluster, and then runs that data back through ipset restore. During this time, we create unique temporary sets based upon unique sets of options and then rotate in the new endpoints into the temporary set and then use swap instructions in order to minimize impact to the data path.
> 
> However, because we were only messaging options that were recognized and important to us, initval was left alone and blindly copied into our option strings for new and temporary sets. This caused initval to be used incorrectly (i.e. the same initval ID was used for multiple sets). I'm not 100% sure about all of the consequences of this, but it seems to have objectively caused some performance issues.
> 
> Additionally, since initval is intentionally unique between sets, this caused us to create many more temporary sets for swapping than was actually necessary. This caused obvious performance issues as restores now contained more instructions than they needed to.
> 
> Reverting the commit removed the issue we saw because it removed the portion of the kernel that generated the initvals which caused ipset save to revert to its previous (5.10 and below) functionality. Additionally, applying your patches also had the same impact because while I believed I was updating our userspace ipset tools in tandem, I found that the headers were actually being copied in from an alternate location and were still using the vanilla headers. This meant that while the kernel was generating initval values, the userspace actually recognized it as IPSET_ATTR_GC values which were then unused.
> 
> This was a very long process to come to such a simple recognition about the ipset save / restore format having been changed. I apologize for the noise.

Thanks for working it out and explaining the root cause :)
I'm probably going to get the syntax wrong, but here goes nothing:

#regzbot invalid: user space mis-configuration

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-07-29 20:21                         ` U'ren, Aaron
  2022-07-30  1:41                           ` Jakub Kicinski
@ 2022-07-30 10:43                           ` Jozsef Kadlecsik
  2022-08-01 23:38                             ` U'ren, Aaron
  1 sibling, 1 reply; 20+ messages in thread
From: Jozsef Kadlecsik @ 2022-07-30 10:43 UTC (permalink / raw)
  To: U'ren, Aaron
  Cc: Jakub Kicinski, Thorsten Leemhuis, McLean, Patrick,
	Pablo Neira Ayuso, netfilter-devel, Brown, Russell, Rueger,
	Manuel, linux-kernel, regressions, Florian Westphal, netdev

[-- Attachment #1: Type: text/plain, Size: 17610 bytes --]

Hi Aaron,

On Fri, 29 Jul 2022, U'ren, Aaron wrote:

> Thanks for all of your help with this issue. I think that we can close 
> this out now.
> 
> After continuing to dig into this problem some more, I eventually 
> figured out that the problem was caused because of how our userspace 
> tooling was interacting with ipset save / restore and the new (ish) 
> initval option that is included in saves / restores.
> 
> Specifically, kube-router runs an ipset save then processes the saved 
> ipset data, messages it a bit based upon the state from the Kubernetes 
> cluster, and then runs that data back through ipset restore. During this 
> time, we create unique temporary sets based upon unique sets of options 
> and then rotate in the new endpoints into the temporary set and then use 
> swap instructions in order to minimize impact to the data path.
> 
> However, because we were only messaging options that were recognized and 
> important to us, initval was left alone and blindly copied into our 
> option strings for new and temporary sets. This caused initval to be 
> used incorrectly (i.e. the same initval ID was used for multiple sets). 
> I'm not 100% sure about all of the consequences of this, but it seems to 
> have objectively caused some performance issues.

It's hard to say. initval is actually the arbitrary initial value for the 
jhash() macro in the kernel. The same initval is used for every element in 
a hash, so it's tied to the hash.

Earlier, initval was a totally hidden internal parameter in ipset. That 
meant that save/restore could possibly not create a restored set which was 
identical with the saved one: hash size could be different; list of 
clashing elements could be different. Therefore I added the ability to 
save and restore initval as well.

> Additionally, since initval is intentionally unique between sets, this 
> caused us to create many more temporary sets for swapping than was 
> actually necessary. This caused obvious performance issues as restores 
> now contained more instructions than they needed to.
> 
> Reverting the commit removed the issue we saw because it removed the 
> portion of the kernel that generated the initvals which caused ipset 
> save to revert to its previous (5.10 and below) functionality. 
> Additionally, applying your patches also had the same impact because 
> while I believed I was updating our userspace ipset tools in tandem, I 
> found that the headers were actually being copied in from an alternate 
> location and were still using the vanilla headers. This meant that while 
> the kernel was generating initval values, the userspace actually 
> recognized it as IPSET_ATTR_GC values which were then unused.
> 
> This was a very long process to come to such a simple recognition about 
> the ipset save / restore format having been changed. I apologize for the 
> noise.

I simply could not imagine a scenario where exposing the initval value 
could result in any kind of regression...

Just to make sure I understood completely: what is your solution for the 
problem, then? Working with a patched kernel, which removes passing 
initval to userspace? Patched ipset tool, which does not send it? Modified 
tooling, which ignores the initval parameter?

I really appreciate your hard work!

Best regards,
Jozsef
 
> From: U'ren, Aaron <Aaron.U'ren@sony.com>
> Date: Friday, July 8, 2022 at 3:08 PM
> To: Jozsef Kadlecsik <kadlec@netfilter.org>, Jakub Kicinski <kuba@kernel.org>
> Cc: Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> Jozsef / Jakub-
> 
> Given your latest email and the fact that just adding back in IPSET_ATTR_GC doesn't shed any light on the issue I wanted to spend a lot more time testing. Also, I wanted to try to provide as much context for this issue as possible.
> 
> I think that the iptables slowness is just a symptom not the cause of the issue. After spending a lot more time with it, I can see that iptables only runs slowly when an existing "ipset restore" process is being run by kube-router simultaneously. Given the other information that you've provided, my hunch is that iptables slows down when ipset restore is running because they are both vying for the same mutex? Anyway, I think troubleshooting it from the direction of iptables slowness is likely the wrong path to go down.
> 
> The true problem seems to be that when IPSET_ATTR_GC is not included, somehow nodes are able to get into a state where "ipset restore" goes from completing in less than a 10th of a second, to taking 30 seconds to a minute to complete. The hard part, is that I still don't know what causes a node to enter this state.
> 
> I have a Kubernetes cluster of about 7 nodes that I can reliably get into this state, but I have yet to be able to reproduce it consistently anywhere else. Other clusters will randomly exhibit the issue if IPSET_ATTR_GC is left out of the kernel, but not consistently. Since the email where we found the commit about 2 weeks ago, we have also been running 6 clusters of 9+ nodes with IPSET_ATTR_GC enabled and have not had any issues.
> 
> Since we have a custom kernel configuration, I have also tried using the vanilla Ubuntu kernel configuration (taken from 5.15.0-40-generic) as well just to ensure that we didn't have some errant configuration option enabled. However, this also reliably reproduced the issue when IPSET_ATTR_GC was removed and just as reliably removed the issue when IPSET_ATTR_GC was added back in.
> 
> I have also verified that neither ipset, iptables, or any of its dependent libraries have references to IPSET_ATTR_GC, going as far as to remove it from the ipset header file (https://git.netfilter.org/iptables/tree/include/linux/netfilter/ipset/ip_set.h#n86) and rebuild it (and all of the libraries and other tools) from scratch just as a hail mary. No changes to user-space seem to have an effect on this issue.
> 
> One other thing that I've done to help track down the issue is to add debug options to kube-router so that it outputs the file that it feeds into "ipset restore -exist". With this file, on nodes affected by this issue, I can reliably reproduce the issue by calling "ipset restore -exist <file" and see that it takes 30+ seconds to execute.
> 
> In a hope that maybe it sheds some light and gives you some more context, I'm going to be sending you and Jakub a copy of the strace and the ipset file that was used separately from this email.
> 
> At this point, I'm not sure how to proceed other than with the files that I'll be sending you. I'm highly confident that somehow the removal of IPSET_ATTR_GC is causing the issues that we see. At this point I've added and removed the options almost 20 times and done reboots across our cluster. Anytime that variable is missing, we see several nodes exhibit the performance issues immediately. Any time the variable is present, we see no nodes exhibit the performance issues.
> 
> Looking forward to hearing back from you and getting to the bottom of this very bizarre issue.
> 
> -Aaron
> 
> From: Jozsef Kadlecsik <kadlec@netfilter.org>
> Date: Saturday, July 2, 2022 at 12:41 PM
> To: Jakub Kicinski <kuba@kernel.org>
> Cc: U'ren, Aaron <Aaron.U'ren@sony.com>, Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> Hi,
> 
> On Thu, 30 Jun 2022, Jakub Kicinski wrote:
> 
> > Sounds like you're pretty close to figuring this out! Can you check 
> > if the user space is intentionally setting IPSET_ATTR_INITVAL?
> > Either that or IPSET_ATTR_GC was not as "unused" as initially thought.
> 
> IPSET_ATTR_GC was really unused. It was an old remnant from the time when 
> ipset userspace-kernel communication was through set/getsockopt. However, 
> when it was migrated to netlink, just the symbol was kept but it was not 
> used either with the userspace tool or the kernel.
> 
> Aaron, could you send me how to reproduce the issue? I have no idea how 
> that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL 
> is totally independent from listing iptables rules. But if you have got a 
> reproducer then I can dig into it.
> 
> Best regards,
> Jozsef
> 
> > Testing something like this could be a useful data point:
> > 
> > diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
> > index 6397d75899bc..7caf9b53d2a7 100644
> > --- a/include/uapi/linux/netfilter/ipset/ip_set.h
> > +++ b/include/uapi/linux/netfilter/ipset/ip_set.h
> > @@ -92,7 +92,7 @@ enum {
> >        /* Reserve empty slots */
> >        IPSET_ATTR_CADT_MAX = 16,
> >        /* Create-only specific attributes */
> > -     IPSET_ATTR_INITVAL,     /* was unused IPSET_ATTR_GC */
> > +     IPSET_ATTR_GC,
> >        IPSET_ATTR_HASHSIZE,
> >        IPSET_ATTR_MAXELEM,
> >        IPSET_ATTR_NETMASK,
> > @@ -104,6 +104,8 @@ enum {
> >        IPSET_ATTR_REFERENCES,
> >        IPSET_ATTR_MEMSIZE,
> >  
> > +     IPSET_ATTR_INITVAL,
> > +
> >        __IPSET_ATTR_CREATE_MAX,
> >  };
> >  #define IPSET_ATTR_CREATE_MAX        (__IPSET_ATTR_CREATE_MAX - 1)
> > 
> > 
> > On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> > > Thorsten / Jozsef -
> > > 
> > > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> > > 
> > > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> > > 
> > > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> > > 
> > > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0IirIc4I$  (netfilter: ipset: Expose the initval hash parameter to userspace)
> > > 
> > > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> > > 
> > > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> > > 
> > > -Aaron
> > > 
> > > From: Thorsten Leemhuis <regressions@leemhuis.info>
> > > Date: Monday, June 20, 2022 at 2:16 AM
> > > To: U'ren, Aaron <Aaron.U'ren@sony.com>
> > > Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> > > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > > On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > > > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> > > >> On 04.05.22 21:37, U'ren, Aaron wrote:  
> > >  [...]  
> > > > 
> > > > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > > > see them in the strace logs. The first one check the internal protocol 
> > > > number of ipset and the second one verifies/gets the processed set (it's 
> > > > an extension to iptables and therefore there's no internal state to save 
> > > > the protocol version number).  
> > > 
> > > Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> > > root case? I have this issue on the list of tracked regressions and
> > > wonder what the status is. Or can I mark this as resolved?
> > > 
> > > Side note: this is not a "something breaks" regressions and it seems to
> > > progress slowly, so I'm putting it on the backburner:
> > > 
> > > #regzbot backburner: performance regression where the culprit is hard to
> > > track down
> > > 
> > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > > 
> > > P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > > reports and sometimes miss something important when writing mails like
> > > this. If that's the case here, don't hesitate to tell me in a public
> > > reply, it's in everyone's interest to set the public record straight.
> > > 
> > >  [...]  
> > > > 
> > > > In your strace log
> > > > 
> > > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > > > 0.109456 close(4)                  = 0 <0.000022>
> > > > 
> > > > the only things which happen in the second sockopt function are to lock 
> > > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > > > setname, save the result in the case of a match and unlock the mutex. 
> > > > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > > > of codes which haven't changed.
> > > > 
> > > > The only thing which can slow down the processing is the mutex handling. 
> > > > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > > > kernel? If not, then bisecting the mutex related patches might help.
> > > > 
> > > > You wrote that flushing tables or ipsets didn't seem to help. That 
> > > > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > > > you try both destroying or flushing?
> > > >   
> > > >> Jozsef, I still have this issue on my list of tracked regressions and it
> > > >> looks like nothing happens since above mail (or did I miss it?). Could
> > > >> you maybe provide some guidance to Aaron to get us all closer to the
> > > >> root of the problem?  
> > > > 
> > > > I really hope it's an accidentally enabled debugging option in the kernel. 
> > > > Otherwise bisecting could help to uncover the issue.
> > > > 
> > > > Best regards,
> > > > Jozsef
> > > >   
> > > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > > >> reports and sometimes miss something important when writing mails like
> > > >> this. If that's the case here, don't hesitate to tell me in a public
> > > >> reply, it's in everyone's interest to set the public record straight.
> > 
> 
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0Udypzvg$ 
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-07-30 10:43                           ` Jozsef Kadlecsik
@ 2022-08-01 23:38                             ` U'ren, Aaron
  2022-08-04  6:28                               ` Jozsef Kadlecsik
  0 siblings, 1 reply; 20+ messages in thread
From: U'ren, Aaron @ 2022-08-01 23:38 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Jakub Kicinski, Thorsten Leemhuis, McLean, Patrick,
	Pablo Neira Ayuso, netfilter-devel, Brown, Russell, Rueger,
	Manuel, linux-kernel, regressions, Florian Westphal, netdev

Jozsef-

Our end solution was to change kube-router to ignore initval when it parsed the "ipset save" output (https://github.com/cloudnativelabs/kube-router/pull/1337/files?diff=unified&w=0). This means that initval is never passed into ipset restore either. This was essentially the same functionality that we had before initval was introduced to the userspace.

You would obviously know better than I would, but if I understand your comment below correctly, I believe that because of how we work with ipset restore, we don't actually need to use the initval parameter. When we work with ipsets in kube-router, we copy the entries into a temporary set that is unique based upon that set's options, then swap it into its final name, and then flush the temporary set afterward.

Because of this approach to ipset logic, I believe that it should mitigate any potential clashing that might happen based upon name with changed options (like hash size), or clashing elements.

However, if you believe that this is not the case, we could look into finding some correct way to save the initval and pass it back into the restore.

- Aaron


From: Jozsef Kadlecsik <kadlec@netfilter.org>
Date: Saturday, July 30, 2022 at 5:44 AM
To: U'ren, Aaron <Aaron.U'ren@sony.com>
Cc: Jakub Kicinski <kuba@kernel.org>, Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
Hi Aaron,

On Fri, 29 Jul 2022, U'ren, Aaron wrote:

> Thanks for all of your help with this issue. I think that we can close 
> this out now.
> 
> After continuing to dig into this problem some more, I eventually 
> figured out that the problem was caused because of how our userspace 
> tooling was interacting with ipset save / restore and the new (ish) 
> initval option that is included in saves / restores.
> 
> Specifically, kube-router runs an ipset save then processes the saved 
> ipset data, messages it a bit based upon the state from the Kubernetes 
> cluster, and then runs that data back through ipset restore. During this 
> time, we create unique temporary sets based upon unique sets of options 
> and then rotate in the new endpoints into the temporary set and then use 
> swap instructions in order to minimize impact to the data path.
> 
> However, because we were only messaging options that were recognized and 
> important to us, initval was left alone and blindly copied into our 
> option strings for new and temporary sets. This caused initval to be 
> used incorrectly (i.e. the same initval ID was used for multiple sets). 
> I'm not 100% sure about all of the consequences of this, but it seems to 
> have objectively caused some performance issues.

It's hard to say. initval is actually the arbitrary initial value for the 
jhash() macro in the kernel. The same initval is used for every element in 
a hash, so it's tied to the hash.

Earlier, initval was a totally hidden internal parameter in ipset. That 
meant that save/restore could possibly not create a restored set which was 
identical with the saved one: hash size could be different; list of 
clashing elements could be different. Therefore I added the ability to 
save and restore initval as well.

> Additionally, since initval is intentionally unique between sets, this 
> caused us to create many more temporary sets for swapping than was 
> actually necessary. This caused obvious performance issues as restores 
> now contained more instructions than they needed to.
> 
> Reverting the commit removed the issue we saw because it removed the 
> portion of the kernel that generated the initvals which caused ipset 
> save to revert to its previous (5.10 and below) functionality. 
> Additionally, applying your patches also had the same impact because 
> while I believed I was updating our userspace ipset tools in tandem, I 
> found that the headers were actually being copied in from an alternate 
> location and were still using the vanilla headers. This meant that while 
> the kernel was generating initval values, the userspace actually 
> recognized it as IPSET_ATTR_GC values which were then unused.
> 
> This was a very long process to come to such a simple recognition about 
> the ipset save / restore format having been changed. I apologize for the 
> noise.

I simply could not imagine a scenario where exposing the initval value 
could result in any kind of regression...

Just to make sure I understood completely: what is your solution for the 
problem, then? Working with a patched kernel, which removes passing 
initval to userspace? Patched ipset tool, which does not send it? Modified 
tooling, which ignores the initval parameter?

I really appreciate your hard work!

Best regards,
Jozsef
 
> From: U'ren, Aaron <Aaron.U'ren@sony.com>
> Date: Friday, July 8, 2022 at 3:08 PM
> To: Jozsef Kadlecsik <kadlec@netfilter.org>, Jakub Kicinski <kuba@kernel.org>
> Cc: Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> Jozsef / Jakub-
> 
> Given your latest email and the fact that just adding back in IPSET_ATTR_GC doesn't shed any light on the issue I wanted to spend a lot more time testing. Also, I wanted to try to provide as much context for this issue as possible.
> 
> I think that the iptables slowness is just a symptom not the cause of the issue. After spending a lot more time with it, I can see that iptables only runs slowly when an existing "ipset restore" process is being run by kube-router simultaneously. Given the other information that you've provided, my hunch is that iptables slows down when ipset restore is running because they are both vying for the same mutex? Anyway, I think troubleshooting it from the direction of iptables slowness is likely the wrong path to go down.
> 
> The true problem seems to be that when IPSET_ATTR_GC is not included, somehow nodes are able to get into a state where "ipset restore" goes from completing in less than a 10th of a second, to taking 30 seconds to a minute to complete. The hard part, is that I still don't know what causes a node to enter this state.
> 
> I have a Kubernetes cluster of about 7 nodes that I can reliably get into this state, but I have yet to be able to reproduce it consistently anywhere else. Other clusters will randomly exhibit the issue if IPSET_ATTR_GC is left out of the kernel, but not consistently. Since the email where we found the commit about 2 weeks ago, we have also been running 6 clusters of 9+ nodes with IPSET_ATTR_GC enabled and have not had any issues.
> 
> Since we have a custom kernel configuration, I have also tried using the vanilla Ubuntu kernel configuration (taken from 5.15.0-40-generic) as well just to ensure that we didn't have some errant configuration option enabled. However, this also reliably reproduced the issue when IPSET_ATTR_GC was removed and just as reliably removed the issue when IPSET_ATTR_GC was added back in.
> 
> I have also verified that neither ipset, iptables, or any of its dependent libraries have references to IPSET_ATTR_GC, going as far as to remove it from the ipset header file (https://urldefense.com/v3/__https://git.netfilter.org/iptables/tree/include/linux/netfilter/ipset/ip_set.h*n86__;Iw!!JmoZiZGBv3RvKRSx!69L-2bZ2sI_yJHZDhKe799D2LnTMz-jfAVznMjfJK6jB68je36HDpX0ag_GDJRIQxS2lfs9imab8LPpnCPI$ ) and rebuild it (and all of the libraries and other tools) from scratch just as a hail mary. No changes to user-space seem to have an effect on this issue.
> 
> One other thing that I've done to help track down the issue is to add debug options to kube-router so that it outputs the file that it feeds into "ipset restore -exist". With this file, on nodes affected by this issue, I can reliably reproduce the issue by calling "ipset restore -exist <file" and see that it takes 30+ seconds to execute.
> 
> In a hope that maybe it sheds some light and gives you some more context, I'm going to be sending you and Jakub a copy of the strace and the ipset file that was used separately from this email.
> 
> At this point, I'm not sure how to proceed other than with the files that I'll be sending you. I'm highly confident that somehow the removal of IPSET_ATTR_GC is causing the issues that we see. At this point I've added and removed the options almost 20 times and done reboots across our cluster. Anytime that variable is missing, we see several nodes exhibit the performance issues immediately. Any time the variable is present, we see no nodes exhibit the performance issues.
> 
> Looking forward to hearing back from you and getting to the bottom of this very bizarre issue.
> 
> -Aaron
> 
> From: Jozsef Kadlecsik <kadlec@netfilter.org>
> Date: Saturday, July 2, 2022 at 12:41 PM
> To: Jakub Kicinski <kuba@kernel.org>
> Cc: U'ren, Aaron <Aaron.U'ren@sony.com>, Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> Hi,
> 
> On Thu, 30 Jun 2022, Jakub Kicinski wrote:
> 
> > Sounds like you're pretty close to figuring this out! Can you check 
> > if the user space is intentionally setting IPSET_ATTR_INITVAL?
> > Either that or IPSET_ATTR_GC was not as "unused" as initially thought.
> 
> IPSET_ATTR_GC was really unused. It was an old remnant from the time when 
> ipset userspace-kernel communication was through set/getsockopt. However, 
> when it was migrated to netlink, just the symbol was kept but it was not 
> used either with the userspace tool or the kernel.
> 
> Aaron, could you send me how to reproduce the issue? I have no idea how 
> that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL 
> is totally independent from listing iptables rules. But if you have got a 
> reproducer then I can dig into it.
> 
> Best regards,
> Jozsef
> 
> > Testing something like this could be a useful data point:
> > 
> > diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
> > index 6397d75899bc..7caf9b53d2a7 100644
> > --- a/include/uapi/linux/netfilter/ipset/ip_set.h
> > +++ b/include/uapi/linux/netfilter/ipset/ip_set.h
> > @@ -92,7 +92,7 @@ enum {
> >        /* Reserve empty slots */
> >        IPSET_ATTR_CADT_MAX = 16,
> >        /* Create-only specific attributes */
> > -     IPSET_ATTR_INITVAL,     /* was unused IPSET_ATTR_GC */
> > +     IPSET_ATTR_GC,
> >        IPSET_ATTR_HASHSIZE,
> >        IPSET_ATTR_MAXELEM,
> >        IPSET_ATTR_NETMASK,
> > @@ -104,6 +104,8 @@ enum {
> >        IPSET_ATTR_REFERENCES,
> >        IPSET_ATTR_MEMSIZE,
> >  
> > +     IPSET_ATTR_INITVAL,
> > +
> >        __IPSET_ATTR_CREATE_MAX,
> >  };
> >  #define IPSET_ATTR_CREATE_MAX        (__IPSET_ATTR_CREATE_MAX - 1)
> > 
> > 
> > On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> > > Thorsten / Jozsef -
> > > 
> > > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> > > 
> > > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> > > 
> > > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> > > 
> > > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0IirIc4I$  (netfilter: ipset: Expose the initval hash parameter to userspace)
> > > 
> > > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> > > 
> > > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> > > 
> > > -Aaron
> > > 
> > > From: Thorsten Leemhuis <regressions@leemhuis.info>
> > > Date: Monday, June 20, 2022 at 2:16 AM
> > > To: U'ren, Aaron <Aaron.U'ren@sony.com>
> > > Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> > > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > > On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > > > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> > > >> On 04.05.22 21:37, U'ren, Aaron wrote:  
> > >  [...]  
> > > > 
> > > > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > > > see them in the strace logs. The first one check the internal protocol 
> > > > number of ipset and the second one verifies/gets the processed set (it's 
> > > > an extension to iptables and therefore there's no internal state to save 
> > > > the protocol version number).  
> > > 
> > > Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> > > root case? I have this issue on the list of tracked regressions and
> > > wonder what the status is. Or can I mark this as resolved?
> > > 
> > > Side note: this is not a "something breaks" regressions and it seems to
> > > progress slowly, so I'm putting it on the backburner:
> > > 
> > > #regzbot backburner: performance regression where the culprit is hard to
> > > track down
> > > 
> > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > > 
> > > P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > > reports and sometimes miss something important when writing mails like
> > > this. If that's the case here, don't hesitate to tell me in a public
> > > reply, it's in everyone's interest to set the public record straight.
> > > 
> > >  [...]  
> > > > 
> > > > In your strace log
> > > > 
> > > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > > > 0.109456 close(4)                  = 0 <0.000022>
> > > > 
> > > > the only things which happen in the second sockopt function are to lock 
> > > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > > > setname, save the result in the case of a match and unlock the mutex. 
> > > > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > > > of codes which haven't changed.
> > > > 
> > > > The only thing which can slow down the processing is the mutex handling. 
> > > > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > > > kernel? If not, then bisecting the mutex related patches might help.
> > > > 
> > > > You wrote that flushing tables or ipsets didn't seem to help. That 
> > > > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > > > you try both destroying or flushing?
> > > >   
> > > >> Jozsef, I still have this issue on my list of tracked regressions and it
> > > >> looks like nothing happens since above mail (or did I miss it?). Could
> > > >> you maybe provide some guidance to Aaron to get us all closer to the
> > > >> root of the problem?  
> > > > 
> > > > I really hope it's an accidentally enabled debugging option in the kernel. 
> > > > Otherwise bisecting could help to uncover the issue.
> > > > 
> > > > Best regards,
> > > > Jozsef
> > > >   
> > > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > > >> reports and sometimes miss something important when writing mails like
> > > >> this. If that's the case here, don't hesitate to tell me in a public
> > > >> reply, it's in everyone's interest to set the public record straight.
> > 
> 
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0Udypzvg$ 
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!69L-2bZ2sI_yJHZDhKe799D2LnTMz-jfAVznMjfJK6jB68je36HDpX0ag_GDJRIQxS2lfs9imab8SZmlVBs$ 
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Intermittent performance regression related to ipset between 5.10 and 5.15
  2022-08-01 23:38                             ` U'ren, Aaron
@ 2022-08-04  6:28                               ` Jozsef Kadlecsik
  0 siblings, 0 replies; 20+ messages in thread
From: Jozsef Kadlecsik @ 2022-08-04  6:28 UTC (permalink / raw)
  To: U'ren, Aaron
  Cc: Jakub Kicinski, Thorsten Leemhuis, McLean, Patrick,
	Pablo Neira Ayuso, netfilter-devel, Brown, Russell, Rueger,
	Manuel, linux-kernel, regressions, Florian Westphal, netdev

[-- Attachment #1: Type: text/plain, Size: 20792 bytes --]

Hi Aaron,

On Mon, 1 Aug 2022, U'ren, Aaron wrote:

> Our end solution was to change kube-router to ignore initval when it 
> parsed the "ipset save" output 
> (https://github.com/cloudnativelabs/kube-router/pull/1337/files?diff=unified&w=0). 
> This means that initval is never passed into ipset restore either. This 
> was essentially the same functionality that we had before initval was 
> introduced to the userspace.
> 
> You would obviously know better than I would, but if I understand your 
> comment below correctly, I believe that because of how we work with 
> ipset restore, we don't actually need to use the initval parameter. When 
> we work with ipsets in kube-router, we copy the entries into a temporary 
> set that is unique based upon that set's options, then swap it into its 
> final name, and then flush the temporary set afterward.
> 
> Because of this approach to ipset logic, I believe that it should 
> mitigate any potential clashing that might happen based upon name with 
> changed options (like hash size), or clashing elements.
> 
> However, if you believe that this is not the case, we could look into 
> finding some correct way to save the initval and pass it back into the 
> restore.

That's a good solution for the problem, there's no need to do anything 
else. By the way it's good to see the swapping is used as it was designed.

Best regards,
Jozsef
 
> From: Jozsef Kadlecsik <kadlec@netfilter.org>
> Date: Saturday, July 30, 2022 at 5:44 AM
> To: U'ren, Aaron <Aaron.U'ren@sony.com>
> Cc: Jakub Kicinski <kuba@kernel.org>, Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
> Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> Hi Aaron,
> 
> On Fri, 29 Jul 2022, U'ren, Aaron wrote:
> 
> > Thanks for all of your help with this issue. I think that we can close 
> > this out now.
> > 
> > After continuing to dig into this problem some more, I eventually 
> > figured out that the problem was caused because of how our userspace 
> > tooling was interacting with ipset save / restore and the new (ish) 
> > initval option that is included in saves / restores.
> > 
> > Specifically, kube-router runs an ipset save then processes the saved 
> > ipset data, messages it a bit based upon the state from the Kubernetes 
> > cluster, and then runs that data back through ipset restore. During this 
> > time, we create unique temporary sets based upon unique sets of options 
> > and then rotate in the new endpoints into the temporary set and then use 
> > swap instructions in order to minimize impact to the data path.
> > 
> > However, because we were only messaging options that were recognized and 
> > important to us, initval was left alone and blindly copied into our 
> > option strings for new and temporary sets. This caused initval to be 
> > used incorrectly (i.e. the same initval ID was used for multiple sets). 
> > I'm not 100% sure about all of the consequences of this, but it seems to 
> > have objectively caused some performance issues.
> 
> It's hard to say. initval is actually the arbitrary initial value for the 
> jhash() macro in the kernel. The same initval is used for every element in 
> a hash, so it's tied to the hash.
> 
> Earlier, initval was a totally hidden internal parameter in ipset. That 
> meant that save/restore could possibly not create a restored set which was 
> identical with the saved one: hash size could be different; list of 
> clashing elements could be different. Therefore I added the ability to 
> save and restore initval as well.
> 
> > Additionally, since initval is intentionally unique between sets, this 
> > caused us to create many more temporary sets for swapping than was 
> > actually necessary. This caused obvious performance issues as restores 
> > now contained more instructions than they needed to.
> > 
> > Reverting the commit removed the issue we saw because it removed the 
> > portion of the kernel that generated the initvals which caused ipset 
> > save to revert to its previous (5.10 and below) functionality. 
> > Additionally, applying your patches also had the same impact because 
> > while I believed I was updating our userspace ipset tools in tandem, I 
> > found that the headers were actually being copied in from an alternate 
> > location and were still using the vanilla headers. This meant that while 
> > the kernel was generating initval values, the userspace actually 
> > recognized it as IPSET_ATTR_GC values which were then unused.
> > 
> > This was a very long process to come to such a simple recognition about 
> > the ipset save / restore format having been changed. I apologize for the 
> > noise.
> 
> I simply could not imagine a scenario where exposing the initval value 
> could result in any kind of regression...
> 
> Just to make sure I understood completely: what is your solution for the 
> problem, then? Working with a patched kernel, which removes passing 
> initval to userspace? Patched ipset tool, which does not send it? Modified 
> tooling, which ignores the initval parameter?
> 
> I really appreciate your hard work!
> 
> Best regards,
> Jozsef
>  
> > From: U'ren, Aaron <Aaron.U'ren@sony.com>
> > Date: Friday, July 8, 2022 at 3:08 PM
> > To: Jozsef Kadlecsik <kadlec@netfilter.org>, Jakub Kicinski <kuba@kernel.org>
> > Cc: Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > Jozsef / Jakub-
> > 
> > Given your latest email and the fact that just adding back in IPSET_ATTR_GC doesn't shed any light on the issue I wanted to spend a lot more time testing. Also, I wanted to try to provide as much context for this issue as possible.
> > 
> > I think that the iptables slowness is just a symptom not the cause of the issue. After spending a lot more time with it, I can see that iptables only runs slowly when an existing "ipset restore" process is being run by kube-router simultaneously. Given the other information that you've provided, my hunch is that iptables slows down when ipset restore is running because they are both vying for the same mutex? Anyway, I think troubleshooting it from the direction of iptables slowness is likely the wrong path to go down.
> > 
> > The true problem seems to be that when IPSET_ATTR_GC is not included, somehow nodes are able to get into a state where "ipset restore" goes from completing in less than a 10th of a second, to taking 30 seconds to a minute to complete. The hard part, is that I still don't know what causes a node to enter this state.
> > 
> > I have a Kubernetes cluster of about 7 nodes that I can reliably get into this state, but I have yet to be able to reproduce it consistently anywhere else. Other clusters will randomly exhibit the issue if IPSET_ATTR_GC is left out of the kernel, but not consistently. Since the email where we found the commit about 2 weeks ago, we have also been running 6 clusters of 9+ nodes with IPSET_ATTR_GC enabled and have not had any issues.
> > 
> > Since we have a custom kernel configuration, I have also tried using the vanilla Ubuntu kernel configuration (taken from 5.15.0-40-generic) as well just to ensure that we didn't have some errant configuration option enabled. However, this also reliably reproduced the issue when IPSET_ATTR_GC was removed and just as reliably removed the issue when IPSET_ATTR_GC was added back in.
> > 
> > I have also verified that neither ipset, iptables, or any of its dependent libraries have references to IPSET_ATTR_GC, going as far as to remove it from the ipset header file (https://urldefense.com/v3/__https://git.netfilter.org/iptables/tree/include/linux/netfilter/ipset/ip_set.h*n86__;Iw!!JmoZiZGBv3RvKRSx!69L-2bZ2sI_yJHZDhKe799D2LnTMz-jfAVznMjfJK6jB68je36HDpX0ag_GDJRIQxS2lfs9imab8LPpnCPI$ ) and rebuild it (and all of the libraries and other tools) from scratch just as a hail mary. No changes to user-space seem to have an effect on this issue.
> > 
> > One other thing that I've done to help track down the issue is to add debug options to kube-router so that it outputs the file that it feeds into "ipset restore -exist". With this file, on nodes affected by this issue, I can reliably reproduce the issue by calling "ipset restore -exist <file" and see that it takes 30+ seconds to execute.
> > 
> > In a hope that maybe it sheds some light and gives you some more context, I'm going to be sending you and Jakub a copy of the strace and the ipset file that was used separately from this email.
> > 
> > At this point, I'm not sure how to proceed other than with the files that I'll be sending you. I'm highly confident that somehow the removal of IPSET_ATTR_GC is causing the issues that we see. At this point I've added and removed the options almost 20 times and done reboots across our cluster. Anytime that variable is missing, we see several nodes exhibit the performance issues immediately. Any time the variable is present, we see no nodes exhibit the performance issues.
> > 
> > Looking forward to hearing back from you and getting to the bottom of this very bizarre issue.
> > 
> > -Aaron
> > 
> > From: Jozsef Kadlecsik <kadlec@netfilter.org>
> > Date: Saturday, July 2, 2022 at 12:41 PM
> > To: Jakub Kicinski <kuba@kernel.org>
> > Cc: U'ren, Aaron <Aaron.U'ren@sony.com>, Thorsten Leemhuis <regressions@leemhuis.info>, McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>
> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > Hi,
> > 
> > On Thu, 30 Jun 2022, Jakub Kicinski wrote:
> > 
> > > Sounds like you're pretty close to figuring this out! Can you check 
> > > if the user space is intentionally setting IPSET_ATTR_INITVAL?
> > > Either that or IPSET_ATTR_GC was not as "unused" as initially thought.
> > 
> > IPSET_ATTR_GC was really unused. It was an old remnant from the time when 
> > ipset userspace-kernel communication was through set/getsockopt. However, 
> > when it was migrated to netlink, just the symbol was kept but it was not 
> > used either with the userspace tool or the kernel.
> > 
> > Aaron, could you send me how to reproduce the issue? I have no idea how 
> > that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL 
> > is totally independent from listing iptables rules. But if you have got a 
> > reproducer then I can dig into it.
> > 
> > Best regards,
> > Jozsef
> > 
> > > Testing something like this could be a useful data point:
> > > 
> > > diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
> > > index 6397d75899bc..7caf9b53d2a7 100644
> > > --- a/include/uapi/linux/netfilter/ipset/ip_set.h
> > > +++ b/include/uapi/linux/netfilter/ipset/ip_set.h
> > > @@ -92,7 +92,7 @@ enum {
> > >        /* Reserve empty slots */
> > >        IPSET_ATTR_CADT_MAX = 16,
> > >        /* Create-only specific attributes */
> > > -     IPSET_ATTR_INITVAL,     /* was unused IPSET_ATTR_GC */
> > > +     IPSET_ATTR_GC,
> > >        IPSET_ATTR_HASHSIZE,
> > >        IPSET_ATTR_MAXELEM,
> > >        IPSET_ATTR_NETMASK,
> > > @@ -104,6 +104,8 @@ enum {
> > >        IPSET_ATTR_REFERENCES,
> > >        IPSET_ATTR_MEMSIZE,
> > >  
> > > +     IPSET_ATTR_INITVAL,
> > > +
> > >        __IPSET_ATTR_CREATE_MAX,
> > >  };
> > >  #define IPSET_ATTR_CREATE_MAX        (__IPSET_ATTR_CREATE_MAX - 1)
> > > 
> > > 
> > > On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> > > > Thorsten / Jozsef -
> > > > 
> > > > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> > > > 
> > > > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> > > > 
> > > > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> > > > 
> > > > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://urldefense.com/v3/__https:/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f__;!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0IirIc4I$  (netfilter: ipset: Expose the initval hash parameter to userspace)
> > > > 
> > > > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> > > > 
> > > > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> > > > 
> > > > -Aaron
> > > > 
> > > > From: Thorsten Leemhuis <regressions@leemhuis.info>
> > > > Date: Monday, June 20, 2022 at 2:16 AM
> > > > To: U'ren, Aaron <Aaron.U'ren@sony.com>
> > > > Cc: McLean, Patrick <Patrick.Mclean@sony.com>, Pablo Neira Ayuso <pablo@netfilter.org>, netfilter-devel@vger.kernel.org <netfilter-devel@vger.kernel.org>, Brown, Russell <Russell.Brown@sony.com>, Rueger, Manuel <manuel.rueger@sony.com>, linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>, regressions@lists.linux.dev <regressions@lists.linux.dev>, Florian Westphal <fw@strlen.de>, netdev@vger.kernel.org <netdev@vger.kernel.org>, Jozsef Kadlecsik <kadlec@netfilter.org>
> > > > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > > > On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > > > > On Mon, 30 May 2022, Thorsten Leemhuis wrote:  
> > > > >> On 04.05.22 21:37, U'ren, Aaron wrote:  
> > > >  [...]  
> > > > > 
> > > > > Every set lookups behind "iptables" needs two getsockopt() calls: you can 
> > > > > see them in the strace logs. The first one check the internal protocol 
> > > > > number of ipset and the second one verifies/gets the processed set (it's 
> > > > > an extension to iptables and therefore there's no internal state to save 
> > > > > the protocol version number).  
> > > > 
> > > > Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> > > > root case? I have this issue on the list of tracked regressions and
> > > > wonder what the status is. Or can I mark this as resolved?
> > > > 
> > > > Side note: this is not a "something breaks" regressions and it seems to
> > > > progress slowly, so I'm putting it on the backburner:
> > > > 
> > > > #regzbot backburner: performance regression where the culprit is hard to
> > > > track down
> > > > 
> > > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > > > 
> > > > P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > > > reports and sometimes miss something important when writing mails like
> > > > this. If that's the case here, don't hesitate to tell me in a public
> > > > reply, it's in everyone's interest to set the public record straight.
> > > > 
> > > >  [...]  
> > > > > 
> > > > > In your strace log
> > > > > 
> > > > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > > > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > > > > 0.109456 close(4)                  = 0 <0.000022>
> > > > > 
> > > > > the only things which happen in the second sockopt function are to lock 
> > > > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the 
> > > > > setname, save the result in the case of a match and unlock the mutex. 
> > > > > Nothing complicated, no deep, multi-level function calls. Just a few line 
> > > > > of codes which haven't changed.
> > > > > 
> > > > > The only thing which can slow down the processing is the mutex handling. 
> > > > > Don't you have accidentally wait/wound mutex debugging enabled in the 
> > > > > kernel? If not, then bisecting the mutex related patches might help.
> > > > > 
> > > > > You wrote that flushing tables or ipsets didn't seem to help. That 
> > > > > literally meant flushing i.e. the sets were emptied but not destroyed? Did 
> > > > > you try both destroying or flushing?
> > > > >   
> > > > >> Jozsef, I still have this issue on my list of tracked regressions and it
> > > > >> looks like nothing happens since above mail (or did I miss it?). Could
> > > > >> you maybe provide some guidance to Aaron to get us all closer to the
> > > > >> root of the problem?  
> > > > > 
> > > > > I really hope it's an accidentally enabled debugging option in the kernel. 
> > > > > Otherwise bisecting could help to uncover the issue.
> > > > > 
> > > > > Best regards,
> > > > > Jozsef
> > > > >   
> > > > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > > > >> reports and sometimes miss something important when writing mails like
> > > > >> this. If that's the case here, don't hesitate to tell me in a public
> > > > >> reply, it's in everyone's interest to set the public record straight.
> > > 
> > 
> > -
> > E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> > PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!9YR_bFOCOkQzPaUftFL2NvuKLm8zPa4tQr_DI8CUZEenjK4Rak_OFmUrCpmiNOaUaiueGbgsEqk0Udypzvg$ 
> > Address : Wigner Research Centre for Physics
> >           H-1525 Budapest 114, POB. 49, Hungary
> > 
> 
> -
> E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
> PGP key : https://urldefense.com/v3/__https:/wigner.hu/*kadlec/pgp_public_key.txt__;fg!!JmoZiZGBv3RvKRSx!69L-2bZ2sI_yJHZDhKe799D2LnTMz-jfAVznMjfJK6jB68je36HDpX0ag_GDJRIQxS2lfs9imab8SZmlVBs$ 
> Address : Wigner Research Centre for Physics
>           H-1525 Budapest 114, POB. 49, Hungary
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.hu
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-08-04  6:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-15 23:15 Intermittent performance regression related to ipset between 5.10 and 5.15 McLean, Patrick
2022-03-16  9:17 ` Thorsten Leemhuis
2022-04-11 10:36   ` Thorsten Leemhuis
2022-04-11 11:47     ` Jozsef Kadlecsik
2022-05-04 13:14       ` Thorsten Leemhuis
2022-05-04 19:37         ` U'ren, Aaron
2022-05-30 13:50           ` Thorsten Leemhuis
2022-05-31  7:41             ` Jozsef Kadlecsik
2022-06-20  7:16               ` Thorsten Leemhuis
2022-06-30 14:59                 ` U'ren, Aaron
2022-06-30 18:04                   ` Jakub Kicinski
2022-07-02  0:32                     ` U'ren, Aaron
2022-07-02 17:40                     ` Jozsef Kadlecsik
2022-07-08 20:08                       ` U'ren, Aaron
2022-07-29 20:21                         ` U'ren, Aaron
2022-07-30  1:41                           ` Jakub Kicinski
2022-07-30 10:43                           ` Jozsef Kadlecsik
2022-08-01 23:38                             ` U'ren, Aaron
2022-08-04  6:28                               ` Jozsef Kadlecsik
2022-07-01  5:18   ` Intermittent performance regression related to ipset between 5.10 and 5.15 #forregzbot Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).