LKML Archive on lore.kernel.org
 help / Atom feed
* Regression: kernel 4.14 an later very slow with many ipsec tunnels
@ 2018-09-13 11:30 Wolfgang Walter
  2018-09-13 13:58 ` Florian Westphal
  0 siblings, 1 reply; 22+ messages in thread
From: Wolfgang Walter @ 2018-09-13 11:30 UTC (permalink / raw)
  To: netdev
  Cc: Florian Westphal, Steffen Klassert, David Miller,
	Linux Kernel Mailing List, Linus Torvalds

Hello,

thanks to the fix from Steffen Klassert I could now run 4.14.69 + his patch 
and 4.18.7 + his patch without oopsing immediately.

But I found that those kernels perform very bad. They perform so bad that they 
are unusable for our router with about 3000 ipsec tunnels (tunnel mode network 
<-> network).

With 4.9. (and all other kernels I used in the last 10 years with much less 
potent hardware) I never had an comparable performance issue with networking.

4.18.7 is better then 4.14.69 but still remains unusuable for us.

Even with very little traffic all 8 cores are working 100% in ksoftirqd. As 
soon as there is real traffic network gets rather unusuable.

Latency of packets goes from between 0.1ms to 1ms up to 100ms to 500ms (4.14) 
or between 15ms to 90ms (4.18).

Throughput also suffers a lot.

I have a simple test I run after every upgrade. This test basically copies 
with scp large files to 60 different remote locations (involving ipsec), 
limited to 1GBit/s combined, and in paralled I ping from different networks 
over this router to machines in other networks of this router (no ipsec-
tunnels involved).

With 4.9 and earlier copying needs about 2 minutes and the pings all remain 
under 2ms roundtrip. 

With 4.14 copying these files needs more than one our. The roundtrip time of 
the ping is > 1 second.

With 4.18 this is much better, copying needs around 6 minutes and ping 
roundtrip is between 30ms and 180ms. But even that is much worse then 4.9.

I think this dramatic loss in performance is due to the removal of the flow 
cache. I propose to revert that for 4.14. I also propose to revert it for the 
next longterm kernel if no other solution is found bringing back 4.9 
performance (at least about the same order of magnitude).

Maybe it should generally be reverted until a solution to the problem exists.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-13 11:30 Regression: kernel 4.14 an later very slow with many ipsec tunnels Wolfgang Walter
@ 2018-09-13 13:58 ` Florian Westphal
  2018-09-13 15:46   ` Wolfgang Walter
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2018-09-13 13:58 UTC (permalink / raw)
  To: Wolfgang Walter
  Cc: netdev, Florian Westphal, Steffen Klassert, David Miller,
	Linux Kernel Mailing List, Linus Torvalds

Wolfgang Walter <linux@stwm.de> wrote:
> thanks to the fix from Steffen Klassert I could now run 4.14.69 + his patch 
> and 4.18.7 + his patch without oopsing immediately.
> 
> But I found that those kernels perform very bad. They perform so bad that they 
> are unusable for our router with about 3000 ipsec tunnels (tunnel mode network 
> <-> network).

Can you do a 'perf record -a -g sleep 5' with 4.18 and provide 'perf
report' result?

It would be good to see where those cycles are spent.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-13 13:58 ` Florian Westphal
@ 2018-09-13 15:46   ` Wolfgang Walter
  2018-09-13 16:38     ` Florian Westphal
  0 siblings, 1 reply; 22+ messages in thread
From: Wolfgang Walter @ 2018-09-13 15:46 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, Steffen Klassert, David Miller,
	Linux Kernel Mailing List, Linus Torvalds

Am Donnerstag, 13. September 2018, 15:58:44 schrieb Florian Westphal:
> Wolfgang Walter <linux@stwm.de> wrote:
> > thanks to the fix from Steffen Klassert I could now run 4.14.69 + his
> > patch
> > and 4.18.7 + his patch without oopsing immediately.
> > 
> > But I found that those kernels perform very bad. They perform so bad that
> > they are unusable for our router with about 3000 ipsec tunnels (tunnel
> > mode network <-> network).
> 
> Can you do a 'perf record -a -g sleep 5' with 4.18 and provide 'perf
> report' result?
> 
> It would be good to see where those cycles are spent.

I'll try that but this isn't that easy as the router image does not contain 
perf. I also have to do that on our production router. I try to do that 
tomorrow evening.

What I can say is that it depends mainly on number of policy rules and SA.

Regards
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-13 15:46   ` Wolfgang Walter
@ 2018-09-13 16:38     ` Florian Westphal
  2018-09-13 17:23       ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2018-09-13 16:38 UTC (permalink / raw)
  To: Wolfgang Walter
  Cc: Florian Westphal, netdev, Steffen Klassert, David Miller,
	Linux Kernel Mailing List, Linus Torvalds

Wolfgang Walter <linux@stwm.de> wrote:
> I'll try that but this isn't that easy as the router image does not contain 
> perf. I also have to do that on our production router. I try to do that 
> tomorrow evening.

No need if its too difficult.

> What I can say is that it depends mainly on number of policy rules and SA.

Thats already a good hint, I guess we're hitting long hash chains in
xfrm_policy_lookup_bytype().

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-13 16:38     ` Florian Westphal
@ 2018-09-13 17:23       ` David Miller
  2018-09-13 21:03         ` Florian Westphal
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2018-09-13 17:23 UTC (permalink / raw)
  To: fw; +Cc: linux, netdev, steffen.klassert, linux-kernel, torvalds

From: Florian Westphal <fw@strlen.de>
Date: Thu, 13 Sep 2018 18:38:48 +0200

> Wolfgang Walter <linux@stwm.de> wrote:
>> What I can say is that it depends mainly on number of policy rules and SA.
> 
> Thats already a good hint, I guess we're hitting long hash chains in
> xfrm_policy_lookup_bytype().

I don't really see how recent changes can influence that.

And the bydst hashes have been dynamically sized for a very long time.

I might have missed something...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-13 17:23       ` David Miller
@ 2018-09-13 21:03         ` Florian Westphal
  2018-09-13 21:12           ` David Miller
  2018-09-14  5:06           ` Steffen Klassert
  0 siblings, 2 replies; 22+ messages in thread
From: Florian Westphal @ 2018-09-13 21:03 UTC (permalink / raw)
  To: David Miller
  Cc: fw, linux, netdev, steffen.klassert, linux-kernel, torvalds,
	christophe.gouault

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Thu, 13 Sep 2018 18:38:48 +0200
> 
> > Wolfgang Walter <linux@stwm.de> wrote:
> >> What I can say is that it depends mainly on number of policy rules and SA.
> > 
> > Thats already a good hint, I guess we're hitting long hash chains in
> > xfrm_policy_lookup_bytype().
> 
> I don't really see how recent changes can influence that.

I don't think there is a recent change that did this.

Walter says < 4.14 is ok, so this is likely related to flow cache removal.

F.e. it looks like all prefixed policies end up in a linked list
(net->xfrm.policy_inexact) and are not even in a hash table.

I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
but can't figure out how to configure that away from the
'no hashing for prefixed policies' default or why we even have
policy_inexact in first place :/

I'll look at this again tomorrow.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-13 21:03         ` Florian Westphal
@ 2018-09-13 21:12           ` David Miller
  2018-09-14  5:06           ` Steffen Klassert
  1 sibling, 0 replies; 22+ messages in thread
From: David Miller @ 2018-09-13 21:12 UTC (permalink / raw)
  To: fw
  Cc: linux, netdev, steffen.klassert, linux-kernel, torvalds,
	christophe.gouault

From: Florian Westphal <fw@strlen.de>
Date: Thu, 13 Sep 2018 23:03:25 +0200

> I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> but can't figure out how to configure that away from the
> 'no hashing for prefixed policies' default or why we even have
> policy_inexact in first place :/
> 
> I'll look at this again tomorrow.

The inexact list exists to handle prefixed input keys.

At the time that I wrote all of the control plane hash table
stuff, configurations I could find consisted of:

1) Entires with non-prefixed keys, which are easy to hash.
   The number of entries could be large (e.g. cell phone
   network)

2) A very small number of prefixed policies.

So hashing, when possible, falling back to the linked list
for prefixed stuff.

Beforehand we only had the linked list :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-13 21:03         ` Florian Westphal
  2018-09-13 21:12           ` David Miller
@ 2018-09-14  5:06           ` Steffen Klassert
  2018-09-14  5:54             ` Florian Westphal
  1 sibling, 1 reply; 22+ messages in thread
From: Steffen Klassert @ 2018-09-14  5:06 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, linux, netdev, linux-kernel, torvalds, christophe.gouault

On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> David Miller <davem@davemloft.net> wrote:
> > From: Florian Westphal <fw@strlen.de>
> > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > 
> > > Wolfgang Walter <linux@stwm.de> wrote:
> > >> What I can say is that it depends mainly on number of policy rules and SA.
> > > 
> > > Thats already a good hint, I guess we're hitting long hash chains in
> > > xfrm_policy_lookup_bytype().
> > 
> > I don't really see how recent changes can influence that.
> 
> I don't think there is a recent change that did this.
> 
> Walter says < 4.14 is ok, so this is likely related to flow cache removal.
> 
> F.e. it looks like all prefixed policies end up in a linked list
> (net->xfrm.policy_inexact) and are not even in a hash table.
> 
> I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> but can't figure out how to configure that away from the
> 'no hashing for prefixed policies' default or why we even have
> policy_inexact in first place :/

The hash threshold can be configured like this:

ip x p set hthresh4 0 0

This sets the hash threshold to local /0 and remote /0 netmasks.
With this configuration, all policies should go to the hashtable.
This might help to balance the hash chains better.

Default hash thresholds are local /32 and remote /32 netmasks, so
all prefixed policies go to the inexact list.

To view the configuration:

ip -s -s x p count


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-14  5:06           ` Steffen Klassert
@ 2018-09-14  5:54             ` Florian Westphal
  2018-09-14  6:01               ` Steffen Klassert
                                 ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Florian Westphal @ 2018-09-14  5:54 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Florian Westphal, David Miller, linux, netdev, linux-kernel,
	torvalds, christophe.gouault

Steffen Klassert <steffen.klassert@secunet.com> wrote:
> On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > David Miller <davem@davemloft.net> wrote:
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > 
> > > > Wolfgang Walter <linux@stwm.de> wrote:
> > > >> What I can say is that it depends mainly on number of policy rules and SA.
> > > > 
> > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > xfrm_policy_lookup_bytype().
> > > 
> > > I don't really see how recent changes can influence that.
> > 
> > I don't think there is a recent change that did this.
> > 
> > Walter says < 4.14 is ok, so this is likely related to flow cache removal.
> > 
> > F.e. it looks like all prefixed policies end up in a linked list
> > (net->xfrm.policy_inexact) and are not even in a hash table.
> > 
> > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > but can't figure out how to configure that away from the
> > 'no hashing for prefixed policies' default or why we even have
> > policy_inexact in first place :/
> 
> The hash threshold can be configured like this:
> 
> ip x p set hthresh4 0 0
> 
> This sets the hash threshold to local /0 and remote /0 netmasks.
> With this configuration, all policies should go to the hashtable.

Yes, but won't they all be hashed to same bucket?

[ jhash(addr & 0, addr & 0) ] ?

> Default hash thresholds are local /32 and remote /32 netmasks, so
> all prefixed policies go to the inexact list.

Yes.

Wolfgang, before having to work on getting perf into your router image
can you perhaps share a bit of info about the policies you're using?

How many are there?  Are they prefixed or not ("10.1.2.1")?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-14  5:54             ` Florian Westphal
@ 2018-09-14  6:01               ` Steffen Klassert
  2018-09-14  8:01                 ` Christophe Gouault
  2018-09-14 11:49               ` Wolfgang Walter
  2018-10-02 14:45               ` Wolfgang Walter
  2 siblings, 1 reply; 22+ messages in thread
From: Steffen Klassert @ 2018-09-14  6:01 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, linux, netdev, linux-kernel, torvalds, christophe.gouault

On Fri, Sep 14, 2018 at 07:54:37AM +0200, Florian Westphal wrote:
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> > On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > > David Miller <davem@davemloft.net> wrote:
> > > > From: Florian Westphal <fw@strlen.de>
> > > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > > 
> > > > > Wolfgang Walter <linux@stwm.de> wrote:
> > > > >> What I can say is that it depends mainly on number of policy rules and SA.
> > > > > 
> > > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > > xfrm_policy_lookup_bytype().
> > > > 
> > > > I don't really see how recent changes can influence that.
> > > 
> > > I don't think there is a recent change that did this.
> > > 
> > > Walter says < 4.14 is ok, so this is likely related to flow cache removal.
> > > 
> > > F.e. it looks like all prefixed policies end up in a linked list
> > > (net->xfrm.policy_inexact) and are not even in a hash table.
> > > 
> > > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > > but can't figure out how to configure that away from the
> > > 'no hashing for prefixed policies' default or why we even have
> > > policy_inexact in first place :/
> > 
> > The hash threshold can be configured like this:
> > 
> > ip x p set hthresh4 0 0
> > 
> > This sets the hash threshold to local /0 and remote /0 netmasks.
> > With this configuration, all policies should go to the hashtable.
> 
> Yes, but won't they all be hashed to same bucket?
> 
> [ jhash(addr & 0, addr & 0) ] ?

Hm, yes. Maybe something between /0 and /32 makes more sense.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-14  6:01               ` Steffen Klassert
@ 2018-09-14  8:01                 ` Christophe Gouault
  0 siblings, 0 replies; 22+ messages in thread
From: Christophe Gouault @ 2018-09-14  8:01 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: fw, David S. Miller, linux, netdev, linux-kernel, torvalds

Le ven. 14 sept. 2018 à 08:01, Steffen Klassert
<steffen.klassert@secunet.com> a écrit :
> > > The hash threshold can be configured like this:
> > >
> > > ip x p set hthresh4 0 0
> > >
> > > This sets the hash threshold to local /0 and remote /0 netmasks.
> > > With this configuration, all policies should go to the hashtable.
> >
> > Yes, but won't they all be hashed to same bucket?
> >
> > [ jhash(addr & 0, addr & 0) ] ?
>
> Hm, yes. Maybe something between /0 and /32 makes more sense.

Indeed, hash thresholds not only determine which policies will be
hashed, but also the number of bits of the local and remote address
that will be used to calculate the hash key. Big thresholds mean
potentially fewer hashed policies, but better distribution in the hash
table, and vice versa.

A good trade off must be found depending on the prefix lengths used in
your policies.

Best regards,
Christophe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-14  5:54             ` Florian Westphal
  2018-09-14  6:01               ` Steffen Klassert
@ 2018-09-14 11:49               ` Wolfgang Walter
  2018-10-02 14:45               ` Wolfgang Walter
  2 siblings, 0 replies; 22+ messages in thread
From: Wolfgang Walter @ 2018-09-14 11:49 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Steffen Klassert, David Miller, netdev, linux-kernel, torvalds,
	christophe.gouault

Am Freitag, 14. September 2018, 07:54:37 schrieb Florian Westphal:
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> > On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > > David Miller <davem@davemloft.net> wrote:
> > > > From: Florian Westphal <fw@strlen.de>
> > > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > > 
> > > > > Wolfgang Walter <linux@stwm.de> wrote:
> > > > >> What I can say is that it depends mainly on number of policy rules
> > > > >> and SA.
> > > > > 
> > > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > > xfrm_policy_lookup_bytype().
> > > > 
> > > > I don't really see how recent changes can influence that.
> > > 
> > > I don't think there is a recent change that did this.
> > > 
> > > Walter says < 4.14 is ok, so this is likely related to flow cache
> > > removal.
> > > 
> > > F.e. it looks like all prefixed policies end up in a linked list
> > > (net->xfrm.policy_inexact) and are not even in a hash table.
> > > 
> > > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > > but can't figure out how to configure that away from the
> > > 'no hashing for prefixed policies' default or why we even have
> > > policy_inexact in first place :/
> > 
> > The hash threshold can be configured like this:
> > 
> > ip x p set hthresh4 0 0
> > 
> > This sets the hash threshold to local /0 and remote /0 netmasks.
> > With this configuration, all policies should go to the hashtable.
> 
> Yes, but won't they all be hashed to same bucket?
> 
> [ jhash(addr & 0, addr & 0) ] ?
> 
> > Default hash thresholds are local /32 and remote /32 netmasks, so
> > all prefixed policies go to the inexact list.
> 
> Yes.
> 
> Wolfgang, before having to work on getting perf into your router image
> can you perhaps share a bit of info about the policies you're using?
> 
> How many are there?  Are they prefixed or not ("10.1.2.1")?

All rules are tunnel rules. That is they are rules like (in strongswan 
notation)

conn A-to-B
        left=111.111.111.111
        leftsubnet=10.148.32.0/24
        leftsigkey=....
        right=111.111.111.222
        rightsubnet=10.148.13.224/29
        rightsigkey=....
		esp=aes128ctr-sha1-ecp256-esn!
		ike=aes128ctr-sha1-ecp256!
		mobike=no
		type=tunnel
		....

(... other options not important here).


leftsubnet and rightsubnet may have any prefix from /30 to /16 here (we do not 
yet use ipv6 but will do so next year).

We have about 3000 of them.

strongswan install IN, FWD and OUT rules for that in the kernel security 
policy database with automated generated priorities (and SAs are generated 
when strongswan actually establish a tunnel).

Also some of the rules overlap in range, that means ordering is important. 

With IKEv2 this may happens automatically for SAs even if you avoid it in your 
rule set as IKEv2 allows narrowing.

In policies you most often get this if you want to excempt a certain network 
or host. We have a about 70 of them at the moment.

We do not use other possible selectors beside src-addr-range and dst-addr-
range (you could additionally select by protocol (icmp, udp, tcp), src- and 
dst-port-range). So theoretically you could have a ruleset where there is a 
rule with exempts all connection to dst port 22 for several network or applies 
different encryption options and so on.

A rule determins what has to be done with the packet (sending or receiving) 
from an ipsec-point of view: allow it without ipsec-transformation, block it 
completely, or require certain ipsec transformation (use this or that 
ecnryption scheme, use header compression,  use transport or tunnel mode, ...)

So for any packet the kernel sends it has to look up if there are SAs which 
matches and from these chose that with the highest priority (which is that one 
with the lowest priority field). If there is none he has to lookup if there is 
a matching policy, again choosing the one with the highest priority (and then 
let the IKE-daemon actually establish a SA). For tunnel-mode he actually has 
to do it twice, I think, as the tunnel-paket again passes ipsec.

For every packet it receives and which ist not an ipsec paket he has to do a 
lookup in the policy database to see if it should have been (or if it is 
allowed or blocked). If no rule is found it is allowed without encryption. We 
have 29.000 allow rules. I did deactivate them for the tests with 4.14 and 
4.18 as these makes things horrible. They are automatically generated from our 
declarativ network description and we actually don't need them as they do not 
overlap with the remote networks tunneled via ipsec. They did not impose any 
burden for 4.9 and earlier.

We sometimes need them (say if 10.10.0.0/16 is remote but 10.10.1.0 which is 
local).

So this is basically the multidimensional packet classifiction problem: from a 
set of m-dimensional blocks find that one with the highest priority which 
contains a certain point.

The dimension here are src-addr-range, dst-addr-range, protocol, src-port-
range, dst-port-range.

If your rule is itself a point you may hash it (and you can only do this if it 
is sure that there is no other non-point rule with higher prio matching this 
point rule as there is no such rule that a more specific rule beats a less 
specific rule (this would be ill defined)).

Here an example how strongswan allows you to use all of the above selectors 
for your rules. For example you could write for leftsubnet:

leftsubnet=10.0.0.1[tcp/http],10.0.0.2[6/80]
leftsubnet=fec1::1[udp],10.0.0.0/16[/53].
leftsubnet=fec1::1[udp/%any],10.0.0.0/16[%any/53]
leftsubnet=fec1::1[udp/%any],10.0.0.0/16[%any/1024-32000]

So ipsec with large policy-database without xfrm flow cache is comparable with 
a large netfilter ruleset (with only one chain) without conntrack.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-09-14  5:54             ` Florian Westphal
  2018-09-14  6:01               ` Steffen Klassert
  2018-09-14 11:49               ` Wolfgang Walter
@ 2018-10-02 14:45               ` Wolfgang Walter
  2018-10-02 14:56                 ` Florian Westphal
  2 siblings, 1 reply; 22+ messages in thread
From: Wolfgang Walter @ 2018-10-02 14:45 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Steffen Klassert, David Miller, netdev, linux-kernel, torvalds,
	christophe.gouault

Hello,

Am Freitag, 14. September 2018, 07:54:37 schrieb Florian Westphal:
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> > On Thu, Sep 13, 2018 at 11:03:25PM +0200, Florian Westphal wrote:
> > > David Miller <davem@davemloft.net> wrote:
> > > > From: Florian Westphal <fw@strlen.de>
> > > > Date: Thu, 13 Sep 2018 18:38:48 +0200
> > > > 
> > > > > Wolfgang Walter <linux@stwm.de> wrote:
> > > > >> What I can say is that it depends mainly on number of policy rules
> > > > >> and SA.
> > > > > 
> > > > > Thats already a good hint, I guess we're hitting long hash chains in
> > > > > xfrm_policy_lookup_bytype().
> > > > 
> > > > I don't really see how recent changes can influence that.
> > > 
> > > I don't think there is a recent change that did this.
> > > 
> > > Walter says < 4.14 is ok, so this is likely related to flow cache
> > > removal.
> > > 
> > > F.e. it looks like all prefixed policies end up in a linked list
> > > (net->xfrm.policy_inexact) and are not even in a hash table.
> > > 
> > > I am staring at b58555f1767c9f4e330fcf168e4e753d2d9196e0
> > > but can't figure out how to configure that away from the
> > > 'no hashing for prefixed policies' default or why we even have
> > > policy_inexact in first place :/
> > 
> > The hash threshold can be configured like this:
> > 
> > ip x p set hthresh4 0 0
> > 
> > This sets the hash threshold to local /0 and remote /0 netmasks.
> > With this configuration, all policies should go to the hashtable.
> 
> Yes, but won't they all be hashed to same bucket?
> 
> [ jhash(addr & 0, addr & 0) ] ?
> 
> > Default hash thresholds are local /32 and remote /32 netmasks, so
> > all prefixed policies go to the inexact list.
> 
> Yes.
> 
> Wolfgang, before having to work on getting perf into your router image
> can you perhaps share a bit of info about the policies you're using?
> 
> How many are there?  Are they prefixed or not ("10.1.2.1")?

Since my last reply to this message I didn't get a reply: is there any 
progress how to fix this performance regression I missed?

Or are we stuck here with longterm kernel 4.9 for a long time?


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-02 14:45               ` Wolfgang Walter
@ 2018-10-02 14:56                 ` Florian Westphal
  2018-10-02 17:34                   ` Wolfgang Walter
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2018-10-02 14:56 UTC (permalink / raw)
  To: Wolfgang Walter
  Cc: Florian Westphal, Steffen Klassert, David Miller, netdev,
	linux-kernel, torvalds, christophe.gouault

Wolfgang Walter <linux@stwm.de> wrote:
> Since my last reply to this message I didn't get a reply: is there any 
> progress how to fix this performance regression I missed?

Did you test/experiment with hthresh config option?

> Or are we stuck here with longterm kernel 4.9 for a long time?

I'm experimenting with per-dst inexact lists in an rbtree but
this will take time.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-02 14:56                 ` Florian Westphal
@ 2018-10-02 17:34                   ` Wolfgang Walter
  2018-10-02 21:35                     ` Florian Westphal
  0 siblings, 1 reply; 22+ messages in thread
From: Wolfgang Walter @ 2018-10-02 17:34 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Steffen Klassert, David Miller, netdev, linux-kernel, torvalds,
	christophe.gouault

Am Dienstag, 2. Oktober 2018, 16:56:16 schrieb Florian Westphal:
> Wolfgang Walter <linux@stwm.de> wrote:
> > Since my last reply to this message I didn't get a reply: is there any
> > progress how to fix this performance regression I missed?
> 
> Did you test/experiment with hthresh config option?

I did. It did not improve the situation.

I suppose that is because our masks range from /16 to /30 and excpecially have 
for example /16 <=> /8 and vice versa.

When forwarding, every policy A => B also implies that you add a policy B => 
A.

I'm not familiar when the policy database is consulted, but I think it now has 
to for every not encrypted paket, and for those all rules have to be 
consulted. And unencrypted traffic is a large part of the traffic on that 
router.

That is: for unencrypted traffic neither the buckets of the hash nor the 
inexact list may be large.

> 
> > Or are we stuck here with longterm kernel 4.9 for a long time?
> 
> I'm experimenting with per-dst inexact lists in an rbtree but
> this will take time.

Hmm, I doubt that this is worth the effort. And certainly not that easy 
correctly done, as it still would have to obey the original order of the rules 
(their priority).

You may have a lot of rules of the form say

	10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....
	10.0.0.0/16 <=> 10.1.0.8/29 encrypt ....
	....

And things like that.

Also, you get something like that

	10.0.1.0/24 <=> 10.0.2.0/29 allow
	10.0.0.0/16 <=> 10.0.2.0/24 encrypt
	0.0.0.0 <=> 10.0.2.0/16 block

And people may use source port and/or destination port or protocol 
(tcp/udp/imcp) to further tailor there ruleset.


Here is the approach HiPAC took for packet classification

https://pdfs.semanticscholar.org/a0bb/9d31e2499fb659c9e0d9544072d2f3c25079.pdf
https://pdfs.semanticscholar.org/0dea/8ee87f596f200de2722cbe9480610dd1a0db.pdf

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-02 17:34                   ` Wolfgang Walter
@ 2018-10-02 21:35                     ` Florian Westphal
  2018-10-04 13:57                       ` Wolfgang Walter
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2018-10-02 21:35 UTC (permalink / raw)
  To: Wolfgang Walter
  Cc: Florian Westphal, Steffen Klassert, David Miller, netdev,
	linux-kernel, torvalds, christophe.gouault

Wolfgang Walter <linux@stwm.de> wrote:
> Am Dienstag, 2. Oktober 2018, 16:56:16 schrieb Florian Westphal:
> > I'm experimenting with per-dst inexact lists in an rbtree but
> > this will take time.
> 
> Hmm, I doubt that this is worth the effort. And certainly not that easy 

Well, I'm not going to send a revert of the flowcache removal.

I'm willing to experiment with alternatives to a full iteration of the
inexact list but thats it.

> correctly done, as it still would have to obey the original order of the rules 
> (their priority).

Except that neither the priority or the order in which it was added
matters in case the selector doesn't match.

I see no reason why we can't have inexact lists done per dst<->src pairs.

> You may have a lot of rules of the form say
> 
> 	10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....
> 	10.0.0.0/16 <=> 10.1.0.8/29 encrypt ....

Sure.

> Also, you get something like that
> 
> 	10.0.1.0/24 <=> 10.0.2.0/29 allow
> 	10.0.0.0/16 <=> 10.0.2.0/24 encrypt
> 	0.0.0.0 <=> 10.0.2.0/16 block
> 
> And people may use source port and/or destination port or protocol 
> (tcp/udp/imcp) to further tailor there ruleset.

Yes. 0.0.0.0/0 handling will require some extra consideration.

So far I have not seen a show-stopper however.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-02 21:35                     ` Florian Westphal
@ 2018-10-04 13:57                       ` Wolfgang Walter
  2018-10-25  9:38                         ` Wolfgang Walter
  0 siblings, 1 reply; 22+ messages in thread
From: Wolfgang Walter @ 2018-10-04 13:57 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Steffen Klassert, David Miller, netdev, linux-kernel, torvalds,
	christophe.gouault

Am Dienstag, 2. Oktober 2018, 23:35:36 schrieb Florian Westphal:
> Wolfgang Walter <linux@stwm.de> wrote:
> > Am Dienstag, 2. Oktober 2018, 16:56:16 schrieb Florian Westphal:
> > > I'm experimenting with per-dst inexact lists in an rbtree but
> > > this will take time.
> > 
> > Hmm, I doubt that this is worth the effort. And certainly not that easy
> 
> Well, I'm not going to send a revert of the flowcache removal.
> 
> I'm willing to experiment with alternatives to a full iteration of the
> inexact list but thats it.

If this brings performance back to pre-removal, I'm fine with that. I'm even 
fine if it is slower by a factor of 2.

I think it is a serious regression, and there is no workaround, and therefor 
it cannot stay like that.

So I still hope that reverting is an option if no acceptable solution can be 
found.

> 
> > correctly done, as it still would have to obey the original order of the
> > rules (their priority).
> 
> Except that neither the priority or the order in which it was added
> matters in case the selector doesn't match.

To match a packet one has to find all matching rules and chose that one with 
the lowest priority.

"indexing" by dst will not help much if you have a ruleset where a lot of 
rules sharing a dst. You also have to replicate rules with dsts that have a 
prefix oft another dst as they may habe a higher priority even if they are 
less specific.

Every such entry may again have such an "indexing" by dst. Only then this 
would be efficient.

> 
> I see no reason why we can't have inexact lists done per dst<->src pairs.
> 
> > You may have a lot of rules of the form say
> > 
> > 	10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....
> > 	10.0.0.0/16 <=> 10.1.0.8/29 encrypt ....
> 

<=> means (in the forwarding case) that the rule set contains the inverted 
rule (at least if you use it in usually ways). So 

	10.0.0.0/16 <=> 10.1.0.0/29 encrypt ....

means

	10.0.0.0/16 => 10.1.0.0/29
	10.1.0.0/29 => 10.0.0.0/16

> Sure.
> 
> > Also, you get something like that
> > 
> > 	10.0.1.0/24 <=> 10.0.2.0/29 allow
> > 	10.0.0.0/16 <=> 10.0.2.0/24 encrypt
> > 	0.0.0.0 <=> 10.0.2.0/16 block
> > 
> > And people may use source port and/or destination port or protocol
> > (tcp/udp/imcp) to further tailor there ruleset.
> 
> Yes. 0.0.0.0/0 handling will require some extra consideration.
> 

There may also be rulesets like

	10.0.1.0/24 => 10.1.0.0/29 encrypt X
	10.0.0.0/16 => 10.1.0.0/29 encrypt Y

Or

	10.0.0.0/16 *  => 10.1.0.0/24 80 encrypt Y
	10.0.1.0/24 * => 10.1.0.0/17 * encrypt X
	10.0.0.0/16 *  => 10.1.0.0/20 * encrypt Z

> So far I have not seen a show-stopper however.

I wonder why there is no such thing for netfilter or the rules list in 
routing. nf does not have such a thing, either. This is the reason why I think 
that this is not that easy and for longterm kernel 4.14 the best solution will 
be a revert anyway.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-04 13:57                       ` Wolfgang Walter
@ 2018-10-25  9:38                         ` Wolfgang Walter
  2018-10-25 17:34                           ` David Miller
  2018-10-25 22:45                           ` Florian Westphal
  0 siblings, 2 replies; 22+ messages in thread
From: Wolfgang Walter @ 2018-10-25  9:38 UTC (permalink / raw)
  To: netdev
  Cc: Florian Westphal, Steffen Klassert, David Miller, linux-kernel,
	torvalds, christophe.gouault, Greg KH

Hello,

there is now a new 4.19 which still has the big performance regression when 
many ipsec tunnels are configured (throughput and latency get worse by 10 to 
50 times) which makes any kernel > 4.9 unusable for our routers.

I still don't understand why a revert of the flow cache removal at least for 
the longterm kernels is that a bad option (maybe as a compile time option), 
especially as there is no workaround available.

We use linux in that scenario since more than 10 years so I'm really rather 
unhappy if not to say despaired that we will be stucked with 4.9 for an 
unforeseeable future.

We would have detected and reported that performance regression much earlier 
if not another bug in ipsec had prevented us from running 4.14 and later until 
end of august 2018 (See kernels > v4.12 oops/crash with ipsec-traffic: 
bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and 
remove the operation of dst_free()).

Am Donnerstag, 4. Oktober 2018, 15:57:52 schrieb Wolfgang Walter:
> Am Dienstag, 2. Oktober 2018, 23:35:36 schrieb Florian Westphal:

[snip]

> > Well, I'm not going to send a revert of the flowcache removal.
> >
> > 
> > I'm willing to experiment with alternatives to a full iteration of the
> > inexact list but thats it.
> 
> If this brings performance back to pre-removal, I'm fine with that. I'm even
> fine if it is slower by a factor of 2.
> 
> I think it is a serious regression, and there is no workaround, and therefor
> it cannot stay like that.
> 
> So I still hope that reverting is an option if no acceptable solution can be
> found.
> 
> > > correctly done, as it still would have to obey the original order of the
> > > rules (their priority).
> > 
> > Except that neither the priority or the order in which it was added
> > matters in case the selector doesn't match.
> 
> To match a packet one has to find all matching rules and chose that one with
> the lowest priority.
> 
> "indexing" by dst will not help much if you have a ruleset where a lot of
> rules sharing a dst. You also have to replicate rules with dsts that have a
> prefix oft another dst as they may habe a higher priority even if they are
> less specific.
> 
> Every such entry may again have such an "indexing" by dst. Only then this
> would be efficient.
>

[snip]
 
> 
> I wonder why there is no such thing for netfilter or the rules list in
> routing. nf does not have such a thing, either. This is the reason why I
> think that this is not that easy and for longterm kernel 4.14 the best
> solution will be a revert anyway.
> 
> Regards,

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-25  9:38                         ` Wolfgang Walter
@ 2018-10-25 17:34                           ` David Miller
  2018-10-25 19:24                             ` Florian Westphal
  2018-10-26 12:18                             ` Wolfgang Walter
  2018-10-25 22:45                           ` Florian Westphal
  1 sibling, 2 replies; 22+ messages in thread
From: David Miller @ 2018-10-25 17:34 UTC (permalink / raw)
  To: linux
  Cc: netdev, fw, steffen.klassert, linux-kernel, torvalds,
	christophe.gouault, gregkh

From: Wolfgang Walter <linux@stwm.de>
Date: Thu, 25 Oct 2018 11:38:19 +0200

> there is now a new 4.19 which still has the big performance regression when 
> many ipsec tunnels are configured (throughput and latency get worse by 10 to 
> 50 times) which makes any kernel > 4.9 unusable for our routers.
> 
> I still don't understand why a revert of the flow cache removal at least for 
> the longterm kernels is that a bad option (maybe as a compile time option), 
> especially as there is no workaround available.

You do know that the flow cache is DDoS targettable, right?

That's why we removed it, we did not make the change lightly.

Adding a DDoS vector back into the kernel is not an option sorry.

Please work diligently with Florian and others to try and find ways to
soften the performance hit.

Thank you.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-25 17:34                           ` David Miller
@ 2018-10-25 19:24                             ` Florian Westphal
  2018-10-26 12:18                             ` Wolfgang Walter
  1 sibling, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2018-10-25 19:24 UTC (permalink / raw)
  To: David Miller
  Cc: linux, netdev, fw, steffen.klassert, linux-kernel, torvalds,
	christophe.gouault, gregkh

David Miller <davem@davemloft.net> wrote:
> Please work diligently with Florian and others to try and find ways to
> soften the performance hit.

I will send a patch series that pre-sorts inexact policies into rbtrees
at insert time as soon as next-next opens up again.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-25  9:38                         ` Wolfgang Walter
  2018-10-25 17:34                           ` David Miller
@ 2018-10-25 22:45                           ` Florian Westphal
  1 sibling, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2018-10-25 22:45 UTC (permalink / raw)
  To: Wolfgang Walter
  Cc: netdev, Florian Westphal, Steffen Klassert, David Miller,
	linux-kernel, torvalds, christophe.gouault, Greg KH

Wolfgang Walter <linux@stwm.de> wrote:
> there is now a new 4.19 which still has the big performance regression when 
> many ipsec tunnels are configured (throughput and latency get worse by 10 to 
> 50 times) which makes any kernel > 4.9 unusable for our routers.

https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=xfrm_pol_18

This is mostly untested, if you want to test this anyway and
find bugs please feel free to report them to me.

Improvements to test script in patch #1 welcome as well (its what
i've been using so far to test this).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Regression: kernel 4.14 an later very slow with many ipsec tunnels
  2018-10-25 17:34                           ` David Miller
  2018-10-25 19:24                             ` Florian Westphal
@ 2018-10-26 12:18                             ` Wolfgang Walter
  1 sibling, 0 replies; 22+ messages in thread
From: Wolfgang Walter @ 2018-10-26 12:18 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, fw, steffen.klassert, linux-kernel, torvalds,
	christophe.gouault, gregkh

Am Donnerstag, 25. Oktober 2018, 10:34:50 schrieb David Miller:
> From: Wolfgang Walter <linux@stwm.de>
> Date: Thu, 25 Oct 2018 11:38:19 +0200
> 
> > there is now a new 4.19 which still has the big performance regression
> > when
> > many ipsec tunnels are configured (throughput and latency get worse by 10
> > to 50 times) which makes any kernel > 4.9 unusable for our routers.
> > 
> > I still don't understand why a revert of the flow cache removal at least
> > for the longterm kernels is that a bad option (maybe as a compile time
> > option), especially as there is no workaround available.
> 
> You do know that the flow cache is DDoS targettable, right?
> 
> That's why we removed it, we did not make the change lightly.

Though this is true, we now have simply a permanent DDoS situation. The 
removal of the flow cache leads to the situation so that with enough ipsec-
tunnels you are now always as bad as you would have been prior under a DDoS 
attack.

This is not comparable to the routing cache situation where a fast, well 
tested solution already existed (for routes in a table; if you use a lot of 
rules for policy routing this may be a different story).

Futher I don't think that the DoS is that a strong argument for the removal of 
the routing cache if the routing performance would have dropped 10 times and 
more.

Also, the routing cache was even a problem with legitimate traffic, so I never 
had a problem with the moderate performance regression it caused here.

> 
> Adding a DDoS vector back into the kernel is not an option sorry.

All kernels >= 4.14 are in our use case as bad as if they were under attack. 
They are completely unusable and I even can't 

> 
> Please work diligently with Florian and others to try and find ways to
> soften the performance hit.
> 

I proposed to revert this for the longterm kernels and I only depending on a 
compile time option which explicitely had to be switched on. Then we could 
start using 4.19. People not using ipsec or who use it only with < 100 rules 
would still live without flow cache.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-13 11:30 Regression: kernel 4.14 an later very slow with many ipsec tunnels Wolfgang Walter
2018-09-13 13:58 ` Florian Westphal
2018-09-13 15:46   ` Wolfgang Walter
2018-09-13 16:38     ` Florian Westphal
2018-09-13 17:23       ` David Miller
2018-09-13 21:03         ` Florian Westphal
2018-09-13 21:12           ` David Miller
2018-09-14  5:06           ` Steffen Klassert
2018-09-14  5:54             ` Florian Westphal
2018-09-14  6:01               ` Steffen Klassert
2018-09-14  8:01                 ` Christophe Gouault
2018-09-14 11:49               ` Wolfgang Walter
2018-10-02 14:45               ` Wolfgang Walter
2018-10-02 14:56                 ` Florian Westphal
2018-10-02 17:34                   ` Wolfgang Walter
2018-10-02 21:35                     ` Florian Westphal
2018-10-04 13:57                       ` Wolfgang Walter
2018-10-25  9:38                         ` Wolfgang Walter
2018-10-25 17:34                           ` David Miller
2018-10-25 19:24                             ` Florian Westphal
2018-10-26 12:18                             ` Wolfgang Walter
2018-10-25 22:45                           ` Florian Westphal

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox