linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ip_conntrack_hash() problem
@ 2002-09-04 12:33 Martin Wilck
  2002-09-04 12:56 ` Harald Welte
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Wilck @ 2002-09-04 12:33 UTC (permalink / raw)
  To: Netfilter Mailing List
  Cc: Linux Kernel mailing list, Rusty Russell, Patrick Schaaf,
	Harald Welte, Andreas Kleen

Hi,

I posted a patch to netfilter-devel a week ago that fixes a severe
performance problem with ip_conntrack_hash() (see below).
Harald rejected it (sort of), telling me I should have read past threads
about the hash function first. 

http://marc.theaimsgroup.com/?l=netfilter-devel&m=103054090215896&w=2

I think I have to insist on this, though.

Although it certainly isn't the "optimal" hash function for
ip_conntrack, it fixes a problem that leads to extremely unbalanced
hashing in some situations, in particular in a simple
client<->router<->webserver scenario. 

In that case, all connection tuples from server to client, i.e. 50% of
all tuples, end up in the same bucket(!), as I showed in my posting to
netfilter-devel.

This happens if the hash size is a power of 2, which is the default on
most newer machines.

The fix is rather trivial (mainly the port numbers are accounted for
outside the ntohl() function), and therefore I'd like to ask again that
it be applied.

Unless I am mistaken, the past discussions were mainly concerned with
fine-tuning of the hash function, which is a topic my patch doesn't
address, and can easily be done on top of it.

Regards,
Martin

-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-04 12:33 ip_conntrack_hash() problem Martin Wilck
@ 2002-09-04 12:56 ` Harald Welte
  2002-09-04 13:24   ` Martin Wilck
  2002-09-04 13:40   ` Martin Wilck
  0 siblings, 2 replies; 21+ messages in thread
From: Harald Welte @ 2002-09-04 12:56 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Netfilter Mailing List, Linux Kernel mailing list, Rusty Russell,
	Patrick Schaaf, Andreas Kleen

On Wed, Sep 04, 2002 at 02:33:41PM +0200, Martin Wilck wrote:
> Hi,
> 
> I posted a patch to netfilter-devel a week ago that fixes a severe
> performance problem with ip_conntrack_hash() (see below).
> Harald rejected it (sort of), telling me I should have read past threads
> about the hash function first. 

no, I didn't reject it.  I just said you should contribute your work
with the work of the other people, so we end up having one
conntrack-optimization patch.

> Although it certainly isn't the "optimal" hash function for
> ip_conntrack, it fixes a problem that leads to extremely unbalanced
> hashing in some situations, in particular in a simple
> client<->router<->webserver scenario. 

It is an artificial case, in which you have a single client opening 
thousands of connections to a single port on a singles server.  Please 
correct me, if I misunderstood.

> This happens if the hash size is a power of 2, which is the default on
> most newer machines.

yes, this is the thing we should change right now.  All other
optimizations should be sorted out as a whole.

> The fix is rather trivial (mainly the port numbers are accounted for
> outside the ntohl() function), and therefore I'd like to ask again that
> it be applied.

Would you be satisfied with making the default hashsize no longer a
power of two?

> Unless I am mistaken, the past discussions were mainly concerned with
> fine-tuning of the hash function, which is a topic my patch doesn't
> address, and can easily be done on top of it.

no, exactly the 'power-of-two' has been discussed as well.

> Regards,
> Martin

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-04 12:56 ` Harald Welte
@ 2002-09-04 13:24   ` Martin Wilck
  2002-09-04 13:26     ` Andi Kleen
  2002-09-04 13:40   ` Martin Wilck
  1 sibling, 1 reply; 21+ messages in thread
From: Martin Wilck @ 2002-09-04 13:24 UTC (permalink / raw)
  To: Harald Welte
  Cc: Netfilter Mailing List, Linux Kernel mailing list, Rusty Russell,
	Patrick Schaaf, Andreas Kleen

Am Mit, 2002-09-04 um 14.56 schrieb Harald Welte:

> It is an artificial case, in which you have a single client opening 
> thousands of connections to a single port on a singles server.  Please 
> correct me, if I misunderstood.

As I stated before, yes, it's artificial, but it is easy to come up with
real-world scenarios that come close. (proxy<->router<->webserver).

> > The fix is rather trivial (mainly the port numbers are accounted for
> > outside the ntohl() function), and therefore I'd like to ask again that
> > it be applied.
> 
> Would you be satisfied with making the default hashsize no longer a
> power of two?

I don't know. After all, users expect that if they set hashsize=4096,
the hashsize will be 4096. It should be possible to use that setting
without suffering extreme performance losses. IMO the right thing for
now is take the port numbers out of the ntohl() function.

> > Unless I am mistaken, the past discussions were mainly concerned with
> > fine-tuning of the hash function, which is a topic my patch doesn't
> > address, and can easily be done on top of it.
> 
> no, exactly the 'power-of-two' has been discussed as well.

Right. But as stated above, making certain hash sizes impossible would
change the usage. Are we sure that there are no user-space tools out
there that rely on the hashsize being equal to what they specify when
the module is loaded?

I think the hash function should be fixed, not the possible choice of
hash sizes, if that is at feasible.

Martin

-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-04 13:24   ` Martin Wilck
@ 2002-09-04 13:26     ` Andi Kleen
  2002-09-05  0:39       ` Rusty Russell
  2002-09-05  8:19       ` Harald Welte
  0 siblings, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2002-09-04 13:26 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Harald Welte, Netfilter Mailing List, Linux Kernel mailing list,
	Rusty Russell, Patrick Schaaf, Andreas Kleen

> I think the hash function should be fixed, not the possible choice of
> hash sizes, if that is at feasible.
I also agree with Martin that it is better to fix the hash function in
this case. Restricting to magic hash table sizes looks like a bad hack.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-04 12:56 ` Harald Welte
  2002-09-04 13:24   ` Martin Wilck
@ 2002-09-04 13:40   ` Martin Wilck
  1 sibling, 0 replies; 21+ messages in thread
From: Martin Wilck @ 2002-09-04 13:40 UTC (permalink / raw)
  To: Harald Welte
  Cc: Netfilter Mailing List, Linux Kernel mailing list, Rusty Russell,
	Patrick Schaaf, Andreas Kleen

Just to make my previous statement clearer:

I think there's nothing wrong with a power-of-2 hashsize, as long as the
hash function contains no implicit or explicit multipliers that are also
powers of two. In general, multipliers should not have a greatest common
divisor (GCD) larger than 1 with the hash size. Unfortunately, in the
current implementation, ntohl() creates an implicit multiplier of 2^16
for the port numbers (on little-endian machines).

Martin

PS: For the sake of that, the patch also changed the multiplier for the
source port from 2 to 7, assuming that it's relatively unlikely to have
a hash size that is a multiple of 7, and knowing that  multiplying by 7
is cheap. Instead of 7, 31 or 127 also seem good candidates that are
even more unlikely to be divisors of the hash size.

I recommend to printk() a warning if the hash size turns out to have a
GCD >1 with multiple of any multiplier in the hash function.

-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-04 13:26     ` Andi Kleen
@ 2002-09-05  0:39       ` Rusty Russell
  2002-09-05  6:21         ` Patrick Schaaf
                           ` (2 more replies)
  2002-09-05  8:19       ` Harald Welte
  1 sibling, 3 replies; 21+ messages in thread
From: Rusty Russell @ 2002-09-05  0:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Harald Welte, Netfilter Mailing List, Linux Kernel mailing list,
	Rusty Russell, Patrick Schaaf

In message <20020904152626.A11438@wotan.suse.de> you write:
> > I think the hash function should be fixed, not the possible choice of
> > hash sizes, if that is at feasible.
> I also agree with Martin that it is better to fix the hash function in
> this case. Restricting to magic hash table sizes looks like a bad hack.

This work is already done:
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Netfilter/conntrack_hashing.patch.gz

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  0:39       ` Rusty Russell
@ 2002-09-05  6:21         ` Patrick Schaaf
  2002-09-05  6:24           ` David S. Miller
  2002-09-05 11:14           ` Rusty Russell
  2002-09-05  6:51         ` Patrick Schaaf
  2002-09-05  7:19         ` Martin Wilck
  2 siblings, 2 replies; 21+ messages in thread
From: Patrick Schaaf @ 2002-09-05  6:21 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andi Kleen, Harald Welte, Netfilter Mailing List,
	Linux Kernel mailing list, Patrick Schaaf

On Thu, Sep 05, 2002 at 10:39:40AM +1000, Rusty Russell wrote:
> In message <20020904152626.A11438@wotan.suse.de> you write:
> > > I think the hash function should be fixed, not the possible choice of
> > > hash sizes, if that is at feasible.
> > I also agree with Martin that it is better to fix the hash function in
> > this case. Restricting to magic hash table sizes looks like a bad hack.
> 
> This work is already done:
> http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Netfilter/conntrack_hashing.patch.gz

Some comments:

A) secs_between_rehash is doubled every rehash, and never decreased
   again. This looks broken. What's the rationale?
B) I despise the (1 << ...htable_bits) construct, used in several places.
   It's nothing but obfuscation. Please reinstate ...htable_size, and
   use that, the code will be more readable.
C) did you measure how much time the rehashing takes, for a single run
   on a significant (2^16 buckets, at least) conntracking table?
D) did you run your hash_conntrack() against my cttest bucket occupation
   simulator? Can we see comparing pictures?

To conclude, I agree with using a multiplicative hash, but I'm a bit nervous
about the rehashing thing. A single random hash_base call would be enough
to satisfy _my_ paranoia. IMHO the rehashing should be a compile/run time
"be more paranoid" option.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  6:21         ` Patrick Schaaf
@ 2002-09-05  6:24           ` David S. Miller
  2002-09-05  6:33             ` Patrick Schaaf
  2002-09-05 11:14           ` Rusty Russell
  1 sibling, 1 reply; 21+ messages in thread
From: David S. Miller @ 2002-09-05  6:24 UTC (permalink / raw)
  To: bof; +Cc: rusty, ak, laforge, netfilter-devel, linux-kernel

   From: Patrick Schaaf <bof@bof.de>
   Date: Thu, 5 Sep 2002 08:21:28 +0200

   B) I despise the (1 << ...htable_bits) construct, used in several places.
      It's nothing but obfuscation. Please reinstate ...htable_size, and
      use that, the code will be more readable.

You despise, but the processor doesn't.  Less data loads
means the code goes faster.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  6:33             ` Patrick Schaaf
@ 2002-09-05  6:32               ` David S. Miller
  2002-09-05  6:39                 ` Patrick Schaaf
  0 siblings, 1 reply; 21+ messages in thread
From: David S. Miller @ 2002-09-05  6:32 UTC (permalink / raw)
  To: bof; +Cc: rusty, ak, laforge, netfilter-devel, linux-kernel

   From: Patrick Schaaf <bof@bof.de>
   Date: Thu, 5 Sep 2002 08:33:40 +0200
   
   So, I don't see how your (abstractly true) observation is relevant, here.
   
So we waste 4 bytes in the kernel for really no reason?
A value we can compute in half a cycle?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  6:24           ` David S. Miller
@ 2002-09-05  6:33             ` Patrick Schaaf
  2002-09-05  6:32               ` David S. Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick Schaaf @ 2002-09-05  6:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: bof, rusty, ak, laforge, netfilter-devel, linux-kernel

On Wed, Sep 04, 2002 at 11:24:25PM -0700, David S. Miller wrote:
> 
>    B) I despise the (1 << ...htable_bits) construct, used in several places.
>       It's nothing but obfuscation. Please reinstate ...htable_size, and
>       use that, the code will be more readable.
> 
> You despise, but the processor doesn't.  Less data loads
> means the code goes faster.

Please explain. I don't think that matters here:

Both _bits and _size are unsigned int, same amount of stuff to load. 

The one single per-packet-path use is in hash_conntrack(), where
the _bits thing can be used without touching the _size thing.

All other places where the patch now uses _bits, really need _size,
and do the ugly computation by shifting. And all those other places
are called very rarely.

So, I don't see how your (abstractly true) observation is relevant, here.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  6:32               ` David S. Miller
@ 2002-09-05  6:39                 ` Patrick Schaaf
  2002-09-05  6:40                   ` David S. Miller
  2002-09-05 13:37                   ` Rik van Riel
  0 siblings, 2 replies; 21+ messages in thread
From: Patrick Schaaf @ 2002-09-05  6:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: bof, rusty, ak, laforge, netfilter-devel, linux-kernel

On Wed, Sep 04, 2002 at 11:32:26PM -0700, David S. Miller wrote:
>    
>    So, I don't see how your (abstractly true) observation is relevant, here.
>    
> So we waste 4 bytes in the kernel for really no reason?
> A value we can compute in half a cycle?

Sorry, but I was under the impression that code readability was worth
the occasional static-global additional 4 bytes. I must have been
mistaken.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  6:39                 ` Patrick Schaaf
@ 2002-09-05  6:40                   ` David S. Miller
  2002-09-05 13:37                   ` Rik van Riel
  1 sibling, 0 replies; 21+ messages in thread
From: David S. Miller @ 2002-09-05  6:40 UTC (permalink / raw)
  To: bof; +Cc: rusty, ak, laforge, netfilter-devel, linux-kernel

   From: Patrick Schaaf <bof@bof.de>
   Date: Thu, 5 Sep 2002 08:39:32 +0200
   
   Sorry, but I was under the impression that code readability was worth
   the occasional static-global additional 4 bytes. I must have been
   mistaken.

For the level of readability you're pining for, yes you are
mistaken.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  0:39       ` Rusty Russell
  2002-09-05  6:21         ` Patrick Schaaf
@ 2002-09-05  6:51         ` Patrick Schaaf
  2002-09-05  7:19         ` Martin Wilck
  2 siblings, 0 replies; 21+ messages in thread
From: Patrick Schaaf @ 2002-09-05  6:51 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andi Kleen, Harald Welte, Netfilter Mailing List,
	Linux Kernel mailing list, Patrick Schaaf

Rusty,

> This work is already done:
> http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Netfilter/conntrack_hashing.patch.gz

Regarding the rehash check in ip_conntrack_find_get, wouldn't it be
better to do that in the confirm function, where a new conntrack
is put into the list? That's called a lot less often than _find_get,
and should be logically equivalent. IOW, why wait until we _find_
an overly long list, when we can rehash at the point in time when
it _became_ overly long?

best regards
  Patrick


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  0:39       ` Rusty Russell
  2002-09-05  6:21         ` Patrick Schaaf
  2002-09-05  6:51         ` Patrick Schaaf
@ 2002-09-05  7:19         ` Martin Wilck
  2002-09-05 11:15           ` Rusty Russell
  2 siblings, 1 reply; 21+ messages in thread
From: Martin Wilck @ 2002-09-05  7:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andi Kleen, Harald Welte, Netfilter Mailing List,
	Linux Kernel mailing list, Patrick Schaaf

Am Don, 2002-09-05 um 02.39 schrieb Rusty Russell:

> This work is already done:
> http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Netfilter/conntrack_hashing.patch.gz

Well, beautiful.
Will this go into the main line kernel any time soon?
(I'd be for that).

If there's still need for discussion regarding this patch, as the thread
suggests, I'd propose to go for an intermediate, less intrusive fix like
mine first.

Martin

-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-04 13:26     ` Andi Kleen
  2002-09-05  0:39       ` Rusty Russell
@ 2002-09-05  8:19       ` Harald Welte
  1 sibling, 0 replies; 21+ messages in thread
From: Harald Welte @ 2002-09-05  8:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Wilck, Netfilter Mailing List, Linux Kernel mailing list,
	Rusty Russell, Patrick Schaaf

On Wed, Sep 04, 2002 at 03:26:26PM +0200, Andi Kleen wrote:
> > I think the hash function should be fixed, not the possible choice of
> > hash sizes, if that is at feasible.
> I also agree with Martin that it is better to fix the hash function in
> this case. Restricting to magic hash table sizes looks like a bad hack.

I wasn't proposing to restrict it.  Just make it chose a sane default.

> -Andi

-- 
Live long and prosper
- Harald Welte / laforge@gnumonks.org               http://www.gnumonks.org/
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- 
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  6:21         ` Patrick Schaaf
  2002-09-05  6:24           ` David S. Miller
@ 2002-09-05 11:14           ` Rusty Russell
  1 sibling, 0 replies; 21+ messages in thread
From: Rusty Russell @ 2002-09-05 11:14 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Andi Kleen, Harald Welte, Netfilter Mailing List,
	Linux Kernel mailing list

In message <20020905082128.D19551@oknodo.bof.de> you write:
> On Thu, Sep 05, 2002 at 10:39:40AM +1000, Rusty Russell wrote:
> > In message <20020904152626.A11438@wotan.suse.de> you write:
> > > > I think the hash function should be fixed, not the possible choice of
> > > > hash sizes, if that is at feasible.
> > > I also agree with Martin that it is better to fix the hash function in
> > > this case. Restricting to magic hash table sizes looks like a bad hack.
> > 
> > This work is already done:
> > http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Netfilter/connt
rack_hashing.patch.gz
> 
> Some comments:
> 
> A) secs_between_rehash is doubled every rehash, and never decreased
>    again. This looks broken. What's the rationale?

It's deliberate.  The randomness available in the system becomes
better with time.  If hashing sucks, then we're screwed anyway: let's
not spuriously stop the world for no improvement.  In an ideal world
we'd say "feed a random number into ip_conntrack once you've booted",
but relying on userspace to do that to avoid DoS is inadvisable.

> B) I despise the (1 << ...htable_bits) construct, used in several places.
>    It's nothing but obfuscation. Please reinstate ...htable_size, and
>    use that, the code will be more readable.

Well, if you really dislike it, how about:

	static inline unsigned htable_size(void)
	{
		return 1 << ip_conntrack_htable_bits;
	}

I chose it because the number of *bits* is the fundament now.

> C) did you measure how much time the rehashing takes, for a single run
>    on a significant (2^16 buckets, at least) conntracking table?

Good question.  I expect it to take an awfully long time, and it will
only get slower when I make the locks per-hash chain (ie. grabbing
2^16 locks will take a long time).  Hence 'don't do it often!'.

We also have an opportunity to resize the hash table here (for the
multiple lock extention, this means grabbing the network write
brlock, which needs to be done from a userspace context).  We *could*
do this when someone frobs ip_conntrack_max though, if we wanted.

> D) did you run your hash_conntrack() against my cttest bucket occupation
>    simulator? Can we see comparing pictures?

No, I havent: I was hoping that you would do that 8)

> To conclude, I agree with using a multiplicative hash, but I'm a bit nervous
> about the rehashing thing. A single random hash_base call would be enough
> to satisfy _my_ paranoia. IMHO the rehashing should be a compile/run time
> "be more paranoid" option.

I'd be happy, too, *except* there's no entropy at boot 8(  Hence my
workaround.

In message <20020905085103.G19551@oknodo.bof.de> you write:
> Regarding the rehash check in ip_conntrack_find_get, wouldn't it be
> better to do that in the confirm function, where a new conntrack
> is put into the list? That's called a lot less often than _find_get,
> and should be logically equivalent. IOW, why wait until we _find_
> an overly long list, when we can rehash at the point in time when
> it _became_ overly long?

Yes, agreed, that's much nicer, esp. since we hold the lock anyway.
Done.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Name: Conntrack Hashing Patch Rewrite
Author: Rusty Russell
Status: Experimental
Depends: 

D: This changes connection tracking's hash to use a secret and
D: multiplicitive the hashing.  The secret is rekeyed occasionally.

diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.33/net/ipv4/netfilter/ip_conntrack_core.c working-2.5.33-conntrack-hash/net/ipv4/netfilter/ip_conntrack_core.c
--- linux-2.5.33/net/ipv4/netfilter/ip_conntrack_core.c	2002-08-28 09:29:54.000000000 +1000
+++ working-2.5.33-conntrack-hash/net/ipv4/netfilter/ip_conntrack_core.c	2002-09-05 21:11:36.000000000 +1000
@@ -32,6 +32,9 @@
 #include <linux/stddef.h>
 #include <linux/sysctl.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
+#include <linux/bitops.h>
+#include <linux/random.h>
 /* For ERR_PTR().  Yeah, I know... --RR */
 #include <linux/fs.h>
 
@@ -61,14 +64,62 @@ void (*ip_conntrack_destroyed)(struct ip
 LIST_HEAD(ip_conntrack_expect_list);
 LIST_HEAD(protocol_list);
 static LIST_HEAD(helpers);
-unsigned int ip_conntrack_htable_size = 0;
+unsigned int ip_conntrack_htable_bits;
 static int ip_conntrack_max = 0;
 static atomic_t ip_conntrack_count = ATOMIC_INIT(0);
 struct list_head *ip_conntrack_hash;
 static kmem_cache_t *ip_conntrack_cachep;
+static unsigned long hash_base;
+static unsigned long secs_between_rehash, min_next_rehash;
+
+static inline unsigned htable_size(void)
+{
+	return 1 << ip_conntrack_htable_bits;
+}
 
 extern struct ip_conntrack_protocol ip_conntrack_generic_protocol;
 
+/* Very slow, so only call occasionally in case we didn't have enough
+   randomness to avoid chain bombing by a determined attacker. */
+static void rehash_conntracks(void)
+{
+	unsigned int i;
+	unsigned long new_hash_base;
+	LIST_HEAD(conntracks);
+
+	MUST_BE_WRITE_LOCKED(&ip_conntrack_lock);
+
+	get_random_bytes(&new_hash_base, sizeof(new_hash_base));
+
+	/* Strip everything. */
+	for (i = 0; i < htable_size(); i++) {
+		struct list_head *l;
+		l = ip_conntrack_hash[i].next;
+		list_del_init(&ip_conntrack_hash[i]);
+		list_add(l, &conntracks);
+	}
+
+	/* Change hash over */
+	hash_base ^= new_hash_base;
+
+	/* Now add them back, one at a time */
+	while (!list_empty(&conntracks)) {
+		struct ip_conntrack_tuple_hash *h;
+		unsigned int hash;
+		struct list_head *l = conntracks.next;
+
+		list_del(l);
+		h = list_entry(l, struct ip_conntrack_tuple_hash, list);
+		hash = hash_conntrack(&h->tuple);
+		list_add(&h->list, &ip_conntrack_hash[hash]);
+	}
+
+	/* Double timeout if we can */
+	if (secs_between_rehash < ULONG_MAX/2/HZ)
+		secs_between_rehash *= 2;
+	min_next_rehash = jiffies + secs_between_rehash*HZ;
+}
+
 static inline int proto_cmpfn(const struct ip_conntrack_protocol *curr,
 			      u_int8_t protocol)
 {
@@ -111,17 +162,13 @@ ip_conntrack_put(struct ip_conntrack *ct
 static inline u_int32_t
 hash_conntrack(const struct ip_conntrack_tuple *tuple)
 {
-#if 0
-	dump_tuple(tuple);
-#endif
-	/* ntohl because more differences in low bits. */
-	/* To ensure that halves of the same connection don't hash
-	   clash, we add the source per-proto again. */
-	return (ntohl(tuple->src.ip + tuple->dst.ip
-		     + tuple->src.u.all + tuple->dst.u.all
-		     + tuple->dst.protonum)
-		+ ntohs(tuple->src.u.all))
-		% ip_conntrack_htable_size;
+	unsigned long hash = hash_base;
+
+	hash ^= tuple->src.ip;
+	hash ^= tuple->dst.ip << (BITS_PER_LONG - 32);
+	hash ^= (tuple->src.u.all + tuple->dst.u.all << 16)
+		 << (BITS_PER_LONG - 32)/2;
+	return hash_long(hash, ip_conntrack_htable_bits);
 }
 
 inline int
@@ -358,9 +405,11 @@ static void death_by_timeout(unsigned lo
 static inline int
 conntrack_tuple_cmp(const struct ip_conntrack_tuple_hash *i,
 		    const struct ip_conntrack_tuple *tuple,
-		    const struct ip_conntrack *ignored_conntrack)
+		    const struct ip_conntrack *ignored_conntrack,
+		    unsigned int *count)
 {
 	MUST_BE_READ_LOCKED(&ip_conntrack_lock);
+	if (count) (*count)++;
 	return i->ctrack != ignored_conntrack
 		&& ip_ct_tuple_equal(tuple, &i->tuple);
 }
@@ -375,7 +424,7 @@ __ip_conntrack_find(const struct ip_conn
 	h = LIST_FIND(&ip_conntrack_hash[hash_conntrack(tuple)],
 		      conntrack_tuple_cmp,
 		      struct ip_conntrack_tuple_hash *,
-		      tuple, ignored_conntrack);
+		      tuple, ignored_conntrack, NULL);
 	return h;
 }
 
@@ -420,7 +469,7 @@ ip_conntrack_get(struct sk_buff *skb, en
 int
 __ip_conntrack_confirm(struct nf_ct_info *nfct)
 {
-	unsigned int hash, repl_hash;
+	unsigned int hash, repl_hash, count;
 	struct ip_conntrack *ct;
 	enum ip_conntrack_info ctinfo;
 
@@ -447,17 +496,19 @@ __ip_conntrack_confirm(struct nf_ct_info
 	DEBUGP("Confirming conntrack %p\n", ct);
 
 	WRITE_LOCK(&ip_conntrack_lock);
+	count = 0;
 	/* See if there's one in the list already, including reverse:
            NAT could have grabbed it without realizing, since we're
            not in the hash.  If there is, we lost race. */
 	if (!LIST_FIND(&ip_conntrack_hash[hash],
 		       conntrack_tuple_cmp,
 		       struct ip_conntrack_tuple_hash *,
-		       &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, NULL)
+		       &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, NULL, &count)
 	    && !LIST_FIND(&ip_conntrack_hash[repl_hash],
 			  conntrack_tuple_cmp,
 			  struct ip_conntrack_tuple_hash *,
-			  &ct->tuplehash[IP_CT_DIR_REPLY].tuple, NULL)) {
+			  &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
+			  NULL, &count)) {
 		list_prepend(&ip_conntrack_hash[hash],
 			     &ct->tuplehash[IP_CT_DIR_ORIGINAL]);
 		list_prepend(&ip_conntrack_hash[repl_hash],
@@ -468,6 +519,16 @@ __ip_conntrack_confirm(struct nf_ct_info
 		ct->timeout.expires += jiffies;
 		add_timer(&ct->timeout);
 		atomic_inc(&ct->ct_general.use);
+
+		/* Check for really long hash chains: say 64* normal
+                   for the two of them. */
+		if (count / 64 > (ip_conntrack_max >> ip_conntrack_htable_bits)
+		    && time_after(jiffies, min_next_rehash)) {
+			printk("ip_conntrack: Rehashing for len %u / %u\n",
+			       count,
+			       (ip_conntrack_max >> ip_conntrack_htable_bits));
+			rehash_conntracks();
+		}
 		WRITE_UNLOCK(&ip_conntrack_lock);
 		return NF_ACCEPT;
 	}
@@ -646,7 +707,7 @@ init_conntrack(const struct ip_conntrack
 		/* Try dropping from random chain, or else from the
                    chain about to put into (in case they're trying to
                    bomb one hash chain). */
-		unsigned int next = (drop_next++)%ip_conntrack_htable_size;
+		unsigned int next = ((drop_next++) & (htable_size() - 1));
 
 		if (!early_drop(&ip_conntrack_hash[next])
 		    && !early_drop(&ip_conntrack_hash[hash])) {
@@ -1154,7 +1215,7 @@ void ip_conntrack_helper_unregister(stru
 	LIST_DELETE(&helpers, me);
 
 	/* Get rid of expecteds, set helpers to NULL. */
-	for (i = 0; i < ip_conntrack_htable_size; i++)
+	for (i = 0; i < htable_size(); i++)
 		LIST_FIND_W(&ip_conntrack_hash[i], unhelp,
 			    struct ip_conntrack_tuple_hash *, me);
 	WRITE_UNLOCK(&ip_conntrack_lock);
@@ -1262,7 +1323,7 @@ get_next_corpse(int (*kill)(const struct
 	unsigned int i;
 
 	READ_LOCK(&ip_conntrack_lock);
-	for (i = 0; !h && i < ip_conntrack_htable_size; i++) {
+	for (i = 0; !h && i < htable_size(); i++) {
 		h = LIST_FIND(&ip_conntrack_hash[i], do_kill,
 			      struct ip_conntrack_tuple_hash *, kill, data);
 	}
@@ -1411,23 +1472,16 @@ int __init ip_conntrack_init(void)
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 256 buckets.  >= 1GB machines have 8192 buckets. */
- 	if (hashsize) {
- 		ip_conntrack_htable_size = hashsize;
- 	} else {
-		ip_conntrack_htable_size
-			= (((num_physpages << PAGE_SHIFT) / 16384)
-			   / sizeof(struct list_head));
+ 	if (!hashsize) {
+		hashsize = (((num_physpages << PAGE_SHIFT) / 16384)
+			    / sizeof(struct list_head));
 		if (num_physpages > (1024 * 1024 * 1024 / PAGE_SIZE))
-			ip_conntrack_htable_size = 8192;
-		if (ip_conntrack_htable_size < 16)
-			ip_conntrack_htable_size = 16;
+			hashsize = 8192;
+		if (hashsize < 16)
+			hashsize = 16;
 	}
-	ip_conntrack_max = 8 * ip_conntrack_htable_size;
-
-	printk("ip_conntrack version %s (%u buckets, %d max)"
-	       " - %d bytes per conntrack\n", IP_CONNTRACK_VERSION,
-	       ip_conntrack_htable_size, ip_conntrack_max,
-	       sizeof(struct ip_conntrack));
+	ip_conntrack_htable_bits = fls(hashsize);
+	ip_conntrack_max = 8 * htable_size();
 
 	ret = nf_register_sockopt(&so_getorigdst);
 	if (ret != 0) {
@@ -1471,6 +1525,10 @@ int __init ip_conntrack_init(void)
 	}
 #endif /*CONFIG_SYSCTL*/
 
+	/* Start with 1 second between rehashing, any time now */
+	min_next_rehash = jiffies-1;
+	secs_between_rehash = 1;
+
 	/* For use by ipt_REJECT */
 	ip_ct_attach = ip_conntrack_attach;
 	return ret;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  7:19         ` Martin Wilck
@ 2002-09-05 11:15           ` Rusty Russell
  2002-09-05 11:54             ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2002-09-05 11:15 UTC (permalink / raw)
  To: Martin Wilck
  Cc: Andi Kleen, Harald Welte, Netfilter Mailing List,
	Linux Kernel mailing list, Patrick Schaaf

In message <1031210342.9785.159.camel@biker.pdb.fsc.net> you write:
> Am Don, 2002-09-05 um 02.39 schrieb Rusty Russell:
> 
> > This work is already done:
> > http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Netfilter/connt
rack_hashing.patch.gz
> 
> Well, beautiful.
> Will this go into the main line kernel any time soon?
> (I'd be for that).
> 
> If there's still need for discussion regarding this patch, as the thread
> suggests, I'd propose to go for an intermediate, less intrusive fix like
> mine first.

I'll feed this through 2.5, once it's had some more testing (hint
hint!).  Then Harald will make a call on 2.4.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05 11:15           ` Rusty Russell
@ 2002-09-05 11:54             ` Andi Kleen
  2002-09-05 17:55               ` Patrick Schaaf
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2002-09-05 11:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Martin Wilck, Andi Kleen, Harald Welte, Netfilter Mailing List,
	Linux Kernel mailing list, Patrick Schaaf

> I'll feed this through 2.5, once it's had some more testing (hint
> hint!).  Then Harald will make a call on 2.4.

I would propose to include Martin's simple patch as a short term fix 
for 2.4 until Rusty's full work can get in. It fixes an bad performance
problems in some not too uncommon cases.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05  6:39                 ` Patrick Schaaf
  2002-09-05  6:40                   ` David S. Miller
@ 2002-09-05 13:37                   ` Rik van Riel
  1 sibling, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2002-09-05 13:37 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: David S. Miller, rusty, ak, laforge, netfilter-devel, linux-kernel

On Thu, 5 Sep 2002, Patrick Schaaf wrote:

> Sorry, but I was under the impression that code readability was worth
> the occasional static-global additional 4 bytes. I must have been
> mistaken.

If you want code readability, you could use a #define ;)

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05 11:54             ` Andi Kleen
@ 2002-09-05 17:55               ` Patrick Schaaf
  2002-09-05 18:24                 ` Martin Wilck
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick Schaaf @ 2002-09-05 17:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rusty Russell, Martin Wilck, Harald Welte,
	Netfilter Mailing List, Linux Kernel mailing list,
	Patrick Schaaf

On Thu, Sep 05, 2002 at 01:54:40PM +0200, Andi Kleen wrote:
> > I'll feed this through 2.5, once it's had some more testing (hint
> > hint!).  Then Harald will make a call on 2.4.
> 
> I would propose to include Martin's simple patch as a short term fix 
> for 2.4 until Rusty's full work can get in. It fixes an bad performance
> problems in some not too uncommon cases.

As a short time fix, seeing that it's mostly even hash bucket counts
that give a problem, I would still propose just making the bucket count
the nearest odd number, i.e. basically htable_size |= 1 in the startup code.

I don't expect any user to notice such a miniscule change.

best regards
  Patrick


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ip_conntrack_hash() problem
  2002-09-05 17:55               ` Patrick Schaaf
@ 2002-09-05 18:24                 ` Martin Wilck
  0 siblings, 0 replies; 21+ messages in thread
From: Martin Wilck @ 2002-09-05 18:24 UTC (permalink / raw)
  To: Patrick Schaaf
  Cc: Andi Kleen, Rusty Russell, Harald Welte, Netfilter Mailing List,
	Linux Kernel mailing list

Am Don, 2002-09-05 um 19.55 schrieb Patrick Schaaf:
> As a short time fix, seeing that it's mostly even hash bucket counts
> that give a problem, I would still propose just making the bucket count
> the nearest odd number, i.e. basically htable_size |= 1 in the startup code.
> 
> I don't expect any user to notice such a miniscule change.

Well then, as most people seem to think this is the way to go, let's do
it, and hope that Rusty's patch will be ready for prime-time soon.

Martin

-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2002-09-05 18:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-04 12:33 ip_conntrack_hash() problem Martin Wilck
2002-09-04 12:56 ` Harald Welte
2002-09-04 13:24   ` Martin Wilck
2002-09-04 13:26     ` Andi Kleen
2002-09-05  0:39       ` Rusty Russell
2002-09-05  6:21         ` Patrick Schaaf
2002-09-05  6:24           ` David S. Miller
2002-09-05  6:33             ` Patrick Schaaf
2002-09-05  6:32               ` David S. Miller
2002-09-05  6:39                 ` Patrick Schaaf
2002-09-05  6:40                   ` David S. Miller
2002-09-05 13:37                   ` Rik van Riel
2002-09-05 11:14           ` Rusty Russell
2002-09-05  6:51         ` Patrick Schaaf
2002-09-05  7:19         ` Martin Wilck
2002-09-05 11:15           ` Rusty Russell
2002-09-05 11:54             ` Andi Kleen
2002-09-05 17:55               ` Patrick Schaaf
2002-09-05 18:24                 ` Martin Wilck
2002-09-05  8:19       ` Harald Welte
2002-09-04 13:40   ` Martin Wilck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).