All of lore.kernel.org
 help / color / mirror / Atom feed
* Route cache performance under stress
@ 2003-04-05 16:37 Florian Weimer
  2003-04-05 18:17 ` Martin Josefsson
                   ` (2 more replies)
  0 siblings, 3 replies; 227+ messages in thread
From: Florian Weimer @ 2003-04-05 16:37 UTC (permalink / raw)
  To: linux-kernel

Please read the following paper:

<http://www.cs.rice.edu/~scrosby/tr/HashAttack.pdf>

Then look at the 2.4 route cache implementation.

Short summary: It is possible to freeze machines with 1 GB of RAM and
more with a stream of 400 packets per second with carefully chosen
source addresses.  Not good.

The route cache is a DoS bottleneck in general (that's why I started
looking at it).  You have to apply rate-limits in the PREROUTING
chain, otherwise a modest packet flood will push the machine off the
network (even with truly random source addresses, not triggering hash
collisions).  The route cache partially defeats the purpose of SYN
cookies, too, because the kernel keeps (transient) state for spoofed
connection attempts in the route cache.

The following patch can be applied in an emergency, if you face the
hash collision DoS attack.  It drastically limits the size of the
cache (but not the bucket count), and decreases performance in some
applications, but 

--- route.c	2003/04/05 12:41:51	1.1
+++ route.c	2003/04/05 12:42:42
@@ -2508,8 +2508,8 @@
 		rt_hash_table[i].chain = NULL;
 	}
 
-	ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
-	ip_rt_max_size = (rt_hash_mask + 1) * 16;
+	ipv4_dst_ops.gc_thresh = 512;
+	ip_rt_max_size = 2048;
 
 	devinet_init();
 	ip_fib_init();


(Yeah, I know, it's stupid, but it might help in an emergency.)

I wonder why the route cache is needed at all for hosts which don't
forward any IP packets, and why it has to include the source addresses
and TOS (for policy-based routing, probably).  Most hosts simply don't
face such complex routing decisions to make the cache a win.

If you don't believe me, hook a Linux box to a packet generator
(generating packets with random source addresses) and use iptables to
drop the packets, in a first test run in the INPUT chain (after route
cache), and in a second one in the PREROUTING chain (before route
cache).  I've observed an incredible difference (not in laboratory
tests, but during actual DoS attacks).

Netfilter ip_conntrack support might have similar issues, but you
can't use it in a uncooperative environment anyway, at least in my
experience.  (Note that there appears to be no way to disable
connection tracking while the code is in the kernel.)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-04-05 16:37 Route cache performance under stress Florian Weimer
@ 2003-04-05 18:17 ` Martin Josefsson
  2003-04-05 18:34 ` Willy Tarreau
  2003-05-16 22:24 ` Simon Kirby
  2 siblings, 0 replies; 227+ messages in thread
From: Martin Josefsson @ 2003-04-05 18:17 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-kernel, netdev, bert hubert

On Sat, 2003-04-05 at 18:37, Florian Weimer wrote:

> Netfilter ip_conntrack support might have similar issues, but you
> can't use it in a uncooperative environment anyway, at least in my
> experience.  (Note that there appears to be no way to disable
> connection tracking while the code is in the kernel.)

It's correct that ip_conntrack has similar issues. There's been some
work on the hashalgorithm used but no patch has been made yet.
And yes it doesn't scale well at all (especially on SMP), I'm about to
start working on this a bit. Hopefully I can improve it somewhat.

If you've compiled ip_conntrack into your kernel there's only two ways
to disable it and both needs code-modifications :)

Install a netfilter-module that gets the packets before conntrack and
steal the packets, the downside is that you will bypass the rest of
iptables as well.

Apply a patch from patch-o-matic that adds a NOTRACK target that
instructs conntrack to not look at the packets marked by that target.

-- 
/Martin

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-04-05 16:37 Route cache performance under stress Florian Weimer
  2003-04-05 18:17 ` Martin Josefsson
@ 2003-04-05 18:34 ` Willy Tarreau
  2003-05-16 22:24 ` Simon Kirby
  2 siblings, 0 replies; 227+ messages in thread
From: Willy Tarreau @ 2003-04-05 18:34 UTC (permalink / raw)
  To: linux-kernel

Hello !

On Sat, Apr 05, 2003 at 06:37:43PM +0200, Florian Weimer wrote:
> Please read the following paper:
> 
> <http://www.cs.rice.edu/~scrosby/tr/HashAttack.pdf>

very interesting article.

> Then look at the 2.4 route cache implementation.

Since we need commutative source/dest addresses in many places, the use of a
XOR is a common practice. In fact, while working on hash tables a while ago, I
found that I could get very good results with something such as :

   RND1 = random_generated_at_start_time() ;
   RND2 = random_generated_at_start_time() ;
   /* RND2 may be 0 or equal to RND1, all cases seem OK */
   x = (RND1 - saddr) ^ (RND1 - daddr) ^ (RND2 + saddr + daddr);
   reduce(x) ...

With this method, I found no way to guess a predictable (saddr, daddr) couple
which gives a same result, and saddr/daddr are still commutative. It resists
common cases where saddr=daddr, saddr=~daddr, saddr=-daddr. And *I think* tha
the random makes other cases difficult to predict. I'm not specialized in
crypto or whatever, so I cannot tell how to generate the best RND1/2, and it's
obvious to me that stupid values like 0 or -1 may not help a lot, but this is
still better than a trivial saddr ^ daddr, at a very low cost.
For example, the x86 encoding of the simple XOR hash would result in :
  mov saddr, %eax
  xor daddr, %eax
 => 2 cycles with result in %eax

The new calculation will look like :
  mov saddr, %ebx
  mov daddr, %ecx

  lea (%ebx,%ecx,1), %eax
  neg %ecx

  add RND2, %eax		// can be omitted if zero
  add RND1, %ecx

  neg %ebx
  xor %ecx, %eax

  add RND1, %ebx
  xor %ebx, %eax
  
=> 5 cycles on dual-pipelines CPUs, result in eax, but uses 2 more regs.

Any comments ?

Regards,
Willy


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-04-05 16:37 Route cache performance under stress Florian Weimer
  2003-04-05 18:17 ` Martin Josefsson
  2003-04-05 18:34 ` Willy Tarreau
@ 2003-05-16 22:24 ` Simon Kirby
  2003-05-16 23:16   ` Florian Weimer
  2003-05-17  2:35   ` David S. Miller
  2 siblings, 2 replies; 227+ messages in thread
From: Simon Kirby @ 2003-05-16 22:24 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-kernel

On Sat, Apr 05, 2003 at 06:37:43PM +0200, Florian Weimer wrote:

> Please read the following paper:
> 
> <http://www.cs.rice.edu/~scrosby/tr/HashAttack.pdf>
> 
> Then look at the 2.4 route cache implementation.
> 
> Short summary: It is possible to freeze machines with 1 GB of RAM and
> more with a stream of 400 packets per second with carefully chosen
> source addresses.  Not good.

Finally, somebody else has noticed this problem!  Unfortunately, I didn't
see this message until now.

I have been seeing this problem for over a year, and have had the same
problems you have with DoS attacks saturating the CPUs on our routers.

We have two Athlon 1800MP boxes doing routing on our network, and the CPU
saturates under embarrassingly low traffic rates with random source IPs.
We've noticed this a few times with DoS attacks generated internally and
with remote DoS attacks.  I too have had to abuse the PREROUTING chain
(in the mangle table to avoid loading the nat table which would bring in
connection tracking -- grr...I hate the way this works in iptables),
particularly with the MSSQL worm that burst out to the 'net that one
Friday night several few months ago.  Adding a single match udp port,
DROP rule to PREROUTING chain made the load go back down to normal
levels.  The same rule in the INPUT/FORWARD chain had no affect on the
CPU utilization (still saturated).

> The route cache is a DoS bottleneck in general (that's why I started
> looking at it).  You have to apply rate-limits in the PREROUTING
> chain, otherwise a modest packet flood will push the machine off the
> network (even with truly random source addresses, not triggering hash
> collisions).  The route cache partially defeats the purpose of SYN
> cookies, too, because the kernel keeps (transient) state for spoofed
> connection attempts in the route cache.

The idea, I believe, was that the route cache was supposed to stay as a
mostly O(1) overhead and not fall over in any specific cases.  As you
say, however, we also have problems with truly random IPs killing large
boxes.  This same box is capable of routing more than one gigabit of tiny
(64 byte) packets when the source IP is not random (using Tigon3 cards).

Under normal operation, it looks like most load we are seeing is in fact
normal route lookups.  We run BGP peering, and so there is a lot of
routes in the table.  Alexey suggested adding another level of hashing to
the fn_hash_lookup function, but that didn't seem to help very much.  The
last time I was looking at this, I enabled profiling on one router to see
what functions were using the most CPU.  Here are the results (71 days
uptime):

	http://blue.netnation.com/sim/ref/readprofile.r1
	http://blue.netnation.com/sim/ref/readprofile.r1.call_sorted
	http://blue.netnation.com/sim/ref/readprofile.r1.time_sorted

The results of this profile are from mostly normal operation.

I will try playing more with this code and look at your patch and paper.

Thanks,

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-16 22:24 ` Simon Kirby
@ 2003-05-16 23:16   ` Florian Weimer
  2003-05-19 19:10     ` Simon Kirby
  2003-05-17  2:35   ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Florian Weimer @ 2003-05-16 23:16 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-kernel

Simon Kirby <sim@netnation.org> writes:

> I hate the way this works in iptables), particularly with the MSSQL
> worm that burst out to the 'net that one Friday night several few
> months ago.  Adding a single match udp port, DROP rule to PREROUTING
> chain made the load go back down to normal levels.  The same rule in
> the INPUT/FORWARD chain had no affect on the CPU utilization (still
> saturated).

Yes, that's exactly the phenomenon, but operators traditionally
attributed it to other things running on the router (such as
accounting).

> Under normal operation, it looks like most load we are seeing is in fact
> normal route lookups.  We run BGP peering, and so there is a lot of
> routes in the table.

You should aggregate the routes before you load them into the kernel.
Hardly anybody seems to do this, but usually, you have much fewer
interfaces than prefixes 8-), so this could result in a huge win.

Anyway, using data structures tailored to the current Internet routing
table, it's certainly possible to do destination-only routing using
half a dozen memory lookups or so (or a few indirect calls, I'm not
sure which option is cheaper).

> I will try playing more with this code and look at your patch and paper.

Well, I didn't write the paper, I found it after discovering the
problem in the kernel.  This complexity attack has been resolved, but
this won't help people like you who have to run Linux on an
uncooperative network.

The patch I posted won't help you as it increases the load
considerably unless most of your flows consist of one packet.  (And
there's no need for patching, you can go ahead and just change the
value via /proc.)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-16 22:24 ` Simon Kirby
  2003-05-16 23:16   ` Florian Weimer
@ 2003-05-17  2:35   ` David S. Miller
  2003-05-17  7:31     ` Florian Weimer
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-17  2:35 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Florian Weimer, linux-kernel

On Fri, 2003-05-16 at 15:24, Simon Kirby wrote:
> I have been seeing this problem for over a year, and have had the same
> problems you have with DoS attacks saturating the CPUs on our routers.

Have a look at current kernels and see if they solve your problem.
They undoubtedly should, and I consider this issue resolved.

-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-17  2:35   ` David S. Miller
@ 2003-05-17  7:31     ` Florian Weimer
  2003-05-17 22:09       ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Florian Weimer @ 2003-05-17  7:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: Simon Kirby, linux-kernel

"David S. Miller" <davem@redhat.com> writes:

> On Fri, 2003-05-16 at 15:24, Simon Kirby wrote:
>> I have been seeing this problem for over a year, and have had the same
>> problems you have with DoS attacks saturating the CPUs on our routers.
>
> Have a look at current kernels and see if they solve your problem.
> They undoubtedly should, and I consider this issue resolved.

The hash collision problem appears to be resolved, but not the more
general performance issues.  Or are there any kernels without a
routing cache?

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-17  7:31     ` Florian Weimer
@ 2003-05-17 22:09       ` David S. Miller
  2003-05-18  9:21         ` Florian Weimer
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-17 22:09 UTC (permalink / raw)
  To: fw; +Cc: sim, linux-kernel

   From: Florian Weimer <fw@deneb.enyo.de>
   Date: Sat, 17 May 2003 09:31:04 +0200
   
   The hash collision problem appears to be resolved, but not the more
   general performance issues.  Or are there any kernels without a
   routing cache?

I think your criticism of the routing cache is not well
founded.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-17 22:09       ` David S. Miller
@ 2003-05-18  9:21         ` Florian Weimer
  2003-05-18  9:31           ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Florian Weimer @ 2003-05-18  9:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, linux-kernel

"David S. Miller" <davem@redhat.com> writes:

>    From: Florian Weimer <fw@deneb.enyo.de>
>    Date: Sat, 17 May 2003 09:31:04 +0200
>    
>    The hash collision problem appears to be resolved, but not the more
>    general performance issues.  Or are there any kernels without a
>    routing cache?
>
> I think your criticism of the routing cache is not well
> founded.

Well, what would change your mind?

I don't really care, as I maintain just one Linux router which routes
substantial bandwidth and which could easily be protected by upstream
ACLs in the case of an emergency.  Others rely on Linux routers for
their ISP business, and they see similar problems as well, and these
people *do* care about this problem.  Some of them contacted me and
told me that they were ignored when they described it.  They certainly
want this problem to be fixed, as using FreeBSD is not always an
option. 8-(

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-18  9:21         ` Florian Weimer
@ 2003-05-18  9:31           ` David S. Miller
  2003-05-19 17:36             ` Jamal Hadi
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-18  9:31 UTC (permalink / raw)
  To: fw; +Cc: linux-kernel, kuznet, netdev, linux-net

   From: Florian Weimer <fw@deneb.enyo.de>
   Date: Sun, 18 May 2003 11:21:14 +0200

[ Please don't CC: sim@netnation.org any more, his address
  bounces at least for me (maybe his site rejects ECN, it is
  the most likely problem if it works for other people) ]

   "David S. Miller" <davem@redhat.com> writes:
   
   > I think your criticism of the routing cache is not well
   > founded.
   
   Well, what would change your mind?

I'll start to listen when you start to demonstrate that you understand
why the input routing cache is there and what problems it solves.

More people will also start to listen when you acutally discuss this
matter on the proper list(s) (which isn't linux-kernel, since
linux-net and netdev@oss.sgi.com are the proper places).  Most of the
net hackers have zero time to follow the enourmous amount of traffic
that exists on linux-kernel and picking out the networking bits.
Frankly, I /dev/null linux-kernel from time to time as well.

The fact is, our routing cache slow path is _FAST_.  And we garbage
collect routing cache entries, so the attacker's entries are deleted
quickly while the entries for legitimate flows stick around.  And
especially during an attack you want your legitimate traffic using the
routing cache.

I've never seen you mention this attribute of how the routing cache
works, nor have I seen you say anything which even suggests that you
are aware of this.  You could even make this apparent by proposing a
replacement for the input routing cache.  But remember, it has to
provide all of the functionality that is there today.

Nobody has demonstrated that there is a performance problem due to the
input routing cache once the hashing DoS is eliminated, which it is
in current kernels.  Take this as my challenge to you. :-)

   using FreeBSD is not always an option

Yeah, that dinosaur :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-18  9:31           ` David S. Miller
@ 2003-05-19 17:36             ` Jamal Hadi
  2003-05-19 19:18               ` Ralph Doncaster
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-05-19 17:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: fw, linux-kernel, kuznet, netdev, linux-net


Florian,
I actually asked you to run some tests last time you showed up
on netdev but never heard back. Maybe we can get some results now
that the complaining is still continuing. Note, we cant just invent
things because "CISCO is doing it like that". That doesnt cut it.
What we need is data to substantiate things and then we move from there.

And oh, i am pretty sure we can beat any of the BSDs forwarding rate.
Anyone wants a duel, lets meet at the water fountain by the town
hall at sunrise.

cheers,
jamal


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-16 23:16   ` Florian Weimer
@ 2003-05-19 19:10     ` Simon Kirby
  0 siblings, 0 replies; 227+ messages in thread
From: Simon Kirby @ 2003-05-19 19:10 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linux-kernel, linux-net

[ Apologies all -- I had my address incorrectly set to sim@netnation.org
  for some reason. ]

On Sat, May 17, 2003 at 01:16:08AM +0200, Florian Weimer wrote:

> > Under normal operation, it looks like most load we are seeing is in fact
> > normal route lookups.  We run BGP peering, and so there is a lot of
> > routes in the table.
> 
> You should aggregate the routes before you load them into the kernel.
> Hardly anybody seems to do this, but usually, you have much fewer
> interfaces than prefixes 8-), so this could result in a huge win.

Hmm... Looking around, I wasn't able to find an option in Zebra to do
this.  Do you know the command to do this?

> Anyway, using data structures tailored to the current Internet routing
> table, it's certainly possible to do destination-only routing using
> half a dozen memory lookups or so (or a few indirect calls, I'm not
> sure which option is cheaper).

Would this still route packets to destinations which would otherwise be
unreachable, then?  While this isn't a big issue, it would be nice to
stop unroutable traffic before it leaves our networks (mostly in the case
where a customer machine is generating bad traffic).

I did experiment with trying to increase the routing (normal, not cache)
hash table another level, but it didn't seem to have much effect.  I
believe I would have to change the algorithm somewhat to prefer falling
into larger hash buckets sooner than how it does at the moment.  I seem
to recall that it would let the hash buckets get rather large before
expanding them.  I haven't had a chance to look at this very deeply, but
the profile I linked to before does show that fn_hash_lookup() does
indeed use more CPU than any other function, so it may be worth looking
at more.  (Aggregating routes would definitely improve the situation in
any case.)

> The patch I posted won't help you as it increases the load
> considerably unless most of your flows consist of one packet.  (And
> there's no need for patching, you can go ahead and just change the
> value via /proc.)

Yes.  I have fiddled with this before, and making the changes you
suggested actually doubled the load in normal operation.  I would assume
this is putting even more pressure on fn_hash_lookup().

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-19 17:36             ` Jamal Hadi
@ 2003-05-19 19:18               ` Ralph Doncaster
  2003-05-19 22:37                 ` Jamal Hadi
  0 siblings, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-05-19 19:18 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: David S. Miller, fw, linux-kernel, kuznet, netdev, linux-net

When I looked at the route-cache code, efficient wasn't the word the came
to mind.  Whether the problem is in the route-cache or not, getting
>100kpps out of a linux router with <= 1Ghz of CPU is not at all an easy
task.  I've tried 2.2 and 2.4 (up to 2.4.20) with 3c905CX cards, with and
without NAPI, on a 750Mhz AMD.  I've never reached 100kpps without
userland (zebra) getting starved.  I've even tried the e1000 with 2.4.20,
and it still doesn't cut it (about 50% better performance than the 3Com).
This is always with a full routing table (~110K routes).

If I actually had the time to do the code, I'd try dumping the route-cache
altogether and keep the forwarding table as an r-tree (probably 2 levels
of 2048 entries since average prefix size is /22).  Frequently-used routes
would lookup faster due to CPU cache hits.  I'd have all the crap for
source-based routing ifdef'd out when firewalling is not compiled in.

My next try will be with FreeBSD, using device polling and the e1000 cards
(since it seems there are no polling patches for the 3c905CX under
FreeBSD).  From the description of how polling under FreeBSD works
http://info.iet.unipi.it/~luigi/polling/
vs NAPI under linux, polling sounds better due to the ability to configure
the polling cycle and CPU load triggers.  From the testing and reading
I've done so far, NAPI doesn't seem to kick in until after 75-80% CPU
load.  With less than 25kpps coming into the box zebra seems to take
almost 10x longer to bring up a session with full routes than it does with
no packet load.  Since CPU load before zebra becomes active is 70-75%, it
would seem a lot of cycles is being wasted on context switching when zebra
gets busy.

If there is a way to get the routing performance I'm looking for in Linux,
I'd really like to know.  I've been searching an asking for over a year
now.  When I initially talked to Jamal about it, he told me NAPI was the
answer.  It does help, but from my experience it's not the answer.  I get
the impression nobody involved in the code has has tested under real-world
conditions.  If that is, in fact, the problem then I can provide an ebgp
multihop full feed and a synflood utility for stress testing.  If the
linux routing and ethernet driver code is improved so I can handle 50kpps
of inbound regular traffic, a 50kpps random-source DOS, and still have 50%
CPU left for Zebra then Cisco might have something to worry about...

Ralph Doncaster, president
6042147 Canada Inc. o/a IStop.com

On Mon, 19 May 2003, Jamal Hadi wrote:

>
> Florian,
> I actually asked you to run some tests last time you showed up
> on netdev but never heard back. Maybe we can get some results now
> that the complaining is still continuing. Note, we cant just invent
> things because "CISCO is doing it like that". That doesnt cut it.
> What we need is data to substantiate things and then we move from there.
>
> And oh, i am pretty sure we can beat any of the BSDs forwarding rate.
> Anyone wants a duel, lets meet at the water fountain by the town
> hall at sunrise.
>
> cheers,
> jamal
>
>
>

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-19 19:18               ` Ralph Doncaster
@ 2003-05-19 22:37                 ` Jamal Hadi
  2003-05-20  1:10                   ` Simon Kirby
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-05-19 22:37 UTC (permalink / raw)
  To: ralph+d; +Cc: David S. Miller, fw, kuznet, netdev, linux-net




Took Linux kernel off the cc list.

On Mon, 19 May 2003, Ralph Doncaster wrote:

> When I looked at the route-cache code, efficient wasn't the word the came
> to mind.  Whether the problem is in the route-cache or not, getting
> 100kpps out of a linux router with <= 1Ghz of CPU is not at all an easy
> task.  I've tried 2.2 and 2.4 (up to 2.4.20) with 3c905CX cards, with and
> without NAPI, on a 750Mhz AMD.  I've never reached 100kpps without
> userland (zebra) getting starved.  I've even tried the e1000 with 2.4.20,
> and it still doesn't cut it (about 50% better performance than the 3Com).
> This is always with a full routing table (~110K routes).
>

I just tested a small userland apps which does some pseudo routing in
userland. With NAPI i am able to do 148Kpps without it same hardware,
about 32Kpps.
I cant test beyond 148Kpps because thats the max pps a 100Mbps card can
do. The point i am making is i dont see the user space starvation.
Granted this is not the same thing you are testing.

> If I actually had the time to do the code, I'd try dumping the route-cache
> altogether and keep the forwarding table as an r-tree (probably 2 levels
> of 2048 entries since average prefix size is /22).  Frequently-used routes
> would lookup faster due to CPU cache hits.  I'd have all the crap for
> source-based routing ifdef'd out when firewalling is not compiled in.
>

I think theres definete benefit to flow/dst cache as is. Modern routing
really should not be just about destination address lookup. Thats whats
practical today (as opposed to the 80s). I agree that we should be
flexible enough to not enforce that everybody use the complexity of
looking up via 5 tuples and maintaining flows at that level - if the
cache lookup is the bottleneck. Theres a recent patch that made it into
2.5.69 which resolves (or so it seems - havent tried it myself) the
cache bucket distribution. This was a major problem before.
The second level issue is on cache misses how fast can you lookup.
So far we are saying "fast enough". Someone needs to prove it is not.


> My next try will be with FreeBSD, using device polling and the e1000 cards
> (since it seems there are no polling patches for the 3c905CX under
> FreeBSD).  From the description of how polling under FreeBSD works
> http://info.iet.unipi.it/~luigi/polling/
> vs NAPI under linux, polling sounds better due to the ability to configure
> the polling cycle and CPU load triggers.  From the testing and reading
> I've done so far, NAPI doesn't seem to kick in until after 75-80% CPU
> load.  With less than 25kpps coming into the box zebra seems to take
> almost 10x longer to bring up a session with full routes than it does with
> no packet load.  Since CPU load before zebra becomes active is 70-75%, it
> would seem a lot of cycles is being wasted on context switching when zebra
> gets busy.
>

Not interested in BSD. When they can beat Linuxs numbers i'll be
interested.

> If there is a way to get the routing performance I'm looking for in Linux,
> I'd really like to know.  I've been searching an asking for over a year
> now.  When I initially talked to Jamal about it, he told me NAPI was the
> answer.  It does help, but from my experience it's not the answer.  I get
> the impression nobody involved in the code has has tested under real-world
> conditions.  If that is, in fact, the problem then I can provide an ebgp
> multihop full feed and a synflood utility for stress testing.  If the
> linux routing and ethernet driver code is improved so I can handle 50kpps
> of inbound regular traffic, a 50kpps random-source DOS, and still have 50%
> CPU left for Zebra then Cisco might have something to worry about...
>

I think we could do 50Kpps in a DOS environment.
We live in the same city. I may be able to spare half a weekend day and
meet up with you for some testing.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-19 22:37                 ` Jamal Hadi
@ 2003-05-20  1:10                   ` Simon Kirby
  2003-05-20  1:14                     ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-05-20  1:10 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: netdev, linux-net

On Mon, May 19, 2003 at 06:37:43PM -0400, Jamal Hadi wrote:

> cache bucket distribution. This was a major problem before.

Was the hash distribution broken before even in the truly random case?
If so, the patch would likely help.  If not, it shouldn't really affect
the DoS case or the normal traffic case, because I doubt we have been the
target of any hashing-specific attacks.

I've been looking through the patch and there seem to be quite a few
changes, including some that may optimize it.  I'm about to test it out
on a test machine and do some benchmarks.  I'll post my results in the
next day or so.

Thanks,

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  1:10                   ` Simon Kirby
@ 2003-05-20  1:14                     ` David S. Miller
  2003-05-20  1:23                       ` Jamal Hadi
  2003-05-21  0:09                       ` Simon Kirby
  0 siblings, 2 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-20  1:14 UTC (permalink / raw)
  To: sim; +Cc: hadi, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Mon, 19 May 2003 18:10:53 -0700
   
   I doubt we have been the
   target of any hashing-specific attacks.

I bet you have been, the weakness in the hash has been very well
publicized and the script kiddies aren't using the truly random
version of the attacks anymore.  Just google for juno-z.101f.c, this
(or some derivative) is the DoS people attack program are actually
using.

It was the only major place where the routing cache was weak
performance wise.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  1:14                     ` David S. Miller
@ 2003-05-20  1:23                       ` Jamal Hadi
  2003-05-20  1:24                         ` David S. Miller
  2003-05-26  7:18                         ` Florian Weimer
  2003-05-21  0:09                       ` Simon Kirby
  1 sibling, 2 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-05-20  1:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, netdev, linux-net



On Mon, 19 May 2003, David S. Miller wrote:

> I bet you have been, the weakness in the hash has been very well
> publicized and the script kiddies aren't using the truly random
> version of the attacks anymore.  Just google for juno-z.101f.c, this
> (or some derivative) is the DoS people attack program are actually
> using.

Also used to attack CISCOs by them kiddies btw. We stand much better
than any CISCO doing caching.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  1:23                       ` Jamal Hadi
@ 2003-05-20  1:24                         ` David S. Miller
  2003-05-20  2:13                           ` Jamal Hadi
  2003-05-26  7:18                         ` Florian Weimer
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-20  1:24 UTC (permalink / raw)
  To: hadi; +Cc: sim, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Mon, 19 May 2003 21:23:08 -0400 (EDT)
   
   Also used to attack CISCOs by them kiddies btw. We stand much better
   than any CISCO doing caching.

I have to assume that the source address selection operates
differently for attacking cisco equiptment, our hashes being
identical would really be unbelievable :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  1:24                         ` David S. Miller
@ 2003-05-20  2:13                           ` Jamal Hadi
  2003-05-20  5:01                             ` Pekka Savola
  2003-05-20  6:46                             ` David S. Miller
  0 siblings, 2 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-05-20  2:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, netdev, linux-net




I dont think the hashes are similar - its the effect into the
slow path. I was told by someone who tested this on a priicey CISCO
that they simply die unless capable of a feature called CEF.

cheers,
jamal

On Mon, 19 May 2003, David S. Miller wrote:

>    From: Jamal Hadi <hadi@shell.cyberus.ca>
>    Date: Mon, 19 May 2003 21:23:08 -0400 (EDT)
>
>    Also used to attack CISCOs by them kiddies btw. We stand much better
>    than any CISCO doing caching.
>
> I have to assume that the source address selection operates
> differently for attacking cisco equiptment, our hashes being
> identical would really be unbelievable :-)
>
>
>

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  2:13                           ` Jamal Hadi
@ 2003-05-20  5:01                             ` Pekka Savola
  2003-05-20 11:47                               ` Jamal Hadi
  2003-05-20  6:46                             ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Pekka Savola @ 2003-05-20  5:01 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: David S. Miller, sim, netdev, linux-net

On Mon, 19 May 2003, Jamal Hadi wrote:
> I dont think the hashes are similar - its the effect into the
> slow path. I was told by someone who tested this on a priicey CISCO
> that they simply die unless capable of a feature called CEF.

Yes, but pretty much nobody is using Cisco without CEF, except in the last 
mile, low-end devices.

> On Mon, 19 May 2003, David S. Miller wrote:
> 
> >    From: Jamal Hadi <hadi@shell.cyberus.ca>
> >    Date: Mon, 19 May 2003 21:23:08 -0400 (EDT)
> >
> >    Also used to attack CISCOs by them kiddies btw. We stand much better
> >    than any CISCO doing caching.
> >
> > I have to assume that the source address selection operates
> > differently for attacking cisco equiptment, our hashes being
> > identical would really be unbelievable :-)
> >
> >
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-net" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  2:13                           ` Jamal Hadi
  2003-05-20  5:01                             ` Pekka Savola
@ 2003-05-20  6:46                             ` David S. Miller
  2003-05-20 12:04                               ` Jamal Hadi
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-20  6:46 UTC (permalink / raw)
  To: hadi; +Cc: sim, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Mon, 19 May 2003 22:13:33 -0400 (EDT)
   
   I dont think the hashes are similar - its the effect into the
   slow path. I was told by someone who tested this on a priicey CISCO
   that they simply die unless capable of a feature called CEF.

I found a description of this thing on Cisco's web site.  Amusingly it
seems to contradict itself, it says that the CEF FIB is fully
populated and has a 1-to-1 correspondance to the routing table yet it
says that the first access to some destination is what creates
CEF entries.

Go figure! :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  5:01                             ` Pekka Savola
@ 2003-05-20 11:47                               ` Jamal Hadi
  2003-05-20 11:55                                 ` Pekka Savola
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-05-20 11:47 UTC (permalink / raw)
  To: Pekka Savola; +Cc: David S. Miller, sim, netdev, linux-net



On Tue, 20 May 2003, Pekka Savola wrote:

> On Mon, 19 May 2003, Jamal Hadi wrote:
> > I dont think the hashes are similar - its the effect into the
> > slow path. I was told by someone who tested this on a priicey CISCO
> > that they simply die unless capable of a feature called CEF.
>
> Yes, but pretty much nobody is using Cisco without CEF, except in the last
> mile, low-end devices.
>

so not a GSR thing only feature. At the edges though, wouldnt it be
important to do more sexy things than just route based on a destination
address?

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20 11:47                               ` Jamal Hadi
@ 2003-05-20 11:55                                 ` Pekka Savola
  0 siblings, 0 replies; 227+ messages in thread
From: Pekka Savola @ 2003-05-20 11:55 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: David S. Miller, sim, netdev, linux-net

On Tue, 20 May 2003, Jamal Hadi wrote:
> On Tue, 20 May 2003, Pekka Savola wrote:
> > On Mon, 19 May 2003, Jamal Hadi wrote:
> > > I dont think the hashes are similar - its the effect into the
> > > slow path. I was told by someone who tested this on a priicey CISCO
> > > that they simply die unless capable of a feature called CEF.
> >
> > Yes, but pretty much nobody is using Cisco without CEF, except in the last
> > mile, low-end devices.
> >
> 
> so not a GSR thing only feature. At the edges though, wouldnt it be
> important to do more sexy things than just route based on a destination
> address?

Indeed.  For example, policy-based routing (e.g. source address dependent
routing) has been claimed to be in the CEF path now (previously it was in
the slow path), but I certainly would "like" to be shown wrong. :-)

By low-end edge devices I basically meant all DSL, ISDN, cablemodem etc.  
equipment.  I don't know of "midrange" Cisco gear, but basically
everything service providers use (at least 7xxx, 10xxx, and 12xxx series)  
do support CEF (or more complicated variations of it).

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  6:46                             ` David S. Miller
@ 2003-05-20 12:04                               ` Jamal Hadi
  2003-05-21  0:36                                 ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-05-20 12:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, netdev, linux-net



On Mon, 19 May 2003, David S. Miller wrote:

> I found a description of this thing on Cisco's web site.  Amusingly it
> seems to contradict itself, it says that the CEF FIB is fully
> populated and has a 1-to-1 correspondance to the routing table yet it
> says that the first access to some destination is what creates
> CEF entries.
>

It seems to be done at interupt level sort of like Linux fast switching
(not to be confused with CISCO fast switching); however, unlike Linux fast
switching which looks up based on dst cache, they do lookups on a
FIB with already nexthop entries (sort of like the hh cache we have).
Theres something akin to a user land process which makes sure the
neighbors are resolved all the time - most routing protocols stacks
already do this today with BGP. I dont think Zebra does.
What i am wondering is what if they have to do more than routing? Dont
they end up with same (if not worse) effect?

Having said all the above, i think it would be worth seeing what the
effect of improving the slow path is (make it a multi-way trie).
Actually before that someone needs to prove slow path is slow ;->
Note: It may make sense that we have options to totaly remove
the cache lookups if necessary - noone has proved a need for it at this
point.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  1:14                     ` David S. Miller
  2003-05-20  1:23                       ` Jamal Hadi
@ 2003-05-21  0:09                       ` Simon Kirby
  2003-05-21  0:13                         ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-05-21  0:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, netdev, linux-net

On Mon, May 19, 2003 at 06:14:05PM -0700, David S. Miller wrote:

> I bet you have been, the weakness in the hash has been very well
> publicized and the script kiddies aren't using the truly random
> version of the attacks anymore.  Just google for juno-z.101f.c, this
> (or some derivative) is the DoS people attack program are actually
> using.

Hmm, I see no difference.  I've been using juno-z.101f.c to spam a test
box (PIII 800 Mhz, 3C996B/BCM5701), and the box easily chokes when I hit
it with my pimpin' 466 MHz Celery (running at 542 MHz) with an eepro100.

NAPI is definitely impressive, though.  When hitting it with my "udpspam"
program which doesn't quite saturate the CPU (and doesn't spoof the
source), it does about 80k interrupts/sec and behaves normally.  When I
run "juno" on it, eth0 interrupts stop entirely and it works nicely in
polling mode.  Userspace still works on the console, but remote SSH is a
bit dodgy because it still drops about half of the rx packets due to CPU
saturation.  Juno reports sending 4,272,266 bytes/second, which is
34,178,128 bits/second.

It's rather difficult to follow, but I don't see any "h4r h4r, expl0it
th3 L1nux h4sh" comments or anything in the code that seems to attempt to
exploit the hash algorithms in (older) Linux.  It seems to be using very
crappy psuedo-random code to generate source IPs.  Perhaps another
variant actually attempts to exploit the hash.

Once I figure out a good way of getting near but not quite saturation, I
will attempt to compare -rc1 and -rc2 to see if the (crappy) random
source handling capacity has increased at all.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-21  0:09                       ` Simon Kirby
@ 2003-05-21  0:13                         ` David S. Miller
  2003-05-26  9:29                           ` Florian Weimer
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-21  0:13 UTC (permalink / raw)
  To: sim; +Cc: hadi, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Tue, 20 May 2003 17:09:36 -0700
   
   It's rather difficult to follow, but I don't see any "h4r h4r, expl0it
   th3 L1nux h4sh" comments or anything in the code that seems to attempt to
   exploit the hash algorithms in (older) Linux.

Look at the vc[] table and how it uses this in rndip().

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20 12:04                               ` Jamal Hadi
@ 2003-05-21  0:36                                 ` David S. Miller
  2003-05-21 13:03                                   ` Jamal Hadi
  2003-05-22  8:40                                   ` Simon Kirby
  0 siblings, 2 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-21  0:36 UTC (permalink / raw)
  To: hadi; +Cc: sim, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Tue, 20 May 2003 08:04:00 -0400 (EDT)

   Note: It may make sense that we have options to totaly remove
   the cache lookups if necessary - noone has proved a need for it at
   this point.

There is a need, thinking otherwise is quite a narrow viewpoint :-)
Let me explain.

Forward looking, Alexey and myself plan to extend the per-cpu flow
cache we designed for IPSEC policy lookups to apply to routing
and socket lookup.  There are two reasons to make this:

1) Per-cpu'ness.
2) Input route lookup turns into a "flow" lookup and thus may
   give you a TCP socket, for example.  It is the most exciting
   part of this work.

It can even be applied to netfilter entries.  It really is the
grand unified theory of flow handling :-)  You can look to
net/core/flow.c, it is the initial prototype and it is working
and being used already for IPSEC policies.  There are only minor
adjustments necessary before we can begin trying to apply it to
other things, but Alexey and myself know how to make them.

So the real argument: Eliminating sourced based keying of input
routes is a flawed idea.  Firstly, independant of POLICY based routing
(which is what it was originally made for) being able to block by
source address on input is a useful feature.  Secondly, if one must
make "fib_validate_source()" on each input packet, it destroys all
the posibility to make per-cpu flow caching a reality.  This is
because fib_validate_source() must walk the inetdev list and thus
grab a shared SMP lock.

Note that any attempt to remove source based keying of routing cache
entries on input (or eliminating the cache entirely) has this problem.

It also becomes quite cumbersome to move all of this logic over to
ip_input() or similar.  And because it will always use a shared SMP
lock it is guarenteed to be slower than the cache especially for
well-behaved flows.  So keep in mind that not all traffic is DoS :-)

(As a side note, and interesting area of discourse would be to see
 if DoS traffic can be somehow patternized, either explicitly in
 the kernel or via descriptions from the user.  People do this today
 via netfilter, but I feel we might be able to do something more
 powerful at the flow caching level, ie. do not build cache entries
 for things looking like unary-packet DoS flow)

None of this means that slowpath should not be improved if necessary.
On the contrary, I would welcome good kernel profiling output from
someone such as sim@netnation during such stress tests.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-21  0:36                                 ` David S. Miller
@ 2003-05-21 13:03                                   ` Jamal Hadi
  2003-05-23  5:42                                     ` David S. Miller
  2003-05-22  8:40                                   ` Simon Kirby
  1 sibling, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-05-21 13:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, netdev, linux-net



On Tue, 20 May 2003, David S. Miller wrote:

>    From: Jamal Hadi <hadi@shell.cyberus.ca>
>    Date: Tue, 20 May 2003 08:04:00 -0400 (EDT)
>
>    Note: It may make sense that we have options to totaly remove
>    the cache lookups if necessary - noone has proved a need for it at
>    this point.
>
> There is a need, thinking otherwise is quite a narrow viewpoint :-)
> Let me explain.
>
> Forward looking, Alexey and myself plan to extend the per-cpu flow
> cache we designed for IPSEC policy lookups to apply to routing
> and socket lookup.  There are two reasons to make this:
>
> 1) Per-cpu'ness.

IPIs to synchronize?

> 2) Input route lookup turns into a "flow" lookup and thus may
>    give you a TCP socket, for example.  It is the most exciting
>    part of this work.
>

For packets that are being forwarded or even host bound, why start at
routing? This should be done much further below. Not sure how to deal
with packets originating from the host.
For example i moved the ingress qdisc to way down before IP is hit
(I can post the patch) and it works quiet well at the moment only with
the u32 classifier (you could use the route classifier for ip packets).
I have a packet editor action so i can do some form of ARP/MAC address
aliasing ;->
This also gives you opportunity to drop early. A flow index could be
created there that could be used to index into the route table for
example. Maybe routing by fwmark would then make sense.

> It can even be applied to netfilter entries.  It really is the
> grand unified theory of flow handling :-)  You can look to
> net/core/flow.c, it is the initial prototype and it is working
> and being used already for IPSEC policies.  There are only minor
> adjustments necessary before we can begin trying to apply it to
> other things, but Alexey and myself know how to make them.
>

I did look at the code initially when it showed up. It does look sane.
Infact i raised the issue about the same time whether pushing and popping
these structures was the best way to go. Another approach would
be to use a "hub and spoke" dispatch based scheme which i use in the
effort to get better traffic control actions. Also the structure itself
had the grandiose view that routing is the mother of them all
i.e you "fit everything around routing" not "fit routing around other
things". Note: routing aint the only sexy thing these days, so unified
theory based on one sexy thing may be unfair to other sexy things;->

> So the real argument: Eliminating sourced based keying of input
> routes is a flawed idea.  Firstly, independant of POLICY based routing
> (which is what it was originally made for) being able to block by
> source address on input is a useful feature.  Secondly, if one must
> make "fib_validate_source()" on each input packet, it destroys all
> the posibility to make per-cpu flow caching a reality.  This is
> because fib_validate_source() must walk the inetdev list and thus
> grab a shared SMP lock.
>

I think the flowi must be captured way before IP is hit and reused
by IP and other sublayers. policy routing dropping or attempts to
fib_validate_source() the packets should  utilize that scheme (i.e install
filters below ip) and tag(fwmark) or drop them on the floor before they
hit IP.

> Note that any attempt to remove source based keying of routing cache
> entries on input (or eliminating the cache entirely) has this problem.
>

nod.

> It also becomes quite cumbersome to move all of this logic over to
> ip_input() or similar.  And because it will always use a shared SMP
> lock it is guarenteed to be slower than the cache especially for
> well-behaved flows.  So keep in mind that not all traffic is DoS :-)
>

true. I think post 2.6 we should just rip apart the infrastructure
and rethink things ;-> (should i go into hiding now?;->)

> (As a side note, and interesting area of discourse would be to see
>  if DoS traffic can be somehow patternized, either explicitly in
>  the kernel or via descriptions from the user.  People do this today
>  via netfilter, but I feel we might be able to do something more
>  powerful at the flow caching level, ie. do not build cache entries
>  for things looking like unary-packet DoS flow)
>

Should be pretty easy to do with a filter framework at the lower
layers such as the one i did with ingress qdisc.

> None of this means that slowpath should not be improved if necessary.
> On the contrary, I would welcome good kernel profiling output from
> someone such as sim@netnation during such stress tests.
>

nod.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-21  0:36                                 ` David S. Miller
  2003-05-21 13:03                                   ` Jamal Hadi
@ 2003-05-22  8:40                                   ` Simon Kirby
  2003-05-22  8:58                                     ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-05-22  8:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-net

On Tue, May 20, 2003 at 05:36:07PM -0700, David S. Miller wrote:

> None of this means that slowpath should not be improved if necessary.
> On the contrary, I would welcome good kernel profiling output from
> someone such as sim@netnation during such stress tests.

I decided to try some profiling while waiting for kernel compiles.
It seems that having a full BGP table is slowing thing down a lot.

I put 2.4.21-rc2 (with the new hash) on the test box.  I modified juno
to include a busy delay loop (to try to avoid timer aliasing throwing
off the remote profile and to be short enough to generate sufficient
traffic), and tuned it to leave about 30% idle CPU on the testing box.
I fired up juno, ran "readprofile -r", and let it sit for a while.
readprofile results:

   384 do_gettimeofday                            2.6667
   199 ipt_route_hook                             3.1094
  1092 fib_lookup                                 3.4125
  1286 ip_packet_match                            3.8274
   248 fib_rule_put                               3.8750
  3209 rt_intern_hash                             4.1784
   852 dst_destroy                                4.8409
  1923 fn_hash_lookup                             6.6771
  1325 kmem_cache_free                            8.2812
  1387 dst_alloc                                  9.6319
  3857 tg3_interrupt                             11.4792
  3848 do_softirq                                16.0333
  7354 ip_route_input                            17.0231
  8814 tg3_poll                                  28.9934
 17370 handle_IRQ_event                         108.5625
 26413 default_idle                             412.7031

I then faked a whole slew of routing table entries to look like normal
BGP routes.  "ip -o route | wc -l" shows 181012 entries, which is similar
to the actual routers.  readprofile results:

   289 do_gettimeofday                            2.0069
   669 fib_lookup                                 2.0906
   158 fib_rule_put                               2.4688
   367 tg3_recycle_rx                             2.5486
   889 ip_packet_match                            2.6458
  2375 rt_intern_hash                             3.0924
   636 dst_destroy                                3.6136
   868 dst_alloc                                  6.0278
  2029 tg3_interrupt                              6.0387
  1037 kmem_cache_free                            6.4813
  5364 ip_route_input                            12.4167
   993 default_idle                              15.5156
  7593 tg3_poll                                  24.9770
  9631 handle_IRQ_event                          60.1938
 26552 fn_hash_lookup                            92.1944

Hmm!  I guess the routing table size has a slight difference on
performance there.

Full readprofile output available here:

	http://blue.netnation.com/sim/ref/

I'm not sure if this is a "good" profile or not... I can try with
oprofile or something instead if that gives more useful results.

I think I wrote a loadable module to dump the hash distribution a while
back, but I can't remember where I put it.  I'll try writing something
like that again and see if there's anything interesting.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22  8:40                                   ` Simon Kirby
@ 2003-05-22  8:58                                     ` David S. Miller
  2003-05-22 10:40                                       ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-22  8:58 UTC (permalink / raw)
  To: sim; +Cc: netdev, linux-net, kuznet

   From: Simon Kirby <sim@netnation.com>
   Date: Thu, 22 May 2003 01:40:03 -0700

     9631 handle_IRQ_event                          60.1938

Are you using APIC irqs?

    26552 fn_hash_lookup                            92.1944
   
   Hmm!  I guess the routing table size has a slight difference on
   performance there.

I assume you have CONFIG_IP_ROUTE_LARGE_TABLES enabled.

If not, try with that turned on.   If you had it on or enabling
it makes little difference, it is time to play with FZ_MAX_DIVISOR
and fn_hash().

All of your BGP routes have the same prefix right?  Yes, with ~181000
routes which you have fib zone hash in current state will fall to
pieces I am afraid.  (even with perfect hash, chain length would be
on the order of ~180 entries :-((( )

   I'm not sure if this is a "good" profile or not... I can try with
   oprofile or something instead if that gives more useful results.
   
It is good, thanks.

Alexey, I will try to make something...

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22  8:58                                     ` David S. Miller
@ 2003-05-22 10:40                                       ` David S. Miller
  2003-05-22 11:15                                         ` Martin Josefsson
  2003-05-22 11:44                                         ` Simon Kirby
  0 siblings, 2 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-22 10:40 UTC (permalink / raw)
  To: sim; +Cc: netdev, linux-net, kuznet

   From: "David S. Miller" <davem@redhat.com>
   Date: Thu, 22 May 2003 01:58:15 -0700 (PDT)
   
   Alexey, I will try to make something...
   
Simon (and others who want to benchmark :-), give this patch below a
try.

It applies cleanly to both 2.4.x and 2.5.x kernels.

Alexey, note the funny inaccurate comment found here, it totally
invalidates "fast computer" comment found a few lines below this.

Actually, much of this code wants some major cleanups.  It is even
quite costly to do these "u32 struct" things, especially on RISC.
Alexey no longer makes major surgery in this area, so they may be
undone. :)

Next experiment can be to reimplement fn_hash() as:

#include <linux/jhash.h>

static fn_hash_idx_t fn_hash(fn_key_t key, struct fn_zone *fz)
{
        u32 h = ntohl(key.datum)>>(32 - fz->fz_order);
	jhash_1word(h, 0);
        h &= FZ_HASHMASK(fz);
        return *(fn_hash_idx_t*)&h;
}

or something like that.  It is assuming we find some problems
with hash distribution when using huge number of routes.  Someone
will need to add fib_hash lookup statistics in order to determine
this.

Anyways, testers please let us know the results.  Note you must
have CONFIG_IP_ROUTE_LARGE_TABLES (and thus CONFIG_IP_ADVANCED_ROUTER)
in order to even make use of this stuff.

Thanks.

--- net/ipv4/fib_hash.c.~1~	Thu May 22 02:47:17 2003
+++ net/ipv4/fib_hash.c	Thu May 22 03:27:12 2003
@@ -89,7 +89,7 @@
 	int		fz_nent;	/* Number of entries	*/
 
 	int		fz_divisor;	/* Hash divisor		*/
-	u32		fz_hashmask;	/* (1<<fz_divisor) - 1	*/
+	u32		fz_hashmask;	/* (fz_divisor - 1)	*/
 #define FZ_HASHMASK(fz)	((fz)->fz_hashmask)
 
 	int		fz_order;	/* Zone order		*/
@@ -149,7 +149,30 @@
 
 static rwlock_t fib_hash_lock = RW_LOCK_UNLOCKED;
 
-#define FZ_MAX_DIVISOR 1024
+#define FZ_MAX_DIVISOR ((PAGE_SIZE<<MAX_ORDER) / sizeof(struct fib_node *))
+
+static unsigned long size_to_order(unsigned long size)
+{
+	unsigned long order;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		if ((PAGE_SIZE << order) >= size)
+			break;
+	}
+	return order;
+}
+
+static struct fib_node **fz_hash_alloc(int divisor)
+{
+	unsigned long size = divisor * sizeof(struct fib_node *);
+
+	if (divisor <= 1024) {
+		return kmalloc(size, GFP_KERNEL);
+	} else {
+		return (struct fib_node **)
+			__get_free_pages(GFP_KERNEL, size_to_order(size));
+	}
+}
 
 #ifdef CONFIG_IP_ROUTE_LARGE_TABLES
 
@@ -174,6 +197,15 @@
 	}
 }
 
+static void fz_hash_free(struct fib_node **hash, int divisor)
+{
+	if (divisor <= 1024)
+		kfree(hash);
+	else
+		free_pages((unsigned long) hash,
+			   size_to_order(divisor * sizeof(struct fib_node *)));
+}
+
 static void fn_rehash_zone(struct fn_zone *fz)
 {
 	struct fib_node **ht, **old_ht;
@@ -185,24 +217,30 @@
 	switch (old_divisor) {
 	case 16:
 		new_divisor = 256;
-		new_hashmask = 0xFF;
 		break;
 	case 256:
 		new_divisor = 1024;
-		new_hashmask = 0x3FF;
 		break;
 	default:
-		printk(KERN_CRIT "route.c: bad divisor %d!\n", old_divisor);
-		return;
+		if ((old_divisor << 1) > FZ_MAX_DIVISOR) {
+			printk(KERN_CRIT "route.c: bad divisor %d!\n", old_divisor);
+			return;
+		}
+		new_divisor = (old_divisor << 1);
+		break;
 	}
+
+	new_hashmask = (new_divisor - 1);
+
 #if RT_CACHE_DEBUG >= 2
 	printk("fn_rehash_zone: hash for zone %d grows from %d\n", fz->fz_order, old_divisor);
 #endif
 
-	ht = kmalloc(new_divisor*sizeof(struct fib_node*), GFP_KERNEL);
+	ht = fz_hash_alloc(new_divisor);
 
 	if (ht)	{
 		memset(ht, 0, new_divisor*sizeof(struct fib_node*));
+
 		write_lock_bh(&fib_hash_lock);
 		old_ht = fz->fz_hash;
 		fz->fz_hash = ht;
@@ -210,7 +248,8 @@
 		fz->fz_divisor = new_divisor;
 		fn_rebuild_zone(fz, old_ht, old_divisor);
 		write_unlock_bh(&fib_hash_lock);
-		kfree(old_ht);
+
+		fz_hash_free(old_ht, old_divisor);
 	}
 }
 #endif /* CONFIG_IP_ROUTE_LARGE_TABLES */
@@ -233,12 +272,11 @@
 	memset(fz, 0, sizeof(struct fn_zone));
 	if (z) {
 		fz->fz_divisor = 16;
-		fz->fz_hashmask = 0xF;
 	} else {
 		fz->fz_divisor = 1;
-		fz->fz_hashmask = 0;
 	}
-	fz->fz_hash = kmalloc(fz->fz_divisor*sizeof(struct fib_node*), GFP_KERNEL);
+	fz->fz_hashmask = (fz->fz_divisor - 1);
+	fz->fz_hash = fz_hash_alloc(fz->fz_divisor);
 	if (!fz->fz_hash) {
 		kfree(fz);
 		return NULL;
@@ -468,7 +506,7 @@
 		return err;
 
 #ifdef CONFIG_IP_ROUTE_LARGE_TABLES
-	if (fz->fz_nent > (fz->fz_divisor<<2) &&
+	if (fz->fz_nent > (fz->fz_divisor<<1) &&
 	    fz->fz_divisor < FZ_MAX_DIVISOR &&
 	    (z==32 || (1<<z) > fz->fz_divisor))
 		fn_rehash_zone(fz);

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 10:40                                       ` David S. Miller
@ 2003-05-22 11:15                                         ` Martin Josefsson
  2003-05-23  1:00                                           ` David S. Miller
                                                             ` (2 more replies)
  2003-05-22 11:44                                         ` Simon Kirby
  1 sibling, 3 replies; 227+ messages in thread
From: Martin Josefsson @ 2003-05-22 11:15 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-net

On Thu, 2003-05-22 at 12:40, David S. Miller wrote:

> +static unsigned long size_to_order(unsigned long size)
> +{
> +	unsigned long order;
> +
> +	for (order = 0; order < MAX_ORDER; order++) {
> +		if ((PAGE_SIZE << order) >= size)
> +			break;
> +	}
> +	return order;
> +}

Any reason you're not using get_order() ?

-- 
/Martin

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 10:40                                       ` David S. Miller
  2003-05-22 11:15                                         ` Martin Josefsson
@ 2003-05-22 11:44                                         ` Simon Kirby
  2003-05-22 13:03                                           ` Martin Josefsson
                                                             ` (2 more replies)
  1 sibling, 3 replies; 227+ messages in thread
From: Simon Kirby @ 2003-05-22 11:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-net, kuznet

On Thu, May 22, 2003 at 03:40:58AM -0700, David S. Miller wrote:

> Simon (and others who want to benchmark :-), give this patch below a
> try.

Compiling it now... :)

There's no APIC on this test box, so it's using the XT-PIC.  Production
has an APIC, so I assume the overhead would be less there.

CONFIG_IP_ROUTE_LARGE_TABLES is and was enabled.

My dumb route populator script was just creating /32 routes.

> Anyways, testers please let us know the results.  Note you must
> have CONFIG_IP_ROUTE_LARGE_TABLES (and thus CONFIG_IP_ADVANCED_ROUTER)
> in order to even make use of this stuff.

Nice!  I tested with 300,000 routing table entries and there is no
discernable difference in performance from having an empty table. 
vmstat shows the same idle time as when the routing table is empty.

I enabled the hash growing debug printk, and so I saw this while
populating the route table:

fn_rehash_zone: hash for zone 32 grows from 16
fn_rehash_zone: hash for zone 32 grows from 256
fn_rehash_zone: hash for zone 32 grows from 1024
fn_rehash_zone: hash for zone 32 grows from 2048
fn_rehash_zone: hash for zone 32 grows from 4096
fn_rehash_zone: hash for zone 32 grows from 8192
fn_rehash_zone: hash for zone 32 grows from 16384
fn_rehash_zone: hash for zone 32 grows from 32768
fn_rehash_zone: hash for zone 32 grows from 65536
fn_rehash_zone: hash for zone 32 grows from 131072

I had originally written a patch to try to extend it to 8192, but I think
it was broken.  This definitely seems to fix it.

If you'd like I can try to regenerate a profile, but you probably already
know what it will look like.

Thanks!

Simon- (Zzz...)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 11:44                                         ` Simon Kirby
@ 2003-05-22 13:03                                           ` Martin Josefsson
  2003-05-23  0:55                                             ` David S. Miller
  2003-05-22 22:33                                           ` David S. Miller
  2003-05-23  0:59                                           ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: Martin Josefsson @ 2003-05-22 13:03 UTC (permalink / raw)
  To: Simon Kirby; +Cc: David S. Miller, netdev, linux-net, kuznet

On Thu, 2003-05-22 at 13:44, Simon Kirby wrote:

> Nice!  I tested with 300,000 routing table entries and there is no
> discernable difference in performance from having an empty table. 
> vmstat shows the same idle time as when the routing table is empty.

How much memory does a table that large use?

-- 
/Martin

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 11:44                                         ` Simon Kirby
  2003-05-22 13:03                                           ` Martin Josefsson
@ 2003-05-22 22:33                                           ` David S. Miller
  2003-05-29 20:51                                             ` Simon Kirby
  2003-05-23  0:59                                           ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-22 22:33 UTC (permalink / raw)
  To: sim; +Cc: netdev, linux-net, kuznet

   From: Simon Kirby <sim@netnation.com>
   Date: Thu, 22 May 2003 04:44:38 -0700

   If you'd like I can try to regenerate a profile, but you probably
   already know what it will look like.

I obviously know some things that will change, but I am still
very much interested in new profiles.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 13:03                                           ` Martin Josefsson
@ 2003-05-23  0:55                                             ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-23  0:55 UTC (permalink / raw)
  To: gandalf; +Cc: sim, netdev, linux-net, kuznet

   From: Martin Josefsson <gandalf@wlug.westbo.se>
   Date: 22 May 2003 15:03:07 +0200

   On Thu, 2003-05-22 at 13:44, Simon Kirby wrote:
   
   > Nice!  I tested with 300,000 routing table entries and there is no
   > discernable difference in performance from having an empty table. 
   > vmstat shows the same idle time as when the routing table is empty.
   
   How much memory does a table that large use?
   
300,000 * sizeof(struct fib_node)

the second term is:

	(2 * sizeof_pointer_on_this_architecture) + /* 8 or 16 bytes */
	sizeof(u32) + /* 4 bytes */
	4 * sizeof(u8)) /* 4 bytes */

So that's 16 bytes on 32-bit systems, and 24 bytes on 64-bit systems.

Therefore 300,000 routes take up 4.8MB on 32-bit systems and 7.2MB
on 64-bit ones.

I cannot fathom a way to make these any smaller :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 11:44                                         ` Simon Kirby
  2003-05-22 13:03                                           ` Martin Josefsson
  2003-05-22 22:33                                           ` David S. Miller
@ 2003-05-23  0:59                                           ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-23  0:59 UTC (permalink / raw)
  To: sim; +Cc: netdev, linux-net, kuznet

   From: Simon Kirby <sim@netnation.com>
   Date: Thu, 22 May 2003 04:44:38 -0700
   
   There's no APIC on this test box, so it's using the XT-PIC.
   Production has an APIC, so I assume the overhead would be less
   there.

It is absolutely critical for routing throughput.  XT-PIC overhead
severely limits your maximum packets/second switching rate.

Another thing to watch out for are drivers still using PIO to
access device registers (3c59x comes to mind).

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 11:15                                         ` Martin Josefsson
@ 2003-05-23  1:00                                           ` David S. Miller
  2003-05-23  1:01                                           ` David S. Miller
  2003-05-24  0:41                                           ` Andrew Morton
  2 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-23  1:00 UTC (permalink / raw)
  To: gandalf; +Cc: netdev, linux-net

   From: Martin Josefsson <gandalf@wlug.westbo.se>
   Date: 22 May 2003 13:15:39 +0200

   On Thu, 2003-05-22 at 12:40, David S. Miller wrote:
   
   > +static unsigned long size_to_order(unsigned long size)
   
   Any reason you're not using get_order() ?

Because I'm stupid, I'll fix that thanks :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 11:15                                         ` Martin Josefsson
  2003-05-23  1:00                                           ` David S. Miller
@ 2003-05-23  1:01                                           ` David S. Miller
  2003-05-23  8:21                                             ` Andi Kleen
  2003-05-24  0:41                                           ` Andrew Morton
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-23  1:01 UTC (permalink / raw)
  To: gandalf; +Cc: netdev, linux-net

   From: Martin Josefsson <gandalf@wlug.westbo.se>
   Date: 22 May 2003 13:15:39 +0200

   On Thu, 2003-05-22 at 12:40, David S. Miller wrote:
   
   > +static unsigned long size_to_order(unsigned long size)
   
   Any reason you're not using get_order() ?

Actually, get_order() aparently only works on powers of
two, which 'size' is definitely not.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-21 13:03                                   ` Jamal Hadi
@ 2003-05-23  5:42                                     ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-23  5:42 UTC (permalink / raw)
  To: hadi; +Cc: sim, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Wed, 21 May 2003 09:03:19 -0400 (EDT)

   On Tue, 20 May 2003, David S. Miller wrote:
   
   > Forward looking, Alexey and myself plan to extend the per-cpu flow
   > cache we designed for IPSEC policy lookups to apply to routing
   > and socket lookup.  There are two reasons to make this:
   >
   > 1) Per-cpu'ness.
   
   IPIs to synchronize?
   
It is a good question.  IPIs are one way to coorindate a flush or
state synchronization.

But this method is perhaps overblown for things like netfilter and
IPSEC policy configuration changes.

One way we can deal with those is via a generation count.  Any time
you insert/delete a netfilter or IPSEC policy rule, it potentially
affects each and every flow cache entry.  So bumping the generation
cound and checking this at flow lookup time is how we solve that
problem.

It is the same thing to do to handle routing table changes as well.
Anyways, this is what net/core/flow.c supports now.

Where this model does not fit is for sockets.  They tend to change the
state of exactly one flow.  We will need mechanisms by which to handle
this.

But, there is a flaw with the generation count scheme... One thing
Alexey has reminded me is that you can't defer cache flushing to
lookup time, because if traffic stops then the whole engine deadlocks
since nothing will release the references inside of the flow cache.

This brings me to another topic which is attempting to even avoid
the reference counting.  This is a very difficult problem, but the
benefits are large, it means that all the data can be shared by
cpus read-only because no writes occur to grab the reference to the
object the flow cache entry points to (socket, route, netfilter rule,
IPSEC policy, etc.)

   > 2) Input route lookup turns into a "flow" lookup and thus may
   >    give you a TCP socket, for example.  It is the most exciting
   >    part of this work.
   
   For packets that are being forwarded or even host bound, why start at
   routing?

It is just how I describe where this occurs.  It has nothing to
do with routing.  Route lookups just so happen to be the first
thing we do when we receive an IPv4 packet :-)

   This should be done much further below.

I don't understand, what I have described is as far into tbe basement
as one can possibly go :-)  If you go any deeper, you do not know
how even to parse the packet.

   This also gives you opportunity to drop early. A flow index could be
   created there that could be used to index into the route table for
   example. Maybe routing by fwmark would then make sense.
   
Flow is made up of protocol specific details.  Please look to
include/net/flow.h:struct flowi, it is how we describe the identity of
what I am calling a flow.

   Also the structure itself had the grandiose view that routing is
   the mother of them all i.e you "fit everything around routing" not
   "fit routing around other things".

Routing describes the virtual path a packet takes within the
stack.  It tells us what to do with packet, therefore it in fact
is "mother of them all".  It is all that networking stack does. :-)
   
Show me some example where you are describing how the stack will
handle a packet and that this is not some form of routing :-)

   I think the flowi must be captured way before IP is hit and reused
   by IP and other sublayers. policy routing dropping or attempts to
   fib_validate_source() the packets should  utilize that scheme (i.e install
   filters below ip) and tag(fwmark) or drop them on the floor before they
   hit IP.

If you do not yet know packet is IP, you have no way to even parse
it into flowi.

Our wires are crossed...

Look, forget that I said that we will make flow determination where we
make input route lookups right now.  Replace this with "the first
thing we will do with an IP packet is build a flowi (by parsing it)
and then look up that flow matching this key".

   I think post 2.6 we should just rip apart the infrastructure
   and rethink things ;-> (should i go into hiding now?;->)
   
I think we suggest very similar things.  Look, for policy dropped
flows they will not make it much further than the first few lines of
ip_input.c:ip_rcv()  It must be called by netif_receive_skb() anyways,
and all calling it says is "this is ipv4 packet" and we must know this
to be able to parse it.

   Should be pretty easy to do with a filter framework at the lower
   layers such as the one i did with ingress qdisc.
   
Ok, publish this code so we can talk in a more precise language.
:-)

If it is some "if (proto == ETH_P_IP) { ... parse ipv4 header" I will
be very disappointed.

   > None of this means that slowpath should not be improved if necessary.
   > On the contrary, I would welcome good kernel profiling output from
   > someone such as sim@netnation during such stress tests.
   
   nod.
   
I note that we have aparently killed the worst of these daemons over
the past 24 hours :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-23  1:01                                           ` David S. Miller
@ 2003-05-23  8:21                                             ` Andi Kleen
  2003-05-23  8:22                                               ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Andi Kleen @ 2003-05-23  8:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: gandalf, netdev, linux-net

On Thu, 22 May 2003 18:01:52 -0700 (PDT)
"David S. Miller" <davem@redhat.com> wrote:

>    From: Martin Josefsson <gandalf@wlug.westbo.se>
>    Date: 22 May 2003 13:15:39 +0200
> 
>    On Thu, 2003-05-22 at 12:40, David S. Miller wrote:
>    
>    > +static unsigned long size_to_order(unsigned long size)
>    
>    Any reason you're not using get_order() ?
> 
> Actually, get_order() aparently only works on powers of
> two, which 'size' is definitely not.

Are you sure? I always used it on all kinds of sizes.

The algorithm looks for me like it works on any size. A quick test
confirms that too.

(i386 version)
static __inline__ int get_order(unsigned long size)
{
        int order;

        size = (size-1) >> (PAGE_SHIFT-1);
        order = -1;
        do {
                size >>= 1;
                order++;
        } while (size);
        return order;
}


-Andi
> 

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-23  8:21                                             ` Andi Kleen
@ 2003-05-23  8:22                                               ` David S. Miller
  2003-05-23  9:03                                                 ` Andi Kleen
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-23  8:22 UTC (permalink / raw)
  To: ak; +Cc: gandalf, netdev, linux-net

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 23 May 2003 10:21:13 +0200

   On Thu, 22 May 2003 18:01:52 -0700 (PDT)
   "David S. Miller" <davem@redhat.com> wrote:
   
   > Actually, get_order() aparently only works on powers of
   > two, which 'size' is definitely not.
   
   Are you sure? I always used it on all kinds of sizes.
   
   The algorithm looks for me like it works on any size. A quick test
   confirms that too.

I believe you.

Then what does that comment above it mean? :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-23  8:22                                               ` David S. Miller
@ 2003-05-23  9:03                                                 ` Andi Kleen
  2003-05-23  9:59                                                   ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Andi Kleen @ 2003-05-23  9:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: gandalf, netdev, linux-net

On Fri, 23 May 2003 01:22:05 -0700 (PDT)
"David S. Miller" <davem@redhat.com> wrote:


> I believe you.
> 
> Then what does that comment above it mean? :-)

I guess it refers to the implementation, not the code.

For pure 2^n you could implement it much more efficiently using ffz()
(not that it really matters of course, most orders are 0)

-Andi


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-23  9:03                                                 ` Andi Kleen
@ 2003-05-23  9:59                                                   ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-23  9:59 UTC (permalink / raw)
  To: ak; +Cc: gandalf, netdev, linux-net

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 23 May 2003 11:03:01 +0200

   On Fri, 23 May 2003 01:22:05 -0700 (PDT)
   "David S. Miller" <davem@redhat.com> wrote:
   
   > Then what does that comment above it mean? :-)
   
   I guess it refers to the implementation, not the code.

Ok, I fixed the fib_hash.c stuff to use get_order().

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 11:15                                         ` Martin Josefsson
  2003-05-23  1:00                                           ` David S. Miller
  2003-05-23  1:01                                           ` David S. Miller
@ 2003-05-24  0:41                                           ` Andrew Morton
  2003-05-26  2:29                                             ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: Andrew Morton @ 2003-05-24  0:41 UTC (permalink / raw)
  To: Martin Josefsson; +Cc: davem, netdev, linux-net

Martin Josefsson <gandalf@wlug.westbo.se> wrote:
>
> Any reason you're not using get_order() ?

hm.

a) mips64 and cris seem to have forgotten to implement it.

b) ppc and ia64 just had to sneak a bit of asm into their version

c) all other architectures did a copy-n-paste.


I get the feeling that we need just a single copy of this guy, in
<linux/bitops.h>



^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-24  0:41                                           ` Andrew Morton
@ 2003-05-26  2:29                                             ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-05-26  2:29 UTC (permalink / raw)
  To: akpm; +Cc: gandalf, netdev, linux-net

   From: Andrew Morton <akpm@digeo.com>
   Date: Fri, 23 May 2003 17:41:56 -0700
   
   a) mips64 and cris seem to have forgotten to implement it.
   
I consider these platforms unmaintained in 2.5.x currently :-)

Without it, several generic things simply don't build.
SCSI tape is one.  Admittedly the rest of the spots are in
obscure or arch specific places.

   b) ppc and ia64 just had to sneak a bit of asm into their version
   
These are just optimizations, and I'm surprised x86 doesn't use
some clever instructions too.

   I get the feeling that we need just a single copy of this guy, in
   <linux/bitops.h>
   
True, I really doubt this is worth optimizing except to be cute.


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-20  1:23                       ` Jamal Hadi
  2003-05-20  1:24                         ` David S. Miller
@ 2003-05-26  7:18                         ` Florian Weimer
  2003-05-26  7:29                           ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Florian Weimer @ 2003-05-26  7:18 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: netdev, linux-net

Jamal Hadi <hadi@shell.cyberus.ca> writes:

> Also used to attack CISCOs by them kiddies btw. We stand much better
> than any CISCO doing caching.

Cisco IOS doesn't have this hash collisions problem, they have moved
away from hash tables ages ago.  You are probably just seeing CPU
starvation (Cisco routers aren't equipped with the fastest available
CPUs *sigh*, and you lose if routing is not performed by other means).

BTW, CEF is just a marketing term.  There's a plethora of
implementations, ranging from software-only to ASICs to special memory
chips (associative arrays with wildcards).  These implementations have
vastly different implications for router performance.  Most notably,
CEF is not a cache (not even in the software case), the data structure
are changed when updated routing information is encountered and not
when packets are received which need to be routed.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-26  7:18                         ` Florian Weimer
@ 2003-05-26  7:29                           ` David S. Miller
  2003-05-26  9:34                             ` Florian Weimer
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-26  7:29 UTC (permalink / raw)
  To: fw; +Cc: hadi, netdev, linux-net

   From: Florian Weimer <fw@deneb.enyo.de>
   Date: Mon, 26 May 2003 09:18:19 +0200
   
   Cisco IOS doesn't have this hash collisions problem, they have
   moved away from hash tables ages ago.
   
Let their loss be our gain :-)  No, I am serious, their solution to
misbehaving flows seems to be just using slow path always and
continually optimize the slow path.

   the data structure are changed when updated routing information is
   encountered and not when packets are received which need to be
   routed.

Yes, they say this in the marketing literature too. :)

Now, how about some real explanation about what they are actually
doing?  Are they replicating the routing table all over the place?
That's one possibility, and would match up to their saying that more
router memory is required when using CEF.

The other possibility is that it's a faster-trie thing generated
from the normal routing tables.  Since CEF aparently works with QoS
and other features, the key must be many bits wide.  Probably similar
in size to our flowi's.

So some bit branching trie based upon flow parameters.  There are
hundreds of patented such schemes :-)

Anyways, you keep saying that flow hashing is stupid, can you propose
an alternative?  Really, I don't mean to be rude, but you do a lot of
complaining about how what we have now sucks and zero of actually
suggesting a usable alternative.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-21  0:13                         ` David S. Miller
@ 2003-05-26  9:29                           ` Florian Weimer
  0 siblings, 0 replies; 227+ messages in thread
From: Florian Weimer @ 2003-05-26  9:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, hadi, netdev, linux-net

"David S. Miller" <davem@redhat.com> writes:

>    From: Simon Kirby <sim@netnation.com>
>    Date: Tue, 20 May 2003 17:09:36 -0700
>    
>    It's rather difficult to follow, but I don't see any "h4r h4r, expl0it
>    th3 L1nux h4sh" comments or anything in the code that seems to attempt to
>    exploit the hash algorithms in (older) Linux.
>
> Look at the vc[] table and how it uses this in rndip().

The vc[] table is used to generate packets which don't fall victim to
widely implemented source address checks (e.g. "ip verify unicast
source reachable-via any" on recent Cisco routers).

I've checked the generated packets and they appear to be distributed
rather evenly among about 3,000 of the 8,192 hash buckets (with the
old hash function, of course), so juno-z.101f.c does not specifically
choose source addresses to trigger collisions.

(BTW, that's the reason why I consider the hash collision DoS attack
not too relevant in practice -- anybody who wants to DoS my machine
can probably send lots of packets to it.  juno-z.101f.c just works
well enough, even if it doesn't saturate all available bandwidth.)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-26  7:29                           ` David S. Miller
@ 2003-05-26  9:34                             ` Florian Weimer
  2003-05-27  6:32                               ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Florian Weimer @ 2003-05-26  9:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, netdev, linux-net

"David S. Miller" <davem@redhat.com> writes:

> Let their loss be our gain :-)  No, I am serious, their solution to
> misbehaving flows seems to be just using slow path always and
> continually optimize the slow path.

Exactly, as a result you get stateless IP forwarding whose performance
is mostly independent of the traffic characteristics.

> Now, how about some real explanation about what they are actually
> doing?  Are they replicating the routing table all over the place?

They do this for dCEF.  In this case the CEF data structure are
replicated on every linecard that supports autonomous routing
decisions.  (This is essential for GSRs because the internal bus is
too narrow for almost any communications.  You are lucky if the
routing tables updates do not saturate it.)

> That's one possibility, and would match up to their saying that more
> router memory is required when using CEF.

CEF is essentially yet another copy of the routing table and therefore
requires memory (they do not aggregate prefixes, so some memory is
required for a full Internet routing table.)

> The other possibility is that it's a faster-trie thing generated
> from the normal routing tables.

Yeah, it's some kind of a trie according to a few Cisco documents
(which are a bit self-contradictory, though).

> Since CEF aparently works with QoS and other features, the key must
> be many bits wide.  Probably similar in size to our flowi's.

I don't think IOS QoS is based on (d)CEF.  It's true that on some
Cisco routers, QoS requires (d)CEF-enabled linecards, but I believe
this is just a software design issue and not inherently tied to the
CEF data structures.

So far I've only seen CEF tables with IPv4 addresses as indices. 8-)

> So some bit branching trie based upon flow parameters.  There are
> hundreds of patented such schemes :-)

Just ignore them. 8-)

> Anyways, you keep saying that flow hashing is stupid, can you propose
> an alternative?

Only for pure IPv4 CIDR routing (based on prefixes and destination
addresses).  I'd try the following scheme: split the destination
address into two parts, and use the more significant one as an index
into a table of (function pointer, closure data pointer) pairs.  These
functions return a pointer to the adjacency information.  They can be
implemented in various ways, depending on the structure of the less
significant part (e.g. if only one subnet is routed differently from
the others, a few comparisons are sufficent to identify it).  As a
result, the routing decision could made with one or two indirect calls
and a couple of memory accesses.

For hosts, if the routing table contains less than (say) ten routes,
order it by decreasing prefix length and scan it sequentially for a
match.

In all cases, L2 addresses should be stored indexed by the least
significant bits of the corresponding IP addresses (no hashing
required).

Of course, this will result in vastly decreased functionality (no
arbitary netmasks, no policy-based routing, code will be fine-tuned
for typical Internet routing tables), so this proposal definitely
comes at a price.

(In the meantime, it might be beneficial to use more buckets in the
routing cache and rely less on collision chaining.)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-26  9:34                             ` Florian Weimer
@ 2003-05-27  6:32                               ` David S. Miller
  2003-06-08 11:39                                 ` Florian Weimer
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-05-27  6:32 UTC (permalink / raw)
  To: fw; +Cc: hadi, netdev, linux-net

   From: Florian Weimer <fw@deneb.enyo.de>
   Date: Mon, 26 May 2003 11:34:37 +0200
   
   Of course, this will result in vastly decreased functionality (no
   arbitary netmasks, no policy-based routing, code will be fine-tuned
   for typical Internet routing tables), so this proposal definitely
   comes at a price.

As a general purpose operating system, where people DO in fact use
these features quite regularly, we cannot make these kinds of choices
without making it optional and definiteily non-default behavior.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-22 22:33                                           ` David S. Miller
@ 2003-05-29 20:51                                             ` Simon Kirby
  2003-06-02 10:58                                               ` Robert Olsson
  0 siblings, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-05-29 20:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-net, kuznet

On Thu, May 22, 2003 at 03:33:30PM -0700, David S. Miller wrote:

>    If you'd like I can try to regenerate a profile, but you probably
>    already know what it will look like.
> 
> I obviously know some things that will change, but I am still
> very much interested in new profiles.

Sorry for the delay -- I was away for a few days.  Here are profile
results from the same machine (still with XT-PIC), the same 300000 route
entries, and your original patch that fixes the hashing.  I should also
mention that in all of these tests I have one filter rule in the INPUT
chain (after routing) to avoid sending back zillions of ICMP packets out
to the spoofed source IPs.

...
    27 check_pgt_cache                            0.8438
  1430 ip_rcv_finish                              2.4870
   135 ipv4_dst_destroy                           2.8125
   357 cpu_idle                                   3.1875
  7714 ip_route_input_slow                        3.3481
   434 fib_rules_policy                           3.8750
  2952 ip_rcv                                     5.2714
    85 kmem_cache_alloc                           5.3125
  2188 netif_receive_skb                          5.4700
  2734 alloc_skb                                  5.6958
   822 skb_release_data                           5.7083
  2161 __kfree_skb                                5.8723
   572 ip_local_deliver                           5.9583
  1023 __constant_c_and_count_memset              6.3937
  3801 fib_validate_source                        6.7875
  6778 rt_garbage_collect                         7.1801
   497 __fib_res_prefsrc                          7.7656
  3035 inet_select_addr                           8.2473
  2717 tcp_match                                  8.4906
   552 ipt_hook                                   8.6250
   706 kmalloc                                    8.8250
  1561 kfree                                      8.8693
  1287 jhash_3words                               8.9375
  5937 nf_hook_slow                              10.9136
  2532 fib_semantic_match                        12.1731
  2356 eth_type_trans                            12.2708
  2166 nf_iterate                                12.3068
  4446 net_rx_action                             12.6307
  1622 kfree_skbmem                              12.6719
   842 rt_hash_code                              13.1562
 16030 ipt_do_table                              14.5199
  2104 tg3_recycle_rx                            14.6111
 13795 tg3_rx                                    14.6133
  5667 __kmem_cache_alloc                        17.7094
  1193 ipt_route_hook                            18.6406
  2851 do_gettimeofday                           19.7986
  7423 fib_lookup                                23.1969
  1497 fib_rule_put                              23.3906
  8803 ip_packet_match                           26.1994
  4970 dst_destroy                               28.2386
 22479 rt_intern_hash                            29.2695
  8804 kmem_cache_free                           55.0250
  8380 dst_alloc                                 58.1944
 18252 fn_hash_lookup                            63.3750
 25473 tg3_interrupt                             75.8125
 24036 do_softirq                               100.1500
 51355 ip_route_input                           118.8773
 57304 tg3_poll                                 188.5000
111691 handle_IRQ_event                         698.0688
168828 default_idle                             2637.9375

Full profile output available here:

	http://blue.netnation.com/sim/ref/
	readprofile.full_route_table_hash_fixed.*

Note that if I increase the packet rate and NAPI kicks in, all of the
handle_IRQ and similar overhead basically disappears because it no longer
uses IRQs.  Pretty spiffy.  Here is a profile of that:

...
    25 tasklet_hi_action                          0.1562
    46 timer_bh                                   0.2054
    97 net_rx_action                              0.2756
    93 tg3_vlan_rx                                0.3875
   158 tg3_poll                                   0.5197
  1630 ip_rcv_finish                              2.8348
   142 ipv4_dst_destroy                           2.9583
   429 fib_rules_policy                           3.8304
  8959 ip_route_input_slow                        3.8885
  2438 ip_rcv                                     4.3536
  2504 alloc_skb                                  5.2167
  1991 __kfree_skb                                5.4103
  2279 netif_receive_skb                          5.6975
   929 skb_release_data                           6.4514
   669 ip_local_deliver                           6.9688
  1175 __constant_c_and_count_memset              7.3438
  2367 tcp_match                                  7.3969
   124 kmem_cache_alloc                           7.7500
  4535 fib_validate_source                        8.0982
   598 __fib_res_prefsrc                          9.3438
  8896 rt_garbage_collect                         9.4237
  3582 inet_select_addr                           9.7337
  1747 kfree                                      9.9261
   717 ipt_hook                                  11.2031
   938 kmalloc                                   11.7250
  1747 jhash_3words                              12.1319
  6879 nf_hook_slow                              12.6452
  2439 eth_type_trans                            12.7031
  1695 kfree_skbmem                              13.2422
  2358 nf_iterate                                13.3977
   872 rt_hash_code                              13.6250
  2933 fib_semantic_match                        14.1010
 16553 ipt_do_table                              14.9937
 15339 tg3_rx                                    16.2489
  2482 tg3_recycle_rx                            17.2361
  5967 __kmem_cache_alloc                        18.6469
  1237 ipt_route_hook                            19.3281
  3120 do_gettimeofday                           21.6667
  8299 ip_packet_match                           24.6994
  8031 fib_lookup                                25.0969
  1877 fib_rule_put                              29.3281
  6088 dst_destroy                               34.5909
 26833 rt_intern_hash                            34.9388
 10666 kmem_cache_free                           66.6625
 20193 fn_hash_lookup                            70.1146
 10516 dst_alloc                                 73.0278
 64803 ip_route_input                           150.0069

Full profile output available as:

	readprofile.full_route_table_hash_fixed_napi.*

Hmm.. I see there is some redundant hashing going on in
ip_route_input_slow() (called only from ip_route_input() which already
calculates the hash), but my patch to fix that adds yet another argument
to ip_route_slow() which isn't that pretty.  It looks like that function
isn't using much CPU anyway.

Why is ip_route_input() so heavy still?  This kernel is compiled
CONFIG_SMP which makes the read_lock() calls actually do something, but
it looks like they should be fairly light.  Should I add an iteration
counter to the for loop, perhaps?

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-29 20:51                                             ` Simon Kirby
@ 2003-06-02 10:58                                               ` Robert Olsson
  2003-06-02 15:18                                                 ` Simon Kirby
  2003-06-09 17:19                                                 ` David S. Miller
  0 siblings, 2 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-02 10:58 UTC (permalink / raw)
  To: Simon Kirby; +Cc: David S. Miller, netdev, linux-net, kuznet


Simon Kirby writes:
 > Full profile output available here:
 > 
 > 	http://blue.netnation.com/sim/ref/
 > 	readprofile.full_route_table_hash_fixed_napi.*
 > 
 > Note that if I increase the packet rate and NAPI kicks in, all of the
 > handle_IRQ and similar overhead basically disappears because it no longer
 > uses IRQs.  Pretty spiffy.  Here is a profile of that:
 > Full profile output available as:


  8896 rt_garbage_collect                         9.4237
  8959 ip_route_input_slow                        3.8885
 10516 dst_alloc                                 73.0278
 10666 kmem_cache_free                           66.6625
 15339 tg3_rx                                    16.2489
 16553 ipt_do_table                              14.9937
 20193 fn_hash_lookup                            70.1146
 26833 rt_intern_hash                            34.9388
 64803 ip_route_input                           150.0069

 From DoS perspective a more interesting experiment compared to where you limited input
 rate to have 30% idle CPU.

 New dst is coming all the time first seached in hash (ip_route_input) and not found
 so ip_route_input_slow/fn_hash_lookup/dst_alloc/rt_intern_hash path is taken to add
 a new dst entry...

 And later GC have to remove all enties with spin_lock_bh hold (no packet processing 
 runs). I see packet drops exactly when GC runs. Tuning GC might help but it's something 
 to observe.

 
 I had some idea to rate-limit new flows and try to isolate the device causing the DoS 
 Something like (ip_route_input):

 [We don't have an hash entry]

        /* 
           DoS check... Rate down but do not stop GC and creation of new 
           hash entries until GC frees resources. We limit per interface 
           so hogger dev(s) will be hit hardest. As a side effect we get 
           dst_overrun per device.

        */

        entries = atomic_read(&ipv4_dst_ops.entries);

        if (entries > ip_rt_max_size) {
                int drp = 4;

                if( dev->dst_hash_overrun++ % drp ) {

                        if (net_ratelimit())
                                printk(KERN_WARNING "dst creation throttled\n");

                        return -ECONNREFUSED;
                }

       /* Also make sure the slow path gets a chance to create the dst entry */

                if (ipv4_dst_ops.gc && ipv4_dst_ops.gc()) {
                        RT_CACHE_STAT_INC(gc_dst_overflow);
                        return -ENOBUFS;
                }
        }
  
 [ip_route_input_slow comes here] 


 But more thinking is needed...

 Cheers.
							--ro
  


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-02 10:58                                               ` Robert Olsson
@ 2003-06-02 15:18                                                 ` Simon Kirby
  2003-06-02 16:36                                                   ` Robert Olsson
  2003-06-09 17:19                                                 ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-06-02 15:18 UTC (permalink / raw)
  To: Robert Olsson; +Cc: David S. Miller, netdev, linux-net, kuznet

On Mon, Jun 02, 2003 at 12:58:31PM +0200, Robert Olsson wrote:

>  New dst is coming all the time first seached in hash (ip_route_input) and not found
>  so ip_route_input_slow/fn_hash_lookup/dst_alloc/rt_intern_hash path is taken to add
>  a new dst entry...
> 
>  And later GC have to remove all enties with spin_lock_bh hold (no packet processing 
>  runs). I see packet drops exactly when GC runs. Tuning GC might help but it's something 
>  to observe.
> 
>  I had some idea to rate-limit new flows and try to isolate the device causing the DoS 
>  Something like (ip_route_input):
...
>                         if (net_ratelimit())
>                                 printk(KERN_WARNING "dst creation throttled\n");
> 
>                         return -ECONNREFUSED;

This reminds me of the situation we experienced with the dst cache
overflowing in early 2.2 kernels.  This was a long time ago, when our
traffic was only about 10 Mbits/second.  We had recently upgraded from a
2.0 kernel.  The dst cache was overflowing due to a bug in the garbage
collector, and at the time, no messages were printed.  It took me a
_long_ time to figure out why connections to a server I hadn't previously
connected to in a while would only work every so often, and not
immediately like they should.  I'm affraid this approach will have a
similar effect, albeit (hopefully) only under an attack.

Is it possible to have a dst LRU or a simpler approximation of such and
recycle dst entries rather than deallocating/reallocating them?  This
would relieve a lot of work from the garbage collector and avoid the
periodic large garbage collection latency.  It could be tuned to only
occur in an attack (I remember Alexey saying that the deferred garbage
collection was implemented to reduce latency in normal opreation).

Would this work?  Cross-CPU thrashing issues?

Simon-

[        Simon Kirby        ][        Network Operations        ]
[     sim@netnation.com     ][   NetNation Communications Inc.  ]
[  Opinions expressed are not necessarily those of my employer. ]

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-02 15:18                                                 ` Simon Kirby
@ 2003-06-02 16:36                                                   ` Robert Olsson
  2003-06-02 18:05                                                     ` Simon Kirby
  2003-06-09 17:21                                                     ` David S. Miller
  0 siblings, 2 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-02 16:36 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Robert Olsson, David S. Miller, netdev, linux-net, kuznet


Simon Kirby writes:

 > This reminds me of the situation we experienced with the dst cache
 > overflowing in early 2.2 kernels.  This was a long time ago, when our
 > traffic was only about 10 Mbits/second.  We had recently upgraded from a
 > 2.0 kernel.  The dst cache was overflowing due to a bug in the garbage
 > collector, and at the time, no messages were printed.  It took me a
 > _long_ time to figure out why connections to a server I hadn't previously
 > connected to in a while would only work every so often, and not
 > immediately like they should.  I'm affraid this approach will have a
 > similar effect, albeit (hopefully) only under an attack.

 We are given more work than we have resources for (max_size) what else than 
 refuse can we do?  But yes we have invested pretty much work already. 

 Also remember we are looking into runs were 100% of incoming traffic has one 
 new dst for every packet. So how is the situation in "real life"? 
 In case of multiple devices at least NAPI gives all devs it's share. 

 > Is it possible to have a dst LRU or a simpler approximation of such and
 > recycle dst entries rather than deallocating/reallocating them?  This
 > would relieve a lot of work from the garbage collector and avoid the
 > periodic large garbage collection latency.  It could be tuned to only
 > occur in an attack (I remember Alexey saying that the deferred garbage
 > collection was implemented to reduce latency in normal opreation).

 I don't see how this can be done. Others may?

 Cheers.
						--ro
 

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-02 16:36                                                   ` Robert Olsson
@ 2003-06-02 18:05                                                     ` Simon Kirby
  2003-06-09 17:21                                                     ` David S. Miller
  1 sibling, 0 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-02 18:05 UTC (permalink / raw)
  To: Robert Olsson; +Cc: David S. Miller, netdev, linux-net, kuznet

On Mon, Jun 02, 2003 at 06:36:37PM +0200, Robert Olsson wrote:

>  We are given more work than we have resources for (max_size) what else than 
>  refuse can we do?  But yes we have invested pretty much work already. 

Well, this is the problem.  We do not and cannot know which entries we
really want to remember (legitimate traffic).  Adding code to actually
refuse new dst entries is just going to make the DoS effective, which is
NOT what we want.

>  Also remember we are looking into runs were 100% of incoming traffic has one 
>  new dst for every packet. So how is the situation in "real life"? 
>  In case of multiple devices at least NAPI gives all devs it's share. 

Right, so, when we are traffic saturated, we want to make sure the whole
route cache and route path is as fast as possible.  Recycling dst entries
by simpy rewriting and rehashing them rather than allocating new and
eventually freeing them all in the garbage collection cycle should reduce
allocator overhead.  If this is only done when the table is full, I don't
see any downside...if this is in fact doable, that is. :)

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-05-27  6:32                               ` David S. Miller
@ 2003-06-08 11:39                                 ` Florian Weimer
  2003-06-08 12:05                                   ` David S. Miller
  2003-06-08 17:58                                   ` Pekka Savola
  0 siblings, 2 replies; 227+ messages in thread
From: Florian Weimer @ 2003-06-08 11:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-net

"David S. Miller" <davem@redhat.com> writes:

>    Of course, this will result in vastly decreased functionality (no
>    arbitary netmasks, no policy-based routing, code will be fine-tuned
>    for typical Internet routing tables), so this proposal definitely
>    comes at a price.
>
> As a general purpose operating system, where people DO in fact use
> these features quite regularly,

Even non-CIDR netmasks?  AFAIK, it's hard to find dedicated networking
devices (and routing protocols!) which support them. 8-/

Anyway, I've played a bit with something inspired by CEF (more
precisely speaking, one diagram in the IOS internals book and some IOS
diagnostic output).

Basically, it's a 256-way trie, with "adjacency information" at the
leaves (consisting of L2 addressing information and the prefix
length).  The leaves contain a full list of child nodes which
reference to the leaf itself.  This allows for branch-free routing
(see below). (A further optimization would not allocate the
self-referencing pointers for leaves which are at the fourth layer of
the trie, but this is unlikely to have a hughe performance impact.)

The trie has 7862 internal nodes for my copy of the Internet routing
table, which amounts to 8113584 bytes (excluding memory management
overhead, twice the value for 64 bit architectures).  The numer of
internal nodes does not depend on the number of interfaces/peerings,
and prefix filtering based on their lengths (/27 or even /24) doesn't
make a huge difference either.

For each adjacency, space for the L2 addressing information is
required plus 256 pointers for the self-references (of course, for
each relevant prefix length, so you have a few kilobytes for a typical
peering).

The routing function looks like this:

struct cef_entry *
cef_route (struct cef_table *table, ipv4_t address)
{
	unsigned char octet1 = address >> 24;
	unsigned char octet2 = (address >> 16) & 0xFF;
	unsigned char octet3 = (address >> 8) & 0xFF;
	unsigned char octet4 = address & 0xFF;
			
	struct cef_entry * entry1 = table->children[octet1];
	struct cef_entry * entry2 = entry1->table[octet2];
	struct cef_entry * entry3 = entry2->table[octet3];
	struct cef_entry * entry4 = entry3->table[octet4];

	return entry4;
}

For the full routing table with "maximum" adjacency information
(different L2 addressing information for each origin AS) and
"real-world" addresses (captured at the border of a medium-size
network, local addresses filtered), the function needs about 82 cycles
per routing decision on my Athlon XP (including function call
overhead).  For random addresses, we have 155 cycles.  In a simulation
of a moderate peering (only 94 adjacencies, simulated interfaces to
half a dozen AS concentrated in Germany), I measured 45 cycles per
routing decision for real-world traffic, and 70 cycles for random
addresses.  (More peerings result in more adjacencies which lead to
fewer cache hits.)

You can save 1K (or 2K on 64-bit architectures) per adjacency if you
introduce data-dependent branches:

struct cef_entry *
cef_route (struct cef_table *table, ipv4_t address)
{
	unsigned char octet1 = address >> 24;
	struct cef_entry * entry1 = table->children[octet1];

	if (entry1->prefix_length < 0) {
		unsigned char octet2 = (address >> 16) & 0xFF;
		struct cef_entry * entry2 = entry1->table[octet2];
		if (entry2->prefix_length < 0) {
			unsigned char octet3 = (address >> 8) & 0xFF;
			struct cef_entry * entry3 = entry2->table[octet3];
			if (entry3->prefix_length < 0) {
				unsigned char octet4 = address & 0xFF;
				struct cef_entry * entry4 = entry3->table[octet4];
				return entry4;
			} else {
				return entry3;
			}
		} else {
			return entry2;
		}
	} else {
		return entry1;
	}
}

However, this decreases performance (even on my Athlon XP with just
256 KB cache).

At the moment, I've got a userspace prototype for simulations which
can build the trie and make routing decisions.  Removing entries is a
bit tricky and requires more data because formerly overridden prefixes
might have to be resurrected.  I'm unsure which data structures should
be used to solve this problem.  Memory management is a related
question, too.  And locking. *sigh*

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 11:39                                 ` Florian Weimer
@ 2003-06-08 12:05                                   ` David S. Miller
  2003-06-08 13:10                                     ` Florian Weimer
  2003-06-08 17:58                                   ` Pekka Savola
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-08 12:05 UTC (permalink / raw)
  To: fw; +Cc: netdev, linux-net

   From: Florian Weimer <fw@deneb.enyo.de>
   Date: Sun, 08 Jun 2003 13:39:49 +0200

   "David S. Miller" <davem@redhat.com> writes:
   
   > As a general purpose operating system, where people DO in fact use
   > these features quite regularly,
   
   Even non-CIDR netmasks?  AFAIK, it's hard to find dedicated networking
   devices (and routing protocols!) which support them. 8-/
   
Yes, people use source based routing to block specific IPs and
subnets, it's also needed to Mobile IPV4.

   Anyway, I've played a bit with something inspired by CEF (more
   precisely speaking, one diagram in the IOS internals book and some IOS
   diagnostic output).
   
Thanks, Alexey and myself will need to study this deeply.

Although, I hope it's not "too similar" to what CEF does because
undoubtedly Cisco has a bazillion patents in this area.  This is
actually an argument for coming up with out own algorithms without
any knowledge of what CEF does or might do. :(

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 12:05                                   ` David S. Miller
@ 2003-06-08 13:10                                     ` Florian Weimer
  2003-06-08 23:49                                       ` Simon Kirby
  2003-06-10  3:05                                       ` Steven Blake
  0 siblings, 2 replies; 227+ messages in thread
From: Florian Weimer @ 2003-06-08 13:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, linux-net

"David S. Miller" <davem@redhat.com> writes:

> Although, I hope it's not "too similar" to what CEF does because
> undoubtedly Cisco has a bazillion patents in this area.

Most things in this area are patented, and the patents are extremely
fuzzy (e.g. policy-based routing with hierarchical sequence of
decisions has been patented countless times). 8-(

> This is actually an argument for coming up with out own algorithms
> without any knowledge of what CEF does or might do. :(

The branchless variant is not described in the IOS book, and I can't
tell if Cisco routers use it.  If this idea is really novel, we are in
pretty good shape because we no longer use trees, tries or whatever,
but a DFA. 8-)

Further parameters which could be tweaked is the kind of adjacency
information (where to store the L2 information, whether to include the
prefix length in the adjacency record etc.).

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 11:39                                 ` Florian Weimer
  2003-06-08 12:05                                   ` David S. Miller
@ 2003-06-08 17:58                                   ` Pekka Savola
  1 sibling, 0 replies; 227+ messages in thread
From: Pekka Savola @ 2003-06-08 17:58 UTC (permalink / raw)
  To: Florian Weimer; +Cc: David S. Miller, netdev, linux-net

On Sun, 8 Jun 2003, Florian Weimer wrote:
> "David S. Miller" <davem@redhat.com> writes:
> 
> >    Of course, this will result in vastly decreased functionality (no
> >    arbitary netmasks, no policy-based routing, code will be fine-tuned
> >    for typical Internet routing tables), so this proposal definitely
> >    comes at a price.
> >
> > As a general purpose operating system, where people DO in fact use
> > these features quite regularly,
> 
> Even non-CIDR netmasks?  AFAIK, it's hard to find dedicated networking
> devices (and routing protocols!) which support them. 8-/

Do you mean netmasks like "255.128.255.0" ?  Those are a real 
abomination and probably not supported.. and I don't know of anything that 
would require them.

Or do you mean netmasks such as 1.1.1.1/19?  I don't know of any credible 
networking devices which wouldn't support them.  If so, please come out of 
the cave.

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 13:10                                     ` Florian Weimer
@ 2003-06-08 23:49                                       ` Simon Kirby
  2003-06-08 23:55                                         ` CIT/Paul
  2003-06-09  5:38                                         ` David S. Miller
  2003-06-10  3:05                                       ` Steven Blake
  1 sibling, 2 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-08 23:49 UTC (permalink / raw)
  To: Florian Weimer; +Cc: netdev, linux-net

On Sun, Jun 08, 2003 at 03:10:25PM +0200, Florian Weimer wrote:

> Further parameters which could be tweaked is the kind of adjacency
> information (where to store the L2 information, whether to include the
> prefix length in the adjacency record etc.).

What is the problem with the current approach?  Does the overhead come
from having to iterate through the hashes for each prefix?

Simon-

[        Simon Kirby        ][        Network Operations        ]
[     sim@netnation.com     ][   NetNation Communications Inc.  ]
[  Opinions expressed are not necessarily those of my employer. ]

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-08 23:49                                       ` Simon Kirby
@ 2003-06-08 23:55                                         ` CIT/Paul
  2003-06-09  3:15                                           ` Jamal Hadi
                                                             ` (2 more replies)
  2003-06-09  5:38                                         ` David S. Miller
  1 sibling, 3 replies; 227+ messages in thread
From: CIT/Paul @ 2003-06-08 23:55 UTC (permalink / raw)
  To: 'Simon Kirby', 'Florian Weimer'; +Cc: netdev, linux-net

The problem with the route cache as it stands is that it adds every new
packet that isn't in the route cache to the cache, say you have 
A denial of service attack going on, OR you just have millions of hosts
going through the router (if you were an ISP).  Anything with seeminly
Random source ips (something like juno-z.101f.c will generate worst case
scenario for forwarding packets) will cause the cache to constantly
Add new entries at pretty much the rate of the attack.. This can stifle
just about any linux router with a measly 10 megabits/second of traffic
unless
The router is tuned up to a large degree (NAPI, certain nics, route
cache timings, etc.) and even then it can still be destroyed no matter
what
The cpu is with less than 100,000 packets per second and in mosts cases
less than 30k.. That's why it's just no acceptable for companies using
it as a replacement for say a cisco 7200 VXR series (npe300,400 nsf-1,
etc.) which can do 300K+ packet per second of routing (and yes it can
even route juno-z.101f.c at 300kpps, I have tested it).   Linux has no
problem doing 300kpps from a single source to a single destination
provided you have NAPI or ITR or something limiting the interrupts.. The
overhead is the route cache and the related systems that use it and also
netfilter is very slow :/  One of these days they will fix it..... If
anyone has any ideas or needs a test-bed to try out code on or would
like me to test some of their code I would be happy to test it on our
development platforms (single and dual processor with intel e1000
82545/6 and above, also e100 and tulip). 

Thanks for your time

P.S. to answer your iteration question.. It does not seem to be such
overhead on the cpu even if the route-cache is 600,000 in size.. I have
tested this and while there is a definite increase in cpu it comes
nothing close to the code that has to add every new arriving packet to
the list.  IMHO the best way to do this would be like CEF w/ adjacency
lists and not have it add every new packet that comes along 

Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On
Behalf Of Simon Kirby
Sent: Sunday, June 08, 2003 7:49 PM
To: Florian Weimer
Cc: netdev@oss.sgi.com; linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


On Sun, Jun 08, 2003 at 03:10:25PM +0200, Florian Weimer wrote:

> Further parameters which could be tweaked is the kind of adjacency 
> information (where to store the L2 information, whether to include the

> prefix length in the adjacency record etc.).

What is the problem with the current approach?  Does the overhead come
from having to iterate through the hashes for each prefix?

Simon-

[        Simon Kirby        ][        Network Operations        ]
[     sim@netnation.com     ][   NetNation Communications Inc.  ]
[  Opinions expressed are not necessarily those of my employer. ]


^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-08 23:55                                         ` CIT/Paul
@ 2003-06-09  3:15                                           ` Jamal Hadi
  2003-06-09  5:27                                             ` CIT/Paul
                                                               ` (2 more replies)
  2003-06-09  5:44                                           ` David S. Miller
  2003-06-09  6:47                                           ` Simon Kirby
  2 siblings, 3 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-06-09  3:15 UTC (permalink / raw)
  To: CIT/Paul
  Cc: 'Simon Kirby', 'Florian Weimer', netdev, linux-net



On Sun, 8 Jun 2003, CIT/Paul wrote:

> The problem with the route cache as it stands is that it adds every new
> packet that isn't in the route cache to the cache, say you have
> A denial of service attack going on, OR you just have millions of hosts
> going through the router (if you were an ISP).  Anything with seeminly
> Random source ips (something like juno-z.101f.c will generate worst case
> scenario for forwarding packets) will cause the cache to constantly
> Add new entries at pretty much the rate of the attack.. This can stifle
> just about any linux router with a measly 10 megabits/second of traffic
> unless

foo have you tried the latest patches posted recently?
get the latest kernel 2.5.x and try it out.
BTW, i dont think it is true you can die with 10mbps. I was reading
some emails where someone said it was a few 100 pps that will kill the
linux sytem (theory mixed with nonsense;->)

> The router is tuned up to a large degree (NAPI, certain nics, route
> cache timings, etc.) and even then it can still be destroyed no matter
> what
> The cpu is with less than 100,000 packets per second and in mosts cases
> less than 30k..

btw thats waay above 10Mbps.

> That's why it's just no acceptable for companies using
> it as a replacement for say a cisco 7200 VXR series (npe300,400 nsf-1,
> etc.) which can do 300K+ packet per second of routing (and yes it can
> even route juno-z.101f.c at 300kpps, I have tested it).   Linux has no
> problem doing 300kpps from a single source to a single destination
> provided you have NAPI or ITR or something limiting the interrupts.. The
> overhead is the route cache and the related systems that use it and also
> netfilter is very slow :/  One of these days they will fix it..... If
> anyone has any ideas or needs a test-bed to try out code on or would
> like me to test some of their code I would be happy to test it on our
> development platforms (single and dual processor with intel e1000
> 82545/6 and above, also e100 and tulip).
>

I think Robert has some numbers with the new patches with similar setups
as you.
Why dont you compare how much the cost of a CISCO npex devices with
Linux PCs with e1000s as well while you are at it ?;->
I am sure there are people who will like to sell you linux devices
at half the cisco prices doing Millions of PPS via hardware assists.
Support these linux supporting companies instead ;->

The more i think about it the more i think CEF is a lame escape from
route caches. What we need is multi-tries at the slow path and
perhaps a binary tree on hash collisions buckets of the dst cache
(instead of a linked list).
You can avoid the packet drive cache generation event by being a little
creative if it gets overwhelming. Fix zebra to resolve each BGP
nexthop fully every periodic time.

In any case who said forwarding by itself was sexy anymore?

cheers,
jamal

> Thanks for your time
>
> P.S. to answer your iteration question.. It does not seem to be such
> overhead on the cpu even if the route-cache is 600,000 in size.. I have
> tested this and while there is a definite increase in cpu it comes
> nothing close to the code that has to add every new arriving packet to
> the list.  IMHO the best way to do this would be like CEF w/ adjacency
> lists and not have it add every new packet that comes along
>
> Paul xerox@foonet.net http://www.httpd.net
>
>
> -----Original Message-----
> From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On
> Behalf Of Simon Kirby
> Sent: Sunday, June 08, 2003 7:49 PM
> To: Florian Weimer
> Cc: netdev@oss.sgi.com; linux-net@vger.kernel.org
> Subject: Re: Route cache performance under stress
>
>
> On Sun, Jun 08, 2003 at 03:10:25PM +0200, Florian Weimer wrote:
>
> > Further parameters which could be tweaked is the kind of adjacency
> > information (where to store the L2 information, whether to include the
>
> > prefix length in the adjacency record etc.).
>
> What is the problem with the current approach?  Does the overhead come
> from having to iterate through the hashes for each prefix?
>
> Simon-
>
> [        Simon Kirby        ][        Network Operations        ]
> [     sim@netnation.com     ][   NetNation Communications Inc.  ]
> [  Opinions expressed are not necessarily those of my employer. ]
>
>
>
>

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09  3:15                                           ` Jamal Hadi
@ 2003-06-09  5:27                                             ` CIT/Paul
  2003-06-09  5:58                                               ` David S. Miller
  2003-06-09  6:25                                             ` David S. Miller
  2003-06-09 13:04                                             ` Ralph Doncaster
  2 siblings, 1 reply; 227+ messages in thread
From: CIT/Paul @ 2003-06-09  5:27 UTC (permalink / raw)
  To: 'Jamal Hadi'
  Cc: 'Simon Kirby', 'Florian Weimer', netdev, linux-net

Ahah Jamal!! Yes I have tried.. It does absoutely nothing for the
constant randomness of packets.
It increases the overall distribution of the hash in the cache but it
does nothing for the addition of new packets..
Try fowarding packets generated by juno-z.101f.c and it adds EVERY
packet to the route cache.. Every one. And at 30,000 pps
It destroys the cache because every single packet coming in is NOT in
the route cache because it's random ips. Nothing you can do
About that except make the cache and everthing related to it wicked
faster, OR remove the per packet additions to the cache (I'm not
Even sure why this is necessary anyway.. Who would want to add every
single src/dst flow to a cache? That's what conntrack does and we all
Know how much you despise that heheheh)
And yes, you can die with 10mbps......Try putting in some netfilter
rules and try putting some basic traffic on it and then hit it with
10mbps of juno-z and see what happens to your cpu.  Granted if there is
a linux router doing ABSOUTELY NOTHING you might be able to hit 50kpps
of juno with dual p3 cpus w/ 512k cache each and tricked out settings
for the hash and route cache but you will also drop some packets along
the way..Still this is not  acceptable yet :>  
Point me at some decent cost linux hardware assist platforms.. IMHO the
only thing that needs hardware assist is the darn route cache (in its
entierty)
BTW, Juno-z can send 12,000 packets per second or more and it's still
10mbps :>

If anyone has any ideas please feel free to e-amil me direct :>


Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: Jamal Hadi [mailto:hadi@shell.cyberus.ca] 
Sent: Sunday, June 08, 2003 11:16 PM
To: CIT/Paul
Cc: 'Simon Kirby'; 'Florian Weimer'; netdev@oss.sgi.com;
linux-net@vger.kernel.org
Subject: RE: Route cache performance under stress




On Sun, 8 Jun 2003, CIT/Paul wrote:

> The problem with the route cache as it stands is that it adds every 
> new packet that isn't in the route cache to the cache, say you have A 
> denial of service attack going on, OR you just have millions of hosts 
> going through the router (if you were an ISP).  Anything with seeminly

> Random source ips (something like juno-z.101f.c will generate worst 
> case scenario for forwarding packets) will cause the cache to 
> constantly Add new entries at pretty much the rate of the attack.. 
> This can stifle just about any linux router with a measly 10 
> megabits/second of traffic unless

foo have you tried the latest patches posted recently?
get the latest kernel 2.5.x and try it out.
BTW, i dont think it is true you can die with 10mbps. I was reading some
emails where someone said it was a few 100 pps that will kill the linux
sytem (theory mixed with nonsense;->)

> The router is tuned up to a large degree (NAPI, certain nics, route 
> cache timings, etc.) and even then it can still be destroyed no matter

> what The cpu is with less than 100,000 packets per second and in mosts

> cases less than 30k..

btw thats waay above 10Mbps.

> That's why it's just no acceptable for companies using
> it as a replacement for say a cisco 7200 VXR series (npe300,400 nsf-1,
> etc.) which can do 300K+ packet per second of routing (and yes it can
> even route juno-z.101f.c at 300kpps, I have tested it).   Linux has no
> problem doing 300kpps from a single source to a single destination 
> provided you have NAPI or ITR or something limiting the interrupts.. 
> The overhead is the route cache and the related systems that use it 
> and also netfilter is very slow :/  One of these days they will fix 
> it..... If anyone has any ideas or needs a test-bed to try out code on

> or would like me to test some of their code I would be happy to test 
> it on our development platforms (single and dual processor with intel 
> e1000 82545/6 and above, also e100 and tulip).
>

I think Robert has some numbers with the new patches with similar setups
as you. Why dont you compare how much the cost of a CISCO npex devices
with Linux PCs with e1000s as well while you are at it ?;-> I am sure
there are people who will like to sell you linux devices at half the
cisco prices doing Millions of PPS via hardware assists. Support these
linux supporting companies instead ;->

The more i think about it the more i think CEF is a lame escape from
route caches. What we need is multi-tries at the slow path and perhaps a
binary tree on hash collisions buckets of the dst cache (instead of a
linked list). You can avoid the packet drive cache generation event by
being a little creative if it gets overwhelming. Fix zebra to resolve
each BGP nexthop fully every periodic time.

In any case who said forwarding by itself was sexy anymore?

cheers,
jamal

> Thanks for your time
>
> P.S. to answer your iteration question.. It does not seem to be such 
> overhead on the cpu even if the route-cache is 600,000 in size.. I 
> have tested this and while there is a definite increase in cpu it 
> comes nothing close to the code that has to add every new arriving 
> packet to the list.  IMHO the best way to do this would be like CEF w/

> adjacency lists and not have it add every new packet that comes along
>
> Paul xerox@foonet.net http://www.httpd.net
>
>
> -----Original Message-----
> From: netdev-bounce@oss.sgi.com [mailto:netdev-bounce@oss.sgi.com] On 
> Behalf Of Simon Kirby
> Sent: Sunday, June 08, 2003 7:49 PM
> To: Florian Weimer
> Cc: netdev@oss.sgi.com; linux-net@vger.kernel.org
> Subject: Re: Route cache performance under stress
>
>
> On Sun, Jun 08, 2003 at 03:10:25PM +0200, Florian Weimer wrote:
>
> > Further parameters which could be tweaked is the kind of adjacency 
> > information (where to store the L2 information, whether to include 
> > the
>
> > prefix length in the adjacency record etc.).
>
> What is the problem with the current approach?  Does the overhead come

> from having to iterate through the hashes for each prefix?
>
> Simon-
>
> [        Simon Kirby        ][        Network Operations        ]
> [     sim@netnation.com     ][   NetNation Communications Inc.  ]
> [  Opinions expressed are not necessarily those of my employer. ]
>
>
>
>


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 23:49                                       ` Simon Kirby
  2003-06-08 23:55                                         ` CIT/Paul
@ 2003-06-09  5:38                                         ` David S. Miller
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09  5:38 UTC (permalink / raw)
  To: sim; +Cc: fw, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Sun, 8 Jun 2003 16:49:26 -0700

   On Sun, Jun 08, 2003 at 03:10:25PM +0200, Florian Weimer wrote:
   
   > Further parameters which could be tweaked is the kind of adjacency
   > information (where to store the L2 information, whether to include the
   > prefix length in the adjacency record etc.).
   
   What is the problem with the current approach?  Does the overhead come
   from having to iterate through the hashes for each prefix?

It comes from doing the slow path, which actually had a bug
(wouldn't grow the hash tables past a certain point).

I bet most of Florian's performance problems go away if he
runs with the fib_hash fix that I put into the tree.

In fact, the current slow path is _OPTIMAL_ for any sane routing
table.  The lookups are exactly O(n_prefixes) where n_prefixes
in the number of unique subnet prefixes you've added to your routing
table.

This is precisely the same complexity as you'd get with a trie based
approach with guarenteed depth not exceeding 32.

I think most people are unaware of how the slow path we have actually
works.

The place I see bugs are in routing cache GC operation, it can't keep
up with how fast we can create new routing cache entries, and this
is merely because it isn't tuned not because it is not capable of
keeping equilibrium properly.

This is why I really wish Florian would explore this area instead
of ripping the whole thing apart :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 23:55                                         ` CIT/Paul
  2003-06-09  3:15                                           ` Jamal Hadi
@ 2003-06-09  5:44                                           ` David S. Miller
  2003-06-09  5:51                                             ` CIT/Paul
  2003-06-09  6:47                                           ` Simon Kirby
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  5:44 UTC (permalink / raw)
  To: xerox; +Cc: sim, fw, netdev, linux-net

   From: "CIT/Paul" <xerox@foonet.net>
   Date: Sun, 8 Jun 2003 19:55:58 -0400

   The problem with the route cache as it stands is that it adds every new
   packet that isn't in the route cache to the cache, say you have 
   A denial of service attack going on, OR you just have millions of hosts
   going through the router (if you were an ISP).

We perform now rather acceptibly in such scenerios.   Robert Olsson
has demonstrated that even if the attacker could fill up your
entire bandwidth with random source address packets, we'd still
provide 50kpps routing speed.

And this can be made much higher because the performance limiter
is the routing cache GC which isn't tuned properly.  It can't keep
up because it doesn't try to purge the right amount entries each
pass.

All the performance problems I've seen have been algorithmic or
outright bugs.  Bad hash functions and limits in how big the FIB
hash tables would grow.  And what's left is fixing GC.

There is nothing AT ALL fundamental about a routing cache that
precludes it from behaving sanely in the presence of a random source
address DoS load.  Absolutely NOTHING.

   This can stifle just about any linux router with a measly 10
   megabits/second of traffic unless

Not true, that happens because of BUGs.  Not because routing caches
cannot behave sanely in such situations.

   The router is tuned up to a large degree (NAPI, certain nics, route
   cache timings, etc.) and even then it can still be destroyed no matter
   what

And today, this is because of BUGs in how the GC works.  You can
design the GC process so that it does the right thing and recycles
only the DoS entries (those being very non-localized).

You should interact with Robert Olsson who has been doing tests on the
effect of gigabit rate full-on DoS runs where every packet creates a
new routing cache entry.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09  5:44                                           ` David S. Miller
@ 2003-06-09  5:51                                             ` CIT/Paul
  2003-06-09  6:03                                               ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: CIT/Paul @ 2003-06-09  5:51 UTC (permalink / raw)
  To: 'David S. Miller'; +Cc: sim, fw, netdev, linux-net

I'd love to test this out.. If it could do full gigabit line rate with
random ips that would be soooooooo nice :>
We wouldn't have to have so many routers any more!! :)


Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: David S. Miller [mailto:davem@redhat.com] 
Sent: Monday, June 09, 2003 1:45 AM
To: xerox@foonet.net
Cc: sim@netnation.com; fw@deneb.enyo.de; netdev@oss.sgi.com;
linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


   From: "CIT/Paul" <xerox@foonet.net>
   Date: Sun, 8 Jun 2003 19:55:58 -0400

   The problem with the route cache as it stands is that it adds every
new
   packet that isn't in the route cache to the cache, say you have 
   A denial of service attack going on, OR you just have millions of
hosts
   going through the router (if you were an ISP).

We perform now rather acceptibly in such scenerios.   Robert Olsson
has demonstrated that even if the attacker could fill up your entire
bandwidth with random source address packets, we'd still provide 50kpps
routing speed.

And this can be made much higher because the performance limiter is the
routing cache GC which isn't tuned properly.  It can't keep up because
it doesn't try to purge the right amount entries each pass.

All the performance problems I've seen have been algorithmic or outright
bugs.  Bad hash functions and limits in how big the FIB hash tables
would grow.  And what's left is fixing GC.

There is nothing AT ALL fundamental about a routing cache that precludes
it from behaving sanely in the presence of a random source address DoS
load.  Absolutely NOTHING.

   This can stifle just about any linux router with a measly 10
   megabits/second of traffic unless

Not true, that happens because of BUGs.  Not because routing caches
cannot behave sanely in such situations.

   The router is tuned up to a large degree (NAPI, certain nics, route
   cache timings, etc.) and even then it can still be destroyed no
matter
   what

And today, this is because of BUGs in how the GC works.  You can design
the GC process so that it does the right thing and recycles only the DoS
entries (those being very non-localized).

You should interact with Robert Olsson who has been doing tests on the
effect of gigabit rate full-on DoS runs where every packet creates a new
routing cache entry.

Franks a lot,
David S. Miller
davem@redhat.com


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  5:27                                             ` CIT/Paul
@ 2003-06-09  5:58                                               ` David S. Miller
  2003-06-09  6:28                                                 ` CIT/Paul
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  5:58 UTC (permalink / raw)
  To: xerox; +Cc: hadi, sim, fw, netdev, linux-net

   From: "CIT/Paul" <xerox@foonet.net>
   Date: Mon, 9 Jun 2003 01:27:48 -0400

   Ahah Jamal!! Yes I have tried.. It does absoutely nothing for the
   constant randomness of packets.
   It increases the overall distribution of the hash in the cache but it
   does nothing for the addition of new packets..
   Try fowarding packets generated by juno-z.101f.c and it adds EVERY
   packet to the route cache.. Every one. And at 30,000 pps
   It destroys the cache because every single packet coming in is NOT in
   the route cache because it's random ips.

So you make packets that do things like this GC the oldest
(LRU) routing cache entry.

This isn't rocket science, and well behaved flows will still get all
the benefits of the routing cache.

The only person penalized will be the attacker since his routing
cache entries will purge out quickly and as a response to HIS traffic.

   Nothing you can do

No, there are many things we can do.

Prove to me that routing caches are unable to behave acceptibly in
random source address DoS situations.

   (I'm not Even sure why this is necessary anyway.. Who would want to
   add every single src/dst flow to a cache?

Because %99 of traffic is well behaved flows, trains of packets.
Even the most loaded core routers see flow lifetimes of at least
8 or 9 packets.

Even if the flows lasted 3 packets, the input route lookup work
saved (source address validation in particular, which requires
access to a centralized global table and thus does not scale well
on SMP) is entriely worth it.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  5:51                                             ` CIT/Paul
@ 2003-06-09  6:03                                               ` David S. Miller
  2003-06-09  6:52                                                 ` Simon Kirby
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  6:03 UTC (permalink / raw)
  To: xerox; +Cc: sim, fw, netdev, linux-net

   From: "CIT/Paul" <xerox@foonet.net>
   Date: Mon, 9 Jun 2003 01:51:45 -0400

   I'd love to test this out.. If it could do full gigabit line rate with
   random ips that would be soooooooo nice :>

It isn't impossible with the current design, that I am
quire sure of.

Here is a simple idea, make the routing cache miss case steal
an entry sitting at the end of the hash chain this new one will
map to.  It only steals entries which have not been recently used.

The big problem area on SMP is fib_validate_source.  I'm sure some
clear thinking can wipe that off the profiles too.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  3:15                                           ` Jamal Hadi
  2003-06-09  5:27                                             ` CIT/Paul
@ 2003-06-09  6:25                                             ` David S. Miller
  2003-06-09  6:59                                               ` Simon Kirby
  2003-06-09 13:04                                             ` Ralph Doncaster
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  6:25 UTC (permalink / raw)
  To: hadi; +Cc: xerox, sim, fw, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Sun, 8 Jun 2003 23:15:46 -0400 (EDT)
   
   The more i think about it the more i think CEF is a lame escape
   from route caches.

It is one perspective :-)

   What we need is multi-tries at the slow path and
   perhaps a binary tree on hash collisions buckets of the dst cache
   (instead of a linked list).

I do not believe that slow path is slow.  In fact after I fixed
hash table growth in fib_hash.c Simon showed us clearly how DoS
performance was _NOT_ tied to the number of routes loaded into
the kernel.

What is slow are things like fib_validate_source() on SMP and the GC
(and some other things, I need to study Simon's profiles more deeply).
The GC is aparently really badly behaved now during DoS like traffic.

My main current quick idea is to make rt_intern_hash() attempt
to flush out entries in the same hash chain instead of allocating
new entries.

I also question the setting of ip_rt_max_size in relation to the
number of hash chains (it's set to n_hashchains * 16 currently,
that sounds wrong, maybe something more like n_hashchains * 2 or
even n_hashchains * 3).

I'll try to cook up a patch to test.  We might even be able to
kill of route cache GC entriely if this scheme works well.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:28                                                 ` CIT/Paul
@ 2003-06-09  6:28                                                   ` David S. Miller
  2003-06-09 16:23                                                     ` Stephen Hemminger
  2003-06-09  7:13                                                   ` Simon Kirby
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  6:28 UTC (permalink / raw)
  To: xerox; +Cc: hadi, sim, fw, netdev, linux-net, Robert.Olsson

   From: "CIT/Paul" <xerox@foonet.net>
   Date: Mon, 9 Jun 2003 02:28:30 -0400

   OK so let's try this.. If you can show me a linux router can can route
   100mbps or more of juno-z.101f.c attack without dropping packets I will
   be thoroughly impressed  :)
   
   I am willing to test out any code/patches and settings that you can
   think of and post some results..
   
Ok, Robert are you willing to help too? :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09  5:58                                               ` David S. Miller
@ 2003-06-09  6:28                                                 ` CIT/Paul
  2003-06-09  6:28                                                   ` David S. Miller
  2003-06-09  7:13                                                   ` Simon Kirby
  0 siblings, 2 replies; 227+ messages in thread
From: CIT/Paul @ 2003-06-09  6:28 UTC (permalink / raw)
  To: 'David S. Miller'; +Cc: hadi, sim, fw, netdev, linux-net

OK so let's try this.. If you can show me a linux router can can route
100mbps or more of juno-z.101f.c attack without dropping packets I will
be thoroughly impressed  :)

I am willing to test out any code/patches and settings that you can
think of and post some results..



Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: David S. Miller [mailto:davem@redhat.com] 
Sent: Monday, June 09, 2003 1:59 AM
To: xerox@foonet.net
Cc: hadi@shell.cyberus.ca; sim@netnation.com; fw@deneb.enyo.de;
netdev@oss.sgi.com; linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


   From: "CIT/Paul" <xerox@foonet.net>
   Date: Mon, 9 Jun 2003 01:27:48 -0400

   Ahah Jamal!! Yes I have tried.. It does absoutely nothing for the
   constant randomness of packets.
   It increases the overall distribution of the hash in the cache but it
   does nothing for the addition of new packets..
   Try fowarding packets generated by juno-z.101f.c and it adds EVERY
   packet to the route cache.. Every one. And at 30,000 pps
   It destroys the cache because every single packet coming in is NOT in
   the route cache because it's random ips.

So you make packets that do things like this GC the oldest
(LRU) routing cache entry.

This isn't rocket science, and well behaved flows will still get all the
benefits of the routing cache.

The only person penalized will be the attacker since his routing cache
entries will purge out quickly and as a response to HIS traffic.

   Nothing you can do

No, there are many things we can do.

Prove to me that routing caches are unable to behave acceptibly in
random source address DoS situations.

   (I'm not Even sure why this is necessary anyway.. Who would want to
   add every single src/dst flow to a cache?

Because %99 of traffic is well behaved flows, trains of packets. Even
the most loaded core routers see flow lifetimes of at least 8 or 9
packets.

Even if the flows lasted 3 packets, the input route lookup work saved
(source address validation in particular, which requires access to a
centralized global table and thus does not scale well on SMP) is
entriely worth it.


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 23:55                                         ` CIT/Paul
  2003-06-09  3:15                                           ` Jamal Hadi
  2003-06-09  5:44                                           ` David S. Miller
@ 2003-06-09  6:47                                           ` Simon Kirby
  2003-06-09  6:49                                             ` David S. Miller
  2003-06-09 13:28                                             ` Ralph Doncaster
  2 siblings, 2 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  6:47 UTC (permalink / raw)
  To: CIT/Paul; +Cc: 'Florian Weimer', netdev, linux-net

On Sun, Jun 08, 2003 at 07:55:58PM -0400, CIT/Paul wrote:

> A denial of service attack going on, OR you just have millions of hosts
> going through the router (if you were an ISP).  Anything with seeminly
> Random source ips (something like juno-z.101f.c will generate worst case
> scenario for forwarding packets) will cause the cache to constantly
> Add new entries at pretty much the rate of the attack.. This can stifle
> just about any linux router with a measly 10 megabits/second of traffic
> unless
> The router is tuned up to a large degree (NAPI, certain nics, route
> cache timings, etc.) and even then it can still be destroyed no matter
> what
> The cpu is with less than 100,000 packets per second and in mosts cases
> less than 30k.. That's why it's just no acceptable for companies using
> it as a replacement for say a cisco 7200 VXR series (npe300,400 nsf-1,
> etc.) which can do 300K+ packet per second of routing (and yes it can
> even route juno-z.101f.c at 300kpps, I have tested it).   Linux has no
> problem doing 300kpps from a single source to a single destination
> provided you have NAPI or ITR or something limiting the interrupts.. The
> overhead is the route cache and the related systems that use it and also
> netfilter is very slow :/  One of these days they will fix it..... If

Whoa, wait a second.

You got a 7200 VXR to do 300kpps?  I would have liked to see that.
We couldn't get our 7206 VXR routers to do anything more than about 12
Mbit/second of small packets, which I believe is about 40,000 packets
per second.  This is with CEF disabled, because it ended up duplicating
packets and doing some other strange things with CEF enabled.

Also, I remember trying with a bucketload of netfilter rules and finding
that the performance difference was hardly noticeable.

Linux can route small packets with random src/dst at much faster than 10
Mbit/sec.  It depeends on the hardware as you say, but it shouldn't ever
be that slow on reasonable hardware.

I remember back even in 1998 with the 2.0 kernel (before the route cache
existed) on a Celeron 300A with eepro100 cards (eepro100 driver, no
interrupt coalescing, definitely no NAPI) was cable of routing at least
20 Mbit/second of SYN packets from random sources.  In fact, I remember
it happily choking some old 3Com switches we had at the time.

I recently saw 90 Mbit/second of additional traffic (small packets with
random sources) going through our routers (now single Athlon 1800MP (MP
for APIC), tg3, NAPI, BGP routing tables), and they didn't seem to care. 
It's definitely not yet perfect, but it's not bad.  The hashing fixes for
large routing tables which Dave M. recently posted has made the situation
much better -- it was very broken before.  What did your routing table
look like when you were doing tests?

I have fiddled with the route cache garbage collection parameters a bit,
but I haven't really been able to reduce the CPU usage by much at all. 
Really, though, shouldn't the route cache overhead be fairly small in
comparison to everything else involved in forwarding?

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:47                                           ` Simon Kirby
@ 2003-06-09  6:49                                             ` David S. Miller
  2003-06-09 13:28                                             ` Ralph Doncaster
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09  6:49 UTC (permalink / raw)
  To: sim; +Cc: xerox, fw, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Sun, 8 Jun 2003 23:47:19 -0700
   
   Really, though, shouldn't the route cache overhead be fairly small in
   comparison to everything else involved in forwarding?

If GC is just doing dumb things, it is possible.

These costs can be hidden in non-rtcache places in the form of cache
misses and displacement on rtcache objects which can show up as higher
costs in other places.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:03                                               ` David S. Miller
@ 2003-06-09  6:52                                                 ` Simon Kirby
  2003-06-09  6:56                                                   ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  6:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: xerox, fw, netdev, linux-net

On Sun, Jun 08, 2003 at 11:03:32PM -0700, David S. Miller wrote:

>    I'd love to test this out.. If it could do full gigabit line rate with
>    random ips that would be soooooooo nice :>

Agreed. :)

> It isn't impossible with the current design, that I am
> quire sure of.
> 
> Here is a simple idea, make the routing cache miss case steal
> an entry sitting at the end of the hash chain this new one will
> map to.  It only steals entries which have not been recently used.

I just asked whether this was possible in a previous email, but you must
have missed it.  I am seeing a lot of memory management stuff in
profiles, so I think recycling routing cache entries (if only when the
table is full and the garbage collector would otherwise need to run)
would be very helpful.

Is it possible to get a good guess of what cache entry to recycle without
walking for a while or without some kind of LRU?

> The big problem area on SMP is fib_validate_source.  I'm sure some
> clear thinking can wipe that off the profiles too.

Not running the important stuff with SMP yet, so I don't care about this
at the moment. O:)

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:52                                                 ` Simon Kirby
@ 2003-06-09  6:56                                                   ` David S. Miller
  2003-06-09  7:36                                                     ` Simon Kirby
  2003-06-09  8:18                                                     ` Simon Kirby
  0 siblings, 2 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09  6:56 UTC (permalink / raw)
  To: sim; +Cc: xerox, fw, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Sun, 8 Jun 2003 23:52:11 -0700

   On Sun, Jun 08, 2003 at 11:03:32PM -0700, David S. Miller wrote:
   
   > Here is a simple idea, make the routing cache miss case steal
   > an entry sitting at the end of the hash chain this new one will
   > map to.  It only steals entries which have not been recently used.
   
   I just asked whether this was possible in a previous email, but you must
   have missed it.  I am seeing a lot of memory management stuff in
   profiles, so I think recycling routing cache entries (if only when the
   table is full and the garbage collector would otherwise need to run)
   would be very helpful.

Yes, indeed.
   
   Is it possible to get a good guess of what cache entry to recycle without
   walking for a while or without some kind of LRU?

This is what my (and therefore your) suggested scheme is trying to
do.

We have to walk the entire destination hash chain _ANYWAYS_ to verify
that a matching entry has not been put into the cache while we were
procuring the new one.  During this walk we can also choose a
candidate rtcache entry to free.

Something like the patch at the end of this email, doesn't compile
it's just a work in progress.  The trick is picking TIMEOUT1 and
TIMEOUT2 :)

Another point is that the default ip_rt_gc_min_interval is
absolutely horrible for DoS like attacks.  When DoS traffic
can fill the rtcache multiple times per second, using a GC
interval of 5 seconds is the worst possible choice. :)

When I see things like this, I can only come to the conclusion
that the tuning Alexey originally did when coding up the rtcache
merely needs to be scaled up to modern day packet rates.

--- net/ipv4/route.c.~1~	Sun Jun  8 23:28:00 2003
+++ net/ipv4/route.c	Sun Jun  8 23:45:47 2003
@@ -717,14 +717,15 @@
 
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
-	struct rtable	*rth, **rthp;
-	unsigned long	now = jiffies;
+	struct rtable	*rth, **rthp, *cand, **candp;
+	unsigned long	now = jiffies, cand_use = now;
 	int attempts = !in_softirq();
 
 restart:
 	rthp = &rt_hash_table[hash].chain;
 
 	spin_lock_bh(&rt_hash_table[hash].lock);
+	cand = NULL;
 	while ((rth = *rthp) != NULL) {
 		if (compare_keys(&rth->fl, &rt->fl)) {
 			/* Put it first */
@@ -753,7 +754,21 @@
 			return 0;
 		}
 
+		if (rt_may_expire(rth, TIMEOUT1, TIMEOUT2)) {
+			unsigned long this_use = rth->u.dst.lastuse;
+
+			if (time_before_eq(this_use, cand_use)) {
+				cand = rth;
+				candp = rthp;
+				cand_use = this_use;
+			}
+		}
 		rthp = &rth->u.rt_next;
+	}
+
+	if (cand) {
+		*candp = cand->u.rt_next;
+		rt_free(cand);
 	}
 
 	/* Try to bind route to arp only if it is output

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:25                                             ` David S. Miller
@ 2003-06-09  6:59                                               ` Simon Kirby
  2003-06-09  7:03                                                 ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  6:59 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, xerox, fw, netdev, linux-net

On Sun, Jun 08, 2003 at 11:25:37PM -0700, David S. Miller wrote:

> I do not believe that slow path is slow.  In fact after I fixed
> hash table growth in fib_hash.c Simon showed us clearly how DoS
> performance was _NOT_ tied to the number of routes loaded into
> the kernel.

Not anymore. :)  Btw, that patch seems to be stable here.  Will we be
seeing it sneak into 2.4?

> My main current quick idea is to make rt_intern_hash() attempt
> to flush out entries in the same hash chain instead of allocating
> new entries.
> 
> I also question the setting of ip_rt_max_size in relation to the
> number of hash chains (it's set to n_hashchains * 16 currently,
> that sounds wrong, maybe something more like n_hashchains * 2 or
> even n_hashchains * 3).

The route cache on our routers here grows to several thousand entries
most of the time because of the quantity of traffic we route, and then
all gets happily blown away when the next BGP table change comes along,
which seems to happen about 10-20 times per miunte (!).  It would
probably be beneficial for us to reduce the amount of work required when
blowing it away and keep it as small as possible.

> I'll try to cook up a patch to test.  We might even be able to

Woohoo!

> kill of route cache GC entriely if this scheme works well.

I asked Alexey about this before and he mentioned it was there because it
made a big difference in processing latency to postpone cleanup to a GC
run.  It should be possible to do recycling only when the table is full
(when the box is getting smashed).  This way latencies would be lowest in
the common case and it would recycle and not have spurts of GC latency in
the DoS case.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:59                                               ` Simon Kirby
@ 2003-06-09  7:03                                                 ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09  7:03 UTC (permalink / raw)
  To: sim; +Cc: hadi, xerox, fw, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Sun, 8 Jun 2003 23:59:55 -0700

   On Sun, Jun 08, 2003 at 11:25:37PM -0700, David S. Miller wrote:
   
   > I do not believe that slow path is slow.  In fact after I fixed
   > hash table growth in fib_hash.c Simon showed us clearly how DoS
   > performance was _NOT_ tied to the number of routes loaded into
   > the kernel.
   
   Not anymore. :)  Btw, that patch seems to be stable here.  Will we be
   seeing it sneak into 2.4?
   
Yes, 2.4.22-pre1 will get it or somewhere thereabouts.

   > I also question the setting of ip_rt_max_size in relation to the
   > number of hash chains (it's set to n_hashchains * 16 currently,
   > that sounds wrong, maybe something more like n_hashchains * 2 or
   > even n_hashchains * 3).
   
   The route cache on our routers here grows to several thousand entries
   most of the time because of the quantity of traffic we route, and then
   all gets happily blown away when the next BGP table change comes along,
   which seems to happen about 10-20 times per miunte (!).  It would
   probably be beneficial for us to reduce the amount of work required when
   blowing it away and keep it as small as possible.

This is simple, by using a generation count.  When route lookup
sees a matching entry with a stale generation count, we pass
this entry as-is into ip_route_{input,output}_slow() and use it
instead of allocating new entry.

It is the same trick as used by the flow cache.
   
I'll code this up as well.

   > kill of route cache GC entriely if this scheme works well.
   
   I asked Alexey about this before and he mentioned it was there because it
   made a big difference in processing latency to postpone cleanup to a GC
   run.

The problem is that GC cannot currently keep up with DoS like traffic
pattern.  As a result, routing latency is not smooth at all, you get
spikes because each GC run goes for up to an entire jiffie because it
has so much work to do.  Meanwhile, during this expensive GC
processing, packet processing is frozen on UP system.

net/core/flow.c:flow_cache_lookup() is instructive, it implements
several of these ideas being discussed today.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:28                                                 ` CIT/Paul
  2003-06-09  6:28                                                   ` David S. Miller
@ 2003-06-09  7:13                                                   ` Simon Kirby
  2003-06-09  8:10                                                     ` CIT/Paul
  2003-06-09  8:56                                                     ` David S. Miller
  1 sibling, 2 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  7:13 UTC (permalink / raw)
  To: CIT/Paul; +Cc: 'David S. Miller', hadi, fw, netdev, linux-net

On Mon, Jun 09, 2003 at 02:28:30AM -0400, CIT/Paul wrote:

> OK so let's try this.. If you can show me a linux router can can route
> 100mbps or more of juno-z.101f.c attack without dropping packets I will
> be thoroughly impressed  :)
> 
> I am willing to test out any code/patches and settings that you can
> think of and post some results..

I'll see if I can set up a test bed this week.  I think we should already
be able to do close to this, but I'll let the numbers will do the
talking. :)

In the tests I've been doing so far, I've been dropping responses (in the
INPUT chain), so I haven't been testing the forwarding through of packets
(though it is testing the routing input).  I'll see if I can set up a
router, target, and DoS box.

I haven't been able to get juno-z.101f.c to saturate 100 Mbit/sec
outgoing, but I've only tried it on eepro100 boxes.  Has anybody got it
to send more?  Mmm, need more tg3 cards...

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:56                                                   ` David S. Miller
@ 2003-06-09  7:36                                                     ` Simon Kirby
  2003-06-09  8:18                                                     ` Simon Kirby
  1 sibling, 0 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  7:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: xerox, fw, netdev, linux-net

On Sun, Jun 08, 2003 at 11:56:22PM -0700, David S. Miller wrote:

> We have to walk the entire destination hash chain _ANYWAYS_ to verify
> that a matching entry has not been put into the cache while we were
> procuring the new one.  During this walk we can also choose a
> candidate rtcache entry to free.

Ah, neat.  I should try reading this stuff. :)

> Something like the patch at the end of this email, doesn't compile
> it's just a work in progress.  The trick is picking TIMEOUT1 and
> TIMEOUT2 :)
> 
> Another point is that the default ip_rt_gc_min_interval is
> absolutely horrible for DoS like attacks.  When DoS traffic
> can fill the rtcache multiple times per second, using a GC
> interval of 5 seconds is the worst possible choice. :)

Yes, I've reduced the gc_min_interval to 1, and it has been that way for
some time.  BTW, you may be interested in this old email from Alexey:

http://www.tux.org/hypermail/linux-kernel/1999week05/1113.html

(This was back when the GC was limited so much that legitimate traffic
was overflowing the table.  DoS attacks must have been really effective
then. :))

Simon-

[        Simon Kirby        ][        Network Operations        ]
[     sim@netnation.com     ][   NetNation Communications Inc.  ]
[  Opinions expressed are not necessarily those of my employer. ]

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09  7:13                                                   ` Simon Kirby
@ 2003-06-09  8:10                                                     ` CIT/Paul
  2003-06-09  8:27                                                       ` Simon Kirby
  2003-06-09 11:38                                                       ` Jamal Hadi
  2003-06-09  8:56                                                     ` David S. Miller
  1 sibling, 2 replies; 227+ messages in thread
From: CIT/Paul @ 2003-06-09  8:10 UTC (permalink / raw)
  To: 'Simon Kirby'
  Cc: 'David S. Miller', hadi, fw, netdev, linux-net

I've got juno-z.101f.c to send 500,000 pps at 300+mbit on our dual p3
1.26 ghz routers.. I can't even send 50mbit of this though one of my
routers
Without it using 100% of both cpus because of the route cache.. It goes
up to 500,000 entries if I let it and it adds 80,000 new entries per
second and they are all cache misses.. I'd be glad to show you the setup
sometime :) I showed it to jamal and we tested some stuff.

Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: Simon Kirby [mailto:sim@netnation.com] 
Sent: Monday, June 09, 2003 3:14 AM
To: CIT/Paul
Cc: 'David S. Miller'; hadi@shell.cyberus.ca; fw@deneb.enyo.de;
netdev@oss.sgi.com; linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


On Mon, Jun 09, 2003 at 02:28:30AM -0400, CIT/Paul wrote:

> OK so let's try this.. If you can show me a linux router can can route

> 100mbps or more of juno-z.101f.c attack without dropping packets I 
> will be thoroughly impressed  :)
> 
> I am willing to test out any code/patches and settings that you can 
> think of and post some results..

I'll see if I can set up a test bed this week.  I think we should
already be able to do close to this, but I'll let the numbers will do
the talking. :)

In the tests I've been doing so far, I've been dropping responses (in
the INPUT chain), so I haven't been testing the forwarding through of
packets (though it is testing the routing input).  I'll see if I can set
up a router, target, and DoS box.

I haven't been able to get juno-z.101f.c to saturate 100 Mbit/sec
outgoing, but I've only tried it on eepro100 boxes.  Has anybody got it
to send more?  Mmm, need more tg3 cards...

Simon-


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:56                                                   ` David S. Miller
  2003-06-09  7:36                                                     ` Simon Kirby
@ 2003-06-09  8:18                                                     ` Simon Kirby
  2003-06-09  8:22                                                       ` David S. Miller
                                                                         ` (2 more replies)
  1 sibling, 3 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  8:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: xerox, fw, netdev, linux-net

On Sun, Jun 08, 2003 at 11:56:22PM -0700, David S. Miller wrote:

> +	if (cand) {
> +		*candp = cand->u.rt_next;
> +		rt_free(cand);
>  	}

Hmm...It looks like this is still freeing the entry.. Is it possible to
recycle the dst without reallocating it?

This is the end of the time-sorted profile output of the test box
saturated by incoming juno packets (firewalled in INPUT chain to avoid
responses to spoofed src IPs), NAPI 100% of the time, tg3:

   158 tg3_poll                                   0.5197
  1630 ip_rcv_finish                              2.8348
   142 ipv4_dst_destroy                           2.9583
   429 fib_rules_policy                           3.8304
  8959 ip_route_input_slow                        3.8885
  2438 ip_rcv                                     4.3536
  2504 alloc_skb                                  5.2167
  1991 __kfree_skb                                5.4103
  2279 netif_receive_skb                          5.6975
   929 skb_release_data                           6.4514
   669 ip_local_deliver                           6.9688
  1175 __constant_c_and_count_memset              7.3438
  2367 tcp_match                                  7.3969
   124 kmem_cache_alloc                           7.7500
  4535 fib_validate_source                        8.0982
   598 __fib_res_prefsrc                          9.3438
  8896 rt_garbage_collect                         9.4237
  3582 inet_select_addr                           9.7337
  1747 kfree                                      9.9261
   717 ipt_hook                                  11.2031
   938 kmalloc                                   11.7250
  1747 jhash_3words                              12.1319
  6879 nf_hook_slow                              12.6452
  2439 eth_type_trans                            12.7031
  1695 kfree_skbmem                              13.2422
  2358 nf_iterate                                13.3977
   872 rt_hash_code                              13.6250
  2933 fib_semantic_match                        14.1010
 16553 ipt_do_table                              14.9937
 15339 tg3_rx                                    16.2489
  2482 tg3_recycle_rx                            17.2361
  5967 __kmem_cache_alloc                        18.6469
  1237 ipt_route_hook                            19.3281
  3120 do_gettimeofday                           21.6667
  8299 ip_packet_match                           24.6994
  8031 fib_lookup                                25.0969
  1877 fib_rule_put                              29.3281
  6088 dst_destroy                               34.5909
 26833 rt_intern_hash                            34.9388
 10666 kmem_cache_free                           66.6625
 20193 fn_hash_lookup                            70.1146
 10516 dst_alloc                                 73.0278
 64803 ip_route_input                           150.0069

This is with a routing table of 300,000 entries (though only one prefix)
and with your hash fix patch.  ip_route_input is still highest, but
dst_alloc is an obvious second.  ip_route_input is actually always the
highest (excluding the IRQ handling stuff), and doesn't seem to change at
all based on routing table size.

	http://blue.netnation.com/sim/ref/

Simon-

[        Simon Kirby        ][        Network Operations        ]
[     sim@netnation.com     ][   NetNation Communications Inc.  ]
[  Opinions expressed are not necessarily those of my employer. ]

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  8:18                                                     ` Simon Kirby
@ 2003-06-09  8:22                                                       ` David S. Miller
  2003-06-09  8:31                                                         ` Simon Kirby
  2003-06-09  9:01                                                       ` David S. Miller
  2003-06-09 14:14                                                       ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  8:22 UTC (permalink / raw)
  To: sim; +Cc: xerox, fw, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Mon, 9 Jun 2003 01:18:03 -0700

   On Sun, Jun 08, 2003 at 11:56:22PM -0700, David S. Miller wrote:
   
   > +	if (cand) {
   > +		*candp = cand->u.rt_next;
   > +		rt_free(cand);
   >  	}
   
   Hmm...It looks like this is still freeing the entry.. Is it possible to
   recycle the dst without reallocating it?
   
Yes, can you test the patch I just sent you?  We can modify that
to recycle easily instead of freeing.

Well... one problem is that in 2.5.x we have to kill off entries using
RCU so such recycling may not be so easy there.

   This is with a routing table of 300,000 entries (though only one prefix)
   and with your hash fix patch.  ip_route_input is still highest, but
   dst_alloc is an obvious second.  ip_route_input is actually always the
   highest (excluding the IRQ handling stuff), and doesn't seem to change at
   all based on routing table size.

We spend a decent amount of time mucking with fib rules, turning
off multiple-tables support would kill that, although I suspect
you're actually using that :)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  8:10                                                     ` CIT/Paul
@ 2003-06-09  8:27                                                       ` Simon Kirby
  2003-06-09 19:38                                                         ` CIT/Paul
  2003-06-09 11:38                                                       ` Jamal Hadi
  1 sibling, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  8:27 UTC (permalink / raw)
  To: CIT/Paul; +Cc: 'David S. Miller', hadi, fw, netdev, linux-net

On Mon, Jun 09, 2003 at 04:10:55AM -0400, CIT/Paul wrote:

> I've got juno-z.101f.c to send 500,000 pps at 300+mbit on our dual p3
> 1.26 ghz routers.. I can't even send 50mbit of this though one of my
> routers
> Without it using 100% of both cpus because of the route cache.. It goes
> up to 500,000 entries if I let it and it adds 80,000 new entries per
> second and they are all cache misses.. I'd be glad to show you the setup
> sometime :) I showed it to jamal and we tested some stuff.

Hmm.. We're running on single 1800MP Athlons here.  Have you had a chance
to profile it? 

- add "profile=1" to the kernel command line
- reboot
- run juno-z.101f.c from remote box
- run "readprofile -r" on the router
- twiddle fingers for a while
- run "readprofile -n -m your_System.map > foo"
- stop juno :)
- run "sort -n +2 < foo > readprofile.time_sorted"

I'm interested to see if your profile results line up to what I'm seeing
here on UP (though I have the kernel compiled SMP...Oops).

Wait a second... 500,000 entries in the route cache?  WTF?  What is your
max_size set to?  That will massively overfill the hash bucket and
definitely take up way too much CPU.  It shouldn't be able to get there
at all unless you have raised max_size.  Here I have:

echo 4 > gc_elasticity          # Higher is weaker, 0 will nuke all [dfl: 8]
echo 1 > gc_interval            # Garbage collection interval (seconds) [dfl: 60]
echo 1 > gc_min_interval        # Garbage collection min interval (seconds) [dfl: 5]
echo 90 > gc_timeout            # Entry lifetime (seconds) [dfl: 300]

[sroot@r1:/proc/sys/net/ipv4/route]# grep . *
...
gc_elasticity:4
gc_interval:1
gc_min_interval:1
gc_thresh:4096
gc_timeout:90
max_delay:10
max_size:65536

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  8:22                                                       ` David S. Miller
@ 2003-06-09  8:31                                                         ` Simon Kirby
  0 siblings, 0 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-09  8:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: xerox, fw, netdev, linux-net

On Mon, Jun 09, 2003 at 01:22:02AM -0700, David S. Miller wrote:

>    Hmm...It looks like this is still freeing the entry.. Is it possible to
>    recycle the dst without reallocating it?
>    
> Yes, can you test the patch I just sent you?  We can modify that
> to recycle easily instead of freeing.

Cool.  I'll see if I can set something up to try that at work tomorrow. 
Insufficient hardware here at home.

> We spend a decent amount of time mucking with fib rules, turning
> off multiple-tables support would kill that, although I suspect
> you're actually using that :)

We use it occasionally for various things.  I'll try profiling with it
turned off to see how much of an impact it has.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  7:13                                                   ` Simon Kirby
  2003-06-09  8:10                                                     ` CIT/Paul
@ 2003-06-09  8:56                                                     ` David S. Miller
  2003-06-09 22:39                                                       ` Robert Olsson
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  8:56 UTC (permalink / raw)
  To: sim; +Cc: xerox, hadi, fw, netdev, linux-net, Robert.Olsson, kuznet

   From: Simon Kirby <sim@netnation.com>
   Date: Mon, 9 Jun 2003 00:13:30 -0700

   On Mon, Jun 09, 2003 at 02:28:30AM -0400, CIT/Paul wrote:
   
   > I am willing to test out any code/patches and settings that you can
   > think of and post some results..
   
   I'll see if I can set up a test bed this week.  I think we should already
   be able to do close to this, but I'll let the numbers will do the
   talking. :)

BTW, ignoring juno, Robert Olsson has some pktgen hacks that allow
that to generate new-dst-per-packet DoS like traffic.  It's much
more effective than Juno-z

Robert could you should these guys your hacks to do that?

Next, here is an interesting first pass patch to try.  Once we hit
gc_thresh, at every new DST allocation we try to shrink the destination
hash chain.  It ought to be very effective in the presence of poorly
behaved traffic such as random-src-address DoS.

The patch is against 2.5.x current...

The next task is to try and handle rt_cache_flush more cheaply, given
Simon's mention that he gets from 10 to 20 BGP updates per minute.
Another idea to this dilemma is maybe to see if Zebra can batch things
a little bit... but that kind of solution might not be possible since I
don't know how that stuff works.

--- net/ipv4/route.c.~1~	Sun Jun  8 23:28:00 2003
+++ net/ipv4/route.c	Mon Jun  9 01:09:45 2003
@@ -882,6 +882,42 @@ static void rt_del(unsigned hash, struct
 	spin_unlock_bh(&rt_hash_table[hash].lock);
 }
 
+static void __rt_hash_shrink(unsigned int hash)
+{
+	struct rtable *rth, **rthp;
+	struct rtable *cand, **candp;
+	unsigned int min_use = ~(unsigned int) 0;
+
+	spin_lock_bh(&rt_hash_table[hash].lock);
+	cand = NULL;
+	candp = NULL;
+	rthp = &rt_hash_table[hash].chain;
+	while ((rth = *rthp) != NULL) {
+		if (!atomic_read(&rth->u.dst.__refcnt) &&
+		    ((unsigned int) rth->u.dst.__use) < min_use) {
+			cand = rth;
+			candp = rthp;
+			min_use = rth->u.dst.__use;
+		}
+		rthp = &rth->u.rt_next;
+	}
+	if (cand) {
+		*candp = cand->u.rt_next;
+		rt_free(cand);
+	}
+
+	spin_unlock_bh(&rt_hash_table[hash].lock);
+}
+
+static inline struct rtable *ip_rt_dst_alloc(unsigned int hash)
+{
+	if (atomic_read(&ipv4_dst_ops.entries) >
+	    ipv4_dst_ops.gc_thresh)
+		__rt_hash_shrink(hash);
+
+	return dst_alloc(&ipv4_dst_ops);
+}
+
 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
 		    u32 saddr, u8 tos, struct net_device *dev)
 {
@@ -912,9 +948,10 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 
 	for (i = 0; i < 2; i++) {
 		for (k = 0; k < 2; k++) {
-			unsigned hash = rt_hash_code(daddr,
-						     skeys[i] ^ (ikeys[k] << 5),
-						     tos);
+			unsigned int hash = rt_hash_code(daddr,
+							 skeys[i] ^
+							 (ikeys[k] << 5),
+							 tos);
 
 			rthp=&rt_hash_table[hash].chain;
 
@@ -942,7 +979,7 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 				dst_hold(&rth->u.dst);
 				rcu_read_unlock();
 
-				rt = dst_alloc(&ipv4_dst_ops);
+				rt = ip_rt_dst_alloc(hash);
 				if (rt == NULL) {
 					ip_rt_put(rth);
 					in_dev_put(in_dev);
@@ -1352,7 +1389,7 @@ static void rt_set_nexthop(struct rtable
 static int ip_route_input_mc(struct sk_buff *skb, u32 daddr, u32 saddr,
 				u8 tos, struct net_device *dev, int our)
 {
-	unsigned hash;
+	unsigned int hash;
 	struct rtable *rth;
 	u32 spec_dst;
 	struct in_device *in_dev = in_dev_get(dev);
@@ -1375,7 +1412,9 @@ static int ip_route_input_mc(struct sk_b
 					dev, &spec_dst, &itag) < 0)
 		goto e_inval;
 
-	rth = dst_alloc(&ipv4_dst_ops);
+	hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos);
+
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
@@ -1421,7 +1460,6 @@ static int ip_route_input_mc(struct sk_b
 	RT_CACHE_STAT_INC(in_slow_mc);
 
 	in_dev_put(in_dev);
-	hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos);
 	return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst);
 
 e_nobufs:
@@ -1584,7 +1622,7 @@ int ip_route_input_slow(struct sk_buff *
 			goto e_inval;
 	}
 
-	rth = dst_alloc(&ipv4_dst_ops);
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
@@ -1663,7 +1701,7 @@ brd_input:
 	RT_CACHE_STAT_INC(in_brd);
 
 local_input:
-	rth = dst_alloc(&ipv4_dst_ops);
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
@@ -2048,7 +2086,10 @@ make_route:
 		}
 	}
 
-	rth = dst_alloc(&ipv4_dst_ops);
+	hash = rt_hash_code(oldflp->fl4_dst,
+			    oldflp->fl4_src ^ (oldflp->oif << 5), tos);
+
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
@@ -2107,7 +2148,6 @@ make_route:
 
 	rth->rt_flags = flags;
 
-	hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^ (oldflp->oif << 5), tos);
 	err = rt_intern_hash(hash, rth, rp);
 done:
 	if (free_res)


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  8:18                                                     ` Simon Kirby
  2003-06-09  8:22                                                       ` David S. Miller
@ 2003-06-09  9:01                                                       ` David S. Miller
  2003-06-09  9:47                                                         ` Andi Kleen
  2003-06-09 14:14                                                       ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09  9:01 UTC (permalink / raw)
  To: sim; +Cc: xerox, fw, netdev, linux-net, kuznet, Robert.Olsson

   From: Simon Kirby <sim@netnation.com>
   Date: Mon, 9 Jun 2003 01:18:03 -0700

    10516 dst_alloc                                 73.0278

Gross, we effectively initialize a new dst multiple times :(
In fact, we modify the same cache lines at least 3 times.

There's a lot more we can do in this area.  But this patch below kills
some of it.  Again, patch is against 2.5.x-current.

Actually, it is a relatively good sign, it means this is a relatively
unexplored area of the networking :-)))

--- net/core/dst.c.~1~	Mon Jun  9 01:47:26 2003
+++ net/core/dst.c	Mon Jun  9 01:53:41 2003
@@ -122,13 +122,31 @@ void * dst_alloc(struct dst_ops * ops)
 	dst = kmem_cache_alloc(ops->kmem_cachep, SLAB_ATOMIC);
 	if (!dst)
 		return NULL;
-	memset(dst, 0, ops->entry_size);
+	dst->next = NULL;
 	atomic_set(&dst->__refcnt, 0);
-	dst->ops = ops;
+	dst->__use = 0;
+	dst->child = NULL;
+	dst->dev = NULL;
+	dst->obsolete = 0;
+	dst->flags = 0;
 	dst->lastuse = jiffies;
+	dst->expires = 0;
+	dst->header_len = 0;
+	dst->trailer_len = 0;
+	memset(dst->metrics, 0, sizeof(dst->metrics));
 	dst->path = dst;
+	dst->rate_last = 0;
+	dst->rate_tokens = 0;
+	dst->error = 0;
+	dst->neighbour = NULL;
+	dst->hh = NULL;
+	dst->xfrm = NULL;
 	dst->input = dst_discard;
 	dst->output = dst_blackhole;
+	dst->ops = ops;
+	INIT_RCU_HEAD(&dst->rcu_head);
+	memset(dst->info, 0,
+	       ops->entry_size - offsetof(struct dst_entry, info));
 #if RT_CACHE_DEBUG >= 2 
 	atomic_inc(&dst_total);
 #endif

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  9:01                                                       ` David S. Miller
@ 2003-06-09  9:47                                                         ` Andi Kleen
  2003-06-09 10:03                                                           ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Andi Kleen @ 2003-06-09  9:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, xerox, fw, netdev, linux-net, kuznet, Robert.Olsson

On Mon, Jun 09, 2003 at 02:01:16AM -0700, David S. Miller wrote:
>    From: Simon Kirby <sim@netnation.com>
>    Date: Mon, 9 Jun 2003 01:18:03 -0700
> 
>     10516 dst_alloc                                 73.0278
> 
> Gross, we effectively initialize a new dst multiple times :(
> In fact, we modify the same cache lines at least 3 times.
> 
> There's a lot more we can do in this area.  But this patch below kills
> some of it.  Again, patch is against 2.5.x-current.
> 
> Actually, it is a relatively good sign, it means this is a relatively
> unexplored area of the networking :-)))
> 
> --- net/core/dst.c.~1~	Mon Jun  9 01:47:26 2003
> +++ net/core/dst.c	Mon Jun  9 01:53:41 2003
> @@ -122,13 +122,31 @@ void * dst_alloc(struct dst_ops * ops)
>  	dst = kmem_cache_alloc(ops->kmem_cachep, SLAB_ATOMIC);
>  	if (!dst)
>  		return NULL;
> -	memset(dst, 0, ops->entry_size);
> +	dst->next = NULL;
>  	atomic_set(&dst->__refcnt, 0);
> -	dst->ops = ops;
> +	dst->__use = 0;
> +	dst->child = NULL;
> +	dst->dev = NULL;
> +	dst->obsolete = 0;
> +	dst->flags = 0;
>  	dst->lastuse = jiffies;
> +	dst->expires = 0;
> +	dst->header_len = 0;
> +	dst->trailer_len = 0;
> +	memset(dst->metrics, 0, sizeof(dst->metrics));

gcc will generate a lot better code for the memsets if you can tell
it somehow they are long aligned and a multiple of 8 bytes.
e.g. redeclare them as long instead of char.  If it cannot figure out 
the alignment it often (or least on x86) calls to the external memset
function.

>  	dst->path = dst;
> +	dst->rate_last = 0;
> +	dst->rate_tokens = 0;
> +	dst->error = 0;
> +	dst->neighbour = NULL;
> +	dst->hh = NULL;
> +	dst->xfrm = NULL;
>  	dst->input = dst_discard;
>  	dst->output = dst_blackhole;
> +	dst->ops = ops;
> +	INIT_RCU_HEAD(&dst->rcu_head);
> +	memset(dst->info, 0,
> +	       ops->entry_size - offsetof(struct dst_entry, info));

Same here.

-Andi

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  9:47                                                         ` Andi Kleen
@ 2003-06-09 10:03                                                           ` David S. Miller
  2003-06-09 10:13                                                             ` Andi Kleen
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09 10:03 UTC (permalink / raw)
  To: ak; +Cc: sim, xerox, fw, netdev, linux-net, kuznet, Robert.Olsson

   From: Andi Kleen <ak@suse.de>
   Date: Mon, 9 Jun 2003 11:47:34 +0200
   
   gcc will generate a lot better code for the memsets if you can tell
   it somehow they are long aligned and a multiple of 8 bytes.

True, but the real bug is that we're initializing any of this
crap here at all.  Right now we write over the same cachelines
3 or so times.  It should really just happen once.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 10:03                                                           ` David S. Miller
@ 2003-06-09 10:13                                                             ` Andi Kleen
  2003-06-09 10:13                                                               ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Andi Kleen @ 2003-06-09 10:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, sim, xerox, fw, netdev, linux-net, kuznet, Robert.Olsson

On Mon, Jun 09, 2003 at 03:03:34AM -0700, David S. Miller wrote:
>    From: Andi Kleen <ak@suse.de>
>    Date: Mon, 9 Jun 2003 11:47:34 +0200
>    
>    gcc will generate a lot better code for the memsets if you can tell
>    it somehow they are long aligned and a multiple of 8 bytes.
> 
> True, but the real bug is that we're initializing any of this
> crap here at all.  Right now we write over the same cachelines
> 3 or so times.  It should really just happen once.

It's unlikely to be the reason for the profile hit on a modern x86.
They are all really fast at reading/writing L1. 

More likely it is the cache miss for fetching the lines initially.
Perhaps it is cache thrashing the dst_entry heads. Adding a strategic
prefetch somewhere early may help a lot.

-Andi

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 10:13                                                             ` Andi Kleen
@ 2003-06-09 10:13                                                               ` David S. Miller
  2003-06-09 10:40                                                                 ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09 10:13 UTC (permalink / raw)
  To: ak; +Cc: sim, xerox, fw, netdev, linux-net, kuznet, Robert.Olsson

   From: Andi Kleen <ak@suse.de>
   Date: Mon, 9 Jun 2003 12:13:02 +0200

   On Mon, Jun 09, 2003 at 03:03:34AM -0700, David S. Miller wrote:
   > True, but the real bug is that we're initializing any of this
   > crap here at all.  Right now we write over the same cachelines
   > 3 or so times.  It should really just happen once.
   
   It's unlikely to be the reason for the profile hit on a modern x86.
   They are all really fast at reading/writing L1. 
   
It's store buffer compression that's being messed up.  I've seen this
on just about any processor.

This is also why the net/core/skbuff.c initialization hacks are so
effective as well.

Trust me, this has every symptom of excess store buffer traffic :)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 10:40                                                                 ` YOSHIFUJI Hideaki / 吉藤英明
@ 2003-06-09 10:40                                                                   ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09 10:40 UTC (permalink / raw)
  To: yoshfuji; +Cc: ak, sim, xerox, fw, netdev, linux-net, kuznet, Robert.Olsson

   From: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@wide.ad.jp>
   Date: Mon, 09 Jun 2003 19:40:46 +0900 (JST)
   
   Ok, how about this?

The memset_tail thing is unnecessary, and better to put the
non-zero objects at the beginning then you can go.

	 memset(dst->${FIRST_ZERO_MEMBER}, 0,
	        ops->entry_size -
		offsetof(struct dst_entry, ${FIRST_ZERO_MEMBER}));

But even _THIS_ is stupid.  All this initialization really should
move to caller.  We can provide a "dst_init()" helper for protocols
that don't want to bother optimizing this.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 10:13                                                               ` David S. Miller
@ 2003-06-09 10:40                                                                 ` YOSHIFUJI Hideaki / 吉藤英明
  2003-06-09 10:40                                                                   ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2003-06-09 10:40 UTC (permalink / raw)
  To: davem; +Cc: ak, sim, xerox, fw, netdev, linux-net, kuznet, Robert.Olsson

In article <20030609.031341.77044985.davem@redhat.com> (at Mon, 09 Jun 2003 03:13:41 -0700 (PDT)), "David S. Miller" <davem@redhat.com> says:

>    It's unlikely to be the reason for the profile hit on a modern x86.
>    They are all really fast at reading/writing L1. 
:
> This is also why the net/core/skbuff.c initialization hacks are so
> effective as well.
> 
> Trust me, this has every symptom of excess store buffer traffic :)

Ok, how about this?

Index: linux25/include/net/dst.h
===================================================================
RCS file: /cvsroot/usagi/usagi/kernel/linux25/include/net/dst.h,v
retrieving revision 1.7
diff -u -r1.7 dst.h
--- linux25/include/net/dst.h	20 Apr 2003 14:55:48 -0000	1.7
+++ linux25/include/net/dst.h	9 Jun 2003 10:26:30 -0000
@@ -38,7 +38,7 @@
 struct dst_entry
 {
 	struct dst_entry        *next;
-	atomic_t		__refcnt;	/* client references	*/
+	
 	int			__use;
 	struct dst_entry	*child;
 	struct net_device       *dev;
@@ -48,14 +48,12 @@
 #define DST_NOXFRM		2
 #define DST_NOPOLICY		4
 #define DST_NOHASH		8
-	unsigned long		lastuse;
 	unsigned long		expires;
 
 	unsigned short		header_len;	/* more space at head required */
 	unsigned short		trailer_len;	/* space to reserve at tail */
 
 	u32			metrics[RTAX_MAX];
-	struct dst_entry	*path;
 
 	unsigned long		rate_last;	/* rate limiting for ICMP */
 	unsigned long		rate_tokens;
@@ -66,16 +64,24 @@
 	struct hh_cache		*hh;
 	struct xfrm_state	*xfrm;
 
-	int			(*input)(struct sk_buff*);
-	int			(*output)(struct sk_buff*);
-
 #ifdef CONFIG_NET_CLS_ROUTE
 	__u32			tclassid;
 #endif
 
-	struct  dst_ops	        *ops;
 	struct rcu_head		rcu_head;
-		
+
+	/* These elements should be at the end of dst_entry{}; 
+	 * see net/core/dst.c:dst_alloc() -- yoshfuji */
+	u32			__dst_memset_tail[0];
+
+	atomic_t		__refcnt;	/* client references	*/
+	unsigned long		lastuse;
+
+	struct dst_entry	*path;
+	int			(*input)(struct sk_buff*);
+	int			(*output)(struct sk_buff*);
+	struct dst_ops	        *ops;
+
 	char			info[0];
 };
 
Index: linux25/net/core/dst.c
===================================================================
RCS file: /cvsroot/usagi/usagi/kernel/linux25/net/core/dst.c,v
retrieving revision 1.1.1.9
diff -u -r1.1.1.9 dst.c
--- linux25/net/core/dst.c	27 May 2003 02:59:54 -0000	1.1.1.9
+++ linux25/net/core/dst.c	9 Jun 2003 10:26:30 -0000
@@ -122,13 +122,16 @@
 	dst = kmem_cache_alloc(ops->kmem_cachep, SLAB_ATOMIC);
 	if (!dst)
 		return NULL;
-	memset(dst, 0, ops->entry_size);
+	memset(dst, 0, offsetof(struct dst_entry, __dst_memset_tail));
 	atomic_set(&dst->__refcnt, 0);
-	dst->ops = ops;
 	dst->lastuse = jiffies;
 	dst->path = dst;
 	dst->input = dst_discard;
 	dst->output = dst_blackhole;
+	dst->ops = ops;
+	if (ops->entry_size > offsetof(struct dst_entry, info))
+		memset(&dst->info, 0, ops->entry_size - offsetof(struct dst_entry, info));
+
 #if RT_CACHE_DEBUG >= 2 
 	atomic_inc(&dst_total);
 #endif

-- 
Hideaki YOSHIFUJI @ USAGI Project <yoshfuji@linux-ipv6.org>
GPG FP: 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09  8:10                                                     ` CIT/Paul
  2003-06-09  8:27                                                       ` Simon Kirby
@ 2003-06-09 11:38                                                       ` Jamal Hadi
  2003-06-09 11:55                                                         ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-09 11:38 UTC (permalink / raw)
  To: CIT/Paul
  Cc: 'Simon Kirby', 'David S. Miller', fw, netdev, linux-net



On Mon, 9 Jun 2003, CIT/Paul wrote:

> I've got juno-z.101f.c to send 500,000 pps at 300+mbit on our dual p3
> 1.26 ghz routers.. I can't even send 50mbit of this though one of my
> routers
> Without it using 100% of both cpus because of the route cache.. It goes
> up to 500,000 entries if I let it and it adds 80,000 new entries per
> second and they are all cache misses.. I'd be glad to show you the setup
> sometime :) I showed it to jamal and we tested some stuff.
>

Yes, you have a nice setup and thats why you should test all the patches
DaveM is posting. Dave, Paul is running in a real ISP environment i think
he is very valuable in helping to test these patches and collect
any says that might be needed. Now watch him disapear ;->

BTW, re: BGP, someone should fix zebra to do batching if it doesnt do it
already (i saw that in one emails). In addition arp all
the nexthops right before installing the entries in the FIB. Repeat the
arp every X timeout. nexthops failinjg ARPs should be removed.
That should give you something close to what i think CEF was designed for
i.e when the packets get to us, part of the route is resolved already.

Additional thought Dave: i think prefetching the rth would help in 2.5
at least when you have lotsa collisions.
call prefetch(nextrth) right after smp_read_barrier_depends() everywhere
in route.c

cheers,
jamal

PS:- this is one of those fun times i wish i had a setup and time ;->


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 11:38                                                       ` Jamal Hadi
@ 2003-06-09 11:55                                                         ` David S. Miller
  2003-06-09 12:18                                                           ` Jamal Hadi
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09 11:55 UTC (permalink / raw)
  To: hadi; +Cc: xerox, sim, fw, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Mon, 9 Jun 2003 07:38:44 -0400 (EDT)

   Yes, you have a nice setup and thats why you should test all the patches
   DaveM is posting. Dave, Paul is running in a real ISP environment i think
   he is very valuable in helping to test these patches and collect
   any says that might be needed. Now watch him disapear ;->
   
If he doesn't test my patches he isn't very useful,
so we'll see :-)

   Additional thought Dave: i think prefetching the rth would help in 2.5
   at least when you have lotsa collisions.
   call prefetch(nextrth) right after smp_read_barrier_depends() everywhere
   in route.c
   
You're going to prefetch "nextrth" when the first thing we're
going to access is "&nextrth->fl"? :-)

It only makes sense to prefetch the 'fl' member of the first hash
chain entry and that's what I've done in my tree.  This points out
that it would make sense to put the struct flowi up into the dst
entry.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 11:55                                                         ` David S. Miller
@ 2003-06-09 12:18                                                           ` Jamal Hadi
  2003-06-09 12:32                                                             ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-09 12:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: xerox, sim, fw, netdev, linux-net



On Mon, 9 Jun 2003, David S. Miller wrote:

>    From: Jamal Hadi <hadi@shell.cyberus.ca>
>    Date: Mon, 9 Jun 2003 07:38:44 -0400 (EDT)
>
>    Yes, you have a nice setup and thats why you should test all the patches
>    DaveM is posting. Dave, Paul is running in a real ISP environment i think
>    he is very valuable in helping to test these patches and collect
>    any says that might be needed. Now watch him disapear ;->
>
> If he doesn't test my patches he isn't very useful,
> so we'll see :-)

Ok foo the pressure in on you now ;->
You wanna see things fixed then run the damn tests or stop bitching ;->

> You're going to prefetch "nextrth" when the first thing we're
> going to access is "&nextrth->fl"? :-)
>
> It only makes sense to prefetch the 'fl' member of the first hash
> chain entry and that's what I've done in my tree.  This points out
> that it would make sense to put the struct flowi up into the dst
> entry.

yes moving the flowi up makes more sense. I found in my tests with a
ethernet driver that prefetching the _next_ dma descriptor gave better
numbers than prefetching the current one but i didnt spend too much time.
I am going to revisit this. Good thought on rearranging the structure,
may help with the descriptors as well.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 12:18                                                           ` Jamal Hadi
@ 2003-06-09 12:32                                                             ` David S. Miller
  2003-06-09 13:22                                                               ` Jamal Hadi
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09 12:32 UTC (permalink / raw)
  To: hadi; +Cc: xerox, sim, fw, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Mon, 9 Jun 2003 08:18:50 -0400 (EDT)
   
   I found in my tests with a ethernet driver that prefetching the
   _next_ dma descriptor gave better numbers than prefetching the
   current one but i didnt spend too much time.

Two issues:

1) We have some cycles to borrow for head entry, we can make
   prefetch right before rcu_read_lock()

2) Ideally, hash chains will not exceed 1 (2 at the max)
   entries.

Just some thinking...

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09  3:15                                           ` Jamal Hadi
  2003-06-09  5:27                                             ` CIT/Paul
  2003-06-09  6:25                                             ` David S. Miller
@ 2003-06-09 13:04                                             ` Ralph Doncaster
  2003-06-09 13:26                                               ` Jamal Hadi
  2 siblings, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-09 13:04 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: CIT/Paul, 'Simon Kirby', 'Florian Weimer',
	netdev, linux-net

On Sun, 8 Jun 2003, Jamal Hadi wrote:

> I am sure there are people who will like to sell you linux devices
> at half the cisco prices doing Millions of PPS via hardware assists.
> Support these linux supporting companies instead ;->

Are you serious?
Who is making these boxes?

-Ralph


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 12:32                                                             ` David S. Miller
@ 2003-06-09 13:22                                                               ` Jamal Hadi
  2003-06-09 13:22                                                                 ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-09 13:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: xerox, sim, fw, netdev, linux-net



On Mon, 9 Jun 2003, David S. Miller wrote:

>    From: Jamal Hadi <hadi@shell.cyberus.ca>
>    Date: Mon, 9 Jun 2003 08:18:50 -0400 (EDT)
>
>    I found in my tests with a ethernet driver that prefetching the
>    _next_ dma descriptor gave better numbers than prefetching the
>    current one but i didnt spend too much time.
>
> Two issues:
>
> 1) We have some cycles to borrow for head entry, we can make
>    prefetch right before rcu_read_lock()
>
> 2) Ideally, hash chains will not exceed 1 (2 at the max)
>    entries.
>

I dont think youll see much benefit with 1 or 2 entries. I was thinking
more along the lines of people with over 100K entries total;
Let me run with this and get back to you.

cheers,
jamal


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 13:22                                                               ` Jamal Hadi
@ 2003-06-09 13:22                                                                 ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09 13:22 UTC (permalink / raw)
  To: hadi; +Cc: xerox, sim, fw, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Mon, 9 Jun 2003 09:22:11 -0400 (EDT)
   
   I dont think youll see much benefit with 1 or 2 entries. I was thinking
   more along the lines of people with over 100K entries total;

You simply don't want the chains to get that long.  In my experience,
even with prefetching tricks, past 2 or 3 entry deep hash chains you
run into serious problems.

TCP has the same issue BTW, in fact DoS-like behavior is the common
thing there.  Every time you create a new TCP connection on a server
it's exactly like a routing cache miss.

   Let me run with this and get back to you.

Ok.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09 13:04                                             ` Ralph Doncaster
@ 2003-06-09 13:26                                               ` Jamal Hadi
  0 siblings, 0 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-06-09 13:26 UTC (permalink / raw)
  To: ralph+d
  Cc: CIT/Paul, 'Simon Kirby', 'Florian Weimer',
	netdev, linux-net, sales



On Mon, 9 Jun 2003, Ralph Doncaster wrote:

> On Sun, 8 Jun 2003, Jamal Hadi wrote:
>
> > I am sure there are people who will like to sell you linux devices
> > at half the cisco prices doing Millions of PPS via hardware assists.
> > Support these linux supporting companies instead ;->
>
> Are you serious?

indeed.

> Who is making these boxes?

http://www.znyx.com/products/hardware/
Take one of these devices, plug it into a 1U box and run.
And i have absolutely no relationship with them at all ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:47                                           ` Simon Kirby
  2003-06-09  6:49                                             ` David S. Miller
@ 2003-06-09 13:28                                             ` Ralph Doncaster
  2003-06-09 16:30                                               ` Simon Kirby
  1 sibling, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-09 13:28 UTC (permalink / raw)
  To: Simon Kirby; +Cc: CIT/Paul, 'Florian Weimer', netdev, linux-net

On Sun, 8 Jun 2003, Simon Kirby wrote:

> You got a 7200 VXR to do 300kpps?  I would have liked to see that.
> We couldn't get our 7206 VXR routers to do anything more than about 12
> Mbit/second of small packets, which I believe is about 40,000 packets
> per second.  This is with CEF disabled, because it ended up duplicating
> packets and doing some other strange things with CEF enabled.

The trick is finding the good IOS revs.  12.0(7)T and 12.2(11)T have been
good ones for me.  Finding other ISPs running ciscos to exchange tips and
ideas has been much easier than finding folks running linux.  A sure-fire
way to get flamed is to post to NANOG asking what's the best Linux router
setup!

For most ISPs it's better to spend $20K on a 7206VXR/NPE-G1 than to spend
days trying to figure out what kernel + patch set, NIC, and motherboard
combination will squeeze the best performance out of a PC router.  And
once you've done that you still have zebra quirks to worry about...

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  8:18                                                     ` Simon Kirby
  2003-06-09  8:22                                                       ` David S. Miller
  2003-06-09  9:01                                                       ` David S. Miller
@ 2003-06-09 14:14                                                       ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09 14:14 UTC (permalink / raw)
  To: sim; +Cc: xerox, fw, netdev, hadi, Robert.Olsson, kuznet


Ok Simon/Robert/Mr.Foo :), give this a try, it's my final installment
for the evening :-)

If this shows improvement, we can make even larger strides
by moving the struct flowi up into struct dst_entry.

--- net/core/dst.c.~1~	Mon Jun  9 01:47:26 2003
+++ net/core/dst.c	Mon Jun  9 03:13:56 2003
@@ -122,13 +122,34 @@ void * dst_alloc(struct dst_ops * ops)
 	dst = kmem_cache_alloc(ops->kmem_cachep, SLAB_ATOMIC);
 	if (!dst)
 		return NULL;
-	memset(dst, 0, ops->entry_size);
+	dst->next = NULL;
 	atomic_set(&dst->__refcnt, 0);
-	dst->ops = ops;
+	dst->__use = 0;
+	dst->child = NULL;
+	dst->dev = NULL;
+	dst->obsolete = 0;
+	dst->flags = 0;
 	dst->lastuse = jiffies;
+	dst->expires = 0;
+	dst->header_len = 0;
+	dst->trailer_len = 0;
+	memset(dst->metrics, 0, sizeof(dst->metrics));
 	dst->path = dst;
+	dst->rate_last = 0;
+	dst->rate_tokens = 0;
+	dst->error = 0;
+	dst->neighbour = NULL;
+	dst->hh = NULL;
+	dst->xfrm = NULL;
 	dst->input = dst_discard;
 	dst->output = dst_blackhole;
+#ifdef CONFIG_NET_CLS_ROUTE
+	dst->tclassid = 0;
+#endif
+	dst->ops = ops;
+	INIT_RCU_HEAD(&dst->rcu_head);
+	memset(dst->info, 0,
+	       ops->entry_size - offsetof(struct dst_entry, info));
 #if RT_CACHE_DEBUG >= 2 
 	atomic_inc(&dst_total);
 #endif
--- net/ipv4/route.c.~1~	Sun Jun  8 23:28:00 2003
+++ net/ipv4/route.c	Mon Jun  9 06:49:15 2003
@@ -88,6 +88,7 @@
 #include <linux/random.h>
 #include <linux/jhash.h>
 #include <linux/rcupdate.h>
+#include <linux/prefetch.h>
 #include <net/protocol.h>
 #include <net/ip.h>
 #include <net/route.h>
@@ -882,6 +883,60 @@ static void rt_del(unsigned hash, struct
 	spin_unlock_bh(&rt_hash_table[hash].lock);
 }
 
+static void __rt_hash_shrink(unsigned int hash)
+{
+	struct rtable *rth, **rthp;
+	struct rtable *cand, **candp;
+	unsigned int min_use = ~(unsigned int) 0;
+
+	spin_lock_bh(&rt_hash_table[hash].lock);
+	cand = NULL;
+	candp = NULL;
+	rthp = &rt_hash_table[hash].chain;
+	while ((rth = *rthp) != NULL) {
+		if (!atomic_read(&rth->u.dst.__refcnt) &&
+		    ((unsigned int) rth->u.dst.__use) < min_use) {
+			cand = rth;
+			candp = rthp;
+			min_use = rth->u.dst.__use;
+		}
+		rthp = &rth->u.rt_next;
+	}
+	if (cand) {
+		*candp = cand->u.rt_next;
+		rt_free(cand);
+	}
+
+	spin_unlock_bh(&rt_hash_table[hash].lock);
+}
+
+static inline struct rtable *ip_rt_dst_alloc(unsigned int hash)
+{
+	if (atomic_read(&ipv4_dst_ops.entries) >
+	    ipv4_dst_ops.gc_thresh)
+		__rt_hash_shrink(hash);
+
+	return dst_alloc(&ipv4_dst_ops);
+}
+
+static void ip_rt_copy(struct rtable *rt, struct rtable *old)
+{
+	memcpy(rt, old, sizeof(*rt));
+
+	INIT_RCU_HEAD(&rt->u.dst.rcu_head);
+	rt->u.dst.__use		= 1;
+	atomic_set(&rt->u.dst.__refcnt, 1);
+	rt->u.dst.child		= NULL;
+	if (rt->u.dst.dev)
+		dev_hold(rt->u.dst.dev);
+	rt->u.dst.obsolete	= 0;
+	rt->u.dst.lastuse	= jiffies;
+	rt->u.dst.path		= &rt->u.dst;
+	rt->u.dst.neighbour	= NULL;
+	rt->u.dst.hh		= NULL;
+	rt->u.dst.xfrm		= NULL;
+}
+
 void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
 		    u32 saddr, u8 tos, struct net_device *dev)
 {
@@ -912,9 +967,10 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 
 	for (i = 0; i < 2; i++) {
 		for (k = 0; k < 2; k++) {
-			unsigned hash = rt_hash_code(daddr,
-						     skeys[i] ^ (ikeys[k] << 5),
-						     tos);
+			unsigned int hash = rt_hash_code(daddr,
+							 skeys[i] ^
+							 (ikeys[k] << 5),
+							 tos);
 
 			rthp=&rt_hash_table[hash].chain;
 
@@ -942,7 +998,7 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 				dst_hold(&rth->u.dst);
 				rcu_read_unlock();
 
-				rt = dst_alloc(&ipv4_dst_ops);
+				rt = ip_rt_dst_alloc(hash);
 				if (rt == NULL) {
 					ip_rt_put(rth);
 					in_dev_put(in_dev);
@@ -950,19 +1006,7 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
 				}
 
 				/* Copy all the information. */
-				*rt = *rth;
- 				INIT_RCU_HEAD(&rt->u.dst.rcu_head);
-				rt->u.dst.__use		= 1;
-				atomic_set(&rt->u.dst.__refcnt, 1);
-				rt->u.dst.child		= NULL;
-				if (rt->u.dst.dev)
-					dev_hold(rt->u.dst.dev);
-				rt->u.dst.obsolete	= 0;
-				rt->u.dst.lastuse	= jiffies;
-				rt->u.dst.path		= &rt->u.dst;
-				rt->u.dst.neighbour	= NULL;
-				rt->u.dst.hh		= NULL;
-				rt->u.dst.xfrm		= NULL;
+				ip_rt_copy(rt, rth);
 
 				rt->rt_flags		|= RTCF_REDIRECTED;
 
@@ -1352,7 +1396,7 @@ static void rt_set_nexthop(struct rtable
 static int ip_route_input_mc(struct sk_buff *skb, u32 daddr, u32 saddr,
 				u8 tos, struct net_device *dev, int our)
 {
-	unsigned hash;
+	unsigned int hash;
 	struct rtable *rth;
 	u32 spec_dst;
 	struct in_device *in_dev = in_dev_get(dev);
@@ -1375,7 +1419,9 @@ static int ip_route_input_mc(struct sk_b
 					dev, &spec_dst, &itag) < 0)
 		goto e_inval;
 
-	rth = dst_alloc(&ipv4_dst_ops);
+	hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos);
+
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
@@ -1421,7 +1467,6 @@ static int ip_route_input_mc(struct sk_b
 	RT_CACHE_STAT_INC(in_slow_mc);
 
 	in_dev_put(in_dev);
-	hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos);
 	return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst);
 
 e_nobufs:
@@ -1584,45 +1629,42 @@ int ip_route_input_slow(struct sk_buff *
 			goto e_inval;
 	}
 
-	rth = dst_alloc(&ipv4_dst_ops);
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
 	atomic_set(&rth->u.dst.__refcnt, 1);
-	rth->u.dst.flags= DST_HOST;
-	if (in_dev->cnf.no_policy)
-		rth->u.dst.flags |= DST_NOPOLICY;
-	if (in_dev->cnf.no_xfrm)
-		rth->u.dst.flags |= DST_NOXFRM;
-	rth->fl.fl4_dst	= daddr;
+	rth->u.dst.dev	= out_dev->dev;
+	dev_hold(out_dev->dev);
+	rth->u.dst.flags= (DST_HOST |
+			   (in_dev->cnf.no_policy ? DST_NOPOLICY : 0) |
+			   (in_dev->cnf.no_xfrm ? DST_NOXFRM : 0));
+	rth->u.dst.input = ip_forward;
+	rth->u.dst.output = ip_output;
+
+	rth->rt_flags	= flags;
+	rth->rt_src	= saddr;
 	rth->rt_dst	= daddr;
-	rth->fl.fl4_tos	= tos;
+	rth->rt_iif 	= dev->ifindex;
+	rth->rt_gateway	= daddr;
+
+	rth->fl.iif	= dev->ifindex;
+	rth->fl.fl4_dst	= daddr;
+	rth->fl.fl4_src	= saddr;
 #ifdef CONFIG_IP_ROUTE_FWMARK
 	rth->fl.fl4_fwmark= skb->nfmark;
 #endif
-	rth->fl.fl4_src	= saddr;
-	rth->rt_src	= saddr;
-	rth->rt_gateway	= daddr;
+	rth->fl.fl4_tos	= tos;
+	rth->rt_spec_dst= spec_dst;
 #ifdef CONFIG_IP_ROUTE_NAT
 	rth->rt_src_map	= fl.fl4_src;
 	rth->rt_dst_map	= fl.fl4_dst;
-	if (flags&RTCF_DNAT)
+	if (flags & RTCF_DNAT)
 		rth->rt_gateway	= fl.fl4_dst;
 #endif
-	rth->rt_iif 	=
-	rth->fl.iif	= dev->ifindex;
-	rth->u.dst.dev	= out_dev->dev;
-	dev_hold(rth->u.dst.dev);
-	rth->fl.oif 	= 0;
-	rth->rt_spec_dst= spec_dst;
-
-	rth->u.dst.input = ip_forward;
-	rth->u.dst.output = ip_output;
 
 	rt_set_nexthop(rth, &res, itag);
 
-	rth->rt_flags = flags;
-
 #ifdef CONFIG_NET_FASTROUTE
 	if (netdev_fastroute && !(flags&(RTCF_NAT|RTCF_MASQ|RTCF_DOREDIRECT))) {
 		struct net_device *odev = rth->u.dst.dev;
@@ -1663,45 +1705,45 @@ brd_input:
 	RT_CACHE_STAT_INC(in_brd);
 
 local_input:
-	rth = dst_alloc(&ipv4_dst_ops);
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
+	atomic_set(&rth->u.dst.__refcnt, 1);
+	rth->u.dst.dev	= &loopback_dev;
+	dev_hold(&loopback_dev);
+	rth->u.dst.flags= (DST_HOST |
+			   (in_dev->cnf.no_policy ? DST_NOPOLICY : 0));
+	rth->u.dst.input= ip_local_deliver;
 	rth->u.dst.output= ip_rt_bug;
+#ifdef CONFIG_NET_CLS_ROUTE
+	rth->u.dst.tclassid = itag;
+#endif
 
-	atomic_set(&rth->u.dst.__refcnt, 1);
-	rth->u.dst.flags= DST_HOST;
-	if (in_dev->cnf.no_policy)
-		rth->u.dst.flags |= DST_NOPOLICY;
-	rth->fl.fl4_dst	= daddr;
+	rth->rt_flags 	= flags|RTCF_LOCAL;
+	rth->rt_type	= res.type;
+	rth->rt_src	= saddr;
 	rth->rt_dst	= daddr;
-	rth->fl.fl4_tos	= tos;
+	rth->rt_iif	= dev->ifindex;
+	rth->rt_gateway	= daddr;
+
+	rth->fl.iif	= dev->ifindex;
+	rth->fl.fl4_dst	= daddr;
+	rth->fl.fl4_src	= saddr;
 #ifdef CONFIG_IP_ROUTE_FWMARK
 	rth->fl.fl4_fwmark= skb->nfmark;
 #endif
-	rth->fl.fl4_src	= saddr;
-	rth->rt_src	= saddr;
+	rth->fl.fl4_tos	= tos;
+	rth->rt_spec_dst= spec_dst;
 #ifdef CONFIG_IP_ROUTE_NAT
 	rth->rt_dst_map	= fl.fl4_dst;
 	rth->rt_src_map	= fl.fl4_src;
 #endif
-#ifdef CONFIG_NET_CLS_ROUTE
-	rth->u.dst.tclassid = itag;
-#endif
-	rth->rt_iif	=
-	rth->fl.iif	= dev->ifindex;
-	rth->u.dst.dev	= &loopback_dev;
-	dev_hold(rth->u.dst.dev);
-	rth->rt_gateway	= daddr;
-	rth->rt_spec_dst= spec_dst;
-	rth->u.dst.input= ip_local_deliver;
-	rth->rt_flags 	= flags|RTCF_LOCAL;
 	if (res.type == RTN_UNREACHABLE) {
 		rth->u.dst.input= ip_error;
 		rth->u.dst.error= -err;
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
-	rth->rt_type	= res.type;
 	goto intern;
 
 no_route:
@@ -1767,6 +1809,8 @@ int ip_route_input(struct sk_buff *skb, 
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos);
 
+	prefetch(&rt_hash_table[hash].chain->fl);
+
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
 		smp_read_barrier_depends();
@@ -2048,7 +2092,10 @@ make_route:
 		}
 	}
 
-	rth = dst_alloc(&ipv4_dst_ops);
+	hash = rt_hash_code(oldflp->fl4_dst,
+			    oldflp->fl4_src ^ (oldflp->oif << 5), tos);
+
+	rth = ip_rt_dst_alloc(hash);
 	if (!rth)
 		goto e_nobufs;
 
@@ -2104,10 +2151,6 @@ make_route:
 
 	rt_set_nexthop(rth, &res, 0);
 	
-
-	rth->rt_flags = flags;
-
-	hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^ (oldflp->oif << 5), tos);
 	err = rt_intern_hash(hash, rth, rp);
 done:
 	if (free_res)
@@ -2132,6 +2175,8 @@ int __ip_route_output_key(struct rtable 
 	struct rtable *rth;
 
 	hash = rt_hash_code(flp->fl4_dst, flp->fl4_src ^ (flp->oif << 5), flp->fl4_tos);
+
+	prefetch(&rt_hash_table[hash].chain->fl);
 
 	rcu_read_lock();
 	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  6:28                                                   ` David S. Miller
@ 2003-06-09 16:23                                                     ` Stephen Hemminger
  2003-06-09 16:37                                                       ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Stephen Hemminger @ 2003-06-09 16:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: xerox, hadi, sim, fw, netdev, linux-net, Robert.Olsson

Has anyone looked into using Judy array's to speedup the route cache. 
HP has opened it up (see http://sourceforge.net/projects/judy ) and it should
have better scaling for these type of attacks.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 13:28                                             ` Ralph Doncaster
@ 2003-06-09 16:30                                               ` Simon Kirby
  2003-06-17 20:58                                                 ` Florian Weimer
  0 siblings, 1 reply; 227+ messages in thread
From: Simon Kirby @ 2003-06-09 16:30 UTC (permalink / raw)
  To: ralph+d; +Cc: netdev, linux-net

On Mon, Jun 09, 2003 at 09:28:59AM -0400, Ralph Doncaster wrote:

> The trick is finding the good IOS revs.  12.0(7)T and 12.2(11)T have been
> good ones for me.  Finding other ISPs running ciscos to exchange tips and
> ideas has been much easier than finding folks running linux.  A sure-fire
> way to get flamed is to post to NANOG asking what's the best Linux router
> setup!
> 
> For most ISPs it's better to spend $20K on a 7206VXR/NPE-G1 than to spend
> days trying to figure out what kernel + patch set, NIC, and motherboard
> combination will squeeze the best performance out of a PC router.  And
> once you've done that you still have zebra quirks to worry about...

I beg to differ.  We had much more pain trying to get those things to
work properly than putting together two boxes that have been up now for
almost a year without incident.  Running Zebra, keepalived, etc., without
any problems at all.  What Zebra quirks?  There has not yet been one
crash or failure, which is much better than we could say for the 7206s.

And I wouldn't exactly call it difficult to "squeeze" performance out of
a PC when the 7206 VXRs have a 200 MHz processor.

The main reason we switched is when we realized we could set up a
powerful Linux box full of gigabit NICs for less than the price of one
gigabit interface.  At the time we purchased the NICs (3C996B-T) for less
than $150 CDN each, and they're probably cheaper now.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 16:23                                                     ` Stephen Hemminger
@ 2003-06-09 16:37                                                       ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09 16:37 UTC (permalink / raw)
  To: shemminger; +Cc: xerox, hadi, sim, fw, netdev, linux-net, Robert.Olsson

   From: Stephen Hemminger <shemminger@osdl.org>
   Date: Mon, 9 Jun 2003 09:23:27 -0700

   Has anyone looked into using Judy array's to speedup the route
   cache.   HP has opened it up (see
   http://sourceforge.net/projects/judy ) and it should have better
   scaling for these type of attacks.
   
Like all such seemingly promising schemes, insert/retrieve are
optimized at the expense of delete.

I normally don't even look at such algorithms anymore, they all
are amazing if you only build tables and look for things in them
but are unusable when O(1) insert/delete/lookup are absolutely
required.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-02 10:58                                               ` Robert Olsson
  2003-06-02 15:18                                                 ` Simon Kirby
@ 2003-06-09 17:19                                                 ` David S. Miller
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09 17:19 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: sim, netdev, linux-net, kuznet

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Mon, 2 Jun 2003 12:58:31 +0200
   
    And later GC have to remove all enties with spin_lock_bh hold (no
    packet processing runs). I see packet drops exactly when GC
    runs. Tuning GC might help but it's something to observe.

Please note, in 2.5.x, holding of this lock on one cpu does
not prevent packet processing (even for routes on same hash
chain) on another cpu because we use RCU there.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-02 16:36                                                   ` Robert Olsson
  2003-06-02 18:05                                                     ` Simon Kirby
@ 2003-06-09 17:21                                                     ` David S. Miller
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09 17:21 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: sim, netdev, linux-net, kuznet

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Mon, 2 Jun 2003 18:36:37 +0200
   
   Simon Kirby writes:
   
    > Is it possible to have a dst LRU or a simpler approximation of such and
    > recycle dst entries rather than deallocating/reallocating them?  This
    > would relieve a lot of work from the garbage collector and avoid the
    > periodic large garbage collection latency.  It could be tuned to only
    > occur in an attack (I remember Alexey saying that the deferred garbage
    > collection was implemented to reduce latency in normal opreation).
   
    I don't see how this can be done. Others may?

Full recycle is very doable in 2.4.x, in 2.5.x is an enormously hard
problem because we use RCU there (readers run completely without
locks).

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09  8:27                                                       ` Simon Kirby
@ 2003-06-09 19:38                                                         ` CIT/Paul
  2003-06-09 21:30                                                           ` David S. Miller
  2003-06-09 22:19                                                           ` Simon Kirby
  0 siblings, 2 replies; 227+ messages in thread
From: CIT/Paul @ 2003-06-09 19:38 UTC (permalink / raw)
  To: 'Simon Kirby'
  Cc: 'David S. Miller', hadi, fw, netdev, linux-net

gc_elasticity:1
gc_interval:600
gc_min_interval:1
gc_thresh:60000
gc_timeout:15
max_delay:10
max_size:512000
min_adv_mss:256
min_delay:5
min_pmtu:552
mtu_expires:600
redirect_load:2
redirect_number:9
redirect_silence:2048
secret_interval:60


I've tried other settings, secret-interval 1 which seems to 'flush' the
cache every second or 60 seconds as I have it here..
If I have secret interval set to 1 the GC never runs because the cache
never gets > my gc thresh..  I've also tried this with
Gc_thresh 2000 and more aggressive settings (timeout 5, interval 10)..
Also tried with max_size 16000 but juno pegs the route cache
And I get massive amounts of dst_cache_overflow messages .. 
This is 'normal' traffic on the router (using the rtstat program)

./rts -i 1
 size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
mc GC: tot ignored goal_miss ovrf
59272     26954    1826     0     0     0     0     0         6       0
0       0       0         0    0
35188     24721    4901     0     0     0     1     0         7       4
0       1       0         0    0
40170     23820    4978     0     0     0     1     0         6       5
0       0       0         0    0
43674     24630    3503     0     0     0     0     0         6       2
0       0       0         0    0
46857     24889    3184     0     0     0     1     0         5       0
0       0       0         0    0
  809     26110    3394     0     0     0     1     2         8       6
0       0       0         0    0
13898     14370   13322     0     0     0     0     1         2       6
2       0       0         0    0
22309     19823    8408     0     0     0     1     0         3       3
0       0       0         0    0
27857     21905    5543     0     0     0     1     0         3       5
0       0       0         0    0
32572     23521    4712     0     0     0     0     0         3       3
0       0       0         0    0
35863     25057    3287     0     0     0     1     0         5       4
0       0       0         0    0
39431     25769    3567     0     0     0     1     0         5       4
0       0       0         0    0
43114     25681    3686     0     0     0     0     0         3       1
0       0       0         0    0
46143     26140    3031     0     0     0     1     0         5       1
0       0       0         0    0
49158     28385    3015     0     0     0     1     0         8       4
0       0       0         0    0
52053     29708    2896     0     0     0     0     0         3       1
0       0       0         0    0

You can see where the secret interval hits and the size goes back down
to nothing.  Also you can see the garbage collection
Kicking in at high pace in the first 2 lines when it hits the thresh it
knocks it down hard and doesn't even show in the gc stats on the right.
This seems to be a good compromise for now.. 

Check what happens when I load up juno..
52253     26156    2933     0     0     0     1     0         5       3
0       0       0         0    0
54313     25429    2061     0     0     0     0     0         4       0
0       0       0         0    0
56467     25754    2153     0     0     0     0     0         9       1
0       0       0         0    0
39980     28341   12497     0     0     0     3     0         8       2
0       1       0         0    0
62200     21112   63032     0     0     0     0     0         0       5
2   15124   15123         0    0
41754     12345   52282     0     0     0     2     0         2       7
1   18776   18774         0    0
49139      8263   42314     0     0     0     0     0         0       1
1   12191   12190         0    0
55385      8256   42518     0     0     0     0     0         2       4
0   14904   14903         0    0
57413      7308   38594     0     0     0     3     0         1       3
1   15545   15544         0    0
59604      7254   38850     0     0     0     0     0         0       1
1   15703   15702         0    0
66136      7835   43335     0     0     0     0     0         0       7
1   22191   22190         0    0
66570      7095   37265     0     0     0     2     0         0       1
1   16560   16559         0    0
69269      6786   39392     0     0     0     0     0         0       1
1   18516   18515         0    0
72988      7749   40492     0     0     0     0     0         0       7
1   19735   19734         0    0
 size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
mc GC: tot ignored goal_miss ovrf
46461      8398   47142     0     0     0     1     0         0       1
1   19312   19310         0    0
58837      8325   49347     0     0     0     1     0         1       4
0   16369   16368         0    0
59948      6691   38094     0     0     0     1     0         1       5
2   16392   16391         0    0
63364      7442   40269     0     0     0     0     0         0       1
0   19528   19527         0    0
64597      7110   38179     0     0     0     0     0         1       5
1   17534   17533         0    0
68520      7306   40842     0     0     0     0     0         0       5
2   20153   20152         0    0
70807      6840   39230     0     0     0     1     0         0       1
0   18631   18630         0    0
73130      6534   39318     0     0     0     1     0         2       3
0   18805   18804         0    0
75149      7038   39141     0     0     0     0     0         0       4
1   18719   18718         0    0
75775      6486   37682     0     0     0     1     0         0       1
1   17183   17182         0    0
53219      8605   51413     0     0     0     2     0         0       9
1   17099   17097         0    0
57124      6977   40914     0     0     0     0     0         1       1
1   16465   16464         0    0
62052      7460   41935     0     0     0     2     0         1       1
1   18499   18498         0    0

Ok you see this happening but during this the router is almost
unusable..   
PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
    3 root      20  -1     0    0     0 RW<  48.5  0.0  34:04
ksoftirqd_CPU0
    4 root      20  -1     0    0     0 RW<  46.7  0.0  34:14
ksoftirqd_CPU1

And this is only about 15mbps of juno-z, of course this is a production
router so I don't want to do any more than that so we don't drop any
'real' packets also but it gives a good real world test.. Both cpus are
slammed at 100% by the ksoftirqds.  This is using e1000 with interrups
limited to ~ 4000/second (ITR), no NAPI.. NAPI messes it up big time and
drops more packets than without :>

I'll load this all up on the test router and do the profiling and test
dave's patches later when I get a chance 


Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: Simon Kirby [mailto:sim@netnation.com] 
Sent: Monday, June 09, 2003 4:27 AM
To: CIT/Paul
Cc: 'David S. Miller'; hadi@shell.cyberus.ca; fw@deneb.enyo.de;
netdev@oss.sgi.com; linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


On Mon, Jun 09, 2003 at 04:10:55AM -0400, CIT/Paul wrote:

> I've got juno-z.101f.c to send 500,000 pps at 300+mbit on our dual p3 
> 1.26 ghz routers.. I can't even send 50mbit of this though one of my 
> routers Without it using 100% of both cpus because of the route 
> cache.. It goes up to 500,000 entries if I let it and it adds 80,000 
> new entries per second and they are all cache misses.. I'd be glad to 
> show you the setup sometime :) I showed it to jamal and we tested some

> stuff.

Hmm.. We're running on single 1800MP Athlons here.  Have you had a
chance to profile it? 

- add "profile=1" to the kernel command line
- reboot
- run juno-z.101f.c from remote box
- run "readprofile -r" on the router
- twiddle fingers for a while
- run "readprofile -n -m your_System.map > foo"
- stop juno :)
- run "sort -n +2 < foo > readprofile.time_sorted"

I'm interested to see if your profile results line up to what I'm seeing
here on UP (though I have the kernel compiled SMP...Oops).

Wait a second... 500,000 entries in the route cache?  WTF?  What is your
max_size set to?  That will massively overfill the hash bucket and
definitely take up way too much CPU.  It shouldn't be able to get there
at all unless you have raised max_size.  Here I have:

echo 4 > gc_elasticity          # Higher is weaker, 0 will nuke all
[dfl: 8]
echo 1 > gc_interval            # Garbage collection interval (seconds)
[dfl: 60]
echo 1 > gc_min_interval        # Garbage collection min interval
(seconds) [dfl: 5]
echo 90 > gc_timeout            # Entry lifetime (seconds) [dfl: 300]

[sroot@r1:/proc/sys/net/ipv4/route]# grep . *
...
gc_elasticity:4
gc_interval:1
gc_min_interval:1
gc_thresh:4096
gc_timeout:90
max_delay:10
max_size:65536

Simon-


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 19:38                                                         ` CIT/Paul
@ 2003-06-09 21:30                                                           ` David S. Miller
  2003-06-09 22:19                                                           ` Simon Kirby
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-09 21:30 UTC (permalink / raw)
  To: xerox; +Cc: sim, hadi, fw, netdev, linux-net

   From: "CIT/Paul" <xerox@foonet.net>
   Date: Mon, 9 Jun 2003 15:38:30 -0400

   I've tried other settings, secret-interval 1 which seems to 'flush' the
   cache every second or 60 seconds as I have it here..
   If I have secret interval set to 1 the GC never runs because the cache
   never gets > my gc thresh..

Set secret interval to infinity.  Even the default setting of 10
minutes is overly anal.  It's only picking a new random secret for the
hash so that algorithmic attacks are less likely even if the attacker
find a method by which to determine the secret key on your system.  It
is impossible for an attacker to do this as far as I am aware.

   Also tried with max_size 16000 but juno pegs the route cache

What do you mean, specifically, by "pegs"?

   This seems to be a good compromise for now.. 
   
Setting the secret interval smaller than it's default serves no
purpose.  I would recommend instead to incrase it.

   Ok you see this happening but during this the router is almost
   unusable..   
   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
       3 root      20  -1     0    0     0 RW<  48.5  0.0  34:04
   ksoftirqd_CPU0
       4 root      20  -1     0    0     0 RW<  46.7  0.0  34:14
   ksoftirqd_CPU1
   
   Both cpus are slammed at 100% by the ksoftirqds.

ksoftirqd kicks in WAY too early, try my patch below.

   This is using e1000 with interrups limited to ~ 4000/second (ITR),
   no NAPI.. NAPI messes it up big time and drops more packets than
   without :>

Something is very wrong, NAPI can only give your system more CPU time
by which to do packet processing.  Some good kernel profiles would be
nice too.
   
Anyways, here is the patch to make ksoftirqd no kick in so quickly,
it's based upon a 2.4.x patch from Ingo Molnar:

--- kernel/softirq.c.~1~	Mon Jun  9 14:28:02 2003
+++ kernel/softirq.c	Mon Jun  9 14:29:28 2003
@@ -52,11 +52,22 @@
 		wake_up_process(tsk);
 }
 
+/*
+ * We restart softirq processing MAX_SOFTIRQ_RESTART times,
+ * and we fall back to softirqd after that.
+ *
+ * This number has been established via experimentation.
+ * The two things to balance is latency against fairness -
+ * we want to handle softirqs as soon as possible, but they
+ * should not be able to lock up the box.
+ */
+#define MAX_SOFTIRQ_RESTART 10
+
 asmlinkage void do_softirq(void)
 {
+	int max_restart = MAX_SOFTIRQ_RESTART;
 	__u32 pending;
 	unsigned long flags;
-	__u32 mask;
 
 	if (in_interrupt())
 		return;
@@ -68,7 +79,6 @@
 	if (pending) {
 		struct softirq_action *h;
 
-		mask = ~pending;
 		local_bh_disable();
 restart:
 		/* Reset the pending bitmask before enabling irqs */
@@ -88,10 +98,8 @@
 		local_irq_disable();
 
 		pending = local_softirq_pending();
-		if (pending & mask) {
-			mask &= ~pending;
+		if (pending && --max_restart)
 			goto restart;
-		}
 		if (pending)
 			wakeup_softirqd(smp_processor_id());
 		__local_bh_enable();

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 19:38                                                         ` CIT/Paul
  2003-06-09 21:30                                                           ` David S. Miller
@ 2003-06-09 22:19                                                           ` Simon Kirby
  2003-06-09 22:54                                                             ` Robert Olsson
                                                                               ` (2 more replies)
  1 sibling, 3 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-09 22:19 UTC (permalink / raw)
  To: CIT/Paul; +Cc: 'David S. Miller', hadi, fw, netdev, linux-net

On Mon, Jun 09, 2003 at 03:38:30PM -0400, CIT/Paul wrote:

> gc_elasticity:1
> gc_interval:600
> gc_min_interval:1
> gc_thresh:60000
> gc_timeout:15
> max_delay:10
> max_size:512000

^^^ EEP, no!  Even the default of 65536 is too big.  No wonder you have
no CPU left.  This should never be bigger than 65536 (unless the hash is
increased), but even then it should be set smaller and the GC interval
should be fixed.  With a table that large, it's going to be walking the
buckets all of the time.

> I've tried other settings, secret-interval 1 which seems to 'flush' the
> cache every second or 60 seconds as I have it here..

That's only for permutating the hash table to avoid remote hash exploits. 
Ideally, you don't want anything clearing the route cache except for the
regular garbage collection (where the gc_elasticity controls how much of
it gets nuked).

> If I have secret interval set to 1 the GC never runs because the cache
> never gets > my gc thresh..  I've also tried this with
> Gc_thresh 2000 and more aggressive settings (timeout 5, interval 10)..
> Also tried with max_size 16000 but juno pegs the route cache
> And I get massive amounts of dst_cache_overflow messages .. 

Try setting gc_min_interval to 0 and gc_elasticity to 4 (so that it
doesn't entirely nuke it all the time, but so that it runs fairly often
and prunes quite a bit).  gc_min_interval:0 will actually make it clear
as it allocates, if I remember correctly.

> This is 'normal' traffic on the router (using the rtstat program)

> 
> ./rts -i 1
>  size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
> mc GC: tot ignored goal_miss ovrf
> 59272     26954    1826     0     0     0     0     0         6       0
> 0       0       0         0    0

Yes, your route cache is way too large for the hash.

Ours looks like this:

[sroot@r2:/root]# rtstat -i 1
 size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot     mc
870721946     16394    1013     8     4     4     0     0        38      12      0
870722937     16278    1007     8     0    10     0     0        32       6      0
870723935     16362     999     5     0     6     0     0        34       8      0
870725083     16483    1158     1     0     0     0     2        26       6      0
870726047     16634     974     0     0     4     0     0        42       0      0
870726168     14315    2338    13    10     8     0     0        34      44      2
870726168     14683    1383     0     8     2     0     0        30      12      2
870726864     16172    1155     0     6     2     0     0        28       4      0
870728079     17842    1234     0     0     0     0     0        28      12      0
870729106     17545    1036     2     0     2     0     0        30       6      0

...Hmm, the size is a bit off there.  I'm not sure what that's all about. 
Did you have to hack on rtstat.c at all?  Alternative:

[sroot@r2:/root]# while (1)
[sroot@r2:(while)]# sleep 1
[sroot@r2:(while)]# ip -o route show cache | wc -l
[sroot@r2:(while)]# end
   8064
   8706
   9299
   9939
  10277
  10857
  11426
  11731
  12328
  12796
  13096
  13623
   1139
   2712
   4233
    561
   2468
   3948
   5075
   5459
   6114
   6768
   7502
   7815
   8303
   8969
   9602
  10090
  10566
  11194
  11765
  11987
  12678
  12920
  13563
  14136
  14693
   2336
   3652
   4814
   5954
   6449
   6741
   7412
   8036

....Hmm, even that is growing a bit large.  Pfft.  I guess we were doing
less traffic last time I checked this. :)

Maybe you have a bit more traffic than us in normal operation and it's
growing faster because of that.  Still, with a gc_elasticity of 1 it
should be clearing it out very quickly.

...Though I just tried that, and it's not.  In fact, the gc_elasticity
doesn't seem to be making much of a difference at all.  The only thing
that seems to really change it is if I set gc_min_interval to 0:

[sroot@r2:/proc/sys/net/ipv4/route]# echo 0 > gc_min_interval
[sroot@r2:/proc/sys/net/ipv4/route]# while ( 1 )
[sroot@r2:(while)]# sleep 1
[sroot@r2:(while)]# ip -o route show cache | wc -l
[sroot@r2:(while)]# end
   9674
   9547
   9678
   9525
   9625
   9544
   9385
    497
   2579
   3820
   4083
   4099
   4068
   4054
   4089
   4095
   4137
   4072
   4071
   4137
   2141
   3414
   4044
   2487
   3759
   4047
   4085
   4092
   4156
   4089
   4008
    475
   2497
   3729
   4146
   4085
   4116

It seems to regulate it after it gets cleared the first time.  If I set
gc_elasticity to 1 it seems to bounce around a lot more -- 4 is much
smoother.  It didn't seem to make a difference with gc_min_interval set
to 1, though... hmmm.  We've been running normally with gc_min_interval
set to 1, but it looks like the BGP table updates have kept the cache
from growing too large.

> Check what happens when I load up juno..

Yeah... Juno's just going to hit it harder and show the problems with it
having to walk through such large hash buckets.  How big is your routing
table on this box?  Is it running BGP?

> slammed at 100% by the ksoftirqds.  This is using e1000 with interrups
> limited to ~ 4000/second (ITR), no NAPI.. NAPI messes it up big time and
> drops more packets than without :>

Hmm, that's weird.  It works quite well here on a single CPU box with
tg3 cards.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09  8:56                                                     ` David S. Miller
@ 2003-06-09 22:39                                                       ` Robert Olsson
  0 siblings, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-09 22:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: sim, xerox, hadi, fw, netdev, linux-net, Robert.Olsson, kuznet


David S. Miller writes:

 > BTW, ignoring juno, Robert Olsson has some pktgen hacks that allow
 > that to generate new-dst-per-packet DoS like traffic.  It's much
 > more effective than Juno-z
 > 
 > Robert could you should these guys your hacks to do that?

Sure.
What a discussion... Well I'm happy for the past lazy days. 


I've include some references in the experiment from last week and it should 
be interesting for people in this discussion.

Summary:
Forwarding experiment with different rates of new incoming destinations/sec. 
Ranging from DoS attack to single destination flow. With full 123k routes.

http://robur.slu.se/Linux/net-development/experiments/router-flow-test.html


Your latest patch looks interesting... good thinking. Operations and tuning 
would be simplier. Hope to have time for a test tomorrow. Testing is very 
manual work still.

Cheers.

						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 22:19                                                           ` Simon Kirby
@ 2003-06-09 22:54                                                             ` Robert Olsson
  2003-06-13  6:21                                                               ` David S. Miller
  2003-06-09 22:56                                                             ` CIT/Paul
  2003-06-10  0:56                                                             ` Ralph Doncaster
  2 siblings, 1 reply; 227+ messages in thread
From: Robert Olsson @ 2003-06-09 22:54 UTC (permalink / raw)
  To: Simon Kirby
  Cc: CIT/Paul, 'David S. Miller', hadi, fw, netdev, linux-net


Simon Kirby writes:

 > [sroot@r2:/root]# rtstat -i 1
 >  size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot     mc
 > 870721946     16394    1013     8     4     4     0     0        38      12      0
 > 870722937     16278    1007     8     0    10     0     0        32       6      0

 > ...Hmm, the size is a bit off there.  I'm not sure what that's all about. 

 Seems you have an older version of rtstat. There are stats for the GC process there 
 too.

 You can get recent rtstat from:
 robur.slu.se:/pub/Linux/net-development/rt_cache_stat/rtstat.c

 
 I'm about to propose some stats even for hash spinning.... 

 
--- linux/include/net/route.h.orig      2003-03-24 22:59:53.000000000 +0100
+++ linux/include/net/route.h   2003-05-16 11:04:07.000000000 +0200
@@ -102,6 +102,8 @@
         unsigned int gc_ignored;
         unsigned int gc_goal_miss;
         unsigned int gc_dst_overflow;
+        unsigned int in_hlist_search;
+        unsigned int out_hlist_search;
 };
 
 extern struct rt_cache_stat *rt_cache_stat;
--- linux/net/ipv4/route.c.orig 2003-03-24 23:01:48.000000000 +0100
+++ linux/net/ipv4/route.c      2003-05-16 11:18:54.000000000 +0200
@@ -321,7 +321,7 @@
        for (i = 0; i < NR_CPUS; i++) {
                if (!cpu_possible(i))
                        continue;
-               len += sprintf(buffer+len, "%08x  %08x %08x %08x %08x %08x %08x %08x  %08x %08x %08x %08x %08x %08x %08x \n",
+               len += sprintf(buffer+len, "%08x  %08x %08x %08x %08x %08x %08x %08x  %08x %08x %08x %08x %08x %08x %08x %08x %08x \n",
                               dst_entries,                    
                               per_cpu_ptr(rt_cache_stat, i)->in_hit,
                               per_cpu_ptr(rt_cache_stat, i)->in_slow_tot,
@@ -338,7 +338,9 @@
                               per_cpu_ptr(rt_cache_stat, i)->gc_total,
                               per_cpu_ptr(rt_cache_stat, i)->gc_ignored,
                               per_cpu_ptr(rt_cache_stat, i)->gc_goal_miss,
-                              per_cpu_ptr(rt_cache_stat, i)->gc_dst_overflow
+                              per_cpu_ptr(rt_cache_stat, i)->gc_dst_overflow,
+                              per_cpu_ptr(rt_cache_stat, i)->in_hlist_search,
+                              per_cpu_ptr(rt_cache_stat, i)->out_hlist_search
 
                        );
        }
@@ -1771,6 +1773,7 @@
                        skb->dst = (struct dst_entry*)rth;
                        return 0;
                }
+               RT_CACHE_STAT_INC(in_hlist_search);
        }
        rcu_read_unlock();
 
@@ -2137,6 +2140,7 @@
                        *rp = rth;
                        return 0;
                }
+               RT_CACHE_STAT_INC(out_hlist_search);
        }
        rcu_read_unlock();


Cheers.
						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09 22:19                                                           ` Simon Kirby
  2003-06-09 22:54                                                             ` Robert Olsson
@ 2003-06-09 22:56                                                             ` CIT/Paul
  2003-06-09 23:05                                                               ` David S. Miller
  2003-06-10  0:03                                                               ` Jamal Hadi
  2003-06-10  0:56                                                             ` Ralph Doncaster
  2 siblings, 2 replies; 227+ messages in thread
From: CIT/Paul @ 2003-06-09 22:56 UTC (permalink / raw)
  To: 'Simon Kirby'
  Cc: 'David S. Miller', hadi, fw, netdev, linux-net

NAPI despises SMP.. Any SMP box we run NAPI on has major packet loss
under high load.. So I find that the e1000 ITR works just as well
And there is no reason for NAPI at this point. 

I will try your settings :)
net.ipv4.route.secret_interval = 600
net.ipv4.route.min_adv_mss = 256
net.ipv4.route.min_pmtu = 552
net.ipv4.route.mtu_expires = 600
net.ipv4.route.gc_elasticity = 4
net.ipv4.route.error_burst = 500
net.ipv4.route.error_cost = 100
net.ipv4.route.redirect_silence = 2048
net.ipv4.route.redirect_number = 9
net.ipv4.route.redirect_load = 2
net.ipv4.route.gc_interval = 600
net.ipv4.route.gc_timeout = 15
net.ipv4.route.gc_min_interval = 0
net.ipv4.route.max_size = 32768
net.ipv4.route.gc_thresh = 2000
net.ipv4.route.max_delay = 10
net.ipv4.route.min_delay = 5

Current settings....
Rtstat output:
size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
mc GC: tot ignored goal_miss ovrf
 2010      9014   14039     0     0     0     0     0         0       6
2   14038       0        49    0
 2008      8675   13999     0     0     0     1     0         1       5
2   13992       0        56    0
 2002      8529   16484     0     0     0     1     0         0       7
2   16483       0        43    0
 2009      8549   15304     0     0     0     0     0         1      10
2   15303       0        55    0
 2007      8491   16118     0     0     0     0     0         0      10
2   16117       0        50    0
 2024      8219   18306     0     0     0     1     0         0       7
2   18309       0        14    0
 2005      8586   15536     0     0     0     0     0         0       9
2   15536       0        42    0
 2007      8804   15797     0     0     0     0     0         0       7
2   15796       0        42    0
 2012      8535   16519     0     0     0     1     0         0       7
2   16518       0        28    0
 2004      8348   15709     0     0     0     0     1         0       8
2   15707       0        42    0
...
 2043      8600   18278     0     0     0     0     0         0      12
2   18285       0        15    0
 2030      8631   17731     0     0     0     1     0         0       9
2   17737       0         7    0
 2002      8489   14653     0     0     0     1     0         2       5
2   14650       0        35    0
 2015      8147   15004     0     0     0     0     0         0       9
2   15003       0        57    0
 2015      8352   17303     0     0     0     2     0         0       8
2   17308       0         7    0
 2025      8451   16768     0     0     0     0     0         0       6
2   16768       0        35    0
 2013      8531   16464     0     0     0     0     0         0      13
2   16476       0         7    0
 2013      8117   15202     0     0     0     1     1         0       7
2   15198       0        35    0
 size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
mc GC: tot ignored goal_miss ovrf
 2019      7913   15054     0     0     0     1     0         0       9
2   15057       0        42    0
 2008      8258   16019     0     0     0     0     0         1       9
2   16020       0        43    0
 2025      8211   17897     0     0     0     1     0         0       5
2   17902       0         0    0

CPU NORMAL:
CPU0 states: 36.0% user, 29.0% system,  0.0% nice, 33.0% idle
CPU1 states: 18.0% user, 61.0% system,  0.0% nice, 19.0% idle
CPU0 states: 21.0% user, 44.0% system,  0.0% nice, 35.0% idle
CPU1 states: 18.0% user, 47.0% system,  0.0% nice, 35.0% idle
    3 root      10  -1     0    0     0 SW<   0.0  0.0  35:29
ksoftirqd_CPU0
    4 root      10  -1     0    0     0 SW<   0.0  0.0  35:35
ksoftirqd_CPU1
 

Rtstat under light juno:

 2315      7955   51691     0     0     0     1     1         1       5
1   51695       0         0    0
 2336      6620   47387     0     0     0     1     0         1       5
1   47393       0         0    0
 2371      5630   49726     0     0     0     0     0         1      12
2   49737       0         0    0
 2372      5420   53458     0     0     0     1     0         0       2
1   53460       0         0    0
 2369      4891   48983     0     0     0     0     0         1       5
2   48988       0         0    0
 2389      4529   50525     0     0     0     0     1         1       8
1   50532       0         0    0
 2334      4645   49092     0     0     1     1     0         0       1
1   49093       0         0    0
 2358      5033   48971     0     0     0     1     0         1       6
2   48977       0         0    0
 2366      4864   51411     0     0     0     2     0         1       8
1   51419       0         0    0
 2370      5035   49444     0     0     0     0     0         0       4
2   49448       0         0    0
 size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
mc GC: tot ignored goal_miss ovrf
 2391      5328   49098     0     0     0     1     0         3      12
3   49110       0         0    0
 2363      5586   50687     0     0     0     2     0         0       7
1   50693       0         0    0
 2361      4571   49243     0     0     0     0     0         0       2
1   49243       0         0    0
 2356      5758   56664     0     0     1     1     0         1       5
1   56666       0         0    0
 2375      5581   62098     0     0     0     2     0         0       8
2   62103       0         0    0
 2393      3895   50762     0     0     0     1     0         0       5
0   50764       0         0    0
 2335      4066   56659     0     0     0     1     0         0      10
2   56667       0         0    0
 2315      3607   49990     0     0     0     1     0         0       4
1   49992       0         0    0
 2339      4369   54149     0     0     0     1     0         0       7
1   54153       0         0    0


CPU under JUNO:
CPU0 states:  0.0% user, 99.3% system,  0.2% nice,  0.0% idle
CPU1 states:  0.2% user, 99.3% system,  0.1% nice,  0.0% idle

    4 root      14  -1     0    0     0 SW<  21.0  0.0  35:33
ksoftirqd_CPU1
    3 root      15  -1     0    0     0 SW<  20.1  0.0  35:27
ksoftirqd_CPU0
  

This is 10mbit of juno....... Or around 9.6 or so...

RTS normal with 8000 thresh:
 size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
mc GC: tot ignored goal_miss ovrf
 8003     11474    9076     0     0     0     2     0         0       4
2    9071       0        10    0
 8010     11425    9205     0     0     0     0     0         0       7
2    9203       0        14    0
 8006     11393   12516     0     0     0     1     0         4       5
0   12509       0        20    0
 8005     12082    9188     0     0     0     2     0         0       5
2    9184       0        14    0
 8004     11447    8893     0     0     0     0     0         0       8
2    8890       0        12    0
 8004     12346    8898     0     0     0     1     0         2       5
2    8891       0        10    0
 8003     11557    8944     0     0     0     2     0         1       7
1    8942       0        14    0
 8004     12812    9890     0     0     0     0     0         1       5
1    9878       0        16    0
 8004     12166   11363     0     0     0     1     0         2       3
2   11349       0        23    0
 8012     11933    8881     0     0     0     2     0         0       6
2    8874       0        15    0
 8003     11938    9024     0     0     0     0     0         1       5
1    9017       0        12    0
 8003     12107    8682     0     0     0     1     0         2       3
2    8674       0        13    0
 8008     11328    8945     0     0     0     1     0         2       6
1    8942       0        10    0


CPU:
CPU0 states:  0.0% user, 50.0% system,  0.0% nice, 49.0% idle
CPU1 states:  1.0% user, 57.0% system,  0.0% nice, 40.0% idle

CPU0 states:  0.0% user, 27.0% system,  0.0% nice, 72.0% idle
CPU1 states:  0.0% user, 41.0% system,  0.0% nice, 58.0% idle

  3 root      12  -1     0    0     0 SW<   0.0  0.0  35:29
ksoftirqd_CPU0
    4 root       9  -1     0    0     0 SW<   0.0  0.0  35:35
ksoftirqd_CPU1
  


I've mucked with TONNnss of settings.. I've even had the route-cache up
to over 600,000 entries and the CPU still has room left for more..
It can't possibly be the size of the cache, it simply has to be the
constant creation and teardown of entries .. I can't hit anywhere NEAR
100kpps
On this router with the amount of load on it..

The routing table:

ip ro ls | wc
    516    2598   21032

Doesn't have too much in it.. It's running bgp but im not taking the
full routes right now.. We will later though.

There are some ip rules
Also some netfilters

 iptables-save | wc
   1154    7658   46126

Of course there isn't 1154 entries because some of that is the chains
and things but there are a lot of rules in netfilter also.. Everything
seems to slow it down :/ especially the mangle table.. If I add 1000
entries to the mangle table in netfilter it uses massive cpu ..
Netfilter seems to be a hog.


Like I said I've tested this with NO netfilter and nothing else on a
test box except for the kernel, e1000 , ITR set to ~4000 and all sorts
of changing the settings and I still can't hit 100kpps routing with
juno-z 


Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: Simon Kirby [mailto:sim@netnation.com] 
Sent: Monday, June 09, 2003 6:19 PM
To: CIT/Paul
Cc: 'David S. Miller'; hadi@shell.cyberus.ca; fw@deneb.enyo.de;
netdev@oss.sgi.com; linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


On Mon, Jun 09, 2003 at 03:38:30PM -0400, CIT/Paul wrote:

> gc_elasticity:1
> gc_interval:600
> gc_min_interval:1
> gc_thresh:60000
> gc_timeout:15
> max_delay:10
> max_size:512000

^^^ EEP, no!  Even the default of 65536 is too big.  No wonder you have
no CPU left.  This should never be bigger than 65536 (unless the hash is
increased), but even then it should be set smaller and the GC interval
should be fixed.  With a table that large, it's going to be walking the
buckets all of the time.

> I've tried other settings, secret-interval 1 which seems to 'flush' 
> the cache every second or 60 seconds as I have it here..

That's only for permutating the hash table to avoid remote hash
exploits. 
Ideally, you don't want anything clearing the route cache except for the
regular garbage collection (where the gc_elasticity controls how much of
it gets nuked).

> If I have secret interval set to 1 the GC never runs because the cache

> never gets > my gc thresh..  I've also tried this with Gc_thresh 2000 
> and more aggressive settings (timeout 5, interval 10).. Also tried 
> with max_size 16000 but juno pegs the route cache And I get massive 
> amounts of dst_cache_overflow messages ..

Try setting gc_min_interval to 0 and gc_elasticity to 4 (so that it
doesn't entirely nuke it all the time, but so that it runs fairly often
and prunes quite a bit).  gc_min_interval:0 will actually make it clear
as it allocates, if I remember correctly.

> This is 'normal' traffic on the router (using the rtstat program)

> 
> ./rts -i 1
>  size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit
tot
> mc GC: tot ignored goal_miss ovrf
> 59272     26954    1826     0     0     0     0     0         6
0
> 0       0       0         0    0

Yes, your route cache is way too large for the hash.

Ours looks like this:

[sroot@r2:/root]# rtstat -i 1
 size   IN: hit     tot    mc no_rt bcast madst masrc  OUT: hit     tot
mc
870721946     16394    1013     8     4     4     0     0        38
12      0
870722937     16278    1007     8     0    10     0     0        32
6      0
870723935     16362     999     5     0     6     0     0        34
8      0
870725083     16483    1158     1     0     0     0     2        26
6      0
870726047     16634     974     0     0     4     0     0        42
0      0
870726168     14315    2338    13    10     8     0     0        34
44      2
870726168     14683    1383     0     8     2     0     0        30
12      2
870726864     16172    1155     0     6     2     0     0        28
4      0
870728079     17842    1234     0     0     0     0     0        28
12      0
870729106     17545    1036     2     0     2     0     0        30
6      0

...Hmm, the size is a bit off there.  I'm not sure what that's all
about. 
Did you have to hack on rtstat.c at all?  Alternative:

[sroot@r2:/root]# while (1)
[sroot@r2:(while)]# sleep 1
[sroot@r2:(while)]# ip -o route show cache | wc -l [sroot@r2:(while)]#
end
   8064
   8706
   9299
   9939
  10277
  10857
  11426
  11731
  12328
  12796
  13096
  13623
   1139
   2712
   4233
    561
   2468
   3948
   5075
   5459
   6114
   6768
   7502
   7815
   8303
   8969
   9602
  10090
  10566
  11194
  11765
  11987
  12678
  12920
  13563
  14136
  14693
   2336
   3652
   4814
   5954
   6449
   6741
   7412
   8036

....Hmm, even that is growing a bit large.  Pfft.  I guess we were doing
less traffic last time I checked this. :)

Maybe you have a bit more traffic than us in normal operation and it's
growing faster because of that.  Still, with a gc_elasticity of 1 it
should be clearing it out very quickly.

...Though I just tried that, and it's not.  In fact, the gc_elasticity
doesn't seem to be making much of a difference at all.  The only thing
that seems to really change it is if I set gc_min_interval to 0:

[sroot@r2:/proc/sys/net/ipv4/route]# echo 0 > gc_min_interval
[sroot@r2:/proc/sys/net/ipv4/route]# while ( 1 ) [sroot@r2:(while)]#
sleep 1 [sroot@r2:(while)]# ip -o route show cache | wc -l
[sroot@r2:(while)]# end
   9674
   9547
   9678
   9525
   9625
   9544
   9385
    497
   2579
   3820
   4083
   4099
   4068
   4054
   4089
   4095
   4137
   4072
   4071
   4137
   2141
   3414
   4044
   2487
   3759
   4047
   4085
   4092
   4156
   4089
   4008
    475
   2497
   3729
   4146
   4085
   4116

It seems to regulate it after it gets cleared the first time.  If I set
gc_elasticity to 1 it seems to bounce around a lot more -- 4 is much
smoother.  It didn't seem to make a difference with gc_min_interval set
to 1, though... hmmm.  We've been running normally with gc_min_interval
set to 1, but it looks like the BGP table updates have kept the cache
from growing too large.

> Check what happens when I load up juno..

Yeah... Juno's just going to hit it harder and show the problems with it
having to walk through such large hash buckets.  How big is your routing
table on this box?  Is it running BGP?

> slammed at 100% by the ksoftirqds.  This is using e1000 with interrups

> limited to ~ 4000/second (ITR), no NAPI.. NAPI messes it up big time 
> and drops more packets than without :>

Hmm, that's weird.  It works quite well here on a single CPU box with
tg3 cards.

Simon-


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 22:56                                                             ` CIT/Paul
@ 2003-06-09 23:05                                                               ` David S. Miller
  2003-06-10 13:41                                                                 ` Robert Olsson
  2003-06-10  0:03                                                               ` Jamal Hadi
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-09 23:05 UTC (permalink / raw)
  To: xerox; +Cc: sim, hadi, fw, netdev, linux-net

   From: "CIT/Paul" <xerox@foonet.net>
   Date: Mon, 9 Jun 2003 18:56:18 -0400

   And there is no reason for NAPI at this point. 
   
Intel's ITR give you high latency, NAPI is far superior
than any hardware based interrupt mitigation scheme whatsoever.

You have some system specific problem with NAPI and we need
to analyze that.

   I've mucked with TONNnss of settings.. I've even had the route-cache up
   to over 600,000 entries and the CPU still has room left for more..
   It can't possibly be the size of the cache,

You are letting your hash chains reach the size of "max_size" divided
by the number of hash chains.

This means that every packet into your machine has to walk that
many hash chains.

You can keep doing some shamans dance saying that the size you
have choosen doesn't matter, but the people who have written
this code and work with it every day know that it does.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-09 22:56                                                             ` CIT/Paul
  2003-06-09 23:05                                                               ` David S. Miller
@ 2003-06-10  0:03                                                               ` Jamal Hadi
  2003-06-10  0:32                                                                 ` Ralph Doncaster
  1 sibling, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10  0:03 UTC (permalink / raw)
  To: CIT/Paul
  Cc: 'Simon Kirby', 'David S. Miller', fw, netdev, linux-net



On Mon, 9 Jun 2003, CIT/Paul wrote:

> NAPI despises SMP.. Any SMP box we run NAPI on has major packet loss
> under high load.. So I find that the e1000 ITR works just as well
> And there is no reason for NAPI at this point.
>

Foo, you on cheap crack again?
Please just try the tests as described if you want to help. It doesnt help
anyone when you wildly wave your hands like that.

Why dont we take you offline - give me access to your machine i have a
couple of hours to kill.

cheers,
jamal


^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10  0:03                                                               ` Jamal Hadi
@ 2003-06-10  0:32                                                                 ` Ralph Doncaster
  2003-06-10  1:15                                                                   ` Jamal Hadi
                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10  0:32 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: CIT/Paul, 'Simon Kirby', 'David S. Miller',
	fw, netdev, linux-net

On Mon, 9 Jun 2003, Jamal Hadi wrote:

> On Mon, 9 Jun 2003, CIT/Paul wrote:
>
> > NAPI despises SMP.. Any SMP box we run NAPI on has major packet loss
> > under high load.. So I find that the e1000 ITR works just as well
> > And there is no reason for NAPI at this point.
> >
>
> Foo, you on cheap crack again?
> Please just try the tests as described if you want to help. It doesnt help
> anyone when you wildly wave your hands like that.

>From personal experience, after trying numerous things for over a year one
can get very frustrated.  Although your contribution has been useful, you
are also guilty of wildly waving your hands around too.  Many moons ago
when I lamented that my 2.2.19 kernel, 750Mhz duron, 3c59x core router
performance sucked you told me NAPI would solve the performance problems.
It didn't.  And Rob's latest numbers seem to show that even with the
latest and greatest patches 148kpps is still a dream.  It's good to see
that people are finally doing tests to simulate real-world routing
(instead of just pretending the problem doesn't exist because they were
able to get 148kpps in some contrived test).

Here's my CPU graphs for the box; it's only doing routing and firewalling
isn't even built into the kernel (2.4.20 with 3c59x NAPI patches)
http://66.11.168.198/mrtg/tbgp/tbgp_usrsys.html

eth1 and eth2 are both sending and receiving ~30mbps of traffic (at
8-10kpps in and out on each interface).

The other variable that I haven't seen people discuss but have anecdotal
evidence will measurably impact performance is the motherboard used
(chipset and chipset configuration/timing).

Lastly from the software side Linux doesn't seem to have anything like
BSD's parameter to control user/system CPU sharing.  Once my CPU load
reaches 70-80%, I'd rather have some dropped packets than let the CPU hit
100% and end up with my BGP sessions drop.

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 22:19                                                           ` Simon Kirby
  2003-06-09 22:54                                                             ` Robert Olsson
  2003-06-09 22:56                                                             ` CIT/Paul
@ 2003-06-10  0:56                                                             ` Ralph Doncaster
  2 siblings, 0 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10  0:56 UTC (permalink / raw)
  To: Simon Kirby
  Cc: CIT/Paul, 'David S. Miller', hadi, fw, netdev, linux-net

On Mon, 9 Jun 2003, Simon Kirby wrote:

> [sroot@r2:/root]# while (1)
> [sroot@r2:(while)]# sleep 1
> [sroot@r2:(while)]# ip -o route show cache | wc -l
> [sroot@r2:(while)]# end

I considered doing the same test on my box, but I don't have enough juice
left to do it every second:
root@tor-router# time ip -o route show cache | wc -l
  15023
real    0m1.563s
user    0m0.380s
sys     0m1.180s

So instead...

root@tor-router# while (true); do sleep 5; ip -o route show cache | wc -l;
done
  12630
  15659
  17951
  20733
   8875
   9282
  11913
   4216
   9437
  11973
  14503
  17088

-Ralph


^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10  0:32                                                                 ` Ralph Doncaster
@ 2003-06-10  1:15                                                                   ` Jamal Hadi
  2003-06-10  2:45                                                                     ` Ralph Doncaster
                                                                                       ` (2 more replies)
  2003-06-10  1:53                                                                   ` Simon Kirby
  2003-06-10 15:49                                                                   ` David S. Miller
  2 siblings, 3 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10  1:15 UTC (permalink / raw)
  To: ralph+d
  Cc: CIT/Paul, 'Simon Kirby', 'David S. Miller',
	fw, netdev, linux-net



On Mon, 9 Jun 2003, Ralph Doncaster wrote:

> From personal experience, after trying numerous things for over a year one
> can get very frustrated.  Although your contribution has been useful, you
> are also guilty of wildly waving your hands around too.  Many moons ago
> when I lamented that my 2.2.19 kernel, 750Mhz duron, 3c59x core router
> performance sucked you told me NAPI would solve the performance problems.
> It didn't.  And Rob's latest numbers seem to show that even with the
> latest and greatest patches 148kpps is still a dream.  It's good to see
> that people are finally doing tests to simulate real-world routing
> (instead of just pretending the problem doesn't exist because they were
> able to get 148kpps in some contrived test).
>

I am not sure that foos tests are not contrived ;->
The man just hammers away at his routers with DOS tools;->
I feel like a shrink calming him down to stop doing that. hehe.

I am actually  not against using the DOS tools because they test the worst
case.
However, to solve a problem you need first to isolate it and
methodically squash the coakroches. For example, In 2.2.x you
wouldnt even see the problems that we have today because we had bigger
problems namely interupt issues. NAPI resolves that. When i told you
that i was basing it on facts.
We are now exposed to dst cache problems. Daves patches isolate and
resolve whats causing all this noise. First it was the cache distribution
which is now resolved. Next it is garbage collection which it seems to
me is being resolved. When someone working so hard like Dave is putting
out these fires we need to help him. If he tells foo to run a specific
test then thats what he should run ... I dont think we should just add
CISCOs CEF just because someone thinks it works better. We need to
systematically isolate and fix.
For example just turning on netfilter is poluting the results.

Problem is people disappear real quick when asked to run tests that
could validate certain concepts. I wish everyone would emulate S Kirby
he actually gives good info.

> Here's my CPU graphs for the box; it's only doing routing and firewalling
> isn't even built into the kernel (2.4.20 with 3c59x NAPI patches)
> http://66.11.168.198/mrtg/tbgp/tbgp_usrsys.html
>
> eth1 and eth2 are both sending and receiving ~30mbps of traffic (at
> 8-10kpps in and out on each interface).
>

Is this still the duron 750Mhz? Are you running zebra? Did you
check out some of the ideas i talked about earlier?

> The other variable that I haven't seen people discuss but have anecdotal
> evidence will measurably impact performance is the motherboard used
> (chipset and chipset configuration/timing).
>

Robert has a good collection for what is good hardware. I am so outdated
i dont keep track anymore. My fastest machine is still an ASuse dual
450Mhz.

> Lastly from the software side Linux doesn't seem to have anything like
> BSD's parameter to control user/system CPU sharing.  Once my CPU load
> reaches 70-80%, I'd rather have some dropped packets than let the CPU hit
> 100% and end up with my BGP sessions drop.
>

Well, heres a good example: With NAPI, have your sessions been dropped?
Have you tried a different NIC? Not sure how well the 3com is maintained
for example.
Try a tulip or tg3 or e1000 or the dlink gige.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  0:32                                                                 ` Ralph Doncaster
  2003-06-10  1:15                                                                   ` Jamal Hadi
@ 2003-06-10  1:53                                                                   ` Simon Kirby
  2003-06-10  3:18                                                                     ` Ralph Doncaster
  2003-06-10 15:56                                                                     ` David S. Miller
  2003-06-10 15:49                                                                   ` David S. Miller
  2 siblings, 2 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-10  1:53 UTC (permalink / raw)
  To: ralph+d
  Cc: Jamal Hadi, CIT/Paul, 'David S. Miller', fw, netdev, linux-net

On Mon, Jun 09, 2003 at 08:32:48PM -0400, Ralph Doncaster wrote:

> Here's my CPU graphs for the box; it's only doing routing and firewalling
> isn't even built into the kernel (2.4.20 with 3c59x NAPI patches)
> http://66.11.168.198/mrtg/tbgp/tbgp_usrsys.html
> 
> eth1 and eth2 are both sending and receiving ~30mbps of traffic (at
> 8-10kpps in and out on each interface).

Interesting!  Your CPU use is quite a bit higher than ours.  It looks
like we have fairly similar network configurations.  We're advertising a
/24 and a /20 of which about 60% of the IPs are in use.  Each router
forwards about 60 Mbit/second (16 kpps) during the day, and the CPU load
is usually around 18-25%.  This is with a single CPU, though I
accidentally compiled the kernel SMP.

I had forgotten to add CPU utilization to the cricket graphs, so I'll
have a better idea from now on, but I've never seen it above 30% (from
"vmstat 1") except in attack cases.  The difference is probably just the
fact that this is running on slightly faster hardware (single Athlon
1800MP, Tyan Tiger MPX board).

> Lastly from the software side Linux doesn't seem to have anything like
> BSD's parameter to control user/system CPU sharing.  Once my CPU load
> reaches 70-80%, I'd rather have some dropped packets than let the CPU hit
> 100% and end up with my BGP sessions drop.

Hmm.  I found that once NAPI was happening, userspace seemed to get a
fairly decent amount of time.  I'm not exactly sure what the settings
are, but I was able to run things through SSH quite easily (not without
noticeable slowness, though).  Actually, the slowness appeared to be
mostly the result of incoming packet drops ("vmstat 1" output where it
was _sending_ data and getting the ACKs some time later was perfectly
smooth).

We just set up a dual Opertron box today with dual onboard Tigon3s, so
I'll see if I can do some profiling.  I hooked it via crossover to
a Xeon 2.4 GHz box with onboard e1000, so I should be able to do some
remote profiling tonight.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10  1:15                                                                   ` Jamal Hadi
@ 2003-06-10  2:45                                                                     ` Ralph Doncaster
  2003-06-10  3:23                                                                       ` Ben Greear
                                                                                         ` (2 more replies)
  2003-06-10 15:53                                                                     ` Route cache performance under stress David S. Miller
  2003-06-11 17:52                                                                     ` Route cache performance under stress Robert Olsson
  2 siblings, 3 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10  2:45 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: CIT/Paul, 'Simon Kirby', 'David S. Miller',
	fw, netdev, linux-net

On Mon, 9 Jun 2003, Jamal Hadi wrote:

> Problem is people disappear real quick when asked to run tests that
> could validate certain concepts. I wish everyone would emulate S Kirby
> he actually gives good info.

The test results Rob posted today show that the testing can be done in a
lab environment.  Most of the people I know that would actually see 50kpps
in the real world don't have the time to apply various patches and run a
bunch of tests; pretending the problem doesn't exist when someone doesn't
run tests to prove is a poor excuse.

> > Here's my CPU graphs for the box; it's only doing routing and firewalling
> > isn't even built into the kernel (2.4.20 with 3c59x NAPI patches)
> > http://66.11.168.198/mrtg/tbgp/tbgp_usrsys.html
> >
> > eth1 and eth2 are both sending and receiving ~30mbps of traffic (at
> > 8-10kpps in and out on each interface).
>
> Is this still the duron 750Mhz? Are you running zebra? Did you
> check out some of the ideas i talked about earlier?

Yup, still a duron 750 on an Asus mobo (Via chipset).  Running Zebra
0.93b.  If the ideas you're referring to are changing the zebra source to
arp the next-nops, then no, I haven't tried it (and am not likely to any
time soon).

> Robert has a good collection for what is good hardware. I am so outdated
> i dont keep track anymore. My fastest machine is still an ASuse dual
> 450Mhz.

There's still more dead-end suggestions than good ones (i.e. the
O'Reilley high performance routing book).

> > Lastly from the software side Linux doesn't seem to have anything like
> > BSD's parameter to control user/system CPU sharing.  Once my CPU load
> > reaches 70-80%, I'd rather have some dropped packets than let the CPU hit
> > 100% and end up with my BGP sessions drop.
> >
>
> Well, heres a good example: With NAPI, have your sessions been dropped?
Yup, twice in the last 2 weeks.

> Have you tried a different NIC? Not sure how well the 3com is maintained
> for example.
> Try a tulip or tg3 or e1000 or the dlink gige.

Initially I was looking for tulip cards but almost nobody is producing
them any more.  Almost a year ago I came across the following list, which
is why I went with the 3com (at the time it indicated rx/tx irqmit for the
3com, until I emailed the author that I found out it was tx only)
http://www.fefe.de/linuxeth/

I had joined the vortex list last fall looking for some tips and that
didn't help much (other than telling me that the 3com wasn't the best
choice).  I've since bought a couple tg3 and a bunch of e1000 cards that
I'm planning to put into production.

Rob's test results seem to show that even if I replace my 3c905cx cards
with e1000's I'll still get killed with a 50kpps synflood with my current
CPU.  Upgrading to dual 2Ghz CPUs is not a preferred solution since I
can't do that in a 1U rack-mount box.  Yeah, I could probably do it with
water cooling, but that's not an option in a telco hotel like 151 Front
St. (Toronto).

A couple weeks ago I got one of my techs to test freeBSD/polling with full
routing tables on a 1Ghz celeron and 2 e1000 cards.  His testing seems to
suggest it will handle a 50kpps synflood DOS.  It would be nice if Linux
could do the same.

Despite the BSD bashing (to be expected on a Linux list, I guess), I will
be using BSD as well as Linux for core routing.  The plan is 1 linux
router and 1 bsd router each running zebra, connected to separate upstream
transit providers, running ibgp between them, and both advertising a
default route into OSPF.  Then if I get hit with a DOS that kills Linux,
the BSD box will have a much better chance of staying up than if I just
used a second Linux box for redundancy.  If the BSD boxes turn out to have
twice the performance of the linux boxes, it may be better for me to dump
linux for routing altogether. :-(

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-08 13:10                                     ` Florian Weimer
  2003-06-08 23:49                                       ` Simon Kirby
@ 2003-06-10  3:05                                       ` Steven Blake
  2003-06-12  6:31                                         ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Steven Blake @ 2003-06-10  3:05 UTC (permalink / raw)
  To: Florian Weimer; +Cc: netdev, linux-net

On Sun, 2003-06-08 at 09:10, Florian Weimer wrote:

> "David S. Miller" <davem@redhat.com> writes:
> 
> > Although, I hope it's not "too similar" to what CEF does because
> > undoubtedly Cisco has a bazillion patents in this area.
> 
> Most things in this area are patented, and the patents are extremely
> fuzzy (e.g. policy-based routing with hierarchical sequence of
> decisions has been patented countless times). 8-(
> 
> > This is actually an argument for coming up with out own algorithms
> > without any knowledge of what CEF does or might do. :(
> 
> The branchless variant is not described in the IOS book, and I can't
> tell if Cisco routers use it.  If this idea is really novel, we are in
> pretty good shape because we no longer use trees, tries or whatever,
> but a DFA. 8-)

Based on my quick reading of your code sample, I think you have just
reinvented multibit trees; in your case with a fixed stride of 8 bits. 

> Further parameters which could be tweaked is the kind of adjacency
> information (where to store the L2 information, whether to include the
> prefix length in the adjacency record etc.).

If you are curious, or just have a lot of time on your hands, you might
find the following set of references useful:

http://www.petri-meat.com/slblake/networking/refs/lpm_pkt-class/

IMHO, the best LPM algorithm (in terms of balancing lookup speed vs.
memory consumption vs. update rate) is CRT, described in the first paper
[ASIK].  It is patented, but there is hope that it might get released
under GPL in the near future.


Regards,

// Steve


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  1:53                                                                   ` Simon Kirby
@ 2003-06-10  3:18                                                                     ` Ralph Doncaster
  2003-06-10 16:06                                                                       ` David S. Miller
  2003-06-10 15:56                                                                     ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10  3:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Jamal Hadi, CIT/Paul, 'David S. Miller', fw, netdev, linux-net

On Mon, 9 Jun 2003, Simon Kirby wrote:

> "vmstat 1") except in attack cases.  The difference is probably just the
> fact that this is running on slightly faster hardware (single Athlon
> 1800MP, Tyan Tiger MPX board).

What happened to Linux users being able to brag about how much they could
do with CPUs that were useless for running Windows?  On a 1Ghz CPU you've
got almost 7,000 cycles to route a packet in order to handle 148kpps.  I
can't see why the slow path should be more than 2,000 cycles.

I know some people's attitude is don't talk if you're not going to write
the code.  If I had the time I would; from my earliest days of programming
I've been optimizing performance to the maximum.  I can still remember
using page 0 on my c64 to store an 8-bit register in 3 cycles instead of
four...

So to put a stake in the ground, I'd like to see a 1Ghz celeron with e1000
cards handle 148kpps of DOS traffic at <50% CPU utilization (with full
routing tables & no firewalling).  If that's not a reasonable expectation,
someone please let me know.  Even if my time was only worth $500/day, in
the past year and a half I spent enough time working on Linux routers to
buy a Cisco NPE-G1. :-(

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  2:45                                                                     ` Ralph Doncaster
@ 2003-06-10  3:23                                                                       ` Ben Greear
  2003-06-10  3:41                                                                         ` Ralph Doncaster
  2003-06-10 18:10                                                                         ` Ralph Doncaster
  2003-06-10  4:34                                                                       ` Simon Kirby
  2003-06-10 10:53                                                                       ` Jamal Hadi
  2 siblings, 2 replies; 227+ messages in thread
From: Ben Greear @ 2003-06-10  3:23 UTC (permalink / raw)
  To: ralph+d; +Cc: 'netdev@oss.sgi.com'

Ralph Doncaster wrote:

> Initially I was looking for tulip cards but almost nobody is producing
> them any more.  Almost a year ago I came across the following list, which
> is why I went with the 3com (at the time it indicated rx/tx irqmit for the
> 3com, until I emailed the author that I found out it was tx only)
> http://www.fefe.de/linuxeth/

If you want 4-port tulip NICs, I've had decent luck with the Phobox p430tx
NICs ($350 or so per NIC, so not cheap).  That said, the e1000s are definately
better as far as my own testing has been concerned.  (I'm doing packet pushing
& reception, no significant routing, though).

One waring about e1000's, make sure you have active airflow across the NICs
if you put two together.  Otherwise, buy a dual port NIC...it has a single
chip and you will have less cooling issues.

Ben


> 
> I had joined the vortex list last fall looking for some tips and that
> didn't help much (other than telling me that the 3com wasn't the best
> choice).  I've since bought a couple tg3 and a bunch of e1000 cards that
> I'm planning to put into production.
> 
> Rob's test results seem to show that even if I replace my 3c905cx cards
> with e1000's I'll still get killed with a 50kpps synflood with my current
> CPU.  Upgrading to dual 2Ghz CPUs is not a preferred solution since I
> can't do that in a 1U rack-mount box.  Yeah, I could probably do it with
> water cooling, but that's not an option in a telco hotel like 151 Front
> St. (Toronto).
> 
> A couple weeks ago I got one of my techs to test freeBSD/polling with full
> routing tables on a 1Ghz celeron and 2 e1000 cards.  His testing seems to
> suggest it will handle a 50kpps synflood DOS.  It would be nice if Linux
> could do the same.
> 
> Despite the BSD bashing (to be expected on a Linux list, I guess), I will
> be using BSD as well as Linux for core routing.  The plan is 1 linux
> router and 1 bsd router each running zebra, connected to separate upstream
> transit providers, running ibgp between them, and both advertising a
> default route into OSPF.  Then if I get hit with a DOS that kills Linux,
> the BSD box will have a much better chance of staying up than if I just
> used a second Linux box for redundancy.  If the BSD boxes turn out to have
> twice the performance of the linux boxes, it may be better for me to dump
> linux for routing altogether. :-(
> 
> -Ralph
> 


-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  3:23                                                                       ` Ben Greear
@ 2003-06-10  3:41                                                                         ` Ralph Doncaster
  2003-06-10 18:10                                                                         ` Ralph Doncaster
  1 sibling, 0 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10  3:41 UTC (permalink / raw)
  To: Ben Greear; +Cc: 'netdev@oss.sgi.com'

On Mon, 9 Jun 2003, Ben Greear wrote:

> One waring about e1000's, make sure you have active airflow across the NICs
> if you put two together.  Otherwise, buy a dual port NIC...it has a single
> chip and you will have less cooling issues.

I liked how easy the e1000's are to come by; even more so than the 3com
cards.  Intel seems to be grabbing market share by agressive pricing
(bought 4 last week for C$50 ea), so almost every computer equipment
distributor carries the intel cards.

Since I already have the single-port cards, I guess I'll install them with
a couple empty PCI slots between them to help with the cooling.

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  2:45                                                                     ` Ralph Doncaster
  2003-06-10  3:23                                                                       ` Ben Greear
@ 2003-06-10  4:34                                                                       ` Simon Kirby
  2003-06-10 11:01                                                                         ` Jamal Hadi
                                                                                           ` (2 more replies)
  2003-06-10 10:53                                                                       ` Jamal Hadi
  2 siblings, 3 replies; 227+ messages in thread
From: Simon Kirby @ 2003-06-10  4:34 UTC (permalink / raw)
  To: ralph+d; +Cc: netdev, linux-net

On Mon, Jun 09, 2003 at 11:18:45PM -0400, Ralph Doncaster wrote:

> What happened to Linux users being able to brag about how much they could
> do with CPUs that were useless for running Windows?  On a 1Ghz CPU you've
> got almost 7,000 cycles to route a packet in order to handle 148kpps.  I
> can't see why the slow path should be more than 2,000 cycles.

We're still here.  I want the code to be fast and efficient as much as
you do.  I'd be willing to bet that a lot of this will get fixed now,
though.  Broken parts of the code only get fixed if enough people whine
or especially if somebody decides to actually fix it.  My guess is that
the "using Linux as an Internet router with more than 10 Mbit/sec of
bandwidth" user base is relatively small.

> I know some people's attitude is don't talk if you're not going to write
> the code.  If I had the time I would; from my earliest days of programming
> I've been optimizing performance to the maximum.  I can still remember
> using page 0 on my c64 to store an 8-bit register in 3 cycles instead of
> four...

I wrote an entire game in TASM once. :)

> So to put a stake in the ground, I'd like to see a 1Ghz celeron with e1000
> cards handle 148kpps of DOS traffic at <50% CPU utilization (with full
> routing tables & no firewalling).

Sounds reasonable.  The routing table size issue has now been eliminated,
so that should make no difference to the equation.

> If that's not a reasonable expectation, someone please let me know. 
> Even if my time was only worth $500/day, in the past year and a half I
> spent enough time working on Linux routers to buy a Cisco NPE-G1. :-(

But in the end you'll end up with a system that you'll know the inner
workings of and that will be open source, maintainable, scalable, easy to
replicate, and easy to upgrade.  And it'll have tcpdump, damn it. :)

On Mon, Jun 09, 2003 at 10:45:29PM -0400, Ralph Doncaster wrote:

> A couple weeks ago I got one of my techs to test freeBSD/polling with full
> routing tables on a 1Ghz celeron and 2 e1000 cards.  His testing seems to
> suggest it will handle a 50kpps synflood DOS.  It would be nice if Linux
> could do the same.

I was going to ask before, and it's probably not even possible anymore,
but have you tried on a 2.0 kernel before?  2.0 kernels probably have a
lot of other problems and don't support the new hardware, but it would be
interesting to see how it scales to many srcs/dsts before the route cache
was integrated.  It probably scales a lot more like FreeBSD does.  You'd
probably have to use eepro100s or something, though.

> Despite the BSD bashing (to be expected on a Linux list, I guess), I will
> be using BSD as well as Linux for core routing.  The plan is 1 linux
> router and 1 bsd router each running zebra, connected to separate upstream
> transit providers, running ibgp between them, and both advertising a
> default route into OSPF.  Then if I get hit with a DOS that kills Linux,
> the BSD box will have a much better chance of staying up than if I just
> used a second Linux box for redundancy.

Good idea.  Others have also suggested using Zebra on one and another of
the BGP routing daemons on another to avoid routing-daemon-specific DoS
issues (or accidental remote crash bugs).

Anyway, the performance issues should be fixable.  It is going to take
some work, but there seem to be some interested people.  I'm going to try
to set up something that will allow for easy comparisons of patches so
that we can measure progress, and perhaps reach an eventual goal.

Simon-

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10  2:45                                                                     ` Ralph Doncaster
  2003-06-10  3:23                                                                       ` Ben Greear
  2003-06-10  4:34                                                                       ` Simon Kirby
@ 2003-06-10 10:53                                                                       ` Jamal Hadi
  2003-06-10 11:41                                                                         ` chas williams
                                                                                           ` (3 more replies)
  2 siblings, 4 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10 10:53 UTC (permalink / raw)
  To: ralph+d
  Cc: CIT/Paul, 'Simon Kirby', 'David S. Miller',
	fw, netdev, linux-net



On Mon, 9 Jun 2003, Ralph Doncaster wrote:

> On Mon, 9 Jun 2003, Jamal Hadi wrote:
>
> The test results Rob posted today show that the testing can be done in a
> lab environment.

I thought you were saying those were _not_ real world traffic patterns.
Robert is just doing a worst case scenario testing. What would be useful
is we actually test on real environments or maybe even collect real
world traffic patterns and run them in the lab.
Typically, real world is less intense than the lab. Ex: noone sends
100Mbps at 64 byte packet size. Typical packet is around 500 bytes
average. If linux can handle that forwarding capacity, it should easily
be doing close to Gige real world capacity.
Have you seen how the big boys advertise? when tuning specs they talk
about bits/sec. Juniper just announced a blade at supercom that can do
firewalling at 500Mbps.

> Most of the people I know that would actually see 50kpps
> in the real world don't have the time to apply various patches and run a

Now thats one big dilema, isnt it? Do you think i have time? Let me
assure you that I dont get paid by anybody to do any of this stuff.
Infact i havent been paid to do any of this stuff since 1994. Thats a lot
of man hours in corporate speak.
The point i am making is as a community we gotta put the hours together;
the coder, the user etc. As someone who is not maintaining anything
(lucky bastard that i am, my name is not even in the credits file - by
choice) so i have the luxury to  disappear once in a while. Imagine Davems
reaction to a message like the above.

> bunch of tests; pretending the problem doesn't exist when someone doesn't
> run tests to prove is a poor excuse.
>

I think you _may_ be right theres a problem. However, as a defensive
mechanism it is easier to tell someone to go away and come back with
solid data. For example, you CPU graphs are very strange: Theres a few
hundred variables that may be involved.
I have spent many hours investigating peoples problems sshing to their
machines only to find out they didnt follow instructions. After the
10th person doing the same thing, what do you expect my reaction to be?
Please see the view from this side as well because it is almost
a thankless task.

> Yup, still a duron 750 on an Asus mobo (Via chipset).  Running Zebra
> 0.93b.  If the ideas you're referring to are changing the zebra source to
> arp the next-nops, then no, I haven't tried it (and am not likely to any
> time soon).
>

I think you may be suffering from the "too low" traffic NAPI syndrome.
Under low traffic (1-2 Mbps) on lower end machines NAPI will consume
more CPU because of an extra PCI operation per packet that is performed.
As for the zebra thing, if you post my message to the Zebra list i am sure
someone will be excited enough to do it. I need a few hours to do it
but like you i dont have much time.

> > Robert has a good collection for what is good hardware. I am so outdated
> > i dont keep track anymore. My fastest machine is still an ASuse dual
> > 450Mhz.
>
> There's still more dead-end suggestions than good ones (i.e. the
> O'Reilley high performance routing book).
>

URL?

> > Well, heres a good example: With NAPI, have your sessions been dropped?
> Yup, twice in the last 2 weeks.
>

I have seen NAPI slow down throughput because of an intensive user space
app.

> > Have you tried a different NIC? Not sure how well the 3com is maintained
> > for example.
> > Try a tulip or tg3 or e1000 or the dlink gige.
>
> Initially I was looking for tulip cards but almost nobody is producing
> them any more.  Almost a year ago I came across the following list, which

Thats not true. You could buy them off znyx. Yes, intel has EOLed the
chips so i dont think Znyx will be doing this for much longer.
Get yourself the giges instead.

> is why I went with the 3com (at the time it indicated rx/tx irqmit for the
> 3com, until I emailed the author that I found out it was tx only)
> http://www.fefe.de/linuxeth/
>
> I had joined the vortex list last fall looking for some tips and that
> didn't help much (other than telling me that the 3com wasn't the best
> choice).  I've since bought a couple tg3 and a bunch of e1000 cards that
> I'm planning to put into production.
>

yes, move to the giges then lets talk again. I think your main problem is
that 3com NAPI is not well supported. Lennert disappeared right after he
released the patch and noone else has the interest of maintaining it.

> Rob's test results seem to show that even if I replace my 3c905cx cards
> with e1000's I'll still get killed with a 50kpps synflood with my current
> CPU.  Upgrading to dual 2Ghz CPUs is not a preferred solution since I
> can't do that in a 1U rack-mount box.  Yeah, I could probably do it with
> water cooling, but that's not an option in a telco hotel like 151 Front
> St. (Toronto).
>

where are you getting the 50Kpps data from? I see him talkking of
input rate of no less than 200Kpps.

> A couple weeks ago I got one of my techs to test freeBSD/polling with full
> routing tables on a 1Ghz celeron and 2 e1000 cards.  His testing seems to
> suggest it will handle a 50kpps synflood DOS.  It would be nice if Linux
> could do the same.
>
> Despite the BSD bashing (to be expected on a Linux list, I guess), I will
> be using BSD as well as Linux for core routing.  The plan is 1 linux
> router and 1 bsd router each running zebra, connected to separate upstream
> transit providers, running ibgp between them, and both advertising a
> default route into OSPF.  Then if I get hit with a DOS that kills Linux,
> the BSD box will have a much better chance of staying up than if I just
> used a second Linux box for redundancy.  If the BSD boxes turn out to have
> twice the performance of the linux boxes, it may be better for me to dump
> linux for routing altogether. :-(
>

This is why you dont get very positivre reaction. You use religious
scripture and you expect that people will help prove you are wrong.
Let the person who showed that BSD can do better publish the data.
If they are in town, let me know because i am willing to walk to
meet the challenge.
Maybe we'll learn something.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  4:34                                                                       ` Simon Kirby
@ 2003-06-10 11:01                                                                         ` Jamal Hadi
  2003-06-10 11:28                                                                         ` Jamal Hadi
  2003-06-10 16:10                                                                         ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10 11:01 UTC (permalink / raw)
  To: Simon Kirby; +Cc: ralph+d, netdev, linux-net



On Mon, 9 Jun 2003, Simon Kirby wrote:

> Anyway, the performance issues should be fixable.  It is going to take
> some work, but there seem to be some interested people.  I'm going to try
> to set up something that will allow for easy comparisons of patches so
> that we can measure progress, and perhaps reach an eventual goal.
>


Now heres the right spirit.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  4:34                                                                       ` Simon Kirby
  2003-06-10 11:01                                                                         ` Jamal Hadi
@ 2003-06-10 11:28                                                                         ` Jamal Hadi
  2003-06-10 13:18                                                                           ` Ralph Doncaster
  2003-06-10 16:10                                                                         ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10 11:28 UTC (permalink / raw)
  To: Simon Kirby; +Cc: ralph+d, netdev, linux-net



On Mon, 9 Jun 2003, Simon Kirby wrote:

> I was going to ask before, and it's probably not even possible anymore,
> but have you tried on a 2.0 kernel before?  2.0 kernels probably have a
> lot of other problems and don't support the new hardware, but it would be
> interesting to see how it scales to many srcs/dsts before the route cache
> was integrated.  It probably scales a lot more like FreeBSD does.  You'd
> probably have to use eepro100s or something, though.
>

As a side note, note that stateless forwarding like BSD patricie tries
is no longer sufficient. Its no longer just looking up a nexthop, dec ttl,
recompute csum that we are optimizing for.
The dst cache/flowi is the way to go, so theres no going back;-> - we just
gotta make what we have work better.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 10:53                                                                       ` Jamal Hadi
@ 2003-06-10 11:41                                                                         ` chas williams
  2003-06-10 16:27                                                                           ` David S. Miller
  2003-06-10 11:41                                                                         ` Pekka Savola
                                                                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 227+ messages in thread
From: chas williams @ 2003-06-10 11:41 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	fw, netdev, linux-net

In message <20030610061010.Y36963@shell.cyberus.ca>,Jamal Hadi writes:
>is we actually test on real environments or maybe even collect real
>world traffic patterns and run them in the lab.
>Typically, real world is less intense than the lab. Ex: noone sends
>100Mbps at 64 byte packet size. Typical packet is around 500 bytes
>average. If linux can handle that forwarding capacity, it should easily

i was curious at one point and collected a some packet size stats on
our border router.  while the average packet size is close to 500,
the bulk (by count) of the traffic seems to be in the 64-95 byte range.
(the length here is the link level size as given by tcpdump -e)

# 27100000 packets  average length = 747
0-31 5271
32-63 0
64-95 12143442
96-127 934314
128-159 202984
160-191 98772
192-223 49279
224-255 37826
256-287 28276
288-319 41675
320-351 42359
352-383 93709
384-415 24557
416-447 73969
448-479 25100
480-511 23210
512-543 86515
544-575 77779
576-607 146066
608-639 23967
640-671 23005
672-703 87471
704-735 13154
736-767 8818
768-799 20850
800-831 7678
832-863 7379
864-895 7920
896-927 5789
928-959 48122
960-991 35512
992-1023 26081
1024-1055 63541
1056-1087 23673
1088-1119 8397
1120-1151 5780
1152-1183 5133
1184-1215 8820
1216-1247 40251
1248-1279 6295
1280-1311 11420
1312-1343 31610
1344-1375 21802
1376-1407 22442
1408-1439 4932071
1440-1471 594385
1472-1503 439460
1504-1535 6434071

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10 10:53                                                                       ` Jamal Hadi
  2003-06-10 11:41                                                                         ` chas williams
@ 2003-06-10 11:41                                                                         ` Pekka Savola
  2003-06-10 11:58                                                                           ` John S. Denker
  2003-06-10 12:07                                                                           ` Jamal Hadi
  2003-06-10 13:10                                                                         ` Ralph Doncaster
  2003-06-10 18:41                                                                         ` Florian Weimer
  3 siblings, 2 replies; 227+ messages in thread
From: Pekka Savola @ 2003-06-10 11:41 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	fw, netdev, linux-net

On Tue, 10 Jun 2003, Jamal Hadi wrote:
> Typically, real world is less intense than the lab. Ex: noone sends
> 100Mbps at 64 byte packet size.

Some attackers do, and if your box dies because of that.. well, you don't 
like it and your managers certainly don't :-)

> Typical packet is around 500 bytes
> average. 

Not sure that's really the case.  I have the impression the traffic is 
basically something like:
 - close to 1500 bytes (data transfers)
 - between 40-100 bytes (TCP acks, simple UDP requests, etc.)
 - something in between

> If linux can handle that forwarding capacity, it should easily
> be doing close to Gige real world capacity.

Yes, but not the worst case capacity you really have to plan for :-(

> Have you seen how the big boys advertise? when tuning specs they talk
> about bits/sec. Juniper just announced a blade at supercom that can do
> firewalling at 500Mbps.

May be for some, but they *DO* give their pps figures also; many operators
do, in fact, *explicitly* check the pps figures especially when there are
some slower-path features in use (ACL's, IPv6, multicast, RPF, etc.):  
that's much more important than the optimal figures which are great for 
advertising material and press releases :-).

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 11:41                                                                         ` Pekka Savola
@ 2003-06-10 11:58                                                                           ` John S. Denker
  2003-06-10 12:12                                                                             ` Jamal Hadi
  2003-06-10 12:07                                                                           ` Jamal Hadi
  1 sibling, 1 reply; 227+ messages in thread
From: John S. Denker @ 2003-06-10 11:58 UTC (permalink / raw)
  To: Pekka Savola
  Cc: Jamal Hadi, ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	fw, netdev, linux-net

On 06/10/2003 07:41 AM, Pekka Savola wrote:
> 
>>Typical packet is around 500 bytes
>>average. 
> 
> Not sure that's really the case.  I have the impression the traffic is 
> basically something like:
>  - close to 1500 bytes (data transfers)
>  - between 40-100 bytes (TCP acks, simple UDP requests, etc.)
>  - something in between

It helps to take a more sophisticated view of things.
In typical networks:
Most of the packet-count is to be found in small packets.
Most of the byte-count is to be found in large packets.

Some things (e.g. routing) depend mainly on the packet-count.
Other things (e.g. encryption, layer-1 hardware requirements,
memory bandwidth usage, ISP contracts) are sensitive to the
byte-count.

We shouldn't optimize one at the expense of the other.


^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10 11:41                                                                         ` Pekka Savola
  2003-06-10 11:58                                                                           ` John S. Denker
@ 2003-06-10 12:07                                                                           ` Jamal Hadi
  2003-06-10 15:29                                                                             ` Ralph Doncaster
  1 sibling, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10 12:07 UTC (permalink / raw)
  To: Pekka Savola
  Cc: ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	fw, netdev, linux-net



On Tue, 10 Jun 2003, Pekka Savola wrote:

> On Tue, 10 Jun 2003, Jamal Hadi wrote:
> > Typically, real world is less intense than the lab. Ex: noone sends
> > 100Mbps at 64 byte packet size.
>
> Some attackers do, and if your box dies because of that.. well, you don't
> like it and your managers certainly don't :-)
>

Assuming the attacker has a 100mbps link to you, yes ;->
I am not trying to say we should ignore it; infact all our tests
have been worst case scenarios.

> > Typical packet is around 500 bytes
> > average.
>
> Not sure that's really the case.  I have the impression the traffic is
> basically something like:
>  - close to 1500 bytes (data transfers)
>  - between 40-100 bytes (TCP acks, simple UDP requests, etc.)
>  - something in between
>

Its is typically trimodal (the ACKs, something in the 500 bytes and the
1500 byte end). The 500 average is derived from staring at cdf graphs:

slightly dated more thorough:
http://www.nlanr.net/NA/Learn/packetsizes.html

Frequent collections by sprint:
http://ipmon.sprint.com/packstat/packet.php?030407

so 500 bytes does sound reasonable.
Theres a lot of papers that have been written on this subject.

> > If linux can handle that forwarding capacity, it should easily
> > be doing close to Gige real world capacity.
>
> Yes, but not the worst case capacity you really have to plan for :-(
>

agreed.

> > Have you seen how the big boys advertise? when tuning specs they talk
> > about bits/sec. Juniper just announced a blade at supercom that can do
> > firewalling at 500Mbps.
>
> May be for some, but they *DO* give their pps figures also; many operators
> do, in fact, *explicitly* check the pps figures especially when there are
> some slower-path features in use (ACL's, IPv6, multicast, RPF, etc.):
> that's much more important than the optimal figures which are great for
> advertising material and press releases :-).
>

The announce in question i saw in some post supercom2003. I kept looking
for conditions that apply to get that 500mbops but couldnt find any.
A lot of people fall for the big brand name, so granted some people will
check, quiet a few dont have that expertise and will buy because iut reads
"juniper".

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 11:58                                                                           ` John S. Denker
@ 2003-06-10 12:12                                                                             ` Jamal Hadi
  2003-06-10 16:33                                                                               ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10 12:12 UTC (permalink / raw)
  To: John S. Denker
  Cc: Pekka Savola, ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	fw, netdev, linux-net



On Tue, 10 Jun 2003, John S. Denker wrote:

> On 06/10/2003 07:41 AM, Pekka Savola wrote:
> >
> >>Typical packet is around 500 bytes
> >>average.
> >
> > Not sure that's really the case.  I have the impression the traffic is
> > basically something like:
> >  - close to 1500 bytes (data transfers)
> >  - between 40-100 bytes (TCP acks, simple UDP requests, etc.)
> >  - something in between
>
> It helps to take a more sophisticated view of things.
> In typical networks:
> Most of the packet-count is to be found in small packets.
> Most of the byte-count is to be found in large packets.
>
> Some things (e.g. routing) depend mainly on the packet-count.
> Other things (e.g. encryption, layer-1 hardware requirements,
> memory bandwidth usage, ISP contracts) are sensitive to the
> byte-count.
>
> We shouldn't optimize one at the expense of the other.

You bring a good point.
Theres another dimension actually: mostly driven by BSD mbuff style
packet allocation; some tests show that some vendors are optimized
for certain packet sizes, Linux skbuffs dont have this problem.
We dont optimize for packet sizes given the linear nature of skbuffs.
Donalds ether drivers tend to amortize some of the costs by reallocating
skbs when the packet <= 100 bytes, but this is no longer valid with
skb recycling and the magazine layer appearing in the slab.

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10 10:53                                                                       ` Jamal Hadi
  2003-06-10 11:41                                                                         ` chas williams
  2003-06-10 11:41                                                                         ` Pekka Savola
@ 2003-06-10 13:10                                                                         ` Ralph Doncaster
  2003-06-10 13:36                                                                           ` Jamal Hadi
                                                                                             ` (2 more replies)
  2003-06-10 18:41                                                                         ` Florian Weimer
  3 siblings, 3 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 13:10 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: CIT/Paul, 'Simon Kirby', 'David S. Miller',
	fw, netdev, linux-net

On Tue, 10 Jun 2003, Jamal Hadi wrote:

> I thought you were saying those were _not_ real world traffic patterns.

I'm saying the tests that you and Rob did in the past did not reflect
real-world use of Linux as a core router (i.e. small routing table and not
many different traffic flows).  The tests he posted yesterday are a big
step forward.

> Typically, real world is less intense than the lab. Ex: noone sends
> 100Mbps at 64 byte packet size. Typical packet is around 500 bytes
> average. If linux can handle that forwarding capacity, it should easily
> be doing close to Gige real world capacity.

No, it needs to work in the worst case.  If some script kiddie can peg my
CPU with a synflood then there's still a problem.

> > Most of the people I know that would actually see 50kpps
> > in the real world don't have the time to apply various patches and run a
>
> Now thats one big dilema, isnt it? Do you think i have time? Let me
> assure you that I dont get paid by anybody to do any of this stuff.

Sure I realize that.  The problem I've seen occur is that Linux developers
with big egos say "linux can route as well as a cisco 3640", or "linux
routing is beats BSD any day".  Then guys like me decide to give it a try,
not realizing we're walking into a tarpit.  If I had been told in the
first place that running linux as a high-throughput router in a service
provider environment was an unknown, things would have been different.

> I have spent many hours investigating peoples problems sshing to their
> machines only to find out they didnt follow instructions. After the
> 10th person doing the same thing, what do you expect my reaction to be?

Take 15 minutes and write a web page with the magic settings required to
make things work.

> > Yup, still a duron 750 on an Asus mobo (Via chipset).  Running Zebra
> > 0.93b.  If the ideas you're referring to are changing the zebra source to
> > arp the next-nops, then no, I haven't tried it (and am not likely to any
> > time soon).
> >
>
> I think you may be suffering from the "too low" traffic NAPI syndrome.
> Under low traffic (1-2 Mbps) on lower end machines NAPI will consume
> more CPU because of an extra PCI operation per packet that is performed.

No, as I said I'm moving ~30mbps and ~10kpps in and out of 2 3c905cx
cards.

> As for the zebra thing, if you post my message to the Zebra list i am sure
> someone will be excited enough to do it. I need a few hours to do it
> but like you i dont have much time.

The last time I looked at the zebra list things seemed pretty dead.  Most
of the new work is now happening on the commercial zebra development.

> > > Well, heres a good example: With NAPI, have your sessions been dropped?
> > Yup, twice in the last 2 weeks.
> >
>
> I have seen NAPI slow down throughput because of an intensive user space
> app.

This is a router with just zebra (zebra, ospfd, bgpd) running.

> > I had joined the vortex list last fall looking for some tips and that
> > didn't help much (other than telling me that the 3com wasn't the best
> > choice).  I've since bought a couple tg3 and a bunch of e1000 cards that
> > I'm planning to put into production.
>
> yes, move to the giges then lets talk again. I think your main problem is
> that 3com NAPI is not well supported. Lennert disappeared right after he
> released the patch and noone else has the interest of maintaining it.

Yes, and it would be nice if you mentioned in your NAPI docs that people
should use a tulip, tg3, or e1000 if they want it to work well.  In making
your sales pitches for NAPI you made it sound like any high-performance
card should do fine (i.e. anything but a Realtek).

> > Rob's test results seem to show that even if I replace my 3c905cx cards
> > with e1000's I'll still get killed with a 50kpps synflood with my current
> > CPU.

>
> where are you getting the 50Kpps data from? I see him talkking of
> input rate of no less than 200Kpps.

On his first graph, for 50k new incoming dst/sec throughput looks to be
~175kpps.  And he's running a 1.8Ghz Xenon vs my 750Mhz Duron.

> > used a second Linux box for redundancy.  If the BSD boxes turn out to have
> > twice the performance of the linux boxes, it may be better for me to dump
> > linux for routing altogether. :-(
> >
>
> This is why you dont get very positivre reaction. You use religious
> scripture and you expect that people will help prove you are wrong.

You don't seem to get it.  There's at least a dozen things more important
to me than seeing Linux routing performance compete with Cisco and BSD.
I'm annoyed that people like you have told me linux is up to the task, and
then when it's not I'm left SOL.  I thought I was talking to competent
techies, but now I see most of the techies were also Linux evangelists.

Now that people like Rob and Dave are taking a hard look at it I think
it's worth my while to ante up for a couple more rounds.  I still fell
like a sucker that should have walked away from the table a long time ago
though.

Jim Mercer and Marc Ackley at 151.net/tht.net told me they tried
Linux/Zebra and gave up (and went with 7206vxr routers).  And they're very
pro-unix (still do all their netflow collection and billing on Unix).
They're not likely to go back and give Linux another try.  If the linux
evangelists had just said Linux would be ready for core routing in a year
(or whatever) instead, I think network operators would look at it more
seriously rather than they joke that they see it as now.

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 11:28                                                                         ` Jamal Hadi
@ 2003-06-10 13:18                                                                           ` Ralph Doncaster
  0 siblings, 0 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 13:18 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: Simon Kirby, netdev, linux-net

On Tue, 10 Jun 2003, Jamal Hadi wrote:

> As a side note, note that stateless forwarding like BSD patricie tries
> is no longer sufficient. Its no longer just looking up a nexthop, dec ttl,
> recompute csum that we are optimizing for.

It would certainly be sufficient for core routing.  If I can have flow
manipulation at no extra cost, I'll take it.  If it's going to double the
horsepower requirements, I don't want it.

-Ralph


^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10 13:10                                                                         ` Ralph Doncaster
@ 2003-06-10 13:36                                                                           ` Jamal Hadi
  2003-06-10 14:03                                                                             ` Ralph Doncaster
  2003-06-10 16:38                                                                           ` David S. Miller
  2003-06-10 16:39                                                                           ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-10 13:36 UTC (permalink / raw)
  To: ralph+d
  Cc: CIT/Paul, 'Simon Kirby', 'David S. Miller',
	fw, netdev, linux-net



On Tue, 10 Jun 2003, Ralph Doncaster wrote:

> On Tue, 10 Jun 2003, Jamal Hadi wrote:
>
> > I thought you were saying those were _not_ real world traffic patterns.
>
> I'm saying the tests that you and Rob did in the past did not reflect
> real-world use of Linux as a core router (i.e. small routing table and not
> many different traffic flows).  The tests he posted yesterday are a big
> step forward.
>

I think at a minimal define what "real world" means.
Is it 100 flows/sec at 20Kpps? what is it?

> > Typically, real world is less intense than the lab. Ex: noone sends
> > 100Mbps at 64 byte packet size. Typical packet is around 500 bytes
> > average. If linux can handle that forwarding capacity, it should easily
> > be doing close to Gige real world capacity.
>
> No, it needs to work in the worst case.  If some script kiddie can peg my
> CPU with a synflood then there's still a problem.
>

Lets work on defining "real world". Factor in the script kiddie.

> Sure I realize that.  The problem I've seen occur is that Linux developers
> with big egos say "linux can route as well as a cisco 3640", or "linux
> routing is beats BSD any day".  Then guys like me decide to give it a try,
> not realizing we're walking into a tarpit.  If I had been told in the
> first place that running linux as a high-throughput router in a service
> provider environment was an unknown, things would have been different.
>

Heres where the problem is:
If you interact at this low level then you oughta produce low level
input. Provide people with data to help. Otherwise its a high maintanance
task.

> > I have spent many hours investigating peoples problems sshing to their
> > machines only to find out they didnt follow instructions. After the
> > 10th person doing the same thing, what do you expect my reaction to be?
>
> Take 15 minutes and write a web page with the magic settings required to
> make things work.
>

I have many times. I still do. It is also a thankless task.

> > I think you may be suffering from the "too low" traffic NAPI syndrome.
> > Under low traffic (1-2 Mbps) on lower end machines NAPI will consume
> > more CPU because of an extra PCI operation per packet that is performed.
>
> No, as I said I'm moving ~30mbps and ~10kpps in and out of 2 3c905cx
> cards.
>

Change your NICs. I dont know what else to suggest.

> > As for the zebra thing, if you post my message to the Zebra list i am sure
> > someone will be excited enough to do it. I need a few hours to do it
> > but like you i dont have much time.
>
> The last time I looked at the zebra list things seemed pretty dead.  Most
> of the new work is now happening on the commercial zebra development.
>

Maybe its time to fork Zebra into something that has the same momentum it
had in the earlier days.

> > yes, move to the giges then lets talk again. I think your main problem is
> > that 3com NAPI is not well supported. Lennert disappeared right after he
> > released the patch and noone else has the interest of maintaining it.
>
> Yes, and it would be nice if you mentioned in your NAPI docs that people
> should use a tulip, tg3, or e1000 if they want it to work well.  In making
> your sales pitches for NAPI you made it sound like any high-performance
> card should do fine (i.e. anything but a Realtek).
>

Theres a URL which points people to where the various NICS supported are.

> On his first graph, for 50k new incoming dst/sec throughput looks to be
> ~175kpps.  And he's running a 1.8Ghz Xenon vs my 750Mhz Duron.
>

i think what would be interesting is to show CPU utilization as well.

> > This is why you dont get very positivre reaction. You use religious
> > scripture and you expect that people will help prove you are wrong.
>
> You don't seem to get it.  There's at least a dozen things more important
> to me than seeing Linux routing performance compete with Cisco and BSD.

Again, if you wanna complain about it at the level you are i think its
only fair you help. I actually dont care about CISCO or BSD. We dont win
because someone else looses. We simply want to be the best.
If you tell me BSD works better, i told you i will walk all the way
downtown in the hope i'll find somethuing we can improve on.

> I'm annoyed that people like you have told me linux is up to the task, and
> then when it's not I'm left SOL.  I thought I was talking to competent
> techies, but now I see most of the techies were also Linux evangelists.
>
> Now that people like Rob and Dave are taking a hard look at it I think
> it's worth my while to ante up for a couple more rounds.  I still fell
> like a sucker that should have walked away from the table a long time ago
> though.
>

I think your setup maybe the question. Like i said theres probably a
hunderd variables involved. It is up to you to isolate things.
Yes, theres a support line in open source, but it is rewarded more
when people show some effort.

> Jim Mercer and Marc Ackley at 151.net/tht.net told me they tried
> Linux/Zebra and gave up (and went with 7206vxr routers).  And they're very
> pro-unix (still do all their netflow collection and billing on Unix).
> They're not likely to go back and give Linux another try.  If the linux
> evangelists had just said Linux would be ready for core routing in a year
> (or whatever) instead, I think network operators would look at it more
> seriously rather than they joke that they see it as now.
>

Theres a lot of BSD bigots in a lot of ISPS and IETF. It's human nature
to be comfortable with what they know best. Most of the people i have
met that put Linux down or consider it a joke come from the old
BSD camp. Its their loss and i dismiss anything they have to say.
Lets work on facts. What is it that we can do to improve Linux?
Provide data. If you want to compare against BSD, what is it that
_ you have facts on_ and not heard from other people that BSD does better?

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 23:05                                                               ` David S. Miller
@ 2003-06-10 13:41                                                                 ` Robert Olsson
  0 siblings, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-10 13:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, sim, xerox, fw, netdev, linux-net


First run...
 
Worst senario 1 dst/pkt w. 64 byte pkts. 2*10 Million packets injected.
eth0, eth2. Input rate 2*190 kpps clone_skb=1. Routing table of 123946 
routes. UP. NAPI gives fairmess between both DoS attackers. :-) But more 
testing to be done. 


   plain   w. DaveM patch
----------------------------------
      72        114 kpps throughput
30271883   12246290 hash misses (second last in my rt_cache_stat)


58% better... and it can be further improved. 



Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 1964858 9793618 9793618 8035147     16      0      0      0 BRU
eth1   1500   0     19      0      0      0 1887577      0      0      0 BRU
eth2   1500   0 1964698 9793419 9793419 8035305      3      0      0      0 BRU
eth3   1500   0      1      0      0      0 1886904      0      0      0 BRU
/proc/net/rt_cache_stat
000004ba  00000e27 003be7ba 00000000 00000000 00000000 00000000 00000000  00000001 00000001 00000000 003869c1 00360b4d 00025dcb 00025dca 01cde98b 00000000 


With DaveM hash-list limit patch.

Input rate 2*190 kpps clone_skb=1
Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 2990462 9680257 9680257 7009542     12      0      0      0 BRU
eth1   1500   0     12      0      0      0 2990467      0      0      0 BRU
eth2   1500   0 2990460 9673421 9673421 7009544      4      0      0      0 BRU
eth3   1500   0      1      0      0      0 2990459      0      0      0 BRU
/proc/net/rt_cache_stat
00000000  00000607 005b3cfb 00000000 00000000 00000000 00000000 00000000  00000000 00000002 00000000 005b2cfa 005b2ced 00000008 00000000 00badd12 00000003 


Cheers.
							--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10 13:36                                                                           ` Jamal Hadi
@ 2003-06-10 14:03                                                                             ` Ralph Doncaster
  0 siblings, 0 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 14:03 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: CIT/Paul, 'Simon Kirby', 'David S. Miller',
	fw, netdev, linux-net

On Tue, 10 Jun 2003, Jamal Hadi wrote:

> > No, it needs to work in the worst case.  If some script kiddie can peg my
> > CPU with a synflood then there's still a problem.
> >
>
> Lets work on defining "real world". Factor in the script kiddie.

"real world" is the worst-case DOS tool available.  Synflood tools like
juno seem to fit that category.  If you think juno is not a good
real-world test, then keep pissing people off and you'll find out how real
it is. ;-)

> > > I have spent many hours investigating peoples problems sshing to their
> > > machines only to find out they didnt follow instructions. After the
> > > 10th person doing the same thing, what do you expect my reaction to be?
> >
> > Take 15 minutes and write a web page with the magic settings required to
> > make things work.
> >
>
> I have many times. I still do. It is also a thankless task.

URL?  I've looked at almost everything on your web page since you were
involved in the pppoe client software.  I haven't seen anything that says
how to sprinkle the pixie dust so my router works well.

> > No, as I said I'm moving ~30mbps and ~10kpps in and out of 2 3c905cx
> > cards.
> >
>
> Change your NICs. I dont know what else to suggest.

Yup.  It just takes a bit of time and planning when the box is deployed in
a POP 400km away...

> > The last time I looked at the zebra list things seemed pretty dead.  Most
> > of the new work is now happening on the commercial zebra development.
> >
>
> Maybe its time to fork Zebra into something that has the same momentum it
> had in the earlier days.

Hmmm... maybe we can both bug MCR to try your suggested changes...

> > You don't seem to get it.  There's at least a dozen things more important
> > to me than seeing Linux routing performance compete with Cisco and BSD.
>
> Again, if you wanna complain about it at the level you are i think its
> only fair you help. I actually dont care about CISCO or BSD. We dont win
> because someone else looses. We simply want to be the best.

You can want to be the best, but I don't think it's fair to sucker people
into using Linux as a core router with false claims.

> > Now that people like Rob and Dave are taking a hard look at it I think
> > it's worth my while to ante up for a couple more rounds.  I still fell
> > like a sucker that should have walked away from the table a long time ago
> > though.
> >
>
> I think your setup maybe the question. Like i said theres probably a
> hunderd variables involved. It is up to you to isolate things.
> Yes, theres a support line in open source, but it is rewarded more
> when people show some effort.

Fuck, if you think I haven't put any effort into it already then there's
no point in even trying any more.

> to be comfortable with what they know best. Most of the people i have
> met that put Linux down or consider it a joke come from the old
> BSD camp. Its their loss and i dismiss anything they have to say.

In my case I would have been better off to dismiss your advice a year ago.
How does that help the Linux cause?

-Ralph


^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10 12:07                                                                           ` Jamal Hadi
@ 2003-06-10 15:29                                                                             ` Ralph Doncaster
  2003-06-11 19:48                                                                               ` Florian Weimer
  0 siblings, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 15:29 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: Pekka Savola, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	fw, netdev, linux-net

On Tue, 10 Jun 2003, Jamal Hadi wrote:

> Assuming the attacker has a 100mbps link to you, yes ;->

A script kiddie 0wning a box with a FE connection is nothing.  During what
was probably the worst DOS I got hit with, one of my upstream providers
said they were seeing about 600mbps of traffic related to the attack.

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  0:32                                                                 ` Ralph Doncaster
  2003-06-10  1:15                                                                   ` Jamal Hadi
  2003-06-10  1:53                                                                   ` Simon Kirby
@ 2003-06-10 15:49                                                                   ` David S. Miller
  2003-06-10 17:33                                                                     ` Ralph Doncaster
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 15:49 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Mon, 9 Jun 2003 20:32:48 -0400 (EDT)
   
   Lastly from the software side Linux doesn't seem to have anything like
   BSD's parameter to control user/system CPU sharing.  Once my CPU load
   reaches 70-80%, I'd rather have some dropped packets than let the CPU hit
   100% and end up with my BGP sessions drop.

When packet (more specifically, software interrupt) processing
reaches a certain level, we offload the work into process context.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  1:15                                                                   ` Jamal Hadi
  2003-06-10  2:45                                                                     ` Ralph Doncaster
@ 2003-06-10 15:53                                                                     ` David S. Miller
  2003-06-10 16:15                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
  2003-06-11 17:52                                                                     ` Route cache performance under stress Robert Olsson
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 15:53 UTC (permalink / raw)
  To: hadi; +Cc: ralph+d, xerox, sim, fw, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Mon, 9 Jun 2003 21:15:18 -0400 (EDT)
   
   Have you tried a different NIC? Not sure how well the 3com is maintained
   for example.

Acutally, the main issue with 3c59x is that it still
uses PIO accesses.  This basically makes it useless
for routing or anything wanting serious latency.

Andrew Morton knows this, but he is such a good maintainer
that he doesn't want to change over the MEM I/O accesses
for fear of breaking something.

It's actually a simple change to make if someone wants to
spend a few cycles on it, then you can see what kind of
performance you'll get with that.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  1:53                                                                   ` Simon Kirby
  2003-06-10  3:18                                                                     ` Ralph Doncaster
@ 2003-06-10 15:56                                                                     ` David S. Miller
  2003-06-10 16:45                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
  2003-06-10 17:19                                                                       ` Route cache performance under stress Ralph Doncaster
  1 sibling, 2 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 15:56 UTC (permalink / raw)
  To: sim; +Cc: ralph+d, hadi, xerox, fw, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Mon, 9 Jun 2003 18:53:12 -0700

   Your CPU use is quite a bit higher than ours.

Yeah, but his faster cpu is all being burnt to a crisp
doing PIO accesses to the 3c59x card.

   I found that once NAPI was happening, userspace seemed to get a
   fairly decent amount of time.

Unfortunately, NAPI won't help him with the current way the 3c59x
driver works.  It needs to provide a way to use MEM I/O before NAPI
would start to be of use to him.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  3:18                                                                     ` Ralph Doncaster
@ 2003-06-10 16:06                                                                       ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 16:06 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: sim, hadi, xerox, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Mon, 9 Jun 2003 23:18:45 -0400 (EDT)
   
   Even if my time was only worth $500/day, in the past year and a
   half I spent enough time working on Linux routers to buy a Cisco
   NPE-G1. :-(
   
Slapping different machines together and mucking with zebra
config files is not going to fix the kind of issues you are
talking about.  It is pure wasted effort.

Someone needs to apply brains to the code and improve the algorithms
and schemes we use.  So far I see approximately 1 person doing
something for every 1,000 guys complaining.  So shut your yap and
open up and editor and some algorithms books and papers. :)

If you stop using Linux right now, I won't cry nor will I lose sleep
tonight, I've never felt threatened by such things so I wouldn't
advise using them to coerce me into somehow "working harder". :)

See, I know the reasonable people will stick around and back me up as
I continue to improve the code.  Becuase I'm actually doing something
about the problems.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  4:34                                                                       ` Simon Kirby
  2003-06-10 11:01                                                                         ` Jamal Hadi
  2003-06-10 11:28                                                                         ` Jamal Hadi
@ 2003-06-10 16:10                                                                         ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 16:10 UTC (permalink / raw)
  To: sim; +Cc: ralph+d, netdev, linux-net

   From: Simon Kirby <sim@netnation.com>
   Date: Mon, 9 Jun 2003 21:34:53 -0700

   Broken parts of the code only get fixed if enough people whine

This isn't how I operate...

   or especially if somebody decides to actually fix it.

This is.  I hack on something because I want to and it seems
interesting to me at the moment.  Not because someone is shitting
their pants in public about it. :)

So, for future reference, you'll get more using honey than vinegar
from me :)
   
Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-10 15:53                                                                     ` Route cache performance under stress David S. Miller
@ 2003-06-10 16:15                                                                       ` Bogdan Costescu
  2003-06-10 16:20                                                                         ` Andi Kleen
  0 siblings, 1 reply; 227+ messages in thread
From: Bogdan Costescu @ 2003-06-10 16:15 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, ralph+d, xerox, sim, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

> Acutally, the main issue with 3c59x is that it still
> uses PIO accesses. This basically makes it useless
> for routing or anything wanting serious latency.

I did try about 2 years ago and converted the driver to MMIO. I wasn't 
able to see _any_ kind of improvement and I was using it in parallel 
computation where latency counts. I have to say though that I wasn't 
interested at that time in obtaining profiles and such because only the 
end-user performance was important.

> Andrew Morton knows this,

... and knows about my MMIO trial too (mentioned also on vortex-list)...

> but he is such a good maintainer that he doesn't want to change over the
> MEM I/O accesses for fear of breaking something.

Given that the 3c59x driver supports several generations of cards most of 
them being EOL-ed years ago, it's pretty hard to do such change. If a new 
driver would be forked that serviced only the latest generations (Cyclone 
= 905B and Tornado = 905C(X)), switching to MMIO would probably make sense 
along with lots of others small changes (large MTU/VLAN, polling 
descriptors, MII-only media selection etc.) and maybe have NAPI in the mix 
as well...

> It's actually a simple change to make if someone wants to
> spend a few cycles on it,

Not if you include testing in those cycles :-)

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-10 16:15                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
@ 2003-06-10 16:20                                                                         ` Andi Kleen
  2003-06-10 16:23                                                                           ` Jeff Garzik
  0 siblings, 1 reply; 227+ messages in thread
From: Andi Kleen @ 2003-06-10 16:20 UTC (permalink / raw)
  To: Bogdan Costescu
  Cc: David S. Miller, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

> 
> > but he is such a good maintainer that he doesn't want to change over the
> > MEM I/O accesses for fear of breaking something.
> 
> Given that the 3c59x driver supports several generations of cards most of 
> them being EOL-ed years ago, it's pretty hard to do such change. If a new 
> driver would be forked that serviced only the latest generations (Cyclone 
> = 905B and Tornado = 905C(X)), switching to MMIO would probably make sense 
> along with lots of others small changes (large MTU/VLAN, polling 
> descriptors, MII-only media selection etc.) and maybe have NAPI in the mix 
> as well...

Can't you just wrap it in a few macros and offer a config for those
who want the best performance and a runtime test for the others?
Then switch between PIO and mmio dynamically.

Even runtime test should be pretty painless these days, the CPU normally
can execute hundreds or even thousands of tests in the time it takes to 
wait for an mmio or even PIO.



> 
> > It's actually a simple change to make if someone wants to
> > spend a few cycles on it,
> 
> Not if you include testing in those cycles :-)

Just make it a whitelist + a force module param.

 
-Andi (who has a 3c980 and could do it, but already has too much on his
todo list..) 

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-10 16:20                                                                         ` Andi Kleen
@ 2003-06-10 16:23                                                                           ` Jeff Garzik
  2003-06-10 17:02                                                                             ` 3c59x David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Jeff Garzik @ 2003-06-10 16:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bogdan Costescu, David S. Miller, hadi, ralph+d, xerox, sim, fw,
	netdev, linux-net

On Tue, Jun 10, 2003 at 06:20:29PM +0200, Andi Kleen wrote:
> Can't you just wrap it in a few macros and offer a config for those
> who want the best performance and a runtime test for the others?
> Then switch between PIO and mmio dynamically.
> 
> Even runtime test should be pretty painless these days, the CPU normally
> can execute hundreds or even thousands of tests in the time it takes to 
> wait for an mmio or even PIO.

I prefer a compile-time test.  But yes, this is what several other
net drivers do:  offer a config option for MMIO (or PIO), and the
default is MMIO unless that is known to be unsafe on certain boards
(which, unfortunately, it is).

	Jeff

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 11:41                                                                         ` chas williams
@ 2003-06-10 16:27                                                                           ` David S. Miller
  2003-06-10 16:57                                                                             ` chas williams
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 16:27 UTC (permalink / raw)
  To: chas; +Cc: hadi, ralph+d, xerox, sim, fw, netdev, linux-net

   From: chas williams <chas@cmf.nrl.navy.mil>
   Date: Tue, 10 Jun 2003 07:41:01 -0400

   the bulk (by count) of the traffic seems to be in the 64-95 byte range.

Ok, time to deploy ATM everywhere to replace our IP routers :)
Sorry Chas, I couldn't resist... :)

Reagardless, there are some sites on the net that publish things
like BGP tables and traffic samples that people can use to do
performance testing on new algorithms.  I've read about it in
papers by Vern Paxson (he used it to do his Bro thing) and others.

I don't have a reference handy, anyone?  I think it's called the
IPMA project...

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 12:12                                                                             ` Jamal Hadi
@ 2003-06-10 16:33                                                                               ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 16:33 UTC (permalink / raw)
  To: hadi; +Cc: jsd, pekkas, ralph+d, xerox, sim, fw, netdev, linux-net

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Tue, 10 Jun 2003 08:12:58 -0400 (EDT)

   Theres another dimension actually: mostly driven by BSD mbuff style
   packet allocation; some tests show that some vendors are optimized
   for certain packet sizes, Linux skbuffs dont have this problem.

Well, the most amusing part for me is that if you read all the
papers on TCP congestion algorithms you'd think that routers
dropped based upon packet sizes since the majority work on
multiple of MSS this and multiple of MSS that. :)

Routers drop packets, period.  They do so using a variety of selection
schemes (RED, CBQ, actually just egrep net/sched/sch_*.c :) but you're
contribution to the router's work is measured in terms of packets and
time when you come right down to it.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 13:10                                                                         ` Ralph Doncaster
  2003-06-10 13:36                                                                           ` Jamal Hadi
@ 2003-06-10 16:38                                                                           ` David S. Miller
  2003-06-10 16:39                                                                           ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 16:38 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 09:10:43 -0400 (EDT)

   No, as I said I'm moving ~30mbps and ~10kpps in and out of 2 3c905cx
   cards.

This is because the driver still uses PIO, I am rather sure of
this.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 13:10                                                                         ` Ralph Doncaster
  2003-06-10 13:36                                                                           ` Jamal Hadi
  2003-06-10 16:38                                                                           ` David S. Miller
@ 2003-06-10 16:39                                                                           ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 16:39 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 09:10:43 -0400 (EDT)
   
   Yes, and it would be nice if you mentioned in your NAPI docs that
   people should use a tulip, tg3, or e1000 if they want it to work
   well.  In making your sales pitches for NAPI you made it sound like
   any high-performance card should do fine (i.e. anything but a Realtek).

The problems the 3c59x has is nothing to do with NAPI vs.
non-NAPI.  You're routing rate is limited by how much time
a PIO to the PCI device takes :)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-10 15:56                                                                     ` David S. Miller
@ 2003-06-10 16:45                                                                       ` Bogdan Costescu
  2003-06-10 16:49                                                                         ` Andi Kleen
  2003-06-10 17:12                                                                         ` 3c59x David S. Miller
  2003-06-10 17:19                                                                       ` Route cache performance under stress Ralph Doncaster
  1 sibling, 2 replies; 227+ messages in thread
From: Bogdan Costescu @ 2003-06-10 16:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, ralph+d, hadi, xerox, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

> Unfortunately, NAPI won't help him with the current way the 3c59x
> driver works.  It needs to provide a way to use MEM I/O before NAPI
> would start to be of use to him.

I don't really want to sound like defending the 3c59x driver, but...
The 3c90x driver released by 3Com uses some mechanism "similar" to NAPI 
which is based on the on-board timer; these timer interrupts are scheduled 
dynamically. With this driver I would typically get TCP bandwidth figures 
4-5 Mbps lower than those obtained with 3c59x and noticable difference in 
the parallel jobs timing (using MPI over TCP). I'm not saying that NAPI 
will perform the same way, just that there might be also hardware limits 
somewhere...

But the real question is: does it make sense to spend time now in trying 
to improve a driver with hope for only a marginal speed increase ? After 
using these cards and the 3c59x driver with very good results for the past 
4 years, I'm looking for GigE replacements. Shouldn't anybody concerned 
with performance do the same ? Does it make sense to pair a very fast CPU 
and memory with a 33MHz-32bit PCI bus ?

And another important question: how much improvement can be gained from
the driver ? Folks that do parallel computation over TCP over Ethernet
know very well that the software in the kernel is the bottleneck (extra
copies, TCP, IRQ management, etc).  Packages that throw away TCP and use
another communication protocol can typically achieve much better ping-pong
times (they do have some other problems though) which shows that the
hardware and NIC driver are capable enough. So until I see a profile
showing that the CPU is spending most of the time in the driver, I won't
be convinced that these changes are needed....

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-10 16:45                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
@ 2003-06-10 16:49                                                                         ` Andi Kleen
  2003-06-11  9:54                                                                           ` Robert Olsson
  2003-06-10 17:12                                                                         ` 3c59x David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Andi Kleen @ 2003-06-10 16:49 UTC (permalink / raw)
  To: Bogdan Costescu
  Cc: David S. Miller, sim, ralph+d, hadi, xerox, fw, netdev, linux-net

> 
> And another important question: how much improvement can be gained from
> the driver ? Folks that do parallel computation over TCP over Ethernet

You can play some tricks with the driver to make eth_type_trans disappear
from the profiles. This usually helps a lot because it avoids one
full "fetch from cache cold memory" roundtrip per packet, which is slow on
any CPU.

-Andi


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 16:27                                                                           ` David S. Miller
@ 2003-06-10 16:57                                                                             ` chas williams
  0 siblings, 0 replies; 227+ messages in thread
From: chas williams @ 2003-06-10 16:57 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, ralph+d, xerox, sim, fw, netdev, linux-net

In message <20030610.092748.115929981.davem@redhat.com>,"David S. Miller" write
s:
>Ok, time to deploy ATM everywhere to replace our IP routers :)
>Sorry Chas, I couldn't resist... :)

i see a lot of crying about the 'atm tax' but it seems to me that the 'ip tax'
is typically much steeper (except when you graph packet_count*packet_size
then you will see that the bulk of the data is carried by larger packets
were the tax isnt as high).  so for some applications, like voice, atm might
actually be a winner as far as the tax goes (as long as you arent doing
voice over ip over atm)

hosestly i needed real numbers to tune the atm driver on our linux-router.
i have two recv buffer pools--small and large (duh).  i needed an idea of what
to use for the small value.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 16:23                                                                           ` Jeff Garzik
@ 2003-06-10 17:02                                                                             ` David S. Miller
  2003-06-10 17:16                                                                               ` 3c59x Jeff Garzik
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 17:02 UTC (permalink / raw)
  To: jgarzik
  Cc: ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

   From: Jeff Garzik <jgarzik@pobox.com>
   Date: Tue, 10 Jun 2003 12:23:42 -0400
   
   I prefer a compile-time test.

This means end users don't see the benefit, so I definitely
prefer Andi's idea.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 16:45                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
  2003-06-10 16:49                                                                         ` Andi Kleen
@ 2003-06-10 17:12                                                                         ` David S. Miller
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 17:12 UTC (permalink / raw)
  To: bogdan.costescu; +Cc: sim, ralph+d, hadi, xerox, fw, netdev, linux-net

   From: Bogdan Costescu <bogdan.costescu@iwr.uni-heidelberg.de>
   Date: Tue, 10 Jun 2003 18:45:03 +0200 (CEST)

   With this driver I would typically get TCP bandwidth figures 
   4-5 Mbps lower than those obtained with 3c59x and noticable difference in 
   the parallel jobs timing (using MPI over TCP). I'm not saying that NAPI 
   will perform the same way, just that there might be also hardware limits 
   somewhere...
   
I think it won't, hardware interrupt mitigation schemes have lots of
problems that NAPI is more ept to deal with.

   But the real question is: does it make sense to spend time now in
   trying to improve a driver with hope for only a marginal speed
   increase ?

People who have the cards care, and I think PIO-->MMIO is more
than marginal.

You're attempt to get "latency" was ill founded :)  Your limits
have to do with the wire speed, not all the cpu cycles being eaten
by PIO acceses.

On a DoS'd router, it's another situation altogether.

   And another important question: how much improvement can be gained from
   the driver ? Folks that do parallel computation over TCP over Ethernet
   know very well that the software in the kernel is the bottleneck (extra
   copies, TCP, IRQ management, etc).

Your lmitations in parallel computation have to do with how TCP
behaves more than how TCP is implemented.

For starters try:

	echo "1" >/proc/sys/net/ipv4/tcp_low_latency

That's the kind of thing that will help parallel computation
folks, not driver hacks.


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:16                                                                               ` 3c59x Jeff Garzik
@ 2003-06-10 17:14                                                                                 ` David S. Miller
  2003-06-10 17:25                                                                                   ` 3c59x Jeff Garzik
  2003-06-10 17:18                                                                                 ` 3c59x Andi Kleen
  2003-06-10 17:29                                                                                 ` 3c59x chas williams
  2 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 17:14 UTC (permalink / raw)
  To: jgarzik
  Cc: ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

   From: Jeff Garzik <jgarzik@pobox.com>
   Date: Tue, 10 Jun 2003 13:16:17 -0400

   On Tue, Jun 10, 2003 at 10:02:09AM -0700, David S. Miller wrote:
   > This means end users don't see the benefit, so I definitely
   > prefer Andi's idea.
   
   Making every IO a conditional branch?  Ug.
   
A PIO costs hundreds if not thousands of instructions!
Come on Jeff, get real :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:02                                                                             ` 3c59x David S. Miller
@ 2003-06-10 17:16                                                                               ` Jeff Garzik
  2003-06-10 17:14                                                                                 ` 3c59x David S. Miller
                                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 227+ messages in thread
From: Jeff Garzik @ 2003-06-10 17:16 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

On Tue, Jun 10, 2003 at 10:02:09AM -0700, David S. Miller wrote:
>    From: Jeff Garzik <jgarzik@pobox.com>
>    Date: Tue, 10 Jun 2003 12:23:42 -0400
>    
>    I prefer a compile-time test.
> 
> This means end users don't see the benefit, so I definitely
> prefer Andi's idea.

Making every IO a conditional branch?  Ug.

	Jeff




^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:16                                                                               ` 3c59x Jeff Garzik
  2003-06-10 17:14                                                                                 ` 3c59x David S. Miller
@ 2003-06-10 17:18                                                                                 ` Andi Kleen
  2003-06-10 17:29                                                                                 ` 3c59x chas williams
  2 siblings, 0 replies; 227+ messages in thread
From: Andi Kleen @ 2003-06-10 17:18 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David S. Miller, ak, bogdan.costescu, hadi, ralph+d, xerox, sim,
	fw, netdev, linux-net

On Tue, Jun 10, 2003 at 01:16:17PM -0400, Jeff Garzik wrote:
> On Tue, Jun 10, 2003 at 10:02:09AM -0700, David S. Miller wrote:
> >    From: Jeff Garzik <jgarzik@pobox.com>
> >    Date: Tue, 10 Jun 2003 12:23:42 -0400
> >    
> >    I prefer a compile-time test.
> > 
> > This means end users don't see the benefit, so I definitely
> > prefer Andi's idea.
> 
> Making every IO a conditional branch?  Ug.

An IO takes hundreds or even thousands of cycles. The test and branch
is completely lost in the noise. I bet you won't be able to measure
a difference on any modern CPU.

-Andi


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 15:56                                                                     ` David S. Miller
  2003-06-10 16:45                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
@ 2003-06-10 17:19                                                                       ` Ralph Doncaster
  1 sibling, 0 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 17:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: sim, hadi, xerox, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

>    From: Simon Kirby <sim@netnation.com>
>    Date: Mon, 9 Jun 2003 18:53:12 -0700
>
>    Your CPU use is quite a bit higher than ours.
>
> Yeah, but his faster cpu is all being burnt to a crisp
> doing PIO accesses to the 3c59x card.
>
>    I found that once NAPI was happening, userspace seemed to get a
>    fairly decent amount of time.
>
> Unfortunately, NAPI won't help him with the current way the 3c59x
> driver works.  It needs to provide a way to use MEM I/O before NAPI
> would start to be of use to him.

Well, I've already decided to retire the 3c905cx cards and drop in a
couple of the Pro/1000 cards I recently bought.  Considering the Intel
GigE cards cost me ~$50 now and the 3Coms are ~$45, I'd say anyone willing
to update 3c59x.c has misplaced priorities or too much time on their
hands...

-Ralph


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:14                                                                                 ` 3c59x David S. Miller
@ 2003-06-10 17:25                                                                                   ` Jeff Garzik
  2003-06-10 17:30                                                                                     ` 3c59x David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Jeff Garzik @ 2003-06-10 17:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

On Tue, Jun 10, 2003 at 10:14:52AM -0700, David S. Miller wrote:
>    From: Jeff Garzik <jgarzik@pobox.com>
>    Date: Tue, 10 Jun 2003 13:16:17 -0400
> 
>    On Tue, Jun 10, 2003 at 10:02:09AM -0700, David S. Miller wrote:
>    > This means end users don't see the benefit, so I definitely
>    > prefer Andi's idea.
>    
>    Making every IO a conditional branch?  Ug.
>    
> A PIO costs hundreds if not thousands of instructions!
> Come on Jeff, get real :-)

MMIO isn't as bad as PIO, and the 'runtime choice' setup implies
slowing down the MMIO path, too.

	Jeff

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:16                                                                               ` 3c59x Jeff Garzik
  2003-06-10 17:14                                                                                 ` 3c59x David S. Miller
  2003-06-10 17:18                                                                                 ` 3c59x Andi Kleen
@ 2003-06-10 17:29                                                                                 ` chas williams
  2003-06-10 17:31                                                                                   ` 3c59x David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: chas williams @ 2003-06-10 17:29 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David S. Miller, ak, bogdan.costescu, hadi, ralph+d, xerox, sim,
	fw, netdev, linux-net

In message <20030610171617.GC1959@gtf.org>,Jeff Garzik writes:
>Making every IO a conditional branch?  Ug.

you could just test once during driver init and setup an indirection
to the appropriate function.  its a little better than test and branch.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:25                                                                                   ` 3c59x Jeff Garzik
@ 2003-06-10 17:30                                                                                     ` David S. Miller
  2003-06-10 19:20                                                                                       ` 3c59x Jeff Garzik
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 17:30 UTC (permalink / raw)
  To: jgarzik
  Cc: ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

   From: Jeff Garzik <jgarzik@pobox.com>
   Date: Tue, 10 Jun 2003 13:25:56 -0400
   
   MMIO isn't as bad as PIO, and the 'runtime choice' setup implies
   slowing down the MMIO path, too.

No end user will see the change then, no distribution vendor worth
their salt will ship with MMIO enabled.

Right now we get only PIO and everybody suffers.  Runtime MMIO
selection is a net win for everyone.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:29                                                                                 ` 3c59x chas williams
@ 2003-06-10 17:31                                                                                   ` David S. Miller
  2003-06-10 17:39                                                                                     ` 3c59x chas williams
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 17:31 UTC (permalink / raw)
  To: chas
  Cc: jgarzik, ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw,
	netdev, linux-net

   From: chas williams <chas@cmf.nrl.navy.mil>
   Date: Tue, 10 Jun 2003 13:29:54 -0400

   In message <20030610171617.GC1959@gtf.org>,Jeff Garzik writes:
   >Making every IO a conditional branch?  Ug.
   
   you could just test once during driver init and setup an indirection
   to the appropriate function.  its a little better than test and branch.
   
Function calls are actually more expensive, you eat an entry in
the cpu's return address cache.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 17:33                                                                     ` Ralph Doncaster
@ 2003-06-10 17:32                                                                       ` David S. Miller
  2003-06-10 18:34                                                                         ` Robert Olsson
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 17:32 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 13:33:01 -0400 (EDT)

   On Tue, 10 Jun 2003, David S. Miller wrote:
   
   > When packet (more specifically, software interrupt) processing
   > reaches a certain level, we offload the work into process context.
   
   That sounds good.

Adjust the nice value of the ksoftirqd tasks, that's the
only thing available.

But your problem has to do with all the PIO accesses, that
absolutely kills the machine.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 15:49                                                                   ` David S. Miller
@ 2003-06-10 17:33                                                                     ` Ralph Doncaster
  2003-06-10 17:32                                                                       ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 17:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: ralph+d, hadi, xerox, sim, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

>    From: Ralph Doncaster <ralph@istop.com>
>    Date: Mon, 9 Jun 2003 20:32:48 -0400 (EDT)
>
>    Lastly from the software side Linux doesn't seem to have anything like
>    BSD's parameter to control user/system CPU sharing.  Once my CPU load
>    reaches 70-80%, I'd rather have some dropped packets than let the CPU hit
>    100% and end up with my BGP sessions drop.
>
> When packet (more specifically, software interrupt) processing
> reaches a certain level, we offload the work into process context.

That sounds good.  Is there a sysctl I can use to define "certain level"?

-Ralph


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:31                                                                                   ` 3c59x David S. Miller
@ 2003-06-10 17:39                                                                                     ` chas williams
  2003-06-10 17:43                                                                                       ` 3c59x David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: chas williams @ 2003-06-10 17:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: jgarzik, ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw,
	netdev, linux-net

>Function calls are actually more expensive, you eat an entry in
>the cpu's return address cache.

i was thinking you could do it higher up, like around the hard_start_xmit
level.  this would create a bit of replicated code, but you could abuse
the preprocessor a bit i imagine.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:39                                                                                     ` 3c59x chas williams
@ 2003-06-10 17:43                                                                                       ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 17:43 UTC (permalink / raw)
  To: chas
  Cc: jgarzik, ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw,
	netdev, linux-net

   From: chas williams <chas@cmf.nrl.navy.mil>
   Date: Tue, 10 Jun 2003 13:39:15 -0400
   
   i was thinking you could do it higher up, like around the hard_start_xmit
   level.  this would create a bit of replicated code, but you could abuse
   the preprocessor a bit i imagine.

3c59x already does this, so now we'd have 4 different copies.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  3:23                                                                       ` Ben Greear
  2003-06-10  3:41                                                                         ` Ralph Doncaster
@ 2003-06-10 18:10                                                                         ` Ralph Doncaster
  2003-06-10 18:21                                                                           ` Ben Greear
  1 sibling, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 18:10 UTC (permalink / raw)
  To: Ben Greear; +Cc: 'netdev@oss.sgi.com'

On Mon, 9 Jun 2003, Ben Greear wrote:

> One waring about e1000's, make sure you have active airflow across the NICs
> if you put two together.  Otherwise, buy a dual port NIC...it has a single
> chip and you will have less cooling issues.

I just took a closer look at my e1000's.  They've got a small RC82540EM
bga chip on them, manufactured 25th week of '02. If these things do get
hot enough to cause problems why wouldn't Intel have them manufactured
with heatsinks attached?

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 18:10                                                                         ` Ralph Doncaster
@ 2003-06-10 18:21                                                                           ` Ben Greear
  0 siblings, 0 replies; 227+ messages in thread
From: Ben Greear @ 2003-06-10 18:21 UTC (permalink / raw)
  To: ralph+d; +Cc: 'netdev@oss.sgi.com'

Ralph Doncaster wrote:
> On Mon, 9 Jun 2003, Ben Greear wrote:
> 
> 
>>One waring about e1000's, make sure you have active airflow across the NICs
>>if you put two together.  Otherwise, buy a dual port NIC...it has a single
>>chip and you will have less cooling issues.
> 
> 
> I just took a closer look at my e1000's.  They've got a small RC82540EM
> bga chip on them, manufactured 25th week of '02. If these things do get
> hot enough to cause problems why wouldn't Intel have them manufactured
> with heatsinks attached?

Dunno...  I wish they had.  My machine had fairly bad cooling (2U, open case). However, when
I put a fan on them, no reboots, whereas before I could crash the machine
with nasty memory corruption after about 1 hour of sustained > 100Mbps
bi-directional traffic.

The temp probe I used showed them to be at about their operating
max, though I forget what that was now...

Maybe your chipset or cooling is better and you won't hit it..but if you do
see crashes, try a fan :)

Ben

> 
> -Ralph
> 


-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 17:32                                                                       ` David S. Miller
@ 2003-06-10 18:34                                                                         ` Robert Olsson
  2003-06-10 18:57                                                                           ` David S. Miller
  2003-06-12  6:45                                                                           ` David S. Miller
  0 siblings, 2 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-10 18:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net


Dave!

I ripped out the route hash just to test the slow path. Seems like your 
patch was very good as we see the same performance w/o dst hash ~114 kpps.

So my test system drops from 450 kpps to 114 kpps when every incoming 
interface carries 100% traffic which has 1 dst/pkt which is very unlikely 
senario I would say. It is not that bad....

Conclusions:
* Your patch is good. (I played with some variants)
* We need to focus on slow path if we feel improving the 1 dst/pkt 
  scenario.


Input rate 2*190 kpps clone_skb=1

Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 3001546 9684614 9684614 6998518     53      0      0      0 BRU
eth1   1500   0     13      0      0      0 3001497      0      0      0 BRU
eth2   1500   0 3001114 9678333 9678333 6998889      3      0      0      0 BRU
eth3   1500   0      2      0      0      0 3001115      0      0      0 BRU

rt_cache_stat
00009146  00000000 005b97ed 00000000 00000000 00000000 00000000 00000000  00000004 00000006 00000000 005b107e 005b1071 00000006 00000000 00000000 00000001 

--- net/ipv4/route.c.030610.2	2003-06-10 18:55:32.000000000 +0200
+++ net/ipv4/route.c	2003-06-10 19:09:23.000000000 +0200
@@ -722,44 +722,10 @@
 
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
-	struct rtable	*rth, **rthp;
-	unsigned long	now = jiffies;
 	int attempts = !in_softirq();
 
 restart:
-	rthp = &rt_hash_table[hash].chain;
-
 	spin_lock_bh(&rt_hash_table[hash].lock);
-	while ((rth = *rthp) != NULL) {
-		if (compare_keys(&rth->fl, &rt->fl)) {
-			/* Put it first */
-			*rthp = rth->u.rt_next;
-			/*
-			 * Since lookup is lockfree, the deletion
-			 * must be visible to another weakly ordered CPU before
-			 * the insertion at the start of the hash chain.
-			 */
-			smp_wmb();
-			rth->u.rt_next = rt_hash_table[hash].chain;
-			/*
-			 * Since lookup is lockfree, the update writes
-			 * must be ordered for consistency on SMP.
-			 */
-			smp_wmb();
-			rt_hash_table[hash].chain = rth;
-
-			rth->u.dst.__use++;
-			dst_hold(&rth->u.dst);
-			rth->u.dst.lastuse = now;
-			spin_unlock_bh(&rt_hash_table[hash].lock);
-
-			rt_drop(rt);
-			*rp = rth;
-			return 0;
-		}
-
-		rthp = &rth->u.rt_next;
-	}
 
 	/* Try to bind route to arp only if it is output
 	   route or unicast forwarding path.
@@ -916,10 +882,7 @@
 
 static inline struct rtable *ip_rt_dst_alloc(unsigned int hash)
 {
-	if (atomic_read(&ipv4_dst_ops.entries) >
-	    ipv4_dst_ops.gc_thresh)
-		__rt_hash_shrink(hash);
-
+	__rt_hash_shrink(hash);
 	return dst_alloc(&ipv4_dst_ops);
 }
 
@@ -1801,37 +1764,6 @@
 int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr,
 		   u8 tos, struct net_device *dev)
 {
-	struct rtable * rth;
-	unsigned	hash;
-	int iif = dev->ifindex;
-
-	tos &= IPTOS_RT_MASK;
-	hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos);
-
-	prefetch(&rt_hash_table[hash].chain->fl);
-
-	rcu_read_lock();
-	for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
-		smp_read_barrier_depends();
-		if (rth->fl.fl4_dst == daddr &&
-		    rth->fl.fl4_src == saddr &&
-		    rth->fl.iif == iif &&
-		    rth->fl.oif == 0 &&
-#ifdef CONFIG_IP_ROUTE_FWMARK
-		    rth->fl.fl4_fwmark == skb->nfmark &&
-#endif
-		    rth->fl.fl4_tos == tos) {
-			rth->u.dst.lastuse = jiffies;
-			dst_hold(&rth->u.dst);
-			rth->u.dst.__use++;
-			RT_CACHE_STAT_INC(in_hit);
-			rcu_read_unlock();
-			skb->dst = (struct dst_entry*)rth;
-			return 0;
-		}
-		RT_CACHE_STAT_INC(in_hlist_search);
-	}
-
 	rcu_read_unlock();
 
 	/* Multicast recognition logic is moved from route cache to here.


Cheers.
						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 10:53                                                                       ` Jamal Hadi
                                                                                           ` (2 preceding siblings ...)
  2003-06-10 13:10                                                                         ` Ralph Doncaster
@ 2003-06-10 18:41                                                                         ` Florian Weimer
  2003-06-11 11:47                                                                           ` Was (Re: " Jamal Hadi
  3 siblings, 1 reply; 227+ messages in thread
From: Florian Weimer @ 2003-06-10 18:41 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	netdev, linux-net

Jamal Hadi <hadi@shell.cyberus.ca> writes:

> Typically, real world is less intense than the lab. Ex: noone sends
> 100Mbps at 64 byte packet size.

Unfortunately, compromised hosts do send such traffic, and DoS victims
receive it. 8-(

You don't want your core routers to break down just because a couple
of the 150,000 hosts in your regional network have been compromised
(think of Slammer) or you are running an IRC server.

> Have you seen how the big boys advertise?

Typical GSR linecards for OC-48 are specified to handle 2 Mpps, but
the switch fabric is reportedly somewhat inert and the router might
choke before that if there are too many linecards involved (I haven't
observed this personally, this just chatter from someone who works
daily with those beasts).  A couple of hundred kpps aren't a problem
for those routers, though, as are 300 Mbit (or was it 400?) of Slammer
traffic (with random destination addresses).

In general, the forwarding performance is nowadays specified in pps
and even flows per second if you look carefully at the data sheets.
Most vendors have learnt that people want routers with comforting
worst-case behavior.  However, you have to read carefully, e.g. a
Catalyst 6500 with Supervisor Engine 1 (instead of 2) can only create
650,000 flows per second, even if it has a much, much higher peak IP
forwarding rate.

(The times of routers which died when confronted with a rapid ICMP
sweep across a /16 are gone for good, I hope.)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 18:34                                                                         ` Robert Olsson
@ 2003-06-10 18:57                                                                           ` David S. Miller
  2003-06-10 19:53                                                                             ` Robert Olsson
                                                                                               ` (3 more replies)
  2003-06-12  6:45                                                                           ` David S. Miller
  1 sibling, 4 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 18:57 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Tue, 10 Jun 2003 20:34:50 +0200
   
   I ripped out the route hash just to test the slow path. Seems like your 
   patch was very good as we see the same performance w/o dst hash ~114 kpps.

How did you "rip it out"?  Just never look into the routing
cache hash and never add entries there?  If so, then yes it is
excellent simulation for pure slow path.

This is not purely an algorithmic problem.  The highest cost thing we
do in the slow path of input route processing is source validation.
This requires real brains to eliminate.

Actually, that's a good idea, if someone if brave just rip out
fib_validate_source (just don't call it, should work for valid
traffic) and see what happens :)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 17:30                                                                                     ` 3c59x David S. Miller
@ 2003-06-10 19:20                                                                                       ` Jeff Garzik
  2003-06-10 19:21                                                                                         ` 3c59x David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Jeff Garzik @ 2003-06-10 19:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

On Tue, Jun 10, 2003 at 10:30:30AM -0700, David S. Miller wrote:
> No end user will see the change then, no distribution vendor worth
> their salt will ship with MMIO enabled.
> 
> Right now we get only PIO and everybody suffers.  Runtime MMIO
> selection is a net win for everyone.

For 3c59x and select other drivers, I agree 100%

If we are making a general rule, I do not agree...

	Jeff

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x
  2003-06-10 19:20                                                                                       ` 3c59x Jeff Garzik
@ 2003-06-10 19:21                                                                                         ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-10 19:21 UTC (permalink / raw)
  To: jgarzik
  Cc: ak, bogdan.costescu, hadi, ralph+d, xerox, sim, fw, netdev, linux-net

   From: Jeff Garzik <jgarzik@pobox.com>
   Date: Tue, 10 Jun 2003 15:20:52 -0400

   On Tue, Jun 10, 2003 at 10:30:30AM -0700, David S. Miller wrote:
   > Right now we get only PIO and everybody suffers.  Runtime MMIO
   > selection is a net win for everyone.
   
   For 3c59x and select other drivers, I agree 100%
   
   If we are making a general rule, I do not agree...

I think we agree then.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 18:57                                                                           ` David S. Miller
@ 2003-06-10 19:53                                                                             ` Robert Olsson
  2003-06-10 21:36                                                                             ` CIT/Paul
                                                                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-10 19:53 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robert.Olsson, ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net


David S. Miller writes:

 > How did you "rip it out"?  Just never look into the routing
 > cache hash and never add entries there?  If so, then yes it is
 > excellent simulation for pure slow path.

 Look at the patch... 
 hash lookup is bypassed -> always slow path. no lookup in rt_intern_hash
 but we keep the entry in the hash and ip_rt_dst_alloc is changed to run
 do __rt_hash_shrink for each call.

 > This is not purely an algorithmic problem.  The highest cost thing we
 > do in the slow path of input route processing is source validation.
 > This requires real brains to eliminate.
 > 
 > Actually, that's a good idea, if someone if brave just rip out
 > fib_validate_source (just don't call it, should work for valid
 > traffic) and see what happens :)

 It should be easy to do... 

 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10 18:57                                                                           ` David S. Miller
  2003-06-10 19:53                                                                             ` Robert Olsson
@ 2003-06-10 21:36                                                                             ` CIT/Paul
  2003-06-10 21:39                                                                             ` Ralph Doncaster
  2003-06-11 17:40                                                                             ` Robert Olsson
  3 siblings, 0 replies; 227+ messages in thread
From: CIT/Paul @ 2003-06-10 21:36 UTC (permalink / raw)
  To: 'David S. Miller', Robert.Olsson
  Cc: ralph+d, ralph, hadi, sim, fw, netdev, linux-net

Why do you need source validation if we are going to use it for a core
router :)
Is there anything else in there that may or may not be necessary
depending on the circumstances that we are using the router for?


Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: David S. Miller [mailto:davem@redhat.com] 
Sent: Tuesday, June 10, 2003 2:58 PM
To: Robert.Olsson@data.slu.se
Cc: ralph+d@istop.com; ralph@istop.com; hadi@shell.cyberus.ca;
xerox@foonet.net; sim@netnation.com; fw@deneb.enyo.de;
netdev@oss.sgi.com; linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Tue, 10 Jun 2003 20:34:50 +0200
   
   I ripped out the route hash just to test the slow path. Seems like
your 
   patch was very good as we see the same performance w/o dst hash ~114
kpps.

How did you "rip it out"?  Just never look into the routing cache hash
and never add entries there?  If so, then yes it is excellent simulation
for pure slow path.

This is not purely an algorithmic problem.  The highest cost thing we do
in the slow path of input route processing is source validation. This
requires real brains to eliminate.

Actually, that's a good idea, if someone if brave just rip out
fib_validate_source (just don't call it, should work for valid
traffic) and see what happens :)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 18:57                                                                           ` David S. Miller
  2003-06-10 19:53                                                                             ` Robert Olsson
  2003-06-10 21:36                                                                             ` CIT/Paul
@ 2003-06-10 21:39                                                                             ` Ralph Doncaster
  2003-06-10 22:20                                                                               ` David S. Miller
  2003-06-11 17:40                                                                             ` Robert Olsson
  3 siblings, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 21:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

> Actually, that's a good idea, if someone if brave just rip out
> fib_validate_source (just don't call it, should work for valid
> traffic) and see what happens :)

Looking at Simon's profile numbers, these seem to contribute a lot more
than fib_validate_source:
  1237 ipt_route_hook                            19.3281
  3120 do_gettimeofday                           21.6667
  8299 ip_packet_match                           24.6994
  8031 fib_lookup                                25.0969
  1877 fib_rule_put                              29.3281

What's the do_gettimeofday for?  Is that just a bogus one that shows up
for kernel profiling (I can recall using the profiling tool quantify had a
similar problem of showing gettimeofday calls that it was doing on its
own).

I traced back the fib_lookup to ip_forward_finish, which seems to only
call ip_forward_options when there's IP options (go figure!), which would
make sense for a SYN packet (juno).  What I can't think of is why we'd
want to have special routing considerations for TCP SYN packets (and other
IP packets with options set).

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 21:39                                                                             ` Ralph Doncaster
@ 2003-06-10 22:20                                                                               ` David S. Miller
  2003-06-10 23:58                                                                                 ` Ralph Doncaster
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 22:20 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 17:39:44 -0400 (EDT)
   
   Looking at Simon's profile numbers, these seem to contribute a lot more
   than fib_validate_source:

Ignore all the fib_rule*() and associated overhead, Simon
is going to turn of policy routing support so that stuff
drops out of the profiles.

The fib_lookup() will decrease significantly from the profiles
if fib_validate_source() is deleted, and this is what I want
confirmed from such an experiment.

     1237 ipt_route_hook                            19.3281
     3120 do_gettimeofday                           21.6667
     8299 ip_packet_match                           24.6994
     8031 fib_lookup                                25.0969
     1877 fib_rule_put                              29.3281
   
   What's the do_gettimeofday for?

Every packet records a timestamp.

   I traced back the fib_lookup to ip_forward_finish, which seems to only
   call ip_forward_options when there's IP options (go figure!), which would
   make sense for a SYN packet (juno).

You're thinking TCP options, not IP options.

   What I can't think of is why we'd want to have special routing
   considerations for TCP SYN packets (and other IP packets with
   options set).

ip_forward_options() has to update things like record-route
IP options, which record hop-by-hop information for diagnostic
tools like traceroute.


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 23:58                                                                                 ` Ralph Doncaster
@ 2003-06-10 23:57                                                                                   ` David S. Miller
  2003-06-11  0:41                                                                                     ` Ralph Doncaster
  2003-06-11  0:51                                                                                   ` Ben Greear
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-10 23:57 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 19:58:47 -0400 (EDT)

   On Tue, 10 Jun 2003, David S. Miller wrote:
   
   > Every packet records a timestamp.
   
   I'm not aware of anything in IP routing that requires a timestamp for
   every packet.  To me it sounds like we could rip that out too.

Guess you never run tcpdump nor use packet schedulers.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 22:20                                                                               ` David S. Miller
@ 2003-06-10 23:58                                                                                 ` Ralph Doncaster
  2003-06-10 23:57                                                                                   ` David S. Miller
  2003-06-11  0:51                                                                                   ` Ben Greear
  0 siblings, 2 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-10 23:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

>    From: Ralph Doncaster <ralph@istop.com>
>    What's the do_gettimeofday for?
>
> Every packet records a timestamp.

I'm not aware of anything in IP routing that requires a timestamp for
every packet.  To me it sounds like we could rip that out too.

-Ralph


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 23:57                                                                                   ` David S. Miller
@ 2003-06-11  0:41                                                                                     ` Ralph Doncaster
  2003-06-11  0:58                                                                                       ` David S. Miller
  2003-06-11  0:58                                                                                       ` David S. Miller
  0 siblings, 2 replies; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-11  0:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

>    From: Ralph Doncaster <ralph@istop.com>
>    Date: Tue, 10 Jun 2003 19:58:47 -0400 (EDT)
>
>    On Tue, 10 Jun 2003, David S. Miller wrote:
>
>    > Every packet records a timestamp.
>
>    I'm not aware of anything in IP routing that requires a timestamp for
>    every packet.  To me it sounds like we could rip that out too.
>
> Guess you never run tcpdump nor use packet schedulers.

So because some (in the case of a core router almost none) of the packets
will need a timestamp, you do it for every single one of them?

This sounded so unbelievable to me that I took a quick look at the code to
see what I'd have to do to get rid of it.  It appears that gettimeofday is
not called for every packet; just for ICMP timestamp requests and for IP
options (ip_options_build and ip_options_compile).

-Ralph

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 23:58                                                                                 ` Ralph Doncaster
  2003-06-10 23:57                                                                                   ` David S. Miller
@ 2003-06-11  0:51                                                                                   ` Ben Greear
  2003-06-11  1:01                                                                                     ` David S. Miller
  1 sibling, 1 reply; 227+ messages in thread
From: Ben Greear @ 2003-06-11  0:51 UTC (permalink / raw)
  To: ralph+d
  Cc: David S. Miller, Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

Ralph Doncaster wrote:
> On Tue, 10 Jun 2003, David S. Miller wrote:
> 
> 
>>   From: Ralph Doncaster <ralph@istop.com>
>>   What's the do_gettimeofday for?
>>
>>Every packet records a timestamp.
> 
> 
> I'm not aware of anything in IP routing that requires a timestamp for
> every packet.  To me it sounds like we could rip that out too.
> 
> -Ralph
> 

Maybe as a configurable option, since it would make tcpdump less useful.
Seems like we could kludge it up so that we used the TSC (or whatever that
really fast hardware clock is) to provide some relative stamp that could be
converted to a time_val later?

It does seem a bit wasteful to do the gettimeofday when most of the time
the result is ignored.

(Or, are there things other than tcpdump that need the gettimeofday stamp?)

Ben

-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  0:41                                                                                     ` Ralph Doncaster
@ 2003-06-11  0:58                                                                                       ` David S. Miller
  2003-06-11  0:58                                                                                       ` David S. Miller
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-11  0:58 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 20:41:13 -0400 (EDT)

   On Tue, 10 Jun 2003, David S. Miller wrote:
   
   > Guess you never run tcpdump nor use packet schedulers.
   
   So because some (in the case of a core router almost none) of the packets
   will need a timestamp, you do it for every single one of them?

In order to be accurate, we must obtain the timestamp
exactly when we receive the packet.

But until we know that the packet is for us or not (which
requires a route lookup), we don't know if we actually need
the timestamp or not.

This is not some arbitrary thing, this is how you have to
implement this.  It's not like we said "screw everyone,
let's get a timestamp all the time whether we need it or
not." :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  0:41                                                                                     ` Ralph Doncaster
  2003-06-11  0:58                                                                                       ` David S. Miller
@ 2003-06-11  0:58                                                                                       ` David S. Miller
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-11  0:58 UTC (permalink / raw)
  To: ralph+d, ralph; +Cc: Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 20:41:13 -0400 (EDT)
   
   This sounded so unbelievable to me that I took a quick look at the code to
   see what I'd have to do to get rid of it.  It appears that gettimeofday is
   not called for every packet; just for ICMP timestamp requests and for IP
   options (ip_options_build and ip_options_compile).

Stop lookin in the IP code.  Look at where we get the packet
from the device, which is one layer up.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  0:51                                                                                   ` Ben Greear
@ 2003-06-11  1:01                                                                                     ` David S. Miller
  2003-06-11  1:15                                                                                       ` Ben Greear
                                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-11  1:01 UTC (permalink / raw)
  To: greearb; +Cc: ralph+d, Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

   From: Ben Greear <greearb@candelatech.com>
   Date: Tue, 10 Jun 2003 17:51:57 -0700
   
   Maybe as a configurable option, since it would make tcpdump less useful.
   Seems like we could kludge it up so that we used the TSC (or whatever that
   really fast hardware clock is) to provide some relative stamp that could be
   converted to a time_val later?

I have a strange feeling that Ralph's system isn't using
TSC and that's why it shows up so high on the profiles :-)
TSC do_gettimeofday() is REALLY cheap (TSC read plus a multiply which
x86 does in like 5 cycles).

Yes, this idea has been tossed around before.  But what's funny
is that on the bigger boxes, you don't use TSC because amongst
the different nodes of the machine they are skewed, so you have
to use the ACPI timer or something like that for timestamping.
   
   It does seem a bit wasteful to do the gettimeofday when most of the time
   the result is ignored.
   
   (Or, are there things other than tcpdump that need the gettimeofday stamp?)

SO_RECVSTAMP, any socket on the machine can ask for this.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:01                                                                                     ` David S. Miller
@ 2003-06-11  1:15                                                                                       ` Ben Greear
  2003-06-11  1:22                                                                                         ` David S. Miller
  2003-06-11  1:17                                                                                       ` Ralph Doncaster
  2003-06-11  7:25                                                                                       ` Andi Kleen
  2 siblings, 1 reply; 227+ messages in thread
From: Ben Greear @ 2003-06-11  1:15 UTC (permalink / raw)
  To: David S. Miller, 'netdev@oss.sgi.com'

David S. Miller wrote:
>    From: Ben Greear <greearb@candelatech.com>
>    Date: Tue, 10 Jun 2003 17:51:57 -0700
>    
>    Maybe as a configurable option, since it would make tcpdump less useful.
>    Seems like we could kludge it up so that we used the TSC (or whatever that
>    really fast hardware clock is) to provide some relative stamp that could be
>    converted to a time_val later?
> 
> I have a strange feeling that Ralph's system isn't using
> TSC and that's why it shows up so high on the profiles :-)
> TSC do_gettimeofday() is REALLY cheap (TSC read plus a multiply which
> x86 does in like 5 cycles).
> 
> Yes, this idea has been tossed around before.  But what's funny
> is that on the bigger boxes, you don't use TSC because amongst
> the different nodes of the machine they are skewed, so you have
> to use the ACPI timer or something like that for timestamping.

What determines whether or not we use the "TSC do_gettimeofday".  Does
it automagically happen when you compile for P-III or something like
that?

And how big of a "bigger box" are you talking about...regular old SMP, or NUMA?

>    
>    It does seem a bit wasteful to do the gettimeofday when most of the time
>    the result is ignored.
>    
>    (Or, are there things other than tcpdump that need the gettimeofday stamp?)
> 
> SO_RECVSTAMP, any socket on the machine can ask for this.

Do we know when we are being asked for this value?  Ie, could we do the
TSC/ACPI timer -> time_val conversion here?  If TSC means cheap gettimeofday
anyway, then my last question is moot.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:01                                                                                     ` David S. Miller
  2003-06-11  1:15                                                                                       ` Ben Greear
@ 2003-06-11  1:17                                                                                       ` Ralph Doncaster
  2003-06-11  1:23                                                                                         ` David S. Miller
  2003-06-11  7:25                                                                                       ` Andi Kleen
  2 siblings, 1 reply; 227+ messages in thread
From: Ralph Doncaster @ 2003-06-11  1:17 UTC (permalink / raw)
  To: David S. Miller
  Cc: greearb, Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

On Tue, 10 Jun 2003, David S. Miller wrote:

> TSC do_gettimeofday() is REALLY cheap (TSC read plus a multiply which
> x86 does in like 5 cycles).

Aren't the read_lock_irqsave and restore expensive?

        read_lock_irqsave(&xtime_lock, flags);
        usec = do_gettimeoffset();
        {
                unsigned long lost = jiffies - wall_jiffies;
                if (lost)
                        usec += lost * (1000000 / HZ);
        }
        sec = xtime.tv_sec;
        usec += xtime.tv_usec;
        read_unlock_irqrestore(&xtime_lock, flags);


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:15                                                                                       ` Ben Greear
@ 2003-06-11  1:22                                                                                         ` David S. Miller
  2003-06-11  1:51                                                                                           ` Ben Greear
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-11  1:22 UTC (permalink / raw)
  To: greearb; +Cc: netdev

   From: Ben Greear <greearb@candelatech.com>
   Date: Tue, 10 Jun 2003 18:15:36 -0700
   
   What determines whether or not we use the "TSC do_gettimeofday".  Does
   it automagically happen when you compile for P-III or something like
   that?
   
The 2.5.x kernel has x86 platform  drivers that decide this.

   And how big of a "bigger box" are you talking about...regular old
   SMP, or NUMA?

Many laptops cannot even use TSC reliably because of power management
etc. issues.

   > SO_RECVSTAMP, any socket on the machine can ask for this.
   
   Do we know when we are being asked for this value?

We have to take the timestamp at netif_receive_skb() for it to
be accurate.

We don't even know if this packet is for this host until a long
time later, let alone whether any local sockets want SO_RECVSTAMP
or whether any IP options want timestamp or whether tcpdump is
listening etc.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:17                                                                                       ` Ralph Doncaster
@ 2003-06-11  1:23                                                                                         ` David S. Miller
  2003-06-11  7:28                                                                                           ` Andi Kleen
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-11  1:23 UTC (permalink / raw)
  To: ralph+d, ralph
  Cc: greearb, Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

   From: Ralph Doncaster <ralph@istop.com>
   Date: Tue, 10 Jun 2003 21:17:28 -0400 (EDT)

   Aren't the read_lock_irqsave and restore expensive?

If x86 has an inefficient implementation, well... :-)

This can be done without locks, nobody has done the x86 implementation
of that that's all.  I think the x86_64 folks did a lockless version,
I know I did for sparc64 :)


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:22                                                                                         ` David S. Miller
@ 2003-06-11  1:51                                                                                           ` Ben Greear
  2003-06-11  3:33                                                                                             ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Ben Greear @ 2003-06-11  1:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

David S. Miller wrote:

>    Do we know when we are being asked for this value?
> 
> We have to take the timestamp at netif_receive_skb() for it to
> be accurate.
> 
> We don't even know if this packet is for this host until a long
> time later, let alone whether any local sockets want SO_RECVSTAMP
> or whether any IP options want timestamp or whether tcpdump is
> listening etc.

Yes, I understand why we want a time-stamp very early...but if
we can get _some_ sort of time stamp very cheap (TSC, for example)
then we can potentially defer the more expensive conversion of
this stamp into the equivalent of whatever do_gettimeofday will
give us.

We could set an 'is-timestamp-converted-already' flag on the skb and have
a macro that gets the timestamp.  This macro can do the conversion
as needed and return the value to calling code.  For platforms that do
not support TSC or it's equivalent, can just use gettimeofday for the
original stamp and set the flag..


-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:51                                                                                           ` Ben Greear
@ 2003-06-11  3:33                                                                                             ` David S. Miller
  2003-06-11 11:54                                                                                               ` gettime: Was (Re: " Jamal Hadi
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-11  3:33 UTC (permalink / raw)
  To: greearb; +Cc: netdev

   From: Ben Greear <greearb@candelatech.com>
   
   Yes, I understand why we want a time-stamp very early...but if
   we can get _some_ sort of time stamp very cheap (TSC, for example)
   then we can potentially defer the more expensive conversion of
   this stamp into the equivalent of whatever do_gettimeofday will
   give us.

I fully understand your idea, I've talked about it with Alexey many
times.  Someone just has to implement it.

pkt_sched.h is probably the place to play, maybe make an
asm/pkt_sched.h header.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:01                                                                                     ` David S. Miller
  2003-06-11  1:15                                                                                       ` Ben Greear
  2003-06-11  1:17                                                                                       ` Ralph Doncaster
@ 2003-06-11  7:25                                                                                       ` Andi Kleen
  2 siblings, 0 replies; 227+ messages in thread
From: Andi Kleen @ 2003-06-11  7:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: greearb, ralph+d, Robert.Olsson, hadi, xerox, sim, fw, netdev, linux-net

> I have a strange feeling that Ralph's system isn't using
> TSC and that's why it shows up so high on the profiles :-)
> TSC do_gettimeofday() is REALLY cheap (TSC read plus a multiply which
> x86 does in like 5 cycles).

On a P4 rdtsc takes 90+ cycles (probably because it's flushing the complete
pipeline). Of course it's still relatively fast if you run that at 3Ghz,
but on slower P4s it may hurt.

On Athlons/Hammers it is quite fast, but at least on Hammer it needs
a pipeline flush again for accuracy (otherwise the CPU can speculate
it around)

One bigger cost is normally the rw lock or the two memory barriers for
the seqlock (on 2.5). On a UP compiled kernel it should not be a problem
though.

-Andi

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11  1:23                                                                                         ` David S. Miller
@ 2003-06-11  7:28                                                                                           ` Andi Kleen
  0 siblings, 0 replies; 227+ messages in thread
From: Andi Kleen @ 2003-06-11  7:28 UTC (permalink / raw)
  To: David S. Miller
  Cc: ralph+d, ralph, greearb, Robert.Olsson, hadi, xerox, sim, fw,
	netdev, linux-net

On Tue, Jun 10, 2003 at 06:23:38PM -0700, David S. Miller wrote:
>    From: Ralph Doncaster <ralph@istop.com>
>    Date: Tue, 10 Jun 2003 21:17:28 -0400 (EDT)
> 
>    Aren't the read_lock_irqsave and restore expensive?
> 
> If x86 has an inefficient implementation, well... :-)

sti/cli is normally fast on x86, a bit slower on P3 core (a few cycles or so)
read_lock_irqsave does a pushfl though, that's rather slow on P4,
but still not that bad. read_lock_irq would be faster, but too risky 
here.

> 
> This can be done without locks, nobody has done the x86 implementation
> of that that's all.  I think the x86_64 folks did a lockless version,
> I know I did for sparc64 :)

2.5 i386 gettimeofday is lockless. But on UP it should not make any difference
anyways.

-Andi

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-10 16:49                                                                         ` Andi Kleen
@ 2003-06-11  9:54                                                                           ` Robert Olsson
  2003-06-11 10:05                                                                             ` Andi Kleen
  0 siblings, 1 reply; 227+ messages in thread
From: Robert Olsson @ 2003-06-11  9:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bogdan Costescu, David S. Miller, sim, ralph+d, hadi, xerox, fw,
	netdev, linux-net



Andi Kleen writes:

 > You can play some tricks with the driver to make eth_type_trans disappear
 > from the profiles. This usually helps a lot because it avoids one
 > full "fetch from cache cold memory" roundtrip per packet, which is slow on
 > any CPU.


 Andi!
 Interesting. Can we get into details?

 Cheers.

						--ro


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-11  9:54                                                                           ` Robert Olsson
@ 2003-06-11 10:05                                                                             ` Andi Kleen
  2003-06-11 10:38                                                                               ` Robert Olsson
  2003-06-11 12:08                                                                               ` Jamal Hadi
  0 siblings, 2 replies; 227+ messages in thread
From: Andi Kleen @ 2003-06-11 10:05 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Andi Kleen, Bogdan Costescu, David S. Miller, sim, ralph+d, hadi,
	xerox, fw, netdev, linux-net

On Wed, Jun 11, 2003 at 11:54:34AM +0200, Robert Olsson wrote:
> 
> 
> Andi Kleen writes:
> 
>  > You can play some tricks with the driver to make eth_type_trans disappear
>  > from the profiles. This usually helps a lot because it avoids one
>  > full "fetch from cache cold memory" roundtrip per packet, which is slow on
>  > any CPU.
> 
> 
>  Andi!
>  Interesting. Can we get into details?

eth_type_trans checks the ethernet protocol ID and sets the broadcast/multicast/
unicast L2 type.

Some NICs have bits in the RX descriptor for most of them. They have a 
"packet is TCP or UDP or IP" bit and also a bit for unicast or sometimes
even multicast/broadcast. So when you have the RX descriptor you 
can just derive these values from there and put them into the skb
without calling eth_type_trans or looking at the cache cold header.

Then you do a prefetch on the header. When the packet reaches the 
network stack later the header has already reached cache  and it can be
processed without a memory round trip latency.

Caveats: 
On some cards it doesn't work for all packets or can be only done 
if you don't have any multicast addresses hashed (that's the case
for the e1000 if I read the header bits correctly). The lxt1001 
(old EOLed card) can do it for all packet types.

Often prefetch size is limited so you should not prefetch more
than what you can store until the packet reaches the stack.

-Andi

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-11 10:05                                                                             ` Andi Kleen
@ 2003-06-11 10:38                                                                               ` Robert Olsson
  2003-06-11 12:08                                                                               ` Jamal Hadi
  1 sibling, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-11 10:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Robert Olsson, Bogdan Costescu, David S. Miller, sim, ralph+d,
	hadi, xerox, fw, netdev, linux-net



Andi Kleen writes:

 > eth_type_trans checks the ethernet protocol ID and sets the broadcast/multicast/
 > unicast L2 type.
 > 
 > Some NICs have bits in the RX descriptor for most of them. They have a 
 > "packet is TCP or UDP or IP" bit and also a bit for unicast or sometimes
 > even multicast/broadcast. So when you have the RX descriptor you 
 > can just derive these values from there and put them into the skb
 > without calling eth_type_trans or looking at the cache cold header.
 > 
 > Then you do a prefetch on the header. When the packet reaches the 
 > network stack later the header has already reached cache  and it can be
 > processed without a memory round trip latency.
 > 
 > Caveats: 
 > On some cards it doesn't work for all packets or can be only done 
 > if you don't have any multicast addresses hashed (that's the case
 > for the e1000 if I read the header bits correctly). The lxt1001 
 > (old EOLed card) can do it for all packet types.

 Thanks!
 
 Yes. Like to give this a try when I got chance. It should be something 
 for the driver authors. Any patch handy? e1000?


 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Was (Re: Route cache performance under stress
  2003-06-10 18:41                                                                         ` Florian Weimer
@ 2003-06-11 11:47                                                                           ` Jamal Hadi
  2003-06-11 18:41                                                                             ` Real World Routers 8-) Florian Weimer
  0 siblings, 1 reply; 227+ messages in thread
From: Jamal Hadi @ 2003-06-11 11:47 UTC (permalink / raw)
  To: Florian Weimer
  Cc: ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	netdev, linux-net



On Tue, 10 Jun 2003, Florian Weimer wrote:

> In general, the forwarding performance is nowadays specified in pps
> and even flows per second if you look carefully at the data sheets.

Ok, this is interesting. I have never seen the flows per second
used for simple L3 forwading. I have seen them being used for NAT or
firewalling.
Looking at the sprint traffic patterns, i think flows/sec is a
meaningful metric.

> Most vendors have learnt that people want routers with comforting
> worst-case behavior.  However, you have to read carefully, e.g. a
> Catalyst 6500 with Supervisor Engine 1 (instead of 2) can only create
> 650,000 flows per second, even if it has a much, much higher peak IP
> forwarding rate.
>

So 2Mpps of 650Kflows/sec ?

> (The times of routers which died when confronted with a rapid ICMP
> sweep across a /16 are gone for good, I hope.)

We should be able to punish specific misbehaving flows. Do you know
if any routers are implementing proper DOS tracebacks to allow for
inserting drop filters?

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* gettime: Was (Re: Route cache performance under stress
  2003-06-11  3:33                                                                                             ` David S. Miller
@ 2003-06-11 11:54                                                                                               ` Jamal Hadi
  2003-06-11 12:08                                                                                                 ` Andi Kleen
                                                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-06-11 11:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: greearb, netdev


Ok, time to go into another separate thread ;->

Sounds like a good idea.

if (skbneedstimestamp)
	do_gettimeofday(&skb->stamp);
else
	defertimestamp()

For defertimestamp() would it be feasible that you store only the
jiffies value in the skb then get timeofday later and somehow
compensate for the difference? Seems very doable to me.

Question is when do you decide skbneedstimestamp?
Is it when the device is in promiscous mode or do it in ip or icmp etc?

cheers,
jamal


On Tue, 10 Jun 2003, David S. Miller wrote:

>    From: Ben Greear <greearb@candelatech.com>
>
>    Yes, I understand why we want a time-stamp very early...but if
>    we can get _some_ sort of time stamp very cheap (TSC, for example)
>    then we can potentially defer the more expensive conversion of
>    this stamp into the equivalent of whatever do_gettimeofday will
>    give us.
>
> I fully understand your idea, I've talked about it with Alexey many
> times.  Someone just has to implement it.
>
> pkt_sched.h is probably the place to play, maybe make an
> asm/pkt_sched.h header.
>
>
>

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: 3c59x (was Route cache performance under stress)
  2003-06-11 10:05                                                                             ` Andi Kleen
  2003-06-11 10:38                                                                               ` Robert Olsson
@ 2003-06-11 12:08                                                                               ` Jamal Hadi
  1 sibling, 0 replies; 227+ messages in thread
From: Jamal Hadi @ 2003-06-11 12:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Robert Olsson, Bogdan Costescu, David S. Miller, sim, ralph+d,
	xerox, fw, netdev, linux-net



On Wed, 11 Jun 2003, Andi Kleen wrote:

> eth_type_trans checks the ethernet protocol ID and sets the broadcast/multicast/
> unicast L2 type.
>
> Some NICs have bits in the RX descriptor for most of them. They have a
> "packet is TCP or UDP or IP" bit and also a bit for unicast or sometimes
> even multicast/broadcast. So when you have the RX descriptor you
> can just derive these values from there and put them into the skb
> without calling eth_type_trans or looking at the cache cold header.
>
> Then you do a prefetch on the header. When the packet reaches the
> network stack later the header has already reached cache  and it can be
> processed without a memory round trip latency.
>

I have done prefetching experiments with a NAPIezed sb1250.c driver on
MIPS. I never got rid of eth_type_trans() just prefetched skb->data
a few lines before calling it. I did see eth_type_trans() almost
disappear from the profile (it was way low to be important).
Andis idea is even more interesting.

I did see i think about 10Kpps more in throughput.
Robert, this means our biggest bottleneck right now is cache misses.
The MIPS processor i am playing with is SMP and has a large shared L2
cache. What i am observing is that this is quiet useful for SMP.
I am limited by how much traffic i can generate right now to test it
more. I can do 295Kpps L3 easy.  This board is an excuse for you to
come down to Ottawa in July ;->


> Caveats:
> On some cards it doesn't work for all packets or can be only done
> if you don't have any multicast addresses hashed (that's the case
> for the e1000 if I read the header bits correctly). The lxt1001
> (old EOLed card) can do it for all packet types.
>

So can the sb1250. I'll try this out.

> Often prefetch size is limited so you should not prefetch more
> than what you can store until the packet reaches the stack.
>

Good point. So is there a systematic way to find out the effects
of the prefecth size or you just have to keep trying until you get
it right?

cheers,
jamal

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: gettime: Was (Re: Route cache performance under stress
  2003-06-11 11:54                                                                                               ` gettime: Was (Re: " Jamal Hadi
@ 2003-06-11 12:08                                                                                                 ` Andi Kleen
  2003-06-12  3:30                                                                                                   ` David S. Miller
  2003-06-11 15:57                                                                                                 ` Ben Greear
  2003-06-12  3:29                                                                                                 ` David S. Miller
  2 siblings, 1 reply; 227+ messages in thread
From: Andi Kleen @ 2003-06-11 12:08 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: David S. Miller, greearb, netdev

On Wed, Jun 11, 2003 at 07:54:53AM -0400, Jamal Hadi wrote:
> 
> Ok, time to go into another separate thread ;->
> 
> Sounds like a good idea.
> 
> if (skbneedstimestamp)
> 	do_gettimeofday(&skb->stamp);
> else
> 	defertimestamp()

Another way is to just store jiffies (= 10 or 1ms accuracy) 

This should be nearly zero cost and accurate enough at least for TCP.

-Andi

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: gettime: Was (Re: Route cache performance under stress
  2003-06-11 11:54                                                                                               ` gettime: Was (Re: " Jamal Hadi
  2003-06-11 12:08                                                                                                 ` Andi Kleen
@ 2003-06-11 15:57                                                                                                 ` Ben Greear
  2003-06-12  3:29                                                                                                 ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: Ben Greear @ 2003-06-11 15:57 UTC (permalink / raw)
  To: Jamal Hadi; +Cc: David S. Miller, netdev

Jamal Hadi wrote:
> Ok, time to go into another separate thread ;->
> 
> Sounds like a good idea.
> 
> if (skbneedstimestamp)
> 	do_gettimeofday(&skb->stamp);
> else
> 	defertimestamp()
> 
> For defertimestamp() would it be feasible that you store only the
> jiffies value in the skb then get timeofday later and somehow
> compensate for the difference? Seems very doable to me.
> 
> Question is when do you decide skbneedstimestamp?
> Is it when the device is in promiscous mode or do it in ip or icmp etc?
> 
> cheers,
> jamal

Jiffies is not nearly precise enough.  You need something with
usec precision at least.

If we make a macro to read the value (converting as needed), and just
change all the readers to use that macro, then we don't have to make any
interesting decisions in the networking core.

Ben


-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 18:57                                                                           ` David S. Miller
                                                                                               ` (2 preceding siblings ...)
  2003-06-10 21:39                                                                             ` Ralph Doncaster
@ 2003-06-11 17:40                                                                             ` Robert Olsson
  2003-06-13  5:38                                                                               ` David S. Miller
  3 siblings, 1 reply; 227+ messages in thread
From: Robert Olsson @ 2003-06-11 17:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robert.Olsson, ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net


David S. Miller writes:

 > Actually, that's a good idea, if someone if brave just rip out
 > fib_validate_source (just don't call it, should work for valid
 > traffic) and see what happens :)


Just about 9% better a bit of surprise...

Still 1 dst/pkt. Input rate 2*189 kpps. All slow path with fib_source_validate 
removed. Now 121 kpps. (114 kpps before)

Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 3212017 9661983 9661983 6787987      8      0      0      0 BRU
eth1   1500   0      9      0      0      0 3212020      0      0      0 BRU
eth2   1500   0 3212714 9656726 9656726 6787290      4      0      0      0 BRU
eth3   1500   0      1      0      0      0 3212713      0      0      0 BRU
rt_cache_stat
00008b63  00000000 0062089f 00000000 00000000 00000000 00000000 00000000  00000000 00000001 00000000 00617a8b 00617a7f 00000005 00000000 00000000 00000002 


So I added fib_source_validat again and profiled the 1 dst/pkt case. So this 
just profile of the slow path with some different performance counters. I'll 
guess the first is most interesting.
 
Cpu type: P4 / Xeon
Cpu speed was (MHz estimation) : 1799.55
Counter 0 counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (count cycles when processor is active) count 180000
vma      samples  %-age       symbol name
c023c038 107340   33.143      fn_hash_lookup
c013154c 17399    5.37223     free_block
c0211364 16502    5.09527     __rt_hash_shrink
c01316e4 12854    3.96889     kmem_cache_alloc
c01b86dc 11719    3.61844     e1000_clean_rx_irq
c02033a0 11557    3.56842     alloc_skb
c0212330 11378    3.51315     ip_route_input_slow
c020cc98 9765     3.01511     eth_type_trans
c0208860 7986     2.46581     dst_alloc
c0216d98 7733     2.38769     ip_output
c021200c 6940     2.14284     rt_set_nexthop
c0213a9c 6331     1.9548      dst_free
c0126998 6272     1.93659     rcu_do_batch
c02035cc 6164     1.90324     skb_release_data
c02036c4 6068     1.8736      __kfree_skb
c01b8558 5532     1.7081      e1000_clean_tx_irq
c01b7678 4970     1.53457     e1000_xmit_frame
c020905c 4965     1.53303     neigh_lookup
c013179c 4819     1.48795     kmem_cache_free
c01317e0 4441     1.37123     kfree
c020cb30 4002     1.23568     eth_header
c0131728 3522     1.08748     kmalloc
c0131384 3434     1.06031     cache_alloc_refill
c023a5fc 3392     1.04734     fib_validate_source
c023d814 2989     0.922904    fib_lookup
c0113368 2190     0.676199    mark_offset_tsc

Cpu type: P4 / Xeon
Cpu speed was (MHz estimation) : 1799.55
Counter 7 counted MISPRED_BRANCH_RETIRED events (retired mispredicted branches) with a unit mask of 0x01 (retired instruction is non-bogus) count 18000
vma      samples  %-age       symbol name
c023c038 5246     85.0933     fn_hash_lookup
c020905c 194      3.1468      neigh_lookup
c0131384 99       1.60584     cache_alloc_refill
c02036c4 66       1.07056     __kfree_skb
c020ce70 51       0.827251    qdisc_restart
c02033a0 51       0.827251    alloc_skb
c0211364 44       0.713706    __rt_hash_shrink
c01b86dc 32       0.519059    e1000_clean_rx_irq
c023d814 28       0.454177    fib_lookup
c0213a9c 25       0.405515    dst_free
c0210ce8 25       0.405515    rt_garbage_collect
c020ef04 23       0.373074    pfifo_dequeue
c01b8558 20       0.324412    e1000_clean_tx_irq
c0206dcc 19       0.308191    netif_receive_skb
c0206880 18       0.291971    dev_queue_xmit
c01b8ab0 18       0.291971    e1000_alloc_rx_buffers
c02155e0 17       0.27575     ip_forward
c021200c 15       0.243309    rt_set_nexthop
c020cc98 13       0.210868    eth_type_trans
c01b7678 13       0.210868    e1000_xmit_frame
c0212330 12       0.194647    ip_route_input_slow
c0131728 12       0.194647    kmalloc
c010f3d0 12       0.194647    do_gettimeofday
c020a12c 9        0.145985    neigh_resolve_output
c010c350 9        0.145985    do_IRQ
c0216d98 8        0.129765    ip_output

Cpu type: P4 / Xeon
Cpu speed was (MHz estimation) : 1799.55
Counter 0 counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x100 (Not set) count 18000
vma      samples  %-age       symbol name
c023c038 2361     31.3047     fn_hash_lookup
c013154c 686      9.09573     free_block
c0211364 507      6.72235     __rt_hash_shrink
c0208860 502      6.65606     dst_alloc
c01b86dc 433      5.74118     e1000_clean_rx_irq
c0213a9c 393      5.21082     dst_free
c0126998 378      5.01193     rcu_do_batch
c020cc98 262      3.47388     eth_type_trans
c02036c4 237      3.1424      __kfree_skb
c0126970 234      3.10263     call_rcu
c01b8558 212      2.81093     e1000_clean_tx_irq
c0216d98 208      2.75789     ip_output
c02035cc 202      2.67833     skb_release_data
c01b7678 189      2.50597     e1000_xmit_frame
c01b8ab0 141      1.86953     e1000_alloc_rx_buffers
c02033a0 118      1.56457     alloc_skb
c0131384 73       0.967913    cache_alloc_refill
c020ce70 46       0.609918    qdisc_restart
c0212330 36       0.477327    ip_route_input_slow
c01317e0 33       0.43755     kfree
c0206880 28       0.371254    dev_queue_xmit
c0210ce8 26       0.344736    rt_garbage_collect
c020ef04 17       0.225404    pfifo_dequeue
c02109d4 16       0.212145    rt_may_expire
c01316e4 16       0.212145    kmem_cache_alloc
c02155e0 12       0.159109    ip_forward

Cpu type: P4 / Xeon
Cpu speed was (MHz estimation) : 1799.55
Counter 7 counted MACHINE_CLEAR events (cycles with entire machine pipeline cleared) with a unit mask of 0x01 (count a portion of cycles the machine is cleared for any cause) count 18000
vma      samples  %-age       symbol name
c010a738 326      55.4422     irq_entries_start
c010afd8 128      21.7687     apic_timer_interrupt
c023c038 45       7.65306     fn_hash_lookup
c013154c 9        1.53061     free_block
c010b208 9        1.53061     page_fault
c01b86dc 8        1.36054     e1000_clean_rx_irq
c0131384 8        1.36054     cache_alloc_refill
c0208860 7        1.19048     dst_alloc
c0213a9c 6        1.02041     dst_free
c0126970 6        1.02041     call_rcu
c0216d98 5        0.85034     ip_output
c0126998 5        0.85034     rcu_do_batch
c0211364 4        0.680272    __rt_hash_shrink
c020cc98 4        0.680272    eth_type_trans
c02036c4 4        0.680272    __kfree_skb
c02035cc 3        0.510204    skb_release_data
c02033a0 3        0.510204    alloc_skb
c01b7678 3        0.510204    e1000_xmit_frame
c01b8ab0 2        0.340136    e1000_alloc_rx_buffers
c01b8558 2        0.340136    e1000_clean_tx_irq
c020ce70 1        0.170068    qdisc_restart
c02f940c 0        0           ipsec_pfkey_init
c02f93cc 0        0           packet_init
c02f9354 0        0           af_unix_init
c02f9320 0        0           xfrm4_input_init
c02f9304 0        0           xfrm4_state_init


Cheers.
						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-10  1:15                                                                   ` Jamal Hadi
  2003-06-10  2:45                                                                     ` Ralph Doncaster
  2003-06-10 15:53                                                                     ` Route cache performance under stress David S. Miller
@ 2003-06-11 17:52                                                                     ` Robert Olsson
  2 siblings, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-11 17:52 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	fw, netdev, linux-net


Jamal Hadi writes:

 > Robert has a good collection for what is good hardware. I am so outdated
 > i dont keep track anymore. My fastest machine is still an ASuse dual
 > 450Mhz.

 Well giving HW recommendations is very risky... :-)

 Anyway what we currently use:
 ftp://robur.slu.se/pub/Linux/bifrost/hardware.txt

 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Real World Routers 8-)
  2003-06-11 11:47                                                                           ` Was (Re: " Jamal Hadi
@ 2003-06-11 18:41                                                                             ` Florian Weimer
  0 siblings, 0 replies; 227+ messages in thread
From: Florian Weimer @ 2003-06-11 18:41 UTC (permalink / raw)
  To: Jamal Hadi
  Cc: ralph+d, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	netdev, linux-net

Jamal Hadi <hadi@shell.cyberus.ca> writes:

> Ok, this is interesting. I have never seen the flows per second
> used for simple L3 forwading. I have seen them being used for NAT or
> firewalling.

Some vendors still sell flow-based routers, and you should be able to
get this numbers if the vendor doesn't try to scam you.

> Looking at the sprint traffic patterns, i think flows/sec is a
> meaningful metric.

It's important to look at this number when buying a router, but I
still think that stateless IP fowarding is the way to go even if you
haven't got specialized hardware (TCAM).

>> Most vendors have learnt that people want routers with comforting
>> worst-case behavior.  However, you have to read carefully, e.g. a
>> Catalyst 6500 with Supervisor Engine 1 (instead of 2) can only create
>> 650,000 flows per second, even if it has a much, much higher peak IP
>> forwarding rate.
>>
>
> So 2Mpps of 650Kflows/sec ?

Exactly.  (You can use a different Supervisor Engine and get stateless
IP switching at 2 Mpps, at least according to the data sheets.)

> We should be able to punish specific misbehaving flows.

This is quite difficult because misbehaving flows often consist of a
single packet.  Managing state for such flows is a waste, but you
hardly can now this when you have to decide whether you want to create
a new flow or not.

If you want to punish per-interface flows, forget it.  Most routers
are not sufficiently multi-homed to make a difference, and attacks
often hit routers on multiple interfaces.

> Do you know if any routers are implementing proper DOS tracebacks to
> allow for inserting drop filters?

You mean IP Pushback?  I haven't seen it on production routers, and
I'm pretty sure that no one uses it yet.

Flow-based traffic monitoring is available on most routers nowadays
(often sampled, though), even on routers that perform stateless IP
forwarding.

Anyway, just dropping packets locally doesn't help you *that* much,
you need cooperation of your upstream (and automated cooperation à la
IP Pushback is still far, far away, I presume).

^ permalink raw reply	[flat|nested] 227+ messages in thread

* RE: Route cache performance under stress
  2003-06-11 19:48                                                                               ` Florian Weimer
@ 2003-06-11 19:40                                                                                 ` CIT/Paul
  2003-06-11 21:09                                                                                 ` Florian Weimer
  1 sibling, 0 replies; 227+ messages in thread
From: CIT/Paul @ 2003-06-11 19:40 UTC (permalink / raw)
  To: 'Florian Weimer', ralph+d
  Cc: 'Jamal Hadi', 'Pekka Savola',
	'Simon Kirby', 'David S. Miller',
	netdev, linux-net

Wait until you see a DoS attack at 2 million pps with random source ips
and ports and dst ports and tcp flags and the only consistant thing
about the entire attack is the destination ip :>  can we say.. Null
route quick!! 

Paul xerox@foonet.net http://www.httpd.net


-----Original Message-----
From: Florian Weimer [mailto:fw@deneb.enyo.de] 
Sent: Wednesday, June 11, 2003 3:48 PM
To: ralph+d@istop.com
Cc: Jamal Hadi; Pekka Savola; CIT/Paul; 'Simon Kirby'; 'David S.
Miller'; netdev@oss.sgi.com; linux-net@vger.kernel.org
Subject: Re: Route cache performance under stress


Ralph Doncaster <ralph@istop.com> writes:

>> Assuming the attacker has a 100mbps link to you, yes ;->
>
> A script kiddie 0wning a box with a FE connection is nothing.  During 
> what was probably the worst DOS I got hit with, one of my upstream 
> providers said they were seeing about 600mbps of traffic related to 
> the attack.

Yes, these numbers keep growing.  By today's standards, 6000 Mbps
shouldn't be too surprising. 8-(

One of the servers I keep running was recently flooded with 1500-byte
UDP packets, Fast Ethernet line rate.  It definitely happens if your
pipes are fat enough.


^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 15:29                                                                             ` Ralph Doncaster
@ 2003-06-11 19:48                                                                               ` Florian Weimer
  2003-06-11 19:40                                                                                 ` CIT/Paul
  2003-06-11 21:09                                                                                 ` Florian Weimer
  0 siblings, 2 replies; 227+ messages in thread
From: Florian Weimer @ 2003-06-11 19:48 UTC (permalink / raw)
  To: ralph+d
  Cc: Jamal Hadi, Pekka Savola, CIT/Paul, 'Simon Kirby',
	'David S. Miller',
	netdev, linux-net

Ralph Doncaster <ralph@istop.com> writes:

>> Assuming the attacker has a 100mbps link to you, yes ;->
>
> A script kiddie 0wning a box with a FE connection is nothing.  During what
> was probably the worst DOS I got hit with, one of my upstream providers
> said they were seeing about 600mbps of traffic related to the attack.

Yes, these numbers keep growing.  By today's standards, 6000 Mbps
shouldn't be too surprising. 8-(

One of the servers I keep running was recently flooded with 1500-byte
UDP packets, Fast Ethernet line rate.  It definitely happens if your
pipes are fat enough.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11 19:48                                                                               ` Florian Weimer
  2003-06-11 19:40                                                                                 ` CIT/Paul
@ 2003-06-11 21:09                                                                                 ` Florian Weimer
  1 sibling, 0 replies; 227+ messages in thread
From: Florian Weimer @ 2003-06-11 21:09 UTC (permalink / raw)
  To: netdev, linux-net

Florian Weimer <fw@deneb.enyo.de> writes:

> Yes, these numbers keep growing.  By today's standards, 6000 Mbps

Oops, that's one "0" too many.  6 Gbps is definitely still surprising.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: gettime: Was (Re: Route cache performance under stress
  2003-06-11 11:54                                                                                               ` gettime: Was (Re: " Jamal Hadi
  2003-06-11 12:08                                                                                                 ` Andi Kleen
  2003-06-11 15:57                                                                                                 ` Ben Greear
@ 2003-06-12  3:29                                                                                                 ` David S. Miller
  2 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-12  3:29 UTC (permalink / raw)
  To: hadi; +Cc: greearb, netdev

   From: Jamal Hadi <hadi@shell.cyberus.ca>
   Date: Wed, 11 Jun 2003 07:54:53 -0400 (EDT)

   Sounds like a good idea.
   
   if (skbneedstimestamp)
   	do_gettimeofday(&skb->stamp);
   else
   	defertimestamp()
   
Damn, read the thread Jamal :(  This is not possible at all.

We do not know the value of 'skbneedstimestamp' until much later,
but we MUST make the timestamp now in order for it to be accurate.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: gettime: Was (Re: Route cache performance under stress
  2003-06-11 12:08                                                                                                 ` Andi Kleen
@ 2003-06-12  3:30                                                                                                   ` David S. Miller
  2003-06-12  6:32                                                                                                     ` Ben Greear
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-12  3:30 UTC (permalink / raw)
  To: ak; +Cc: hadi, greearb, netdev

   From: Andi Kleen <ak@suse.de>
   Date: Wed, 11 Jun 2003 14:08:03 +0200

   Another way is to just store jiffies (= 10 or 1ms accuracy) 
   
   This should be nearly zero cost and accurate enough at least for TCP.
   
TCP doesn't use it Andi.  SO_RECVSTAMP etc. uses it and that
MUST be accurate.

People, start approaching this from an actually implementable
angle, not one's that have no basis in reality :)

   

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10  3:05                                       ` Steven Blake
@ 2003-06-12  6:31                                         ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-12  6:31 UTC (permalink / raw)
  To: slblake; +Cc: fw, netdev, linux-net

   From: Steven Blake <slblake@petri-meat.com>
   Date: 09 Jun 2003 23:05:47 -0400

   http://www.petri-meat.com/slblake/networking/refs/lpm_pkt-class/
   
Interesting link, thanks for mentioning it.

   IMHO, the best LPM algorithm (in terms of balancing lookup speed vs.
   memory consumption vs. update rate) is CRT, described in the first paper
   [ASIK].  It is patented, but there is hope that it might get released
   under GPL in the near future.

It would be nice if this actually was a "paper", but it's
a patent entry, such things are always so cryptic.  Is there
a real paper on this scheme?

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: gettime: Was (Re: Route cache performance under stress
  2003-06-12  3:30                                                                                                   ` David S. Miller
@ 2003-06-12  6:32                                                                                                     ` Ben Greear
  2003-06-12  8:46                                                                                                       ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Ben Greear @ 2003-06-12  6:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, hadi, netdev

David S. Miller wrote:
>    From: Andi Kleen <ak@suse.de>
>    Date: Wed, 11 Jun 2003 14:08:03 +0200
> 
>    Another way is to just store jiffies (= 10 or 1ms accuracy) 
>    
>    This should be nearly zero cost and accurate enough at least for TCP.
>    
> TCP doesn't use it Andi.  SO_RECVSTAMP etc. uses it and that
> MUST be accurate.
> 
> People, start approaching this from an actually implementable
> angle, not one's that have no basis in reality :)

I think we need a generic method to get something like the
TSC..ie very fast, very precise.

Then, we need a way to turn this into the time-of-day.

After that, we can calculate time-of-day in a lazy manner.

Something like:

/* In driver or as early as possible */
skb->rx_stamp = getCurTSC();
skb->flags |= (RX_STAMP_IS_NOT_YET_CONVERTED);
....

/* somebody wants to know what time of day rx-stamp was */
if (skb->flags & (RX_STAMP_IS_NOT_YET_CONVERTED)) {
   skb->rx_stamp = do_gettimeofday() - ((getCurTSC() - skb->rx_stamp) *
                                        (magic-conversion-to-timeval-units));
   skb->flags &= ~(RX_STAMP_IS_NOT_YET_CONVERTED);
}
/* rx_stamp is now relative to time-of-day */


But, Dave mentioned TSC is not always good to use, and it won't work at all
on older cpus, so the getCurTSC() thing probably needs to be a macro...
Seems like this macro would be useful in lots of places...pktgen for
instance :)

Ben

> 
>    
> 


-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-10 18:34                                                                         ` Robert Olsson
  2003-06-10 18:57                                                                           ` David S. Miller
@ 2003-06-12  6:45                                                                           ` David S. Miller
  2003-06-12 13:56                                                                             ` Robert Olsson
  1 sibling, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-12  6:45 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Tue, 10 Jun 2003 20:34:50 +0200

   I ripped out the route hash just to test the slow path.

I want to point out an error in such simulations.

It doesn't eliminate some of the most expensive part of the routing
cache, the 'dst' management.  All of that still happens even after
your patch.

A better simulation of a "pure slowpath" would be to move the DST
entry into the fib entries themselves.

That is a lot more work, but it would validate the various ideas and
claims being made.  For example, it would say for sure whether
eliminating the routing cache is a win or not for DoS traffic.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: gettime: Was (Re: Route cache performance under stress
  2003-06-12  6:32                                                                                                     ` Ben Greear
@ 2003-06-12  8:46                                                                                                       ` David S. Miller
  0 siblings, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-12  8:46 UTC (permalink / raw)
  To: greearb; +Cc: ak, hadi, netdev

   From: Ben Greear <greearb@candelatech.com>
   Date: Wed, 11 Jun 2003 23:32:41 -0700

   skb->rx_stamp = getCurTSC();

Thanks for mentioned this idea for the 10th time in
the past 2 days :-)

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-12  6:45                                                                           ` David S. Miller
@ 2003-06-12 13:56                                                                             ` Robert Olsson
  2003-06-12 21:35                                                                               ` David S. Miller
  0 siblings, 1 reply; 227+ messages in thread
From: Robert Olsson @ 2003-06-12 13:56 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robert.Olsson, ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net


David S. Miller writes:

 > I want to point out an error in such simulations.
 > 
 > It doesn't eliminate some of the most expensive part of the routing
 > cache, the 'dst' management.  All of that still happens even after
 > your patch.
 > 
 > A better simulation of a "pure slowpath" would be to move the DST
 > entry into the fib entries themselves.
 > 
 > That is a lot more work, but it would validate the various ideas and
 > claims being made.  For example, it would say for sure whether
 > eliminating the routing cache is a win or not for DoS traffic.

 Well it's true. But do we need too? From the profile we actually see the 
 'dst' management cost. Not much I would say and still we will need some 
 adminstration i.e refcounting even it we should remove the route hash.
 
c023c038 107340   33.143      fn_hash_lookup
c013154c 17399    5.37223     free_block
c0211364 16502    5.09527     __rt_hash_shrink
c01316e4 12854    3.96889     kmem_cache_alloc
c01b86dc 11719    3.61844     e1000_clean_rx_irq
c02033a0 11557    3.56842     alloc_skb
c0212330 11378    3.51315     ip_route_input_slow
c020cc98 9765     3.01511     eth_type_trans
c0208860 7986     2.46581     dst_alloc
c0216d98 7733     2.38769     ip_output
c021200c 6940     2.14284     rt_set_nexthop
c0213a9c 6331     1.9548      dst_free
c0126998 6272     1.93659     rcu_do_batch
c02035cc 6164     1.90324     skb_release_data
c02036c4 6068     1.8736      __kfree_skb
c01b8558 5532     1.7081      e1000_clean_tx_irq
c01b7678 4970     1.53457     e1000_xmit_frame

 From what I understand now removing the route hash is not a good idea. It's 
 seems we can control the hash pretty well and this even under very extreme 
 conditions.
 
 I think that people who is suggesting this thinks that we can achieve same 
 performance without it. I don't think we can. So question is should we tune
 routing to do 120 kpps regardless of input or have a performance span of 
 112-420 kpps (numbers from my tests). Where we most of the time are close
 to the higher limit?

 Anyway fib_lookup seems to be something to look into regardless of this 
 question.

 Cheers.

						--ro
 

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-12 13:56                                                                             ` Robert Olsson
@ 2003-06-12 21:35                                                                               ` David S. Miller
  2003-06-13 10:50                                                                                 ` Robert Olsson
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-12 21:35 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Thu, 12 Jun 2003 15:56:47 +0200

   David S. Miller writes:
   
    > That is a lot more work, but it would validate the various ideas and
    > claims being made.  For example, it would say for sure whether
    > eliminating the routing cache is a win or not for DoS traffic.
   
    Well it's true. But do we need too? From the profile we actually see the 
    'dst' management cost. Not much I would say and still we will need some 
    adminstration i.e refcounting even it we should remove the route hash.

But Robert, do you know "why" the dst management doesn't show up in
your profiles when you rip-out the rtcache?

It's because to total number of DST entries is so small that they all
fit in the cpu cache.  When the rtcache is enabled and we thus have up
to "max_size" DST entries in flight at all times, the dst management
routines show up very clearly because they have a high probability of
missing the cpu cache.

In particular, have a good look at Simon's profiles.  dst_alloc() is
quite near the top there.

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-11 17:40                                                                             ` Robert Olsson
@ 2003-06-13  5:38                                                                               ` David S. Miller
  2003-06-13 10:22                                                                                 ` Robert Olsson
  2003-06-13 17:15                                                                                 ` Robert Olsson
  0 siblings, 2 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-13  5:38 UTC (permalink / raw)
  To: Robert.Olsson
  Cc: ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net, kuznet

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Wed, 11 Jun 2003 19:40:47 +0200

   vma      samples  %-age       symbol name
   c023c038 107340   33.143      fn_hash_lookup

Ok, let's optimize our datastructures for how we actually
use them :-)  Also, fn_zone shrunk by 8 bytes.

Try this:

--- ./include/net/ip_fib.h.~1~	Thu Jun 12 22:18:33 2003
+++ ./include/net/ip_fib.h	Thu Jun 12 22:19:57 2003
@@ -89,13 +89,12 @@ struct fib_info
 struct fib_rule;
 #endif
 
-struct fib_result
-{
-	unsigned char	prefixlen;
-	unsigned char	nh_sel;
+struct fib_result {
+	struct fib_info *fi;
 	unsigned char	type;
 	unsigned char	scope;
-	struct fib_info *fi;
+	unsigned char	prefixlen;
+	unsigned char	nh_sel;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 	struct fib_rule	*r;
 #endif
--- ./net/ipv4/fib_hash.c.~1~	Thu Jun 12 21:47:11 2003
+++ ./net/ipv4/fib_hash.c	Thu Jun 12 22:08:27 2003
@@ -65,16 +65,15 @@ typedef struct {
 	u32	datum;
 } fn_hash_idx_t;
 
-struct fib_node
-{
-	struct fib_node		*fn_next;
-	struct fib_info		*fn_info;
-#define FIB_INFO(f)	((f)->fn_info)
+struct fib_node {
 	fn_key_t		fn_key;
 	u8			fn_tos;
-	u8			fn_type;
-	u8			fn_scope;
 	u8			fn_state;
+	u8			fn_scope;
+	u8			fn_type;
+	struct fib_node		*fn_next;
+	struct fib_info		*fn_info;
+#define FIB_INFO(f)	((f)->fn_info)
 };
 
 #define FN_S_ZOMBIE	1
@@ -82,29 +81,19 @@ struct fib_node
 
 static int fib_hash_zombies;
 
-struct fn_zone
-{
-	struct fn_zone	*fz_next;	/* Next not empty zone	*/
-	struct fib_node	**fz_hash;	/* Hash table pointer	*/
-	int		fz_nent;	/* Number of entries	*/
-
+struct fn_zone {
 	int		fz_divisor;	/* Hash divisor		*/
-	u32		fz_hashmask;	/* (fz_divisor - 1)	*/
-#define FZ_HASHMASK(fz)	((fz)->fz_hashmask)
-
+#define FZ_HASHMASK(fz)	((fz)->fz_divisor - 1)
 	int		fz_order;	/* Zone order		*/
-	u32		fz_mask;
-#define FZ_MASK(fz)	((fz)->fz_mask)
+#define FZ_MASK(fz)	(inet_make_mask((fz)->fz_order))
+	struct fib_node	**fz_hash;	/* Hash table pointer	*/
+	struct fn_zone	*fz_next;	/* Next not empty zone	*/
+	int		fz_nent;	/* Number of entries	*/
 };
 
-/* NOTE. On fast computers evaluation of fz_hashmask and fz_mask
-   can be cheaper than memory lookup, so that FZ_* macros are used.
- */
-
-struct fn_hash
-{
-	struct fn_zone	*fn_zones[33];
+struct fn_hash {
 	struct fn_zone	*fn_zone_list;
+	struct fn_zone	*fn_zones[33];
 };
 
 static __inline__ fn_hash_idx_t fn_hash(fn_key_t key, struct fn_zone *fz)
@@ -197,7 +186,6 @@ static void fn_rehash_zone(struct fn_zon
 {
 	struct fib_node **ht, **old_ht;
 	int old_divisor, new_divisor;
-	u32 new_hashmask;
 		
 	old_divisor = fz->fz_divisor;
 
@@ -217,8 +205,6 @@ static void fn_rehash_zone(struct fn_zon
 		break;
 	}
 
-	new_hashmask = (new_divisor - 1);
-
 #if RT_CACHE_DEBUG >= 2
 	printk("fn_rehash_zone: hash for zone %d grows from %d\n", fz->fz_order, old_divisor);
 #endif
@@ -231,7 +217,6 @@ static void fn_rehash_zone(struct fn_zon
 		write_lock_bh(&fib_hash_lock);
 		old_ht = fz->fz_hash;
 		fz->fz_hash = ht;
-		fz->fz_hashmask = new_hashmask;
 		fz->fz_divisor = new_divisor;
 		fn_rebuild_zone(fz, old_ht, old_divisor);
 		write_unlock_bh(&fib_hash_lock);
@@ -261,7 +246,6 @@ fn_new_zone(struct fn_hash *table, int z
 	} else {
 		fz->fz_divisor = 1;
 	}
-	fz->fz_hashmask = (fz->fz_divisor - 1);
 	fz->fz_hash = fz_hash_alloc(fz->fz_divisor);
 	if (!fz->fz_hash) {
 		kfree(fz);
@@ -269,7 +253,6 @@ fn_new_zone(struct fn_hash *table, int z
 	}
 	memset(fz->fz_hash, 0, fz->fz_divisor*sizeof(struct fib_node*));
 	fz->fz_order = z;
-	fz->fz_mask = inet_make_mask(z);
 
 	/* Find the first not empty zone with more specific mask */
 	for (i=z+1; i<=32; i++)
@@ -312,10 +295,15 @@ fn_hash_lookup(struct fib_table *tb, con
 			if (f->fn_tos && f->fn_tos != flp->fl4_tos)
 				continue;
 #endif
-			f->fn_state |= FN_S_ACCESSED;
+			{
+				u8 state = f->fn_state;
 
-			if (f->fn_state&FN_S_ZOMBIE)
-				continue;
+				if (!(state & FN_S_ACCESSED))
+					f->fn_state = state | FN_S_ACCESSED;
+
+				if (state & FN_S_ZOMBIE)
+					continue;
+			}
 			if (f->fn_scope < flp->fl4_scope)
 				continue;
 

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 22:54                                                             ` Robert Olsson
@ 2003-06-13  6:21                                                               ` David S. Miller
  2003-06-13 10:40                                                                 ` Robert Olsson
  0 siblings, 1 reply; 227+ messages in thread
From: David S. Miller @ 2003-06-13  6:21 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: sim, xerox, hadi, fw, netdev, linux-net

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Tue, 10 Jun 2003 00:54:32 +0200
    
    I'm about to propose some stats even for hash spinning.... 

Do you mind if I apply this?  It looks fine.

Next, we should put similar metrics into fib_hash.c

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-13  5:38                                                                               ` David S. Miller
@ 2003-06-13 10:22                                                                                 ` Robert Olsson
  2003-06-13 17:15                                                                                 ` Robert Olsson
  1 sibling, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-13 10:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robert.Olsson, ralph+d, ralph, hadi, xerox, sim, fw, netdev,
	linux-net, kuznet


David S. Miller writes:
 >    From: Robert Olsson <Robert.Olsson@data.slu.se>
 >    Date: Wed, 11 Jun 2003 19:40:47 +0200
 > 
 >    vma      samples  %-age       symbol name
 >    c023c038 107340   33.143      fn_hash_lookup
 > 
 > Ok, let's optimize our datastructures for how we actually
 > use them :-)  Also, fn_zone shrunk by 8 bytes.
 > 
 > Try this:

 I'll pass the university lab on my way home later today and hope give 
 to it a try.

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-13  6:21                                                               ` David S. Miller
@ 2003-06-13 10:40                                                                 ` Robert Olsson
  2003-06-15  6:36                                                                   ` David S. Miller
  2003-06-17 17:03                                                                   ` Robert Olsson
  0 siblings, 2 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-13 10:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: Robert.Olsson, sim, xerox, hadi, fw, netdev, linux-net


David S. Miller writes:
 >    From: Robert Olsson <Robert.Olsson@data.slu.se>
 >    Date: Tue, 10 Jun 2003 00:54:32 +0200
 >     
 >     I'm about to propose some stats even for hash spinning.... 
 > 
 > Do you mind if I apply this?  It looks fine.

 No please do. There is an updated rtstat already.

 > Next, we should put similar metrics into fib_hash.c

 Yes.

 Also "candidate" selection in __rt_hash_shrink can be done in 
 rt_intern_hash. We avoid an the extra spinning over the hash chain. 
 Eventually we can save here and have the candidate always ready.

 Cheers.

					--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-12 21:35                                                                               ` David S. Miller
@ 2003-06-13 10:50                                                                                 ` Robert Olsson
  0 siblings, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-13 10:50 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robert.Olsson, ralph+d, ralph, hadi, xerox, sim, fw, netdev, linux-net


David S. Miller writes:

 > But Robert, do you know "why" the dst management doesn't show up in
 > your profiles when you rip-out the rtcache?
 > 
 > It's because to total number of DST entries is so small that they all
 > fit in the cpu cache.  When the rtcache is enabled and we thus have up
 > to "max_size" DST entries in flight at all times, the dst management
 > routines show up very clearly because they have a high probability of
 > missing the cpu cache.
 > 
 > In particular, have a good look at Simon's profiles.  dst_alloc() is
 > quite near the top there.

 Yes and that was the intention to get pretty close to pure slowpath. 
 As a result I/we now appreciate the hash better...

 Cheers.

						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-13  5:38                                                                               ` David S. Miller
  2003-06-13 10:22                                                                                 ` Robert Olsson
@ 2003-06-13 17:15                                                                                 ` Robert Olsson
  1 sibling, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-13 17:15 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robert.Olsson, ralph+d, ralph, hadi, xerox, sim, fw, netdev,
	linux-net, kuznet


David S. Miller writes:
 >    From: Robert Olsson <Robert.Olsson@data.slu.se>
 >    Date: Wed, 11 Jun 2003 19:40:47 +0200
 > 
 >    vma      samples  %-age       symbol name
 >    c023c038 107340   33.143      fn_hash_lookup
 > 
 > Ok, let's optimize our datastructures for how we actually
 > use them :-)  Also, fn_zone shrunk by 8 bytes.

Some percent less on our XEON/UP here.

Input rate 2*190 kpps clone_skb=1. dst hash code in.

Without

Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 2988282 9680382 9680382 7011722     11      0      0      0 BRU
eth1   1500   0     15      0      0      0 2988291      0      0      0 BRU
eth2   1500   0 2988336 9681671 9681671 7011667      3      0      0      0 BRU
eth3   1500   0      2      0      0      0 2988337      0      0      0 BRU

00002100  000005e8 005b2c4a 00000000 00000001 00000000 00000000 00000000  00000000 00000002 00000000 005b1c49 005b1c3c 00000007 00000000 00bd877c 00000002 

With patch

Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 2877562 9698244 9698244 7122444     11      0      0      0 BRU
eth1   1500   0     13      0      0      0 2819350      0      0      0 BRU
eth2   1500   0 2877512 9693477 9693477 7122492      4      0      0      0 BRU
eth3   1500   0      1      0      0      0 2877511      0      0      0 BRU

00001ec7  0000058e 0057cb3c 00000000 00000000 00000000 00000000 00000000  00000000 00000000 00000000 0057bb37 0057bb2c 00000008 00000000 00b0c4a6 00000000 


Time for a beer now... Still a lot of progress this week. In practice we
never see "dst cache overflow" again.

Cheers. 

						--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-13 10:40                                                                 ` Robert Olsson
@ 2003-06-15  6:36                                                                   ` David S. Miller
  2003-06-17 17:03                                                                   ` Robert Olsson
  1 sibling, 0 replies; 227+ messages in thread
From: David S. Miller @ 2003-06-15  6:36 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: sim, xerox, hadi, fw, netdev, linux-net

   From: Robert Olsson <Robert.Olsson@data.slu.se>
   Date: Fri, 13 Jun 2003 12:40:23 +0200

   
   David S. Miller writes:
    >    From: Robert Olsson <Robert.Olsson@data.slu.se>
    >    Date: Tue, 10 Jun 2003 00:54:32 +0200
    >     
    >     I'm about to propose some stats even for hash spinning.... 
    > 
    > Do you mind if I apply this?  It looks fine.
   
    No please do. There is an updated rtstat already.
   
I am, but I have to do this by hand.  It seems your email client has
caught the disease that turns all tabs into spaces :(

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-13 10:40                                                                 ` Robert Olsson
  2003-06-15  6:36                                                                   ` David S. Miller
@ 2003-06-17 17:03                                                                   ` Robert Olsson
  1 sibling, 0 replies; 227+ messages in thread
From: Robert Olsson @ 2003-06-17 17:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: Robert Olsson, sim, xerox, hadi, fw, netdev, linux-net


 > David S. Miller writes:
 > Next, we should put similar metrics into fib_hash.c

 A starting point...

 Kernel hack enclosed and companion app from:
 ftp://robur.slu.se/pub/Linux/net-development/fibstat
 
 Just some hash metrics yet. Output below is from our DoS tests:

lookup_total   == hash lookup/sec
zone_search    == zones search/sec
chain_search   == chain search/sec

lookup_total  zone_search chain_search
           0            0            0
           0            0            0
           0            0            0
      475084      4513198      2454249
      861704      8186188      4450394
      867935      8245366      4480320
      863319      8201514      4458924
      864056      8208532      4463344
      863788      8205986      4461238
      861772      8186834      4449507


--- include/net/ip_fib.h.030617	2003-06-17 15:03:57.000000000 +0200
+++ include/net/ip_fib.h	2003-06-17 16:07:00.000000000 +0200
@@ -135,6 +135,21 @@
 	unsigned char	tb_data[0];
 };
 
+struct fib_stat 
+{
+        unsigned int lookup_total;
+        unsigned int zone_search;
+        unsigned int chain_search;
+};
+
+extern struct fib_stat *fib_stat;
+#define FIB_STAT_INC(field)                                          \
+                (per_cpu_ptr(fib_stat, smp_processor_id())->field++)
+
+
+extern int __init fib_stat_init(void);
+
+
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
 extern struct fib_table *ip_fib_local_table;
--- net/ipv4/fib_hash.c.030617	2003-06-15 23:02:21.000000000 +0200
+++ net/ipv4/fib_hash.c	2003-06-17 16:01:45.000000000 +0200
@@ -13,6 +13,11 @@
  *		modify it under the terms of the GNU General Public License
  *		as published by the Free Software Foundation; either version
  *		2 of the License, or (at your option) any later version.
+ *
+ *
+ * Fixes:
+ * Robert Olsson		:	Added statistics
+ *
  */
 
 #include <linux/config.h>
@@ -107,6 +112,10 @@
 	struct fn_zone	*fn_zone_list;
 };
 
+
+struct fib_stat *fib_stat;
+
+
 static __inline__ fn_hash_idx_t fn_hash(fn_key_t key, struct fn_zone *fz)
 {
 	u32 h = ntohl(key.datum)>>(32 - fz->fz_order);
@@ -307,12 +316,19 @@
 	struct fn_zone *fz;
 	struct fn_hash *t = (struct fn_hash*)tb->tb_data;
 
+	FIB_STAT_INC(lookup_total);
+
 	read_lock(&fib_hash_lock);
 	for (fz = t->fn_zone_list; fz; fz = fz->fz_next) {
 		struct fib_node *f;
 		fn_key_t k = fz_key(flp->fl4_dst, fz);
 
+		FIB_STAT_INC(zone_search);
+
 		for (f = fz_chain(k, fz); f; f = f->fn_next) {
+
+			FIB_STAT_INC(chain_search);
+
 			if (!fn_key_eq(k, f->fn_key)) {
 				if (fn_key_leq(k, f->fn_key))
 					break;
@@ -1108,6 +1124,54 @@
 	.release	= ip_seq_release,
 };
 
+static int fib_stat_get_info(char *buffer, char **start, off_t offset, int length)
+{
+	int i;
+	int len = 0;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_possible(i))
+			continue;
+		len += sprintf(buffer+len, "%08x %08x %08x \n",
+			       per_cpu_ptr(fib_stat, i)->lookup_total,
+			       per_cpu_ptr(fib_stat, i)->zone_search,
+			       per_cpu_ptr(fib_stat, i)->chain_search
+
+			);
+	}
+	len -= offset;
+
+	if (len > length)
+		len = length;
+	if (len < 0)
+		len = 0;
+
+	*start = buffer + offset;
+  	return len;
+}
+
+int __init fib_stat_init(void)
+{
+	int i, rc = 0;
+
+	fib_stat = kmalloc_percpu(sizeof (struct fib_stat),
+				  GFP_KERNEL);
+	if (!fib_stat) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (cpu_possible(i)) {
+			memset(per_cpu_ptr(fib_stat, i), 0,
+			       sizeof (struct fib_stat));
+		}
+	}
+	
+ out:
+	return rc;
+}
+
 int __init fib_proc_init(void)
 {
 	struct proc_dir_entry *p;
@@ -1116,13 +1180,27 @@
 	p = create_proc_entry("route", S_IRUGO, proc_net);
 	if (p)
 		p->proc_fops = &fib_seq_fops;
-	else
+	else {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+
+
+        p = proc_net_create ("fib_stat", 0, fib_stat_get_info);
+
+	if (!p) {
 		rc = -ENOMEM;
+		remove_proc_entry("route", proc_net);
+	}
+
+ out:
 	return rc;
 }
 
 void __init fib_proc_exit(void)
 {
 	remove_proc_entry("route", proc_net);
+	remove_proc_entry("fib_stat", proc_net);
 }
 #endif /* CONFIG_PROC_FS */
--- net/ipv4/route.c.030617	2003-06-16 16:56:34.000000000 +0200
+++ net/ipv4/route.c	2003-06-17 16:02:41.000000000 +0200
@@ -2754,7 +2754,8 @@
 	rt_cache_stat = kmalloc_percpu(sizeof (struct rt_cache_stat),
 					GFP_KERNEL);
 	if (!rt_cache_stat) 
-		goto out_enomem1;
+		goto out_enomem0;
+
 	for (i = 0; i < NR_CPUS; i++) {
 		if (cpu_possible(i)) {
 			memset(per_cpu_ptr(rt_cache_stat, i), 0,
@@ -2765,6 +2766,9 @@
 	devinet_init();
 	ip_fib_init();
 
+	if(fib_stat_init()) 
+		goto out_enomem1;
+
 	init_timer(&rt_flush_timer);
 	rt_flush_timer.function = rt_run_flush;
 	init_timer(&rt_periodic_timer);
@@ -2785,7 +2789,7 @@
 
 #ifdef CONFIG_PROC_FS
 	if (rt_cache_proc_init())
-		goto out_enomem;
+		goto out_enomem2;
 	proc_net_create ("rt_cache_stat", 0, rt_cache_stat_get_info);
 #ifdef CONFIG_NET_CLS_ROUTE
 	create_proc_read_entry("net/rt_acct", 0, 0, ip_rt_acct_read, NULL);
@@ -2795,9 +2799,12 @@
 	xfrm4_init();
 out:
 	return rc;
-out_enomem:
-	kfree_percpu(rt_cache_stat);
+
+out_enomem2:
+	kfree_percpu(fib_stat);
 out_enomem1:
+	kfree_percpu(rt_cache_stat);
+out_enomem0:
 	rc = -ENOMEM;
 	goto out;
 }


Cheers.
					--ro

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
  2003-06-09 16:30                                               ` Simon Kirby
@ 2003-06-17 20:58                                                 ` Florian Weimer
  0 siblings, 0 replies; 227+ messages in thread
From: Florian Weimer @ 2003-06-17 20:58 UTC (permalink / raw)
  To: Simon Kirby; +Cc: ralph+d, netdev, linux-net

Simon Kirby <sim@netnation.com> writes:

> What Zebra quirks?

Zebra doesn't send BGP keepalives while updating the kernel's view of
the routing table.  If a configuration change results massive routing
table updates (e.g. changed LOCAL_PREF), it's quite likely that you
BGP peering sessions terminate because of a timeout.

Other "quirks" are just things that don't work as they should (mostly
Cisco incompatibilities, sometimes genuine bugs in route-map support
etc.).

It's not dramatic in most cases, but like any complex technology, it
takes some time to get used to.

(Disclaimer: I'm not a Zebra user. 8-)

> And I wouldn't exactly call it difficult to "squeeze" performance out of
> a PC when the 7206 VXRs have a 200 MHz processor.

You missed the NPE-G1 part.

cisco 7204VXR (NPE-G1) processor (revision A) with 245760K/16384K bytes of memory.
SB-1 CPU at 700Mhz, Implementation 1, Rev 0.2, 512KB L2 Cache

Probably still slow by x86 standards, and with a rather small cache,
but it's sufficient for a few kpps, I guess...

^ permalink raw reply	[flat|nested] 227+ messages in thread

* Re: Route cache performance under stress
@ 2003-04-08  6:14 Scott A Crosby
  0 siblings, 0 replies; 227+ messages in thread
From: Scott A Crosby @ 2003-04-08  6:14 UTC (permalink / raw)
  To: linux-kernel

Please CC me on any replies:


The suggested code here is problematic.

   RND1 = random_generated_at_start_time() ;
   RND2 = random_generated_at_start_time() ;
   /* RND2 may be 0 or equal to RND1, all cases seem OK */
   x = (RND1 - saddr) ^ (RND1 - daddr) ^ (RND2 + saddr + daddr);
   reduce(x)

For instance, if the table is assumed to have size N, bucket
collisions can be generated by:

   saddr=daddr= k*N  for all k.

Or, a different attack, if I assume that reduce(x) determines the
bucket by masking off, say, the lowest 12 bits, then:

   saddr=0xXXXXXAAA
   daddr=0xYYYYYBBB

Where, XXX, YYY are anything, AAA, BBB are arbitrarily chosen.

Now, lets look at the various terms:
 (RND1 - saddr)         = 0xUUUUUCCC
 (RND1 - daddr)         = 0xUUUUUDDD
 (RND2 + saddr + daddr) = 0xUUUUUEEE

The U's are all unknown, but the CCC, DDD, and EEE---the only thing
that we care about---are constant. Thus, the lowest 12 bits of x will
be constant. If those are the only bits that are used, then the
attacker has complete freedom to forge the highest 20 bits of saddr
and daddr.

With that function, you'd probably be better off masking off the high
order bits. At least there's a chance of a carry from the UUUU's
propagating into the bits you mask off.

I'm rusty with statistical analysis of cryptographic algorithms, but I
suspect demons may be lurking from that avenue too.


What might work better is to have a good universal hash function, h,
then do:

   h_k(saddr) - h_k(daddr)

Perhaps the simplest is:

  h_k(x) = x * k (mod P)

where P is a prime, and $ 0<= k < P$ is a random variable determined
at bootup.

Scott



^ permalink raw reply	[flat|nested] 227+ messages in thread

end of thread, other threads:[~2003-06-17 20:58 UTC | newest]

Thread overview: 227+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-05 16:37 Route cache performance under stress Florian Weimer
2003-04-05 18:17 ` Martin Josefsson
2003-04-05 18:34 ` Willy Tarreau
2003-05-16 22:24 ` Simon Kirby
2003-05-16 23:16   ` Florian Weimer
2003-05-19 19:10     ` Simon Kirby
2003-05-17  2:35   ` David S. Miller
2003-05-17  7:31     ` Florian Weimer
2003-05-17 22:09       ` David S. Miller
2003-05-18  9:21         ` Florian Weimer
2003-05-18  9:31           ` David S. Miller
2003-05-19 17:36             ` Jamal Hadi
2003-05-19 19:18               ` Ralph Doncaster
2003-05-19 22:37                 ` Jamal Hadi
2003-05-20  1:10                   ` Simon Kirby
2003-05-20  1:14                     ` David S. Miller
2003-05-20  1:23                       ` Jamal Hadi
2003-05-20  1:24                         ` David S. Miller
2003-05-20  2:13                           ` Jamal Hadi
2003-05-20  5:01                             ` Pekka Savola
2003-05-20 11:47                               ` Jamal Hadi
2003-05-20 11:55                                 ` Pekka Savola
2003-05-20  6:46                             ` David S. Miller
2003-05-20 12:04                               ` Jamal Hadi
2003-05-21  0:36                                 ` David S. Miller
2003-05-21 13:03                                   ` Jamal Hadi
2003-05-23  5:42                                     ` David S. Miller
2003-05-22  8:40                                   ` Simon Kirby
2003-05-22  8:58                                     ` David S. Miller
2003-05-22 10:40                                       ` David S. Miller
2003-05-22 11:15                                         ` Martin Josefsson
2003-05-23  1:00                                           ` David S. Miller
2003-05-23  1:01                                           ` David S. Miller
2003-05-23  8:21                                             ` Andi Kleen
2003-05-23  8:22                                               ` David S. Miller
2003-05-23  9:03                                                 ` Andi Kleen
2003-05-23  9:59                                                   ` David S. Miller
2003-05-24  0:41                                           ` Andrew Morton
2003-05-26  2:29                                             ` David S. Miller
2003-05-22 11:44                                         ` Simon Kirby
2003-05-22 13:03                                           ` Martin Josefsson
2003-05-23  0:55                                             ` David S. Miller
2003-05-22 22:33                                           ` David S. Miller
2003-05-29 20:51                                             ` Simon Kirby
2003-06-02 10:58                                               ` Robert Olsson
2003-06-02 15:18                                                 ` Simon Kirby
2003-06-02 16:36                                                   ` Robert Olsson
2003-06-02 18:05                                                     ` Simon Kirby
2003-06-09 17:21                                                     ` David S. Miller
2003-06-09 17:19                                                 ` David S. Miller
2003-05-23  0:59                                           ` David S. Miller
2003-05-26  7:18                         ` Florian Weimer
2003-05-26  7:29                           ` David S. Miller
2003-05-26  9:34                             ` Florian Weimer
2003-05-27  6:32                               ` David S. Miller
2003-06-08 11:39                                 ` Florian Weimer
2003-06-08 12:05                                   ` David S. Miller
2003-06-08 13:10                                     ` Florian Weimer
2003-06-08 23:49                                       ` Simon Kirby
2003-06-08 23:55                                         ` CIT/Paul
2003-06-09  3:15                                           ` Jamal Hadi
2003-06-09  5:27                                             ` CIT/Paul
2003-06-09  5:58                                               ` David S. Miller
2003-06-09  6:28                                                 ` CIT/Paul
2003-06-09  6:28                                                   ` David S. Miller
2003-06-09 16:23                                                     ` Stephen Hemminger
2003-06-09 16:37                                                       ` David S. Miller
2003-06-09  7:13                                                   ` Simon Kirby
2003-06-09  8:10                                                     ` CIT/Paul
2003-06-09  8:27                                                       ` Simon Kirby
2003-06-09 19:38                                                         ` CIT/Paul
2003-06-09 21:30                                                           ` David S. Miller
2003-06-09 22:19                                                           ` Simon Kirby
2003-06-09 22:54                                                             ` Robert Olsson
2003-06-13  6:21                                                               ` David S. Miller
2003-06-13 10:40                                                                 ` Robert Olsson
2003-06-15  6:36                                                                   ` David S. Miller
2003-06-17 17:03                                                                   ` Robert Olsson
2003-06-09 22:56                                                             ` CIT/Paul
2003-06-09 23:05                                                               ` David S. Miller
2003-06-10 13:41                                                                 ` Robert Olsson
2003-06-10  0:03                                                               ` Jamal Hadi
2003-06-10  0:32                                                                 ` Ralph Doncaster
2003-06-10  1:15                                                                   ` Jamal Hadi
2003-06-10  2:45                                                                     ` Ralph Doncaster
2003-06-10  3:23                                                                       ` Ben Greear
2003-06-10  3:41                                                                         ` Ralph Doncaster
2003-06-10 18:10                                                                         ` Ralph Doncaster
2003-06-10 18:21                                                                           ` Ben Greear
2003-06-10  4:34                                                                       ` Simon Kirby
2003-06-10 11:01                                                                         ` Jamal Hadi
2003-06-10 11:28                                                                         ` Jamal Hadi
2003-06-10 13:18                                                                           ` Ralph Doncaster
2003-06-10 16:10                                                                         ` David S. Miller
2003-06-10 10:53                                                                       ` Jamal Hadi
2003-06-10 11:41                                                                         ` chas williams
2003-06-10 16:27                                                                           ` David S. Miller
2003-06-10 16:57                                                                             ` chas williams
2003-06-10 11:41                                                                         ` Pekka Savola
2003-06-10 11:58                                                                           ` John S. Denker
2003-06-10 12:12                                                                             ` Jamal Hadi
2003-06-10 16:33                                                                               ` David S. Miller
2003-06-10 12:07                                                                           ` Jamal Hadi
2003-06-10 15:29                                                                             ` Ralph Doncaster
2003-06-11 19:48                                                                               ` Florian Weimer
2003-06-11 19:40                                                                                 ` CIT/Paul
2003-06-11 21:09                                                                                 ` Florian Weimer
2003-06-10 13:10                                                                         ` Ralph Doncaster
2003-06-10 13:36                                                                           ` Jamal Hadi
2003-06-10 14:03                                                                             ` Ralph Doncaster
2003-06-10 16:38                                                                           ` David S. Miller
2003-06-10 16:39                                                                           ` David S. Miller
2003-06-10 18:41                                                                         ` Florian Weimer
2003-06-11 11:47                                                                           ` Was (Re: " Jamal Hadi
2003-06-11 18:41                                                                             ` Real World Routers 8-) Florian Weimer
2003-06-10 15:53                                                                     ` Route cache performance under stress David S. Miller
2003-06-10 16:15                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
2003-06-10 16:20                                                                         ` Andi Kleen
2003-06-10 16:23                                                                           ` Jeff Garzik
2003-06-10 17:02                                                                             ` 3c59x David S. Miller
2003-06-10 17:16                                                                               ` 3c59x Jeff Garzik
2003-06-10 17:14                                                                                 ` 3c59x David S. Miller
2003-06-10 17:25                                                                                   ` 3c59x Jeff Garzik
2003-06-10 17:30                                                                                     ` 3c59x David S. Miller
2003-06-10 19:20                                                                                       ` 3c59x Jeff Garzik
2003-06-10 19:21                                                                                         ` 3c59x David S. Miller
2003-06-10 17:18                                                                                 ` 3c59x Andi Kleen
2003-06-10 17:29                                                                                 ` 3c59x chas williams
2003-06-10 17:31                                                                                   ` 3c59x David S. Miller
2003-06-10 17:39                                                                                     ` 3c59x chas williams
2003-06-10 17:43                                                                                       ` 3c59x David S. Miller
2003-06-11 17:52                                                                     ` Route cache performance under stress Robert Olsson
2003-06-10  1:53                                                                   ` Simon Kirby
2003-06-10  3:18                                                                     ` Ralph Doncaster
2003-06-10 16:06                                                                       ` David S. Miller
2003-06-10 15:56                                                                     ` David S. Miller
2003-06-10 16:45                                                                       ` 3c59x (was Route cache performance under stress) Bogdan Costescu
2003-06-10 16:49                                                                         ` Andi Kleen
2003-06-11  9:54                                                                           ` Robert Olsson
2003-06-11 10:05                                                                             ` Andi Kleen
2003-06-11 10:38                                                                               ` Robert Olsson
2003-06-11 12:08                                                                               ` Jamal Hadi
2003-06-10 17:12                                                                         ` 3c59x David S. Miller
2003-06-10 17:19                                                                       ` Route cache performance under stress Ralph Doncaster
2003-06-10 15:49                                                                   ` David S. Miller
2003-06-10 17:33                                                                     ` Ralph Doncaster
2003-06-10 17:32                                                                       ` David S. Miller
2003-06-10 18:34                                                                         ` Robert Olsson
2003-06-10 18:57                                                                           ` David S. Miller
2003-06-10 19:53                                                                             ` Robert Olsson
2003-06-10 21:36                                                                             ` CIT/Paul
2003-06-10 21:39                                                                             ` Ralph Doncaster
2003-06-10 22:20                                                                               ` David S. Miller
2003-06-10 23:58                                                                                 ` Ralph Doncaster
2003-06-10 23:57                                                                                   ` David S. Miller
2003-06-11  0:41                                                                                     ` Ralph Doncaster
2003-06-11  0:58                                                                                       ` David S. Miller
2003-06-11  0:58                                                                                       ` David S. Miller
2003-06-11  0:51                                                                                   ` Ben Greear
2003-06-11  1:01                                                                                     ` David S. Miller
2003-06-11  1:15                                                                                       ` Ben Greear
2003-06-11  1:22                                                                                         ` David S. Miller
2003-06-11  1:51                                                                                           ` Ben Greear
2003-06-11  3:33                                                                                             ` David S. Miller
2003-06-11 11:54                                                                                               ` gettime: Was (Re: " Jamal Hadi
2003-06-11 12:08                                                                                                 ` Andi Kleen
2003-06-12  3:30                                                                                                   ` David S. Miller
2003-06-12  6:32                                                                                                     ` Ben Greear
2003-06-12  8:46                                                                                                       ` David S. Miller
2003-06-11 15:57                                                                                                 ` Ben Greear
2003-06-12  3:29                                                                                                 ` David S. Miller
2003-06-11  1:17                                                                                       ` Ralph Doncaster
2003-06-11  1:23                                                                                         ` David S. Miller
2003-06-11  7:28                                                                                           ` Andi Kleen
2003-06-11  7:25                                                                                       ` Andi Kleen
2003-06-11 17:40                                                                             ` Robert Olsson
2003-06-13  5:38                                                                               ` David S. Miller
2003-06-13 10:22                                                                                 ` Robert Olsson
2003-06-13 17:15                                                                                 ` Robert Olsson
2003-06-12  6:45                                                                           ` David S. Miller
2003-06-12 13:56                                                                             ` Robert Olsson
2003-06-12 21:35                                                                               ` David S. Miller
2003-06-13 10:50                                                                                 ` Robert Olsson
2003-06-10  0:56                                                             ` Ralph Doncaster
2003-06-09 11:38                                                       ` Jamal Hadi
2003-06-09 11:55                                                         ` David S. Miller
2003-06-09 12:18                                                           ` Jamal Hadi
2003-06-09 12:32                                                             ` David S. Miller
2003-06-09 13:22                                                               ` Jamal Hadi
2003-06-09 13:22                                                                 ` David S. Miller
2003-06-09  8:56                                                     ` David S. Miller
2003-06-09 22:39                                                       ` Robert Olsson
2003-06-09  6:25                                             ` David S. Miller
2003-06-09  6:59                                               ` Simon Kirby
2003-06-09  7:03                                                 ` David S. Miller
2003-06-09 13:04                                             ` Ralph Doncaster
2003-06-09 13:26                                               ` Jamal Hadi
2003-06-09  5:44                                           ` David S. Miller
2003-06-09  5:51                                             ` CIT/Paul
2003-06-09  6:03                                               ` David S. Miller
2003-06-09  6:52                                                 ` Simon Kirby
2003-06-09  6:56                                                   ` David S. Miller
2003-06-09  7:36                                                     ` Simon Kirby
2003-06-09  8:18                                                     ` Simon Kirby
2003-06-09  8:22                                                       ` David S. Miller
2003-06-09  8:31                                                         ` Simon Kirby
2003-06-09  9:01                                                       ` David S. Miller
2003-06-09  9:47                                                         ` Andi Kleen
2003-06-09 10:03                                                           ` David S. Miller
2003-06-09 10:13                                                             ` Andi Kleen
2003-06-09 10:13                                                               ` David S. Miller
2003-06-09 10:40                                                                 ` YOSHIFUJI Hideaki / 吉藤英明
2003-06-09 10:40                                                                   ` David S. Miller
2003-06-09 14:14                                                       ` David S. Miller
2003-06-09  6:47                                           ` Simon Kirby
2003-06-09  6:49                                             ` David S. Miller
2003-06-09 13:28                                             ` Ralph Doncaster
2003-06-09 16:30                                               ` Simon Kirby
2003-06-17 20:58                                                 ` Florian Weimer
2003-06-09  5:38                                         ` David S. Miller
2003-06-10  3:05                                       ` Steven Blake
2003-06-12  6:31                                         ` David S. Miller
2003-06-08 17:58                                   ` Pekka Savola
2003-05-21  0:09                       ` Simon Kirby
2003-05-21  0:13                         ` David S. Miller
2003-05-26  9:29                           ` Florian Weimer
2003-04-08  6:14 Scott A Crosby

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.