Re: [ANNOUNCE] NF-HIPAC: High Performance Packet Classification

From: Roberto Nibali <ratz@drugphish.ch>
To: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@redhat.com>,
	niv@us.ibm.com, linux-kernel@vger.kernel.org,
	jamal <hadi@cyberus.ca>, netdev <netdev@oss.sgi.com>
Subject: Re: [ANNOUNCE] NF-HIPAC: High Performance Packet Classification
Date: Thu, 26 Sep 2002 22:49:14 +0200	[thread overview]
Message-ID: <3D9372CA.7080203@drugphish.ch> (raw)
In-Reply-To: 20020926140430.E14485@wotan.suse.de

> For iptables/ipchain you need to write hierarchical/port range rules 
> in this case and try to terminate searchs early.

We're still trying to find the correct mathematical functions to do 
this. Trust me, it is not so easy, the mapping of the port matrix and 
the network flow through many stacked packet filters and firewalls 
generates a rather complex graph (partly bigraph (LVS-DR for example)) 
which has complex structures (redundancy and parallelisations). It's not 
that we could sit down and implement a fw-script for our packet filters, 
the fw-script is being generated through a meta-fw layer that knows 
about the surrounding network nodes.

> But yes, we also found that the L2 cache is limiting here
> (ip_conntrack has the same problem) 

I think this weekend I will do my tests also measuring some cpu 
performance counters with oprofile, such as DATA_READ_MISS, CODE CACHE 
MISS and NONCACHEABLE_MEMORY_READS.

> At least  that is easily fixed. Just increase the LOG_BUF_LEN parameter
> in kernel/printk.c

Tests showed that this only helps in peak situations, I think we should 
simply forget about printk().

> Alternatively don't use slow printk, but nfnetlink to report bad packets
> and print from user space. That should scale much better.

Yes and there are a few things that my collegue found out during his 
tests (actually pretty straight forward things):

1. A big log buffer is only useful to come by peaks
2. A big log buffer while having high CPU load doesn't help at all
3. The smaller the message, the better (binary logging thus is an
    advantage)
4. The logging via printk() is extremely expensive, because of the
    conversions and whatnot. A rough estimate would be 12500 clock
    cycles for a log entry generated by printk(). This means that on a
    PIII/450 a log entry needs 0.000028s and this again leads to
    following observation: Having 36000pps which should all be logged,
    you will end up with a system having 100% CPU load and being 0% idle.
5. The kernel should log a binary stream, also the daemon that needs to
    fetch the data. If you want to convert the binary to human readable
    format, you start a process with low prio or do it on-demand.
6. Ideally the log daemon should be preemtible to get a defined time
    slice to do its job.

Some test results conducted by a coworker of mine (Achim Gsell):

Max pkt rate the system can log without losing more then 1% of the messages:
----------------------------------------------------------------------------

kernel:		Linux 2.4.19-gentoo-r7 (low latency scheduling)

daemon:		syslog-ng (nice 0), logbufsiz=16k, pkts=10*10000, CPU=PIII/450
packet-len:	64		256		512		1024

		2873pkt/s	3332pkt/s	3124pkt/s	3067pkt/s
		1.4 Mb/s	6.6Mb/s		12.2Mb/s	23.9Mb/s

daemon:		syslog-ng (nice 0), logbufsiz=16k, pkts=10*10000, CPU=PIVM/1.7
packet-len:	64		256		512		1024

		7808pkt/s	7807pkt/s	7806pkt/s	    pkt/s
		3.8 Mb/s	15.2Mb/s	30.5Mb/s	    Mb/s

----------------------------------------------------------------------------------------------------------

daemon:		cat /proc/kmsg > kernlog, logbufsiz=16k, pkts=10*10000, 
CPU=PIII/450
packet-len:	64		256		512		1024

		4300pkt/s	        	         	3076pkt/s
		2.1 Mb/s	       		         	24.0Mb/s

daemon:		ulogd (nlbufsize=4k, qthreshold=1), pkts=10*10000, CPU=PIII/450
packet-len:	64		256		512		1024

		4097pkt/s	        	       		4097pkt/s
		2.0 Mb/s	       		         	32  Mb/s

daemon:		ulogd (nlbufsize=2^17 - 1, qthreshold=1), pkts=10*10000, 
CPU=PIII/450
packet-len:	64		256		512		1024

		6576pkt/s	        	         	5000pkt/s
		3.2 Mb/s	       		        	38  Mb/s

daemon:		ulogd (nlbufsize=64k, qthreshold=1), pkts=1*10000, CPU=PIII/450
packet-len:	64		256		512		1024

		         	        	         	    pkt/s
		        	       		        	4.0 Mb/s

daemon:		ulogd (nlbufsize=2^17 - 1, qthreshold=50), pkts=10*10000, 
CPU=PIII/450
packet-len:	64		256		512		1024

		6170pkt/s	        	         	5000pkt/s
		3.0 Mb/s	       		        	38  Mb/s

Best regards,
Roberto Nibali, ratz
-- 
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc