[PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h , Re: [PATCH, netfilter] NUMA aware ipv4/netfilter/ip

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h , Re: [PATCH, netfilter] NUMA aware ipv4/netfilter/ip_tables.c
       [not found]     ` <200509201830.20689.ak@suse.de>
@ 2005-09-20 21:45       ` Eric Dumazet
  2005-09-20 21:46         ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h Eric Dumazet
  0 siblings, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-20 21:45 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel; +Cc: netfilter-devel, netdev

Andi Kleen a écrit :
> On Tuesday 20 September 2005 11:47, Eric Dumazet wrote:
> 
> 
> 
> I would prefer if random code didn't mess with mempolicy internals
> like this. Better just call sys_set_mempolicy() 
> 
> -Andi
> 
> 


OK but this prior patch seems necessary :

- Adds sys_set_mempolicy() in include/linux/syscalls.h

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h
  2005-09-20 21:45       ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h , Re: [PATCH, netfilter] NUMA aware ipv4/netfilter/ip_tables.c Eric Dumazet
@ 2005-09-20 21:46         ` Eric Dumazet
  2005-09-21 21:24           ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
                             ` (3 more replies)
  0 siblings, 4 replies; 50+ messages in thread
From: Eric Dumazet @ 2005-09-20 21:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel, netfilter-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 381 bytes --]

Andi Kleen a écrit :

 > On Tuesday 20 September 2005 11:47, Eric Dumazet wrote:
 >
 >
 >
 > I would prefer if random code didn't mess with mempolicy internals
 > like this. Better just call sys_set_mempolicy()
 > -Andi
 >
 >


OK but this prior patch seems necessary :

- Adds sys_set_mempolicy() in include/linux/syscalls.h

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>




[-- Attachment #2: patch_add_sys_set_mempolicy --]
[-- Type: text/plain, Size: 407 bytes --]

--- linux-2.6/include/linux/syscalls.h	2005-09-06 01:20:21.000000000 +0200
+++ linux-2.6-ed/include/linux/syscalls.h	2005-09-20 23:43:07.000000000 +0200
@@ -508,5 +508,7 @@
 
 asmlinkage long sys_ioprio_set(int which, int who, int ioprio);
 asmlinkage long sys_ioprio_get(int which, int who);
+asmlinkage long sys_set_mempolicy(int mode, unsigned long __user *nmask,
+					unsigned long maxnode);
 
 #endif

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-20 21:46         ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h Eric Dumazet
@ 2005-09-21 21:24           ` Eric Dumazet
  2005-09-21 22:43             ` Christoph Lameter
                               ` (2 more replies)
  2005-09-21 21:29           ` [PATCH 1/3] " Eric Dumazet
                             ` (2 subsequent siblings)
  3 siblings, 3 replies; 50+ messages in thread
From: Eric Dumazet @ 2005-09-21 21:24 UTC (permalink / raw)
  To: linux-kernel, netfilter-devel, netdev; +Cc: Andi Kleen

Hi

I have reworked net/ipv4/ip_tables.c to boost its performance, and post three 
patches.

In oprofile results, ipt_do_table() was at the first position.
It is now at 6th position, using 1/3 of the CPU it was using before.
(Tests done on a dual Xeon i386 and a dual Opteron x86_64)

Short description :

1) No more central rwlock protecting each table (filter, nat, mangle, raw),
    but one lock per CPU. It avoids cache line ping pongs for each packet.

2) Loop unrollings and various code optimizations to reduce branches
    mispredictions.

3) NUMA aware allocation of memory (this part was posted earlier, but got some 
polishing problems)

Long description:

1) No more one rwlock_t protecting the 'curtain'

One major bottleneck on SMP machines is the fact that one rwlock
is taken in ipt_do_table() entry and exit. That 2 atomic operations are
the killer, and even if multiple readers can work together on the same table
(using read_lock_bh()), the cache line containing rwlock still must be
taken exclusively by each CPU at entry/exit of ipt_do_table().

As existing code already use separate copies of the data for each cpu, it is
very easy to convert the central rwlock to separate rwlocks, allocated
percpu (and NUMA aware).

When a cpu enters ipt_do_table(), it can locks its local copy, touching a
cache line that is not used by other cpus. Further operations are done on
'local' copy of the data.

When a sum of all counters must be done, we can write_lock each part at a
time, instead of locking all the parts, reducing the lock contention.

2) Loop unrolling

It seems that with current compilers and CFLAGS, the code from
ip_packet_match() is very bad, using lot of mispredicted conditional branches 
I made some patches and generated code on i386 and x86_64
is much better.

3) NUMA allocation.

Part of the performance problem we have with netfilter is memory allocation
is not NUMA aware, but 'only' SMP aware (ie each CPU normally touch
separate cache lines)

Even with small iptables rules, the cost of this misplacement can be high
  on common workloads.

Instead of using one vmalloc() area (located in the node of the iptables
process), we now allocate an area for each possible CPU, using NUMA policy
(MPOL_PREFERRED) so that memory should be allocated in the CPU's node
if possible.

If the size of ipt_table is small enough (less than one page), we use
kmalloc_node() instead of vmalloc(), to use less memory and less TLB entries)
in small setups.

Note1 : I also optimized get_counters(), using a SET_COUNTER() for the first 
cpu, avoiding a memset() and ADD_COUNTER() if SMP on other cpus.

Note2 : This patch depends on another patch that declares sys_set_mempolicy()
in include/linux/syscalls.h
( http://marc.theaimsgroup.com/?l=linux-kernel&m=112725288622984&w=2 )

Thank you
Eric Dumazet

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-20 21:46         ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h Eric Dumazet
  2005-09-21 21:24           ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
@ 2005-09-21 21:29           ` Eric Dumazet
  2005-09-22 12:57             ` Harald Welte
  2005-09-21 21:32           ` [PATCH 2/3] " Eric Dumazet
  2005-09-21 21:37           ` [PATCH 3/3] " Eric Dumazet
  3 siblings, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-21 21:29 UTC (permalink / raw)
  To: linux-kernel, netfilter-devel, netdev; +Cc: Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

Patch 1/3

1) No more one rwlock_t protecting the 'curtain'

One major bottleneck on SMP machines is the fact that one rwlock
is taken in ipt_do_table() entry and exit. That 2 atomic operations are
the killer, and even if multiple readers can work together on the same table
(using read_lock_bh()), the cache line containing rwlock still must be
taken exclusively by each CPU at entry/exit of ipt_do_table().

As existing code already use separate copies of the data for each cpu, it is
very easy to convert the central rwlock to separate rwlocks, allocated
percpu (and NUMA aware).

When a cpu enters ipt_do_table(), it can locks its local copy, touching a
cache line that is not used by other cpus. Further operations are done on
'local' copy of the data.

When a sum of all counters must be done, we can write_lock each part at a
time, instead of locking all the parts, reducing the lock contention.

Note1 : I also optimized get_counters(), using a SET_COUNTER() for the first 
cpu, avoiding a memset() and ADD_COUNTER() if SMP on other cpus.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: patch_ip_tables_numa.1 --]
[-- Type: text/plain, Size: 8651 bytes --]

--- linux-2.6/include/linux/netfilter_ipv4/ip_tables.h	2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6-ed/include/linux/netfilter_ipv4/ip_tables.h	2005-09-21 17:42:07.000000000 +0200
@@ -445,7 +445,11 @@
 	unsigned int valid_hooks;
 
 	/* Lock for the curtain */
+#ifdef CONFIG_SMP
+	rwlock_t *lock_p; /* one lock per cpu */
+#else
 	rwlock_t lock;
+#endif
 
 	/* Man behind the curtain... */
 	struct ipt_table_info *private;
--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/ip_nat_rule.c	2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/ip_nat_rule.c	2005-09-21 17:44:01.000000000 +0200
@@ -93,7 +93,6 @@
 static struct ipt_table nat_table = {
 	.name		= "nat",
 	.valid_hooks	= NAT_VALID_HOOKS,
-	.lock		= RW_LOCK_UNLOCKED,
 	.me		= THIS_MODULE,
 };
 
--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/iptable_filter.c	2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/iptable_filter.c	2005-09-21 17:45:20.000000000 +0200
@@ -77,7 +77,6 @@
 static struct ipt_table packet_filter = {
 	.name		= "filter",
 	.valid_hooks	= FILTER_VALID_HOOKS,
-	.lock		= RW_LOCK_UNLOCKED,
 	.me		= THIS_MODULE
 };
 
--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/iptable_mangle.c	2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/iptable_mangle.c	2005-09-21 17:45:20.000000000 +0200
@@ -107,7 +107,6 @@
 static struct ipt_table packet_mangler = {
 	.name		= "mangle",
 	.valid_hooks	= MANGLE_VALID_HOOKS,
-	.lock		= RW_LOCK_UNLOCKED,
 	.me		= THIS_MODULE,
 };
 
--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/iptable_raw.c	2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/iptable_raw.c	2005-09-21 17:45:20.000000000 +0200
@@ -82,7 +82,6 @@
 static struct ipt_table packet_raw = { 
 	.name = "raw", 
 	.valid_hooks =  RAW_VALID_HOOKS, 
-	.lock = RW_LOCK_UNLOCKED, 
 	.me = THIS_MODULE
 };
 
--- linux-2.6.14-rc1.p1/net/ipv4/netfilter/ip_tables.c	2005-09-19 19:56:12.000000000 +0200
+++ linux-2.6.14-rc1/net/ipv4/netfilter/ip_tables.c	2005-09-21 23:49:13.000000000 +0200
@@ -110,6 +110,7 @@
 static LIST_HEAD(ipt_target);
 static LIST_HEAD(ipt_match);
 static LIST_HEAD(ipt_tables);
+#define SET_COUNTER(c,b,p) do { (c).bcnt = (b); (c).pcnt = (p); } while(0)
 #define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0)
 
 #ifdef CONFIG_SMP
@@ -273,6 +274,9 @@
 	const char *indev, *outdev;
 	void *table_base;
 	struct ipt_entry *e, *back;
+#ifdef CONFIG_SMP
+	rwlock_t *lock_p;
+#endif
 
 	/* Initialization */
 	ip = (*pskb)->nh.iph;
@@ -287,7 +291,18 @@
 	 * match it. */
 	offset = ntohs(ip->frag_off) & IP_OFFSET;
 
+#ifdef CONFIG_SMP
+	/*
+	 * We expand read_lock_bh() here because we need to get
+	 * the address of this cpu rwlock_t
+	 */
+	local_bh_disable();
+	preempt_disable();
+	lock_p = per_cpu_ptr(table->lock_p, smp_processor_id());
+	_raw_read_lock(lock_p);
+#else
 	read_lock_bh(&table->lock);
+#endif
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
 	table_base = (void *)table->private->entries
 		+ TABLE_OFFSET(table->private, smp_processor_id());
@@ -396,7 +411,11 @@
 #ifdef CONFIG_NETFILTER_DEBUG
 	((struct ipt_entry *)table_base)->comefrom = 0xdead57ac;
 #endif
+#ifdef CONFIG_SMP
+	read_unlock_bh(lock_p);
+#else
 	read_unlock_bh(&table->lock);
+#endif
 
 #ifdef DEBUG_ALLOW_ALL
 	return NF_ACCEPT;
@@ -930,6 +949,30 @@
 	return ret;
 }
 
+static void ipt_table_global_lock(struct ipt_table *table)
+{
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_cpu(cpu) {
+		write_lock_bh(per_cpu_ptr(table->lock_p, cpu));
+	}
+#else
+	write_lock_bh(&table->lock);
+#endif
+}
+
+static void ipt_table_global_unlock(struct ipt_table *table)
+{
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_cpu(cpu) {
+		write_unlock_bh(per_cpu_ptr(table->lock_p, cpu));
+	}
+#else
+	write_unlock_bh(&table->lock);
+#endif
+}
+
 static struct ipt_table_info *
 replace_table(struct ipt_table *table,
 	      unsigned int num_counters,
@@ -954,24 +997,25 @@
 #endif
 
 	/* Do the substitution. */
-	write_lock_bh(&table->lock);
+	ipt_table_global_lock(table);
 	/* Check inside lock: is the old number correct? */
 	if (num_counters != table->private->number) {
 		duprintf("num_counters != table->private->number (%u/%u)\n",
 			 num_counters, table->private->number);
-		write_unlock_bh(&table->lock);
+		ipt_table_global_unlock(table);
 		*error = -EAGAIN;
 		return NULL;
 	}
 	oldinfo = table->private;
 	table->private = newinfo;
 	newinfo->initial_entries = oldinfo->initial_entries;
-	write_unlock_bh(&table->lock);
+	ipt_table_global_unlock(table);
 
 	return oldinfo;
 }
 
 /* Gets counters. */
+#ifdef CONFIG_SMP
 static inline int
 add_entry_to_counter(const struct ipt_entry *e,
 		     struct ipt_counters total[],
@@ -982,22 +1026,71 @@
 	(*i)++;
 	return 0;
 }
+#endif
+static inline int
+set_entry_to_counter(const struct ipt_entry *e,
+		     struct ipt_counters total[],
+		     unsigned int *i)
+{
+	SET_COUNTER(total[*i], e->counters.bcnt, e->counters.pcnt);
+
+	(*i)++;
+	return 0;
+}
 
+/*
+ * if table is NULL, no locking should be done
+ */
 static void
-get_counters(const struct ipt_table_info *t,
+get_counters(struct ipt_table *table,
+		const struct ipt_table_info *t,
 	     struct ipt_counters counters[])
 {
-	unsigned int cpu;
 	unsigned int i;
+#ifdef CONFIG_SMP
+	unsigned int cpu;
+	unsigned int curcpu;
+
+	curcpu = get_cpu();
+
+	if (table)
+		write_lock_bh(per_cpu_ptr(table->lock_p, curcpu));
+	i = 0;
+	IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, curcpu),
+			  t->size,
+			  set_entry_to_counter,
+			  counters,
+			  &i);
+	if (table)
+		write_unlock_bh(per_cpu_ptr(table->lock_p, curcpu));
 
-	for (cpu = 0; cpu < num_possible_cpus(); cpu++) {
+	for_each_cpu(cpu) {
+		if (cpu == curcpu)
+			continue;
+		if (table)
+			write_lock_bh(per_cpu_ptr(table->lock_p, cpu));
 		i = 0;
 		IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, cpu),
 				  t->size,
 				  add_entry_to_counter,
 				  counters,
 				  &i);
+		if (table)
+			write_unlock_bh(per_cpu_ptr(table->lock_p, cpu));
 	}
+	put_cpu();
+#else
+	if (table)
+		write_lock_bh(&table->lock);
+	i = 0;
+	IPT_ENTRY_ITERATE(t->entries,
+			  t->size,
+			  set_entry_to_counter,
+			  counters,
+			  &i);
+	if (table)
+		write_unlock_bh(&table->lock);
+#endif
 }
 
 static int
@@ -1020,10 +1113,7 @@
 		return -ENOMEM;
 
 	/* First, sum counters... */
-	memset(counters, 0, countersize);
-	write_lock_bh(&table->lock);
-	get_counters(table->private, counters);
-	write_unlock_bh(&table->lock);
+	get_counters(table, table->private, counters);
 
 	/* ... then copy entire thing from CPU 0... */
 	if (copy_to_user(userptr, table->private->entries, total_size) != 0) {
@@ -1183,7 +1273,7 @@
 		module_put(t->me);
 
 	/* Get the old counters. */
-	get_counters(oldinfo, counters);
+	get_counters(NULL, oldinfo, counters);
 	/* Decrease module usage counts and free resource */
 	IPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL);
 	vfree(oldinfo);
@@ -1235,6 +1325,9 @@
 	struct ipt_counters_info tmp, *paddc;
 	struct ipt_table *t;
 	int ret = 0;
+#ifdef CONFIG_SMP
+	rwlock_t *lockp;
+#endif
 
 	if (copy_from_user(&tmp, user, sizeof(tmp)) != 0)
 		return -EFAULT;
@@ -1257,7 +1350,12 @@
 		goto free;
 	}
 
+#ifdef CONFIG_SMP
+	lockp = per_cpu_ptr(t->lock_p, get_cpu());
+	write_lock_bh(lockp);
+#else
 	write_lock_bh(&t->lock);
+#endif
 	if (t->private->number != paddc->num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
@@ -1270,7 +1368,12 @@
 			  paddc->counters,
 			  &i);
  unlock_up_free:
+#ifdef CONFIG_SMP
+	write_unlock_bh(lockp);
+	put_cpu();
+#else
 	write_unlock_bh(&t->lock);
+#endif
 	up(&ipt_mutex);
 	module_put(t->me);
  free:
@@ -1456,7 +1559,17 @@
 	struct ipt_table_info *newinfo;
 	static struct ipt_table_info bootstrap
 		= { 0, 0, 0, { 0 }, { 0 }, { } };
-
+#ifdef CONFIG_SMP
+	int cpu;
+	if (!table->lock_p) {
+		if ((table->lock_p = alloc_percpu(rwlock_t)) == NULL)
+			return -ENOMEM;
+	}
+	for_each_cpu(cpu)
+		rwlock_init(per_cpu_ptr(table->lock_p, cpu));
+#else
+	rwlock_init(&table->lock);
+#endif
 	newinfo = vmalloc(sizeof(struct ipt_table_info)
 			  + SMP_ALIGN(repl->size) * num_possible_cpus());
 	if (!newinfo)
@@ -1497,7 +1610,6 @@
 	/* save number of initial entries */
 	table->private->initial_entries = table->private->number;
 
-	rwlock_init(&table->lock);
 	list_prepend(&ipt_tables, table);
 
  unlock:
@@ -1519,6 +1631,10 @@
 	IPT_ENTRY_ITERATE(table->private->entries, table->private->size,
 			  cleanup_entry, NULL);
 	vfree(table->private);
+#ifdef CONFIG_SMP
+	free_percpu(table->lock_p);
+	table->lock_p = NULL;
+#endif
 }
 
 /* Returns 1 if the port is matched by the range, 0 otherwise */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-20 21:46         ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h Eric Dumazet
  2005-09-21 21:24           ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
  2005-09-21 21:29           ` [PATCH 1/3] " Eric Dumazet
@ 2005-09-21 21:32           ` Eric Dumazet
  2005-09-22 12:48             ` Harald Welte
  2005-09-21 21:37           ` [PATCH 3/3] " Eric Dumazet
  3 siblings, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-21 21:32 UTC (permalink / raw)
  To: linux-kernel, netfilter-devel, netdev; +Cc: Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 330 bytes --]

Patch 2/3 (please apply after Patch 1/3)

2) Loop unrolling

It seems that with current compilers and CFLAGS, the code from
ip_packet_match() is very bad, using lot of mispredicted conditional branches 
I made some patches and generated code on i386 and x86_64
is much better.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: patch_ip_tables_numa.2 --]
[-- Type: text/plain, Size: 3958 bytes --]

--- linux-2.6/net/ipv4/netfilter/ip_tables.c	2005-09-21 23:55:30.000000000 +0200
+++ linux-2.6-ed/net/ipv4/netfilter/ip_tables.c	2005-09-22 00:39:29.000000000 +0200
@@ -125,6 +125,27 @@
 #define up(x) do { printk("UP:%u:" #x "\n", __LINE__); up(x); } while(0)
 #endif
 
+/*
+ * Special macro to compare IFNAMSIZ 'strings', with mask.
+ * Best results if feeded with 3 args pointing to 'unsigned long' types
+ * Loop is manually unrolled for performance reasons.
+ * gcc generates code without branches, IFNAMSIZ being a constant
+ */
+#define compare_if_lstrings(ret, devname, expect, expect_mask)			\
+	ret = ((devname)[0] ^ (expect)[0]) & (expect_mask)[0];			\
+	if (IFNAMSIZ > sizeof(*devname))					\
+		ret |= ((devname)[1] ^ (expect)[1]) & (expect_mask)[1];		\
+	if (IFNAMSIZ > 2 * sizeof(*devname))					\
+		ret |= ((devname)[2] ^ (expect)[2]) & (expect_mask)[2];		\
+	if (IFNAMSIZ > 3 * sizeof(*devname))					\
+		ret |= ((devname)[3] ^ (expect)[3]) & (expect_mask)[3];		\
+	/* just in case IFNAMSIZ is enlarged */					\
+	if (IFNAMSIZ > 4 * sizeof(*devname)) {					\
+		int i;								\
+		for (i = 4 ; (i < IFNAMSIZ/sizeof(*devname)); i++)		\
+			ret |= ((devname)[i] ^ (expect)[i]) & (expect_mask)[i];	\
+	}
+
 /* Returns whether matches rule or not. */
 static inline int
 ip_packet_match(const struct iphdr *ip,
@@ -133,16 +154,22 @@
 		const struct ipt_ip *ipinfo,
 		int isfrag)
 {
-	size_t i;
+	int bool1, bool2;
 	unsigned long ret;
 
 #define FWINV(bool,invflg) ((bool) ^ !!(ipinfo->invflags & invflg))
 
-	if (FWINV((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr,
-		  IPT_INV_SRCIP)
-	    || FWINV((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr,
-		     IPT_INV_DSTIP)) {
-		dprintf("Source or dest mismatch.\n");
+	bool1 = ((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr);
+	bool1 ^= !!(ipinfo->invflags & IPT_INV_SRCIP);
+
+	bool2 = ((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr);
+	bool2 ^= !!(ipinfo->invflags & IPT_INV_DSTIP);
+
+	if ((bool1 | bool2) != 0) {
+		if (bool1)
+			dprintf("Source%s mismatch.\n", bool2 ? " and Dest":"");
+		else
+			dprintf("Dest mismatch.\n");
 
 		dprintf("SRC: %u.%u.%u.%u. Mask: %u.%u.%u.%u. Target: %u.%u.%u.%u.%s\n",
 			NIPQUAD(ip->saddr),
@@ -157,27 +184,26 @@
 		return 0;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ipinfo->iniface)[i])
-			& ((const unsigned long *)ipinfo->iniface_mask)[i];
-	}
+	compare_if_lstrings(ret,
+		(const unsigned long *)indev,
+		(const unsigned long *)ipinfo->iniface,
+		(const unsigned long *)ipinfo->iniface_mask);
 
-	if (FWINV(ret != 0, IPT_INV_VIA_IN)) {
+	bool1 = FWINV(ret != 0, IPT_INV_VIA_IN);
+	if (bool1) {
 		dprintf("VIA in mismatch (%s vs %s).%s\n",
 			indev, ipinfo->iniface,
 			ipinfo->invflags&IPT_INV_VIA_IN ?" (INV)":"");
 		return 0;
 	}
 
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)outdev)[i]
-			^ ((const unsigned long *)ipinfo->outiface)[i])
-			& ((const unsigned long *)ipinfo->outiface_mask)[i];
-	}
+	compare_if_lstrings(ret,
+		(const unsigned long *)outdev,
+		(const unsigned long *)ipinfo->outiface,
+		(const unsigned long *)ipinfo->outiface_mask);
 
-	if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
+	bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
+	if (bool1) {
 		dprintf("VIA out mismatch (%s vs %s).%s\n",
 			outdev, ipinfo->outiface,
 			ipinfo->invflags&IPT_INV_VIA_OUT ?" (INV)":"");
@@ -185,8 +211,9 @@
 	}
 
 	/* Check specific protocol */
-	if (ipinfo->proto
-	    && FWINV(ip->protocol != ipinfo->proto, IPT_INV_PROTO)) {
+	bool1 = FWINV(ip->protocol != ipinfo->proto, IPT_INV_PROTO) ;
+	bool1 &= (ipinfo->proto != 0);
+	if (bool1) {
 		dprintf("Packet protocol %hi does not match %hi.%s\n",
 			ip->protocol, ipinfo->proto,
 			ipinfo->invflags&IPT_INV_PROTO ? " (INV)":"");

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 3/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-20 21:46         ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h Eric Dumazet
                             ` (2 preceding siblings ...)
  2005-09-21 21:32           ` [PATCH 2/3] " Eric Dumazet
@ 2005-09-21 21:37           ` Eric Dumazet
  2005-09-22 12:50             ` Harald Welte
  3 siblings, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-21 21:37 UTC (permalink / raw)
  To: linux-kernel, netfilter-devel, netdev; +Cc: Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1068 bytes --]

Patch 3/3 (please apply after Patch 2/3)

3) NUMA allocation.

Part of the performance problem we have with netfilter is memory allocation
is not NUMA aware, but 'only' SMP aware (ie each CPU normally touch
separate cache lines)

Even with small iptables rules, the cost of this misplacement can be high
  on common workloads.

Instead of using one vmalloc() area (located in the node of the iptables
process), we now allocate an area for each possible CPU, using NUMA policy
(MPOL_PREFERRED) so that memory should be allocated in the CPU's node
if possible.

If the size of ipt_table is small enough (less than one page), we use
kmalloc_node() instead of vmalloc(), to use less memory and less TLB entries)
in small setups.

Please note that this patch doesnt change the number of allocated bytes, only 
the location of allocated zones.

Note2 : This patch depends on another patch that declares sys_set_mempolicy()
in include/linux/syscalls.h
( http://marc.theaimsgroup.com/?l=linux-kernel&m=112725288622984&w=2 )

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: patch_ip_tables_numa.3 --]
[-- Type: text/plain, Size: 12981 bytes --]

--- linux-2.6/net/ipv4/netfilter/ip_tables.c	2005-09-22 00:44:34.000000000 +0200
+++ linux-2.6-ed/net/ipv4/netfilter/ip_tables.c	2005-09-22 00:57:15.000000000 +0200
@@ -17,6 +17,8 @@
 #include <linux/skbuff.h>
 #include <linux/kmod.h>
 #include <linux/vmalloc.h>
+#include <linux/mempolicy.h>
+#include <linux/syscalls.h>
 #include <linux/netdevice.h>
 #include <linux/module.h>
 #include <linux/tcp.h>
@@ -82,11 +84,6 @@
    context stops packets coming through and allows user context to read
    the counters or update the rules.
 
-   To be cache friendly on SMP, we arrange them like so:
-   [ n-entries ]
-   ... cache-align padding ...
-   [ n-entries ]
-
    Hence the start of any table is given by get_table() below.  */
 
 /* The table itself */
@@ -104,7 +101,7 @@
 	unsigned int underflow[NF_IP_NUMHOOKS];
 
 	/* ipt_entry tables: one per CPU */
-	char entries[0] ____cacheline_aligned;
+	void *entries[NR_CPUS];
 };
 
 static LIST_HEAD(ipt_target);
@@ -113,12 +110,6 @@
 #define SET_COUNTER(c,b,p) do { (c).bcnt = (b); (c).pcnt = (p); } while(0)
 #define ADD_COUNTER(c,b,p) do { (c).bcnt += (b); (c).pcnt += (p); } while(0)
 
-#ifdef CONFIG_SMP
-#define TABLE_OFFSET(t,p) (SMP_ALIGN((t)->size)*(p))
-#else
-#define TABLE_OFFSET(t,p) 0
-#endif
-
 #if 0
 #define down(x) do { printk("DOWN:%u:" #x "\n", __LINE__); down(x); } while(0)
 #define down_interruptible(x) ({ int __r; printk("DOWNi:%u:" #x "\n", __LINE__); __r = down_interruptible(x); if (__r != 0) printk("ABORT-DOWNi:%u\n", __LINE__); __r; })
@@ -331,8 +322,7 @@
 	read_lock_bh(&table->lock);
 #endif
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
-	table_base = (void *)table->private->entries
-		+ TABLE_OFFSET(table->private, smp_processor_id());
+	table_base = (void *)table->private->entries[smp_processor_id()];
 	e = get_entry(table_base, table->private->hook_entry[hook]);
 
 #ifdef CONFIG_NETFILTER_DEBUG
@@ -608,7 +598,7 @@
 /* Figures out from what hook each rule can be called: returns 0 if
    there are loops.  Puts hook bitmask in comefrom. */
 static int
-mark_source_chains(struct ipt_table_info *newinfo, unsigned int valid_hooks)
+mark_source_chains(struct ipt_table_info *newinfo, unsigned int valid_hooks, void *entry0)
 {
 	unsigned int hook;
 
@@ -617,7 +607,7 @@
 	for (hook = 0; hook < NF_IP_NUMHOOKS; hook++) {
 		unsigned int pos = newinfo->hook_entry[hook];
 		struct ipt_entry *e
-			= (struct ipt_entry *)(newinfo->entries + pos);
+			= (struct ipt_entry *)(entry0 + pos);
 
 		if (!(valid_hooks & (1 << hook)))
 			continue;
@@ -667,13 +657,13 @@
 						goto next;
 
 					e = (struct ipt_entry *)
-						(newinfo->entries + pos);
+						(entry0 + pos);
 				} while (oldpos == pos + e->next_offset);
 
 				/* Move along one */
 				size = e->next_offset;
 				e = (struct ipt_entry *)
-					(newinfo->entries + pos + size);
+					(entry0 + pos + size);
 				e->counters.pcnt = pos;
 				pos += size;
 			} else {
@@ -690,7 +680,7 @@
 					newpos = pos + e->next_offset;
 				}
 				e = (struct ipt_entry *)
-					(newinfo->entries + newpos);
+					(entry0 + newpos);
 				e->counters.pcnt = pos;
 				pos = newpos;
 			}
@@ -900,6 +890,7 @@
 translate_table(const char *name,
 		unsigned int valid_hooks,
 		struct ipt_table_info *newinfo,
+		void *entry0,
 		unsigned int size,
 		unsigned int number,
 		const unsigned int *hook_entries,
@@ -920,11 +911,11 @@
 	duprintf("translate_table: size %u\n", newinfo->size);
 	i = 0;
 	/* Walk through entries, checking offsets. */
-	ret = IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size,
+	ret = IPT_ENTRY_ITERATE(entry0, newinfo->size,
 				check_entry_size_and_hooks,
 				newinfo,
-				newinfo->entries,
-				newinfo->entries + size,
+				entry0,
+				entry0 + size,
 				hook_entries, underflows, &i);
 	if (ret != 0)
 		return ret;
@@ -952,25 +943,24 @@
 		}
 	}
 
-	if (!mark_source_chains(newinfo, valid_hooks))
+	if (!mark_source_chains(newinfo, valid_hooks, entry0))
 		return -ELOOP;
 
 	/* Finally, each sanity check must pass */
 	i = 0;
-	ret = IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size,
+	ret = IPT_ENTRY_ITERATE(entry0, newinfo->size,
 				check_entry, name, size, &i);
 
 	if (ret != 0) {
-		IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size,
+		IPT_ENTRY_ITERATE(entry0, newinfo->size,
 				  cleanup_entry, &i);
 		return ret;
 	}
 
 	/* And one copy for every other CPU */
-	for (i = 1; i < num_possible_cpus(); i++) {
-		memcpy(newinfo->entries + SMP_ALIGN(newinfo->size)*i,
-		       newinfo->entries,
-		       SMP_ALIGN(newinfo->size));
+	for_each_cpu(i) {
+		if (newinfo->entries[i] && newinfo->entries[i] != entry0)
+			memcpy(newinfo->entries[i], entry0, newinfo->size);
 	}
 
 	return ret;
@@ -1010,15 +1000,12 @@
 
 #ifdef CONFIG_NETFILTER_DEBUG
 	{
-		struct ipt_entry *table_base;
-		unsigned int i;
+		int cpu;
 
-		for (i = 0; i < num_possible_cpus(); i++) {
-			table_base =
-				(void *)newinfo->entries
-				+ TABLE_OFFSET(newinfo, i);
-
-			table_base->comefrom = 0xdead57ac;
+		for_each_cpu(cpu) {
+			struct ipt_entry *table_base = newinfo->entries[cpu];
+			if (table_base)
+				table_base->comefrom = 0xdead57ac;
 		}
 	}
 #endif
@@ -1083,7 +1070,7 @@
 	if (table)
 		write_lock_bh(per_cpu_ptr(table->lock_p, curcpu));
 	i = 0;
-	IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, curcpu),
+	IPT_ENTRY_ITERATE(t->entries[curcpu],
 			  t->size,
 			  set_entry_to_counter,
 			  counters,
@@ -1097,7 +1084,7 @@
 		if (table)
 			write_lock_bh(per_cpu_ptr(table->lock_p, cpu));
 		i = 0;
-		IPT_ENTRY_ITERATE(t->entries + TABLE_OFFSET(t, cpu),
+		IPT_ENTRY_ITERATE(t->entries[cpu],
 				  t->size,
 				  add_entry_to_counter,
 				  counters,
@@ -1110,7 +1097,7 @@
 	if (table)
 		write_lock_bh(&table->lock);
 	i = 0;
-	IPT_ENTRY_ITERATE(t->entries,
+	IPT_ENTRY_ITERATE(t->entries[0],
 			  t->size,
 			  set_entry_to_counter,
 			  counters,
@@ -1129,6 +1116,7 @@
 	struct ipt_entry *e;
 	struct ipt_counters *counters;
 	int ret = 0;
+	void *loc_cpu_entry;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	   (other than comefrom, which userspace doesn't care
@@ -1142,8 +1130,12 @@
 	/* First, sum counters... */
 	get_counters(table, table->private, counters);
 
-	/* ... then copy entire thing from CPU 0... */
-	if (copy_to_user(userptr, table->private->entries, total_size) != 0) {
+	/*
+	 * choose the copy that is on our node/cpu,
+	 */
+	loc_cpu_entry = table->private->entries[get_cpu()];
+	/* ... then copy entire thing ... */
+	if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
 		ret = -EFAULT;
 		goto free_counters;
 	}
@@ -1155,7 +1147,7 @@
 		struct ipt_entry_match *m;
 		struct ipt_entry_target *t;
 
-		e = (struct ipt_entry *)(table->private->entries + off);
+		e = (struct ipt_entry *)(loc_cpu_entry + off);
 		if (copy_to_user(userptr + off
 				 + offsetof(struct ipt_entry, counters),
 				 &counters[num],
@@ -1192,6 +1184,7 @@
 	}
 
  free_counters:
+	put_cpu();
 	vfree(counters);
 	return ret;
 }
@@ -1224,6 +1217,60 @@
 	return ret;
 }
 
+static void free_table_info(struct ipt_table_info *info)
+{
+	int cpu;
+	for_each_cpu(cpu) {
+		if (info->size <= PAGE_SIZE)
+			kfree(info->entries[cpu]);
+		else
+			vfree(info->entries[cpu]);
+	}
+	kfree(info);
+}
+
+static struct ipt_table_info *alloc_table_info(unsigned int size)
+{
+struct ipt_table_info *newinfo;
+int cpu;
+	newinfo = kzalloc(sizeof(struct ipt_table_info), GFP_KERNEL);
+	if (!newinfo)
+		return NULL;
+	newinfo->size = size;
+	for_each_cpu(cpu) {
+		if (size <= PAGE_SIZE) {
+			newinfo->entries[cpu] = kmalloc_node(size,
+				GFP_KERNEL,
+				cpu_to_node(cpu));
+		} else {
+#ifdef CONFIG_NUMA
+			struct mempolicy *oldpol;
+			mm_segment_t oldfs = get_fs();
+			DECLARE_BITMAP(mynode, MAX_NUMNODES);
+
+			oldpol = current->mempolicy;
+			mpol_get(oldpol);
+			bitmap_zero(mynode, MAX_NUMNODES);
+			set_bit(cpu_to_node(cpu), mynode);
+			set_fs(KERNEL_DS);
+			sys_set_mempolicy(MPOL_PREFERRED, mynode, MAX_NUMNODES);
+			set_fs(oldfs);
+#endif
+			newinfo->entries[cpu] = vmalloc(size);
+#ifdef CONFIG_NUMA
+			mpol_free(current->mempolicy);
+			current->mempolicy = oldpol;
+#endif
+		}
+		if (newinfo->entries[cpu] == 0) {
+			free_table_info(newinfo);
+			return NULL;
+		}
+	}
+	return newinfo;
+}
+
+
 static int
 do_replace(void __user *user, unsigned int len)
 {
@@ -1232,6 +1279,7 @@
 	struct ipt_table *t;
 	struct ipt_table_info *newinfo, *oldinfo;
 	struct ipt_counters *counters;
+	void *loc_cpu_entry, *loc_cpu_old_entry;
 
 	if (copy_from_user(&tmp, user, sizeof(tmp)) != 0)
 		return -EFAULT;
@@ -1244,12 +1292,14 @@
 	if ((SMP_ALIGN(tmp.size) >> PAGE_SHIFT) + 2 > num_physpages)
 		return -ENOMEM;
 
-	newinfo = vmalloc(sizeof(struct ipt_table_info)
-			  + SMP_ALIGN(tmp.size) * num_possible_cpus());
+	newinfo = alloc_table_info(tmp.size);
 	if (!newinfo)
 		return -ENOMEM;
-
-	if (copy_from_user(newinfo->entries, user + sizeof(tmp),
+	/*
+	 * choose the copy that is on our node/cpu
+	 */
+	loc_cpu_entry = newinfo->entries[get_cpu()];
+	if (copy_from_user(loc_cpu_entry, user + sizeof(tmp),
 			   tmp.size) != 0) {
 		ret = -EFAULT;
 		goto free_newinfo;
@@ -1260,10 +1310,9 @@
 		ret = -ENOMEM;
 		goto free_newinfo;
 	}
-	memset(counters, 0, tmp.num_counters * sizeof(struct ipt_counters));
 
 	ret = translate_table(tmp.name, tmp.valid_hooks,
-			      newinfo, tmp.size, tmp.num_entries,
+			      newinfo, loc_cpu_entry, tmp.size, tmp.num_entries,
 			      tmp.hook_entry, tmp.underflow);
 	if (ret != 0)
 		goto free_newinfo_counters;
@@ -1302,8 +1351,10 @@
 	/* Get the old counters. */
 	get_counters(NULL, oldinfo, counters);
 	/* Decrease module usage counts and free resource */
-	IPT_ENTRY_ITERATE(oldinfo->entries, oldinfo->size, cleanup_entry,NULL);
-	vfree(oldinfo);
+	loc_cpu_old_entry = oldinfo->entries[smp_processor_id()];
+	IPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,NULL);
+	put_cpu();
+	free_table_info(oldinfo);
 	if (copy_to_user(tmp.counters, counters,
 			 sizeof(struct ipt_counters) * tmp.num_counters) != 0)
 		ret = -EFAULT;
@@ -1315,11 +1366,12 @@
 	module_put(t->me);
 	up(&ipt_mutex);
  free_newinfo_counters_untrans:
-	IPT_ENTRY_ITERATE(newinfo->entries, newinfo->size, cleanup_entry,NULL);
+	IPT_ENTRY_ITERATE(loc_cpu_entry, newinfo->size, cleanup_entry,NULL);
  free_newinfo_counters:
 	vfree(counters);
  free_newinfo:
-	vfree(newinfo);
+	put_cpu();
+	free_table_info(newinfo);
 	return ret;
 }
 
@@ -1352,6 +1404,7 @@
 	struct ipt_counters_info tmp, *paddc;
 	struct ipt_table *t;
 	int ret = 0;
+	void *loc_cpu_entry;
 #ifdef CONFIG_SMP
 	rwlock_t *lockp;
 #endif
@@ -1389,7 +1442,11 @@
 	}
 
 	i = 0;
-	IPT_ENTRY_ITERATE(t->private->entries,
+	/*
+	 * choose the copy that is on our node,
+	 */
+	loc_cpu_entry = t->private->entries[smp_processor_id()];
+	IPT_ENTRY_ITERATE(loc_cpu_entry,
 			  t->private->size,
 			  add_counter_to_entry,
 			  paddc->counters,
@@ -1584,8 +1641,15 @@
 {
 	int ret;
 	struct ipt_table_info *newinfo;
-	static struct ipt_table_info bootstrap
-		= { 0, 0, 0, { 0 }, { 0 }, { } };
+	static struct ipt_table_info bootstrap = {
+		.size = 0,
+		.number = 0,
+		.initial_entries = 0,
+		.hook_entry = { 0 },
+		.underflow = { 0 },
+		.entries = {NULL }
+		};
+	void *loc_cpu_entry;
 #ifdef CONFIG_SMP
 	int cpu;
 	if (!table->lock_p) {
@@ -1597,26 +1661,30 @@
 #else
 	rwlock_init(&table->lock);
 #endif
-	newinfo = vmalloc(sizeof(struct ipt_table_info)
-			  + SMP_ALIGN(repl->size) * num_possible_cpus());
+
+	newinfo = alloc_table_info(repl->size);
 	if (!newinfo)
 		return -ENOMEM;
-
-	memcpy(newinfo->entries, repl->entries, repl->size);
+	/*
+	 * choose the copy that is on our node/cpu
+	 */
+	loc_cpu_entry = newinfo->entries[get_cpu()];
+	memcpy(loc_cpu_entry, repl->entries, repl->size);
 
 	ret = translate_table(table->name, table->valid_hooks,
-			      newinfo, repl->size,
+			      newinfo, loc_cpu_entry, repl->size,
 			      repl->num_entries,
 			      repl->hook_entry,
 			      repl->underflow);
+	put_cpu();
 	if (ret != 0) {
-		vfree(newinfo);
+		free_table_info(newinfo);
 		return ret;
 	}
 
 	ret = down_interruptible(&ipt_mutex);
 	if (ret != 0) {
-		vfree(newinfo);
+		free_table_info(newinfo);
 		return ret;
 	}
 
@@ -1644,20 +1712,25 @@
 	return ret;
 
  free_unlock:
-	vfree(newinfo);
+	free_table_info(newinfo);
 	goto unlock;
 }
 
 void ipt_unregister_table(struct ipt_table *table)
 {
+	void *loc_cpu_entry;
 	down(&ipt_mutex);
 	LIST_DELETE(&ipt_tables, table);
 	up(&ipt_mutex);
 
-	/* Decrease module usage counts and free resources */
-	IPT_ENTRY_ITERATE(table->private->entries, table->private->size,
+	/* Decrease module usage counts and free resources
+	 * choose the copy that is on our node/cpu
+	 */
+	loc_cpu_entry = table->private->entries[get_cpu()];
+	IPT_ENTRY_ITERATE(loc_cpu_entry, table->private->size,
 			  cleanup_entry, NULL);
-	vfree(table->private);
+	put_cpu();
+	free_table_info(table->private);
 #ifdef CONFIG_SMP
 	free_percpu(table->lock_p);
 	table->lock_p = NULL;

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-21 21:24           ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
@ 2005-09-21 22:43             ` Christoph Lameter
  2005-09-22  0:34               ` David S. Miller
  2005-09-22  4:18             ` James Morris
  2005-09-22 13:03             ` Andi Kleen
  2 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2005-09-21 22:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netfilter-devel, netdev, Andi Kleen

On Wed, 21 Sep 2005, Eric Dumazet wrote:

> Instead of using one vmalloc() area (located in the node of the iptables
> process), we now allocate an area for each possible CPU, using NUMA policy
> (MPOL_PREFERRED) so that memory should be allocated in the CPU's node
> if possible.
> 
> If the size of ipt_table is small enough (less than one page), we use
> kmalloc_node() instead of vmalloc(), to use less memory and less TLB entries)
> in small setups.

Maybe we better introduce vmalloc_node() instead of improvising this for 
several subsystems? The e1000 driver has similar issues.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-21 22:43             ` Christoph Lameter
@ 2005-09-22  0:34               ` David S. Miller
  2005-09-22  1:44                 ` Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: David S. Miller @ 2005-09-22  0:34 UTC (permalink / raw)
  To: clameter; +Cc: dada1, linux-kernel, netfilter-devel, netdev, ak

From: Christoph Lameter <clameter@engr.sgi.com>
Date: Wed, 21 Sep 2005 15:43:29 -0700 (PDT)

> Maybe we better introduce vmalloc_node() instead of improvising this for 
> several subsystems? The e1000 driver has similar issues.

I agree.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22  0:34               ` David S. Miller
@ 2005-09-22  1:44                 ` Christoph Lameter
  2005-09-22 12:11                   ` Eric Dumazet
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2005-09-22  1:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: dada1, linux-kernel, netfilter-devel, netdev, ak

On Wed, 21 Sep 2005, David S. Miller wrote:

> From: Christoph Lameter <clameter@engr.sgi.com>
> Date: Wed, 21 Sep 2005 15:43:29 -0700 (PDT)
> 
> > Maybe we better introduce vmalloc_node() instead of improvising this for 
> > several subsystems? The e1000 driver has similar issues.
> 
> I agree.

I did an implementation in June.

See http://marc.theaimsgroup.com/?l=linux-mm&m=111766643127530&w=2

Not sure if this will fit the bill. Never really tested it.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-21 21:24           ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
  2005-09-21 22:43             ` Christoph Lameter
@ 2005-09-22  4:18             ` James Morris
  2005-09-22  5:07               ` Eric Dumazet
  2005-09-22 13:03             ` Andi Kleen
  2 siblings, 1 reply; 50+ messages in thread
From: James Morris @ 2005-09-22  4:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netfilter-devel, netdev, Andi Kleen

On Wed, 21 Sep 2005, Eric Dumazet wrote:

> I have reworked net/ipv4/ip_tables.c to boost its performance, and post three
> patches.

Do you have any performance measurements?


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22  4:18             ` James Morris
@ 2005-09-22  5:07               ` Eric Dumazet
  0 siblings, 0 replies; 50+ messages in thread
From: Eric Dumazet @ 2005-09-22  5:07 UTC (permalink / raw)
  To: James Morris; +Cc: linux-kernel, netfilter-devel, netdev, Andi Kleen

James Morris a écrit :
> 
> Do you have any performance measurements?

Yes, as I said in the first mail :

 >In oprofile results, ipt_do_table() was at the first position.
 >It is now at 6th position, using 1/3 of the CPU it was using before.
 >(Tests done on a dual Xeon i386 and a dual Opteron x86_64)

On the dual opteron machine, with 40.000 packets coming per second, and 35.000 
sent per second, the numbers were : 12.8 % before the patches, 4.4 % after the 
patches.

I dont have separate perf measurements for each patch.

Considering the fact that I inlined the read_lock_bh() call (not displayed in 
oprofile results, probably because of the special .spinlock.text section) that 
should have increased the profile of ipt_do_table(), thats a lot of CPU cycles 
  and mem bandwitdh that are available for other jobs.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22  1:44                 ` Christoph Lameter
@ 2005-09-22 12:11                   ` Eric Dumazet
  2005-09-22 12:49                     ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-22 12:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, linux-kernel, netfilter-devel, netdev, ak

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]

Christoph Lameter a écrit :
> On Wed, 21 Sep 2005, David S. Miller wrote:
> 
> 
>>From: Christoph Lameter <clameter@engr.sgi.com>
>>Date: Wed, 21 Sep 2005 15:43:29 -0700 (PDT)
>>
>>
>>>Maybe we better introduce vmalloc_node() instead of improvising this for 
>>>several subsystems? The e1000 driver has similar issues.
>>
>>I agree.
> 
> 
> I did an implementation in June.
> 
> See http://marc.theaimsgroup.com/?l=linux-mm&m=111766643127530&w=2
> 
> Not sure if this will fit the bill. Never really tested it.


Maybe this simpler patch has more chances to be accepted ?

Thank you

[NUMA]

- Adds a vmalloc_node() function : A simple wrapper around vmalloc() to 
allocate memory from a preferred node.

This NUMA aware variant will be used by ip_tables and various network drivers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: vmalloc_node --]
[-- Type: text/plain, Size: 2418 bytes --]

--- linux-2.6.14-rc2/include/linux/vmalloc.h	2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/include/linux/vmalloc.h	2005-09-22 11:34:55.000000000 +0200
@@ -1,6 +1,7 @@
 #ifndef _LINUX_VMALLOC_H
 #define _LINUX_VMALLOC_H
 
+#include <linux/config.h>	/* vmalloc_node() needs CONFIG_ options */
 #include <linux/spinlock.h>
 #include <asm/page.h>		/* pgprot_t */
 
@@ -32,6 +33,14 @@
  *	Highlevel APIs for driver use
  */
 extern void *vmalloc(unsigned long size);
+#ifdef CONFIG_NUMA
+extern void *vmalloc_node(unsigned long size, int node);
+#else
+static inline void *vmalloc_node(unsigned long size, int node)
+{
+	return vmalloc(size);
+}
+#endif
 extern void *vmalloc_exec(unsigned long size);
 extern void *vmalloc_32(unsigned long size);
 extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot);
--- linux-2.6.14-rc2/mm/vmalloc.c	2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/mm/vmalloc.c	2005-09-22 11:55:19.000000000 +0200
@@ -19,6 +19,9 @@
 #include <asm/uaccess.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_NUMA
+#include <linux/mempolicy.h>
+#endif
 
 DEFINE_RWLOCK(vmlist_lock);
 struct vm_struct *vmlist;
@@ -471,7 +474,7 @@
  *	Allocate enough pages to cover @size from the page level
  *	allocator and map them into contiguous kernel virtual space.
  *
- *	For tight cotrol over page level allocator and protection flags
+ *	For tight control over page level allocator and protection flags
  *	use __vmalloc() instead.
  */
 void *vmalloc(unsigned long size)
@@ -481,6 +484,40 @@
 
 EXPORT_SYMBOL(vmalloc);
 
+#ifdef CONFIG_NUMA
+/**
+ * vmalloc_node - allocate virtually contiguous memory
+ *
+ *	@size:		allocation size
+ *	@node:		preferred node
+ *
+ * This vmalloc variant try to allocate memory from a preferred node.
+ */
+void *vmalloc_node(unsigned long size, int node)
+{
+	void *result;
+	struct mempolicy *oldpol = current->mempolicy;
+	mm_segment_t oldfs = get_fs();
+	DECLARE_BITMAP(prefnode, MAX_NUMNODES);
+
+	mpol_get(oldpol);
+	bitmap_zero(prefnode, MAX_NUMNODES);
+	set_bit(node, prefnode);
+
+	set_fs(KERNEL_DS);
+	sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
+	set_fs(oldfs);
+
+	result = vmalloc(size);
+
+	mpol_free(current->mempolicy);
+	current->mempolicy = oldpol;
+	return result;
+}
+
+EXPORT_SYMBOL(vmalloc_node);
+#endif
+
 #ifndef PAGE_KERNEL_EXEC
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-21 21:32           ` [PATCH 2/3] " Eric Dumazet
@ 2005-09-22 12:48             ` Harald Welte
  2005-09-22 13:05               ` Eric Dumazet
  0 siblings, 1 reply; 50+ messages in thread
From: Harald Welte @ 2005-09-22 12:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netfilter-devel, netdev, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1126 bytes --]

On Wed, Sep 21, 2005 at 11:32:24PM +0200, Eric Dumazet wrote:
> Patch 2/3 (please apply after Patch 1/3)
> 
> 2) Loop unrolling

First of all, thanks for your performance analysis and patches.  I'm
very inclined of merging them.

> It seems that with current compilers and CFLAGS, the code from
> ip_packet_match() is very bad, using lot of mispredicted conditional branches I made some patches 
> and generated code on i386 and x86_64
> is much better.

This only describes your "compare_if_lstrings()" changes, and I'm happy
to merge them.

However, you also removed the use of the FWINV macro and replaced it by
explicit code (including the bool1/bool2 variables, which are not really
named intuitively).  Why was this neccessary?

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 12:11                   ` Eric Dumazet
@ 2005-09-22 12:49                     ` Christoph Hellwig
  2005-09-22 12:54                       ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2005-09-22 12:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, David S. Miller, linux-kernel,
	netfilter-devel, netdev, ak

On Thu, Sep 22, 2005 at 02:11:26PM +0200, Eric Dumazet wrote:
> +#ifdef CONFIG_NUMA
> +/**
> + * vmalloc_node - allocate virtually contiguous memory
> + *
> + *	@size:		allocation size
> + *	@node:		preferred node
> + *
> + * This vmalloc variant try to allocate memory from a preferred node.
> + */
> +void *vmalloc_node(unsigned long size, int node)
> +{
> +	void *result;
> +	struct mempolicy *oldpol = current->mempolicy;
> +	mm_segment_t oldfs = get_fs();
> +	DECLARE_BITMAP(prefnode, MAX_NUMNODES);
> +
> +	mpol_get(oldpol);
> +	bitmap_zero(prefnode, MAX_NUMNODES);
> +	set_bit(node, prefnode);
> +
> +	set_fs(KERNEL_DS);
> +	sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
> +	set_fs(oldfs);
> +
> +	result = vmalloc(size);
> +
> +	mpol_free(current->mempolicy);
> +	current->mempolicy = oldpol;
> +	return result;
> +}

No way, sorry.  If you want a vmalloc node do it right.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-21 21:37           ` [PATCH 3/3] " Eric Dumazet
@ 2005-09-22 12:50             ` Harald Welte
  0 siblings, 0 replies; 50+ messages in thread
From: Harald Welte @ 2005-09-22 12:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netfilter-devel, netdev, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

On Wed, Sep 21, 2005 at 11:37:20PM +0200, Eric Dumazet wrote:
> Patch 3/3 (please apply after Patch 2/3)

Thanks for your patches, again.

> 3) NUMA allocation.

I'll wait with applying this patch until a decision has been made on
merging something like vmalloc_node() to mainline.

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 12:49                     ` Christoph Hellwig
@ 2005-09-22 12:54                       ` Andi Kleen
  2005-09-22 12:58                         ` Christoph Hellwig
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2005-09-22 12:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Dumazet, Christoph Lameter, David S. Miller, linux-kernel,
	netfilter-devel, netdev

On Thursday 22 September 2005 14:49, Christoph Hellwig wrote:
> On Thu, Sep 22, 2005 at 02:11:26PM +0200, Eric Dumazet wrote:
> > +#ifdef CONFIG_NUMA
> > +/**
> > + * vmalloc_node - allocate virtually contiguous memory
> > + *
> > + *	@size:		allocation size
> > + *	@node:		preferred node
> > + *
> > + * This vmalloc variant try to allocate memory from a preferred node.
> > + */
> > +void *vmalloc_node(unsigned long size, int node)
> > +{
> > +	void *result;
> > +	struct mempolicy *oldpol = current->mempolicy;
> > +	mm_segment_t oldfs = get_fs();
> > +	DECLARE_BITMAP(prefnode, MAX_NUMNODES);
> > +
> > +	mpol_get(oldpol);
> > +	bitmap_zero(prefnode, MAX_NUMNODES);
> > +	set_bit(node, prefnode);
> > +
> > +	set_fs(KERNEL_DS);
> > +	sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
> > +	set_fs(oldfs);
> > +
> > +	result = vmalloc(size);
> > +
> > +	mpol_free(current->mempolicy);
> > +	current->mempolicy = oldpol;
> > +	return result;
> > +}
>
> No way, sorry.  If you want a vmalloc node do it right.

The implementation looks fine to me, so I think it's already right.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-21 21:29           ` [PATCH 1/3] " Eric Dumazet
@ 2005-09-22 12:57             ` Harald Welte
  2005-09-22 13:17               ` Eric Dumazet
  0 siblings, 1 reply; 50+ messages in thread
From: Harald Welte @ 2005-09-22 12:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netfilter-devel, netdev, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]

On Wed, Sep 21, 2005 at 11:29:13PM +0200, Eric Dumazet wrote:
> Patch 1/3
> 
> 1) No more one rwlock_t protecting the 'curtain'

I have no problem with this change "per se", but with the
implementation.

As of now, we live without any ugly #ifdef CONFIG_SMP / #endif sections
in the code - and if possible, I would continue this good tradition.

For example the get_counters() function.  Wouldn't all the smp specific
code (for_each_cpu(), ...)  be #defined to nothing anyway?

And if we really need the #ifdef's, I would appreciate if those
sectionas are as small as possible.  in get_counters() the section can
definitely be smaller, rather than basically having the whole function
body separate for smp and non-smp cases.

Also, how much would we loose in runtime performance if we were using a
"rwlock_t *" even in the UP case?.  I mean, it's just one more pointer
dereference of something that is expected to be in cache anyway, isn't
it?  This gets rid of another huge set of #ifdefs that make the code
unreadable and prone to errors being introduced later on.

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 12:54                       ` Andi Kleen
@ 2005-09-22 12:58                         ` Christoph Hellwig
  2005-09-22 13:05                           ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Hellwig @ 2005-09-22 12:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Eric Dumazet, Christoph Lameter,
	David S. Miller, linux-kernel, netfilter-devel, netdev

On Thu, Sep 22, 2005 at 02:54:22PM +0200, Andi Kleen wrote:
> > > +void *vmalloc_node(unsigned long size, int node)
> > > +{
> > > +	void *result;
> > > +	struct mempolicy *oldpol = current->mempolicy;
> > > +	mm_segment_t oldfs = get_fs();
> > > +	DECLARE_BITMAP(prefnode, MAX_NUMNODES);
> > > +
> > > +	mpol_get(oldpol);
> > > +	bitmap_zero(prefnode, MAX_NUMNODES);
> > > +	set_bit(node, prefnode);
> > > +
> > > +	set_fs(KERNEL_DS);
> > > +	sys_set_mempolicy(MPOL_PREFERRED, prefnode, MAX_NUMNODES);
> > > +	set_fs(oldfs);
> > > +
> > > +	result = vmalloc(size);
> > > +
> > > +	mpol_free(current->mempolicy);
> > > +	current->mempolicy = oldpol;
> > > +	return result;
> > > +}
> >
> > No way, sorry.  If you want a vmalloc node do it right.
> 
> The implementation looks fine to me, so I think it's already right.

Umm, no - adding set_fs/get_fs mess for things like that is not right.
If we want to go down the mempolicy-based route we need to add a proper
kernel entry point for setting a mempolicy

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-21 21:24           ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
  2005-09-21 22:43             ` Christoph Lameter
  2005-09-22  4:18             ` James Morris
@ 2005-09-22 13:03             ` Andi Kleen
  2005-09-22 13:30               ` Eric Dumazet
  2005-09-23 17:09               ` Harald Welte
  2 siblings, 2 replies; 50+ messages in thread
From: Andi Kleen @ 2005-09-22 13:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netfilter-devel, netdev


> 1) No more central rwlock protecting each table (filter, nat, mangle, raw),
>     but one lock per CPU. It avoids cache line ping pongs for each packet.

Another useful change would be to not take the lock when there are no
rules. Currently just loading iptables has a large overhead.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 12:58                         ` Christoph Hellwig
@ 2005-09-22 13:05                           ` Andi Kleen
  2005-09-22 15:37                             ` Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2005-09-22 13:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Dumazet, Christoph Lameter, David S. Miller, linux-kernel,
	netfilter-devel, netdev

On Thursday 22 September 2005 14:58, Christoph Hellwig wrote:

> Umm, no - adding set_fs/get_fs mess for things like that is not right.

I think it's fine. We're using it for various other interfaces too. In fact
sys_set_mempolicy is already used elsewhere in the kernel too.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 12:48             ` Harald Welte
@ 2005-09-22 13:05               ` Eric Dumazet
  2005-09-23  4:02                 ` Willy Tarreau
  0 siblings, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-22 13:05 UTC (permalink / raw)
  To: Harald Welte; +Cc: netdev, netfilter-devel, linux-kernel, Andi Kleen

Harald Welte a écrit :
> On Wed, Sep 21, 2005 at 11:32:24PM +0200, Eric Dumazet wrote:
> 
>>Patch 2/3 (please apply after Patch 1/3)
>>
>>2) Loop unrolling
> 
> 
> First of all, thanks for your performance analysis and patches.  I'm
> very inclined of merging them.
> 
> 
>>It seems that with current compilers and CFLAGS, the code from
>>ip_packet_match() is very bad, using lot of mispredicted conditional branches I made some patches 
>>and generated code on i386 and x86_64
>>is much better.
> 
> 
> This only describes your "compare_if_lstrings()" changes, and I'm happy
> to merge them.
> 
> However, you also removed the use of the FWINV macro and replaced it by
> explicit code (including the bool1/bool2 variables, which are not really
> named intuitively).  Why was this neccessary?
> 

It was necessary to get the best code with gcc-3.4.4 on i386 and gcc-4.0.1 on 
x86_64

For example :

bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
if (bool1) {

gives a better code than :

if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {

(one less conditional branch)

Dont ask me why, it is shocking but true :(

Eric


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 12:57             ` Harald Welte
@ 2005-09-22 13:17               ` Eric Dumazet
  0 siblings, 0 replies; 50+ messages in thread
From: Eric Dumazet @ 2005-09-22 13:17 UTC (permalink / raw)
  To: Harald Welte; +Cc: linux-kernel, netfilter-devel, netdev, Andi Kleen

Harald Welte a écrit :
> On Wed, Sep 21, 2005 at 11:29:13PM +0200, Eric Dumazet wrote:
> 
>>Patch 1/3
>>
>>1) No more one rwlock_t protecting the 'curtain'
> 
> 
> I have no problem with this change "per se", but with the
> implementation.
> 
> As of now, we live without any ugly #ifdef CONFIG_SMP / #endif sections
> in the code - and if possible, I would continue this good tradition.
> 
> For example the get_counters() function.  Wouldn't all the smp specific
> code (for_each_cpu(), ...)  be #defined to nothing anyway?

Well... not exactly, but you are right only the first loop (SET_COUNTER) will 
really do something. The if (cpu == curcpu) will be true but the compiler wont 
  know that, cpu and curcpu are still C variables.


> 
> And if we really need the #ifdef's, I would appreciate if those
> sectionas are as small as possible.  in get_counters() the section can
> definitely be smaller, rather than basically having the whole function
> body separate for smp and non-smp cases.

get_counters() is not critical, so I agree with you we can stick the general 
version (not the UP optimized one)

> 
> Also, how much would we loose in runtime performance if we were using a
> "rwlock_t *" even in the UP case?.  I mean, it's just one more pointer
> dereference of something that is expected to be in cache anyway, isn't
> it?  This gets rid of another huge set of #ifdefs that make the code
> unreadable and prone to errors being introduced later on.
> 

Well, in UP case, the rwlock_t is a nulldef.

I was inspired by another use of percpu data in include/linux/genhd.h
#ifdef  CONFIG_SMP
     struct disk_stats *dkstats;
#else
     struct disk_stats dkstats;
#endif

But if you dislike this, we can use pointer for all cases.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 13:03             ` Andi Kleen
@ 2005-09-22 13:30               ` Eric Dumazet
  2005-09-23 17:09               ` Harald Welte
  1 sibling, 0 replies; 50+ messages in thread
From: Eric Dumazet @ 2005-09-22 13:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, netfilter-devel, netdev

Andi Kleen a écrit :
>>1) No more central rwlock protecting each table (filter, nat, mangle, raw),
>>    but one lock per CPU. It avoids cache line ping pongs for each packet.
> 
> 
> Another useful change would be to not take the lock when there are no
> rules. Currently just loading iptables has a large overhead.
> 

Unfortunatly there are allways rules, after the loading of iptables, at least 
for the "packet_filter" table.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 13:05                           ` Andi Kleen
@ 2005-09-22 15:37                             ` Christoph Lameter
  2005-09-22 15:50                               ` Eric Dumazet
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2005-09-22 15:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Eric Dumazet, David S. Miller, linux-kernel,
	netfilter-devel, netdev

On Thu, 22 Sep 2005, Andi Kleen wrote:

> On Thursday 22 September 2005 14:58, Christoph Hellwig wrote:
> 
> > Umm, no - adding set_fs/get_fs mess for things like that is not right.
> 
> I think it's fine. We're using it for various other interfaces too. In fact
> sys_set_mempolicy is already used elsewhere in the kernel too.

It should really be do_set_mempolicy instead to be cleaner. I got a patch 
here that fixes the policy layer.

But still I agree with Christoph that a real vmalloc_node is better. There 
will be no fuzzing around with memory policies etc and its certainly 
better performance wise.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 15:37                             ` Christoph Lameter
@ 2005-09-22 15:50                               ` Eric Dumazet
  2005-09-22 15:55                                 ` Christoph Lameter
  2005-09-23 17:11                                 ` Harald Welte
  0 siblings, 2 replies; 50+ messages in thread
From: Eric Dumazet @ 2005-09-22 15:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Christoph Hellwig, David S. Miller, linux-kernel,
	netfilter-devel, netdev

Christoph Lameter a écrit :
> 
> It should really be do_set_mempolicy instead to be cleaner. I got a patch 
> here that fixes the policy layer.
> 
> But still I agree with Christoph that a real vmalloc_node is better. There 
> will be no fuzzing around with memory policies etc and its certainly 
> better performance wise.

vmalloc_node() should be seldom used, at driver init, or when a new ip_tables 
is loaded. If it happens to be a performance problem, then we can optimize it.
Why should we spend days of work for a function that is yet to be used ?

Eric


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 15:50                               ` Eric Dumazet
@ 2005-09-22 15:55                                 ` Christoph Lameter
  2005-09-23 17:11                                 ` Harald Welte
  1 sibling, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2005-09-22 15:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Christoph Hellwig, David S. Miller, linux-kernel,
	netfilter-devel, netdev

On Thu, 22 Sep 2005, Eric Dumazet wrote:

> vmalloc_node() should be seldom used, at driver init, or when a new ip_tables
> is loaded. If it happens to be a performance problem, then we can optimize it.

Allright. However, there are a couple of uses of vmalloc_node that I can 
see right now and they will likely increase in the future.

> Why should we spend days of work for a function that is yet to be used ?

I already did a vmalloc_node patch. So no need for spending days of work. 
The patch can wait until it becomes performance critical.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 13:05               ` Eric Dumazet
@ 2005-09-23  4:02                 ` Willy Tarreau
  2005-09-23  5:14                   ` Eric Dumazet
  2005-09-23 14:00                   ` Tim Mattox
  0 siblings, 2 replies; 50+ messages in thread
From: Willy Tarreau @ 2005-09-23  4:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Harald Welte, netdev, netfilter-devel, linux-kernel, Andi Kleen

On Thu, Sep 22, 2005 at 03:05:50PM +0200, Eric Dumazet wrote:
(...) 
> It was necessary to get the best code with gcc-3.4.4 on i386 and 
> gcc-4.0.1 on x86_64
> 
> For example :
> 
> bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
> if (bool1) {
> 
> gives a better code than :
> 
> if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
> 
> (one less conditional branch)
> 
> Dont ask me why, it is shocking but true :(

I also noticed many times that gcc's optimization of "if (complex condition)"
is rather poor and it's often better to put it in a variable before. I even
remember that if you use an intermediate variable, it can often generate a
CMOV instruction on processors which support it, while it produces cond tests
and jumps without the variable. Generally speaking, if you want fast code,
you have to write it as a long sequence of small instructions, just as if
you were writing assembly. As you said, shocking but true.

BTW, cheers for your optimizations !

Regards,
Willy

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23  4:02                 ` Willy Tarreau
@ 2005-09-23  5:14                   ` Eric Dumazet
  2005-09-23 11:33                     ` Willy Tarreau
  2005-09-23 14:00                   ` Tim Mattox
  1 sibling, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-23  5:14 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Harald Welte, netdev, netfilter-devel, linux-kernel, Andi Kleen

Willy Tarreau a écrit :
> On Thu, Sep 22, 2005 at 03:05:50PM +0200, Eric Dumazet wrote:
> (...) 
> 
>>It was necessary to get the best code with gcc-3.4.4 on i386 and 
>>gcc-4.0.1 on x86_64
>>
>>For example :
>>
>>bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
>>if (bool1) {
>>
>>gives a better code than :
>>
>>if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
>>
>>(one less conditional branch)
>>
>>Dont ask me why, it is shocking but true :(
> 
> 
> I also noticed many times that gcc's optimization of "if (complex condition)"
> is rather poor and it's often better to put it in a variable before. I even
> remember that if you use an intermediate variable, it can often generate a
> CMOV instruction on processors which support it, while it produces cond tests
> and jumps without the variable. Generally speaking, if you want fast code,
> you have to write it as a long sequence of small instructions, just as if
> you were writing assembly. As you said, shocking but true.

Even without CMOV support, the suggested patch helps :

Here is the code generated with gcc-3.4.4  on a pentium4 (i686) for :

/********************/
bool1 = ((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr);
bool1 ^= !!(ipinfo->invflags & IPT_INV_SRCIP);

bool2 = ((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr);
bool2 ^= !!(ipinfo->invflags & IPT_INV_DSTIP);

if ((bool1 | bool2) != 0) {

/********************/
cb:       0f b6 56 53          movzbl 0x53(%esi),%edx
cf:       8b 46 08             mov    0x8(%esi),%eax #ip->saddr
d2:       23 47 0c             and    0xc(%edi),%eax #ipinfo->smsk.s_addr
d5:       0f b6 da             movzbl %dl,%ebx
d8:       3b 06                cmp    (%esi),%eax #ipinfo->src.s_addr
da:       88 55 cf             mov    %dl,0xffffffcf(%ebp)
dd:       89 da                mov    %ebx,%edx
df:       0f 95 c0             setne  %al
e2:       c1 ea 03             shr    $0x3,%edx
e5:       31 c2                xor    %eax,%edx
e7:       8b 46 0c             mov    0xc(%esi),%eax #ip->daddr&ipinfo
ea:       23 47 10             and    0x10(%edi),%eax #ipinfo->dmsk.s_addr
ed:       3b 46 04             cmp    0x4(%esi),%eax #ipinfo->dst.s_addr
f0:       89 d8                mov    %ebx,%eax
f2:       0f 95 c1             setne  %cl
f5:       c1 e8 04             shr    $0x4,%eax
f8:       31 c8                xor    %ecx,%eax
fa:       09 d0                or     %edx,%eax
fc:       a8 01                test   $0x1,%al
fe:       0f 85 95 00 00 00    jne    dest // only one conditional branch

As you can see the whole sequence is rather good : only one conditional branch
(No CMOV instructions as you can see, so even on a i486 the code should be 
roughly the same)

Now here is the code generated for the original code :
/********************/
	if (FWINV((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr,
			IPT_INV_SRCIP)
	  || FWINV((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr,
			IPT_INV_DSTIP)) {
/********************/
       cb:       0f b6 4e 53             movzbl 0x53(%esi),%ecx
       cf:       f6 c1 08                test   $0x8,%cl
       d2:       0f 84 af 01 00 00       je     287 <ipt_do_table+0x25d>
       d8:       8b 46 08                mov    0x8(%esi),%eax
       db:       23 47 0c                and    0xc(%edi),%eax
       de:       3b 06                   cmp    (%esi),%eax
       e0:       0f 84 b0 01 00 00       je     296 <ipt_do_table+0x26c>
       e6:       f6 c1 10                test   $0x10,%cl
       e9:       0f 84 d4 01 00 00       je     2c3 <ipt_do_table+0x299>
       ef:       8b 46 0c                mov    0xc(%esi),%eax
       f2:       23 47 10                and    0x10(%edi),%eax
       f5:       3b 46 04                cmp    0x4(%esi),%eax
       f8:       0f 84 98 01 00 00       je     296 <ipt_do_table+0x26c>

...

      287:       8b 46 08                mov    0x8(%esi),%eax
      28a:       23 47 0c                and    0xc(%edi),%eax
      28d:       3b 06                   cmp    (%esi),%eax
      28f:       2e 0f 84 50 fe ff ff    je,pn  e6 <ipt_do_table+0xbc>
      296:       0f b7 46 5a             movzwl 0x5a(%esi),%eax
      29a:       01 c6                   add    %eax,%esi
      29c:       8b 4d f0                mov    0xfffffff0(%ebp),%ecx
      29f:       85 c9                   test   %ecx,%ecx
      2a1:       0f 84 24 fe ff ff       je     cb <ipt_do_table+0xa1>

...

      2c3:       8b 46 0c                mov    0xc(%esi),%eax
      2c6:       23 47 10                and    0x10(%edi),%eax
      2c9:       3b 46 04                cmp    0x4(%esi),%eax
      2cc:       75 c8                   jne    296 <ipt_do_table+0x26c>
      2ce:       e9 2b fe ff ff          jmp    fe <ipt_do_table+0xd4>


/******************/
	As you can see, that a lot of conditional branches, that cannot be predicted 
correctly by the cpu, unless consecutives iptables rules generate the same flow.


Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23  5:14                   ` Eric Dumazet
@ 2005-09-23 11:33                     ` Willy Tarreau
  0 siblings, 0 replies; 50+ messages in thread
From: Willy Tarreau @ 2005-09-23 11:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willy Tarreau, Harald Welte, netdev, netfilter-devel,
	linux-kernel, Andi Kleen

On Fri, Sep 23, 2005 at 07:14:24AM +0200, Eric Dumazet wrote:
 
> Even without CMOV support, the suggested patch helps :
> 
> Here is the code generated with gcc-3.4.4  on a pentium4 (i686) for :
> 
> /********************/
> bool1 = ((ip->saddr&ipinfo->smsk.s_addr) != ipinfo->src.s_addr);
> bool1 ^= !!(ipinfo->invflags & IPT_INV_SRCIP);
> 
> bool2 = ((ip->daddr&ipinfo->dmsk.s_addr) != ipinfo->dst.s_addr);
> bool2 ^= !!(ipinfo->invflags & IPT_INV_DSTIP);
> 
> if ((bool1 | bool2) != 0) {
> 
> /********************/

(...)

I totally agree with your demonstration. It would be interesting to compare
she same code on an architecture with more registers (eg: sparc). One of
the reasons of bad optimization of such constructs on x86 seems to be the
lack of registers for the number of variables and intermediate results.
When you write the code like above, you show the workflow to the compiler
(and I often use the same technique).

Regards,
Willy


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23  4:02                 ` Willy Tarreau
  2005-09-23  5:14                   ` Eric Dumazet
@ 2005-09-23 14:00                   ` Tim Mattox
  1 sibling, 0 replies; 50+ messages in thread
From: Tim Mattox @ 2005-09-23 14:00 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, Harald Welte, netdev, netfilter-devel,
	linux-kernel, Andi Kleen

AFAIK, the usual reason gcc can generate better code when you write
   bool = complex conditional
   if (bool) ...
is because of C's short-circuit conditional evaluation rules for || and &&.  By
moving the complex expression to an assignment statement, you are telling
gcc that it is okay and safe to evaluate every part of the expression.
If gcc was smart about being able to prove that all parts of the complex
expression where safe (no possibly null pointer references) and had no
side effects, then it could generate the same code with
  if (complex conditional) ...

However, there are many situations where even a magical compiler
couldn't prove that there were no possible side effects, etc., and would
have to generate multiple conditional branches to properly meet the
short-circuit conditional evaluation rules for && and ||.

However, in the specific cases in this thread using FWINV without
|| and && operators, an optimizing compiler "should" be smart enough
to generate more linear code for today's heavily pipelined CPUs.
For now, I guess it's still the duty of the programmer to use coding
style to force the compiler to generate more linear machine code.

On 9/23/05, Willy Tarreau <willy@w.ods.org> wrote:
> On Thu, Sep 22, 2005 at 03:05:50PM +0200, Eric Dumazet wrote:
> (...)
> > It was necessary to get the best code with gcc-3.4.4 on i386 and
> > gcc-4.0.1 on x86_64
> >
> > For example :
> >
> > bool1 = FWINV(ret != 0, IPT_INV_VIA_OUT);
> > if (bool1) {
> >
> > gives a better code than :
> >
> > if (FWINV(ret != 0, IPT_INV_VIA_OUT)) {
> >
> > (one less conditional branch)
> >
> > Dont ask me why, it is shocking but true :(
>
> I also noticed many times that gcc's optimization of "if (complex condition)"
> is rather poor and it's often better to put it in a variable before. I even
> remember that if you use an intermediate variable, it can often generate a
> CMOV instruction on processors which support it, while it produces cond tests
> and jumps without the variable. Generally speaking, if you want fast code,
> you have to write it as a long sequence of small instructions, just as if
> you were writing assembly. As you said, shocking but true.
>
> BTW, cheers for your optimizations !
>
> Regards,
> Willy
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
Tim Mattox - tmattox@gmail.com
  http://homepage.mac.com/tmattox/
    I'm a bright... http://www.the-brights.net/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 13:03             ` Andi Kleen
  2005-09-22 13:30               ` Eric Dumazet
@ 2005-09-23 17:09               ` Harald Welte
  2005-09-27 16:23                 ` Andi Kleen
  1 sibling, 1 reply; 50+ messages in thread
From: Harald Welte @ 2005-09-23 17:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Eric Dumazet, linux-kernel, netfilter-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

On Thu, Sep 22, 2005 at 03:03:21PM +0200, Andi Kleen wrote:
> 
> > 1) No more central rwlock protecting each table (filter, nat, mangle, raw),
> >     but one lock per CPU. It avoids cache line ping pongs for each packet.
> 
> Another useful change would be to not take the lock when there are no
> rules. Currently just loading iptables has a large overhead.

This is partially due to the netfilter hooks that are registered (so we
always take nf_hook_slow() in the NF_HOOK() macro).

The default policies inside an iptables chain are internally implemented
as a rule.  Thus, policies as built-in rules have packet/byte counters.

Therefore, without making a semantic change, we cannot do any of the
following optimizations:

1) not take a lock when the chain is empty
2) not register at the netfilter hook when the chain is empty.

This is well-known, but I don't think we can change the semantics for
the user during a stable kernel series.  That's one point where not
having 2.7.x really hurts.

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-22 15:50                               ` Eric Dumazet
  2005-09-22 15:55                                 ` Christoph Lameter
@ 2005-09-23 17:11                                 ` Harald Welte
  2005-09-23 17:44                                   ` Christoph Lameter
  2005-09-23 17:47                                   ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
  1 sibling, 2 replies; 50+ messages in thread
From: Harald Welte @ 2005-09-23 17:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Andi Kleen, Christoph Hellwig,
	David S. Miller, linux-kernel, netfilter-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 1327 bytes --]

On Thu, Sep 22, 2005 at 05:50:49PM +0200, Eric Dumazet wrote:
> Christoph Lameter a écrit :
> >It should really be do_set_mempolicy instead to be cleaner. I got a patch here that fixes the 
> >policy layer.
> >But still I agree with Christoph that a real vmalloc_node is better. There will be no fuzzing 
> >around with memory policies etc and its certainly better performance wise.
> 
> vmalloc_node() should be seldom used, at driver init, or when a new
> ip_tables is loaded. If it happens to be a performance problem, then
> we can optimize it.  Why should we spend days of work for a function
> that is yet to be used ?

I see a contradiction in your sentence.  "a new ip_tables is loaded"
every time a user changes a single rule.  There are numerous setups that
dynamically change the ruleset (e.g. at interface up/down point, or even
think of your typical wlan hotspot, where once a user is authorized,
he'll get different rules.

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23 17:11                                 ` Harald Welte
@ 2005-09-23 17:44                                   ` Christoph Lameter
  2005-09-23 18:04                                     ` Dave Hansen
  2005-09-23 17:47                                   ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2005-09-23 17:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Harald Welte, Andi Kleen, Christoph Hellwig, David S. Miller,
	linux-kernel, netfilter-devel, netdev

Here is an updated version of the vmalloc_node patch:

This patch adds

vmalloc_node(size, node)	-> Allocate necessary memory on the specified node

and

get_vm_area_node(size, flags, node)

and the other functions that it depends on.

Index: linux-2.6.14-rc2/include/linux/vmalloc.h
===================================================================
--- linux-2.6.14-rc2.orig/include/linux/vmalloc.h	2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/include/linux/vmalloc.h	2005-09-23 10:28:37.000000000 -0700
@@ -32,22 +32,35 @@ struct vm_struct {
  *	Highlevel APIs for driver use
  */
 extern void *vmalloc(unsigned long size);
+extern void *vmalloc_node(unsigned long size, int node);
 extern void *vmalloc_exec(unsigned long size);
 extern void *vmalloc_32(unsigned long size);
-extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void vfree(void *addr);
+extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask,
+				pgprot_t prot, int node);
+extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+				pgprot_t prot, int node);
 
+extern void vfree(void *addr);
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
 extern void vunmap(void *addr);
- 
-/*
- *	Lowlevel-APIs (not for driver use!)
+
+/**
+ *      get_vm_area  -  reserve a contingous kernel virtual area
+ *
+ *      @size:          size of the area
+ *      @flags:         %VM_IOREMAP for I/O mappings or VM_ALLOC
+ *
+ *      Search an area of @size in the kernel virtual mapping area,
+ *      and reserved it for out purposes.  Returns the area descriptor
+ *      on success or %NULL on failure.
  */
-extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
 extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
-					unsigned long start, unsigned long end);
+					unsigned long start, unsigned long end, int node);
+#define get_vm_area(__size, __flags) __get_vm_area((__size), (__flags), VMALLOC_START, \
+				 VMALLOC_END, -1)
+#define get_vm_area_node(__size, __flags, __node) __get_vm_area((__size), (__flags), \
+				VMALLOC_START, VMALLOC_END, __node)
 extern struct vm_struct *remove_vm_area(void *addr);
 extern struct vm_struct *__remove_vm_area(void *addr);
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
Index: linux-2.6.14-rc2/mm/vmalloc.c
===================================================================
--- linux-2.6.14-rc2.orig/mm/vmalloc.c	2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/mm/vmalloc.c	2005-09-23 10:43:07.000000000 -0700
@@ -5,6 +5,7 @@
  *  Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
  *  SMP-safe vmalloc/vfree/ioremap, Tigran Aivazian <tigran@veritas.com>, May 2000
  *  Major rework to support vmap/vunmap, Christoph Hellwig, SGI, August 2002
+ *  Numa awareness, Christoph Lameter, SGI, June 2005
  */
 
 #include <linux/mm.h>
@@ -159,7 +160,7 @@ int map_vm_area(struct vm_struct *area, 
 }
 
 struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end, int node)
 {
 	struct vm_struct **p, *tmp, *area;
 	unsigned long align = 1;
@@ -178,7 +179,7 @@ struct vm_struct *__get_vm_area(unsigned
 	addr = ALIGN(start, align);
 	size = PAGE_ALIGN(size);
 
-	area = kmalloc(sizeof(*area), GFP_KERNEL);
+	area = kmalloc_node(sizeof(*area), GFP_KERNEL, node);
 	if (unlikely(!area))
 		return NULL;
 
@@ -231,21 +232,6 @@ out:
 	return NULL;
 }
 
-/**
- *	get_vm_area  -  reserve a contingous kernel virtual area
- *
- *	@size:		size of the area
- *	@flags:		%VM_IOREMAP for I/O mappings or VM_ALLOC
- *
- *	Search an area of @size in the kernel virtual mapping area,
- *	and reserved it for out purposes.  Returns the area descriptor
- *	on success or %NULL on failure.
- */
-struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
-{
-	return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END);
-}
-
 /* Caller must hold vmlist_lock */
 struct vm_struct *__remove_vm_area(void *addr)
 {
@@ -395,7 +381,8 @@ void *vmap(struct page **pages, unsigned
 
 EXPORT_SYMBOL(vmap);
 
-void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+			pgprot_t prot, int node)
 {
 	struct page **pages;
 	unsigned int nr_pages, array_size, i;
@@ -406,9 +393,9 @@ void *__vmalloc_area(struct vm_struct *a
 	area->nr_pages = nr_pages;
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE)
-		pages = __vmalloc(array_size, gfp_mask, PAGE_KERNEL);
+		pages = __vmalloc(array_size, gfp_mask, PAGE_KERNEL, node);
 	else
-		pages = kmalloc(array_size, (gfp_mask & ~__GFP_HIGHMEM));
+		pages = kmalloc_node(array_size, (gfp_mask & ~__GFP_HIGHMEM), node);
 	area->pages = pages;
 	if (!area->pages) {
 		remove_vm_area(area->addr);
@@ -418,7 +405,10 @@ void *__vmalloc_area(struct vm_struct *a
 	memset(area->pages, 0, array_size);
 
 	for (i = 0; i < area->nr_pages; i++) {
-		area->pages[i] = alloc_page(gfp_mask);
+		if (node < 0)
+			area->pages[i] = alloc_page(gfp_mask);
+		else
+			area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
 		if (unlikely(!area->pages[i])) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
@@ -446,7 +436,7 @@ fail:
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
  */
-void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot, int node)
 {
 	struct vm_struct *area;
 
@@ -454,13 +444,12 @@ void *__vmalloc(unsigned long size, unsi
 	if (!size || (size >> PAGE_SHIFT) > num_physpages)
 		return NULL;
 
-	area = get_vm_area(size, VM_ALLOC);
+	area = get_vm_area_node(size, VM_ALLOC, node);
 	if (!area)
 		return NULL;
 
-	return __vmalloc_area(area, gfp_mask, prot);
+	return __vmalloc_area(area, gfp_mask, prot, node);
 }
-
 EXPORT_SYMBOL(__vmalloc);
 
 /**
@@ -476,11 +465,30 @@ EXPORT_SYMBOL(__vmalloc);
  */
 void *vmalloc(unsigned long size)
 {
-       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
+       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL, -1);
 }
 
 EXPORT_SYMBOL(vmalloc);
 
+/**
+ *	vmalloc_node  -  allocate memory on a specific node
+ *
+ *	@size:		allocation size
+ *	@node;		numa node
+ *
+ *	Allocate enough pages to cover @size from the page level
+ *	allocator and map them into contiguous kernel virtual space.
+ *
+ *	For tight cotrol over page level allocator and protection flags
+ *	use __vmalloc() instead.
+ */
+void *vmalloc_node(unsigned long size, int node)
+{
+       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL, node);
+}
+
+EXPORT_SYMBOL(vmalloc_node);
+
 #ifndef PAGE_KERNEL_EXEC
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
@@ -500,7 +508,7 @@ EXPORT_SYMBOL(vmalloc);
 
 void *vmalloc_exec(unsigned long size)
 {
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC);
+	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, -1);
 }
 
 /**
@@ -513,7 +521,7 @@ void *vmalloc_exec(unsigned long size)
  */
 void *vmalloc_32(unsigned long size)
 {
-	return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
+	return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL, -1);
 }
 
 EXPORT_SYMBOL(vmalloc_32);
Index: linux-2.6.14-rc2/fs/xfs/linux-2.6/kmem.c
===================================================================
--- linux-2.6.14-rc2.orig/fs/xfs/linux-2.6/kmem.c	2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/fs/xfs/linux-2.6/kmem.c	2005-09-23 10:17:20.000000000 -0700
@@ -55,7 +55,7 @@ kmem_alloc(size_t size, unsigned int __n
 		if (size < MAX_SLAB_SIZE || retries > MAX_VMALLOCS)
 			ptr = kmalloc(size, lflags);
 		else
-			ptr = __vmalloc(size, lflags, PAGE_KERNEL);
+			ptr = __vmalloc(size, lflags, PAGE_KERNEL, -1);
 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
 			return ptr;
 		if (!(++retries % 100))
Index: linux-2.6.14-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.14-rc2.orig/mm/page_alloc.c	2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/mm/page_alloc.c	2005-09-23 10:17:20.000000000 -0700
@@ -2542,7 +2542,7 @@ void *__init alloc_large_system_hash(con
 		if (flags & HASH_EARLY)
 			table = alloc_bootmem(size);
 		else if (hashdist)
-			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
+			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL, -1);
 		else {
 			unsigned long order;
 			for (order = 0; ((1UL << order) << PAGE_SHIFT) < size; order++)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23 17:11                                 ` Harald Welte
  2005-09-23 17:44                                   ` Christoph Lameter
@ 2005-09-23 17:47                                   ` Eric Dumazet
  2005-09-23 18:00                                     ` Kyle Moffett
  1 sibling, 1 reply; 50+ messages in thread
From: Eric Dumazet @ 2005-09-23 17:47 UTC (permalink / raw)
  To: Harald Welte
  Cc: Christoph Lameter, Andi Kleen, Christoph Hellwig,
	David S. Miller, linux-kernel, netfilter-devel, netdev

Harald Welte a écrit :
> On Thu, Sep 22, 2005 at 05:50:49PM +0200, Eric Dumazet wrote:
> 
>>Christoph Lameter a écrit :
>>
>>>It should really be do_set_mempolicy instead to be cleaner. I got a patch here that fixes the 
>>>policy layer.
>>>But still I agree with Christoph that a real vmalloc_node is better. There will be no fuzzing 
>>>around with memory policies etc and its certainly better performance wise.
>>
>>vmalloc_node() should be seldom used, at driver init, or when a new
>>ip_tables is loaded. If it happens to be a performance problem, then
>>we can optimize it.  Why should we spend days of work for a function
>>that is yet to be used ?
> 
> 
> I see a contradiction in your sentence.  "a new ip_tables is loaded"
> every time a user changes a single rule.  There are numerous setups that
> dynamically change the ruleset (e.g. at interface up/down point, or even
> think of your typical wlan hotspot, where once a user is authorized,
> he'll get different rules.
> 

But a user changing a single rule usually calls (fork()/exec()) a program 
called iptables. The underlying cost of all this, plus copying the rules to 
user space, so that iptables change them and reload them in the kernel is far 
more important than an hypothetical vmalloc_node() performance problem.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23 17:47                                   ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
@ 2005-09-23 18:00                                     ` Kyle Moffett
  0 siblings, 0 replies; 50+ messages in thread
From: Kyle Moffett @ 2005-09-23 18:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Harald Welte, Christoph Lameter, Andi Kleen, Christoph Hellwig,
	David S. Miller, linux-kernel, netfilter-devel, netdev

On Sep 23, 2005, at 13:47:53, Eric Dumazet wrote:
> Harald Welte a écrit :
>> I see a contradiction in your sentence.  "a new ip_tables is  
>> loaded" every time a user changes a single rule.  There are  
>> numerous setups that dynamically change the ruleset (e.g. at  
>> interface up/down point, or even think of your typical wlan  
>> hotspot, where once a user is authorized, he'll get different rules.
>
> But a user changing a single rule usually calls (fork()/exec()) a  
> program called iptables. The  underlying cost of all this, plus  
> copying the rules to user space, so that iptables change them and  
> reload them in the kernel is far more important than an  
> hypothetical vmalloc_node() performance problem.

Yeah, if you're really worried about the cost of iptables  
manipulations, you should probably write your own happy little C  
program to atomically load, update, and store the rules.  Even then,  
the cost of copying the whole ruleset to userspace for modification  
is probably greater than that of memory allocation issues, especially  
if the ruleset is large enough that memory allocation issues cause  
problems :-D

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$ L++++(+ 
++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+ PGP+++ t+(+ 
++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r  !y?(-)
------END GEEK CODE BLOCK------



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23 17:44                                   ` Christoph Lameter
@ 2005-09-23 18:04                                     ` Dave Hansen
  2005-09-26 17:58                                       ` vmalloc_node Christoph Lameter
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Hansen @ 2005-09-23 18:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Harald Welte, Andi Kleen, Christoph Hellwig,
	David S. Miller, Linux Kernel Mailing List, netfilter-devel,
	netdev

On Fri, 2005-09-23 at 10:44 -0700, Christoph Lameter wrote:
>         for (i = 0; i < area->nr_pages; i++) {
> -               area->pages[i] = alloc_page(gfp_mask);
> +               if (node < 0)
> +                       area->pages[i] = alloc_page(gfp_mask);
> +               else
> +                       area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
>                 if (unlikely(!area->pages[i])) {
>                         /* Successfully allocated i pages, free them in __vunmap() */
>                         area->nr_pages = i;
...
>  void *vmalloc_exec(unsigned long size)
>  {
> -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC);
> +       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, -1);
>  }

Instead of hard-coding all of those -1's for the node to specify a
default allocation, and changing all of those callers, why not:

void *__vmalloc_node(unsigned long size, unsigned int __nocast gfp_mask,
pgprot_t prot, int node)
{
	... existing vmalloc code here
}

void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask,
pgprot_t prot)
{
	__vmalloc_node(size, gfp_mask, prot, -1);
}

A named macro is probably better than -1, but if it is only used in one
place, it is hard to complain.

-- Dave


^ permalink raw reply	[flat|nested] 50+ messages in thread

* vmalloc_node
  2005-09-23 18:04                                     ` Dave Hansen
@ 2005-09-26 17:58                                       ` Christoph Lameter
  2005-09-26 18:10                                         ` vmalloc_node Dave Hansen
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2005-09-26 17:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric Dumazet, Harald Welte, Andi Kleen, akpm, Christoph Hellwig,
	David S. Miller, Linux Kernel Mailing List

On Fri, 23 Sep 2005, Dave Hansen wrote:

> Instead of hard-coding all of those -1's for the node to specify a
> default allocation, and changing all of those callers, why not:

Done.

> 	__vmalloc_node(size, gfp_mask, prot, -1);
> A named macro is probably better than -1, but if it is only used in one
> place, it is hard to complain.

-1 is used consistently in the *_node functions to indicate that the node 
is not specified. Should I replace -1 throughout the kernel with a 
constant?

Here is the updated vmalloc_node patch:

--

This patch adds

vmalloc_node(size, node)	-> Allocate necessary memory on the specified node

and

get_vm_area_node(size, flags, node)

and the other functions that it depends on.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.14-rc2/include/linux/vmalloc.h
===================================================================
--- linux-2.6.14-rc2.orig/include/linux/vmalloc.h	2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/include/linux/vmalloc.h	2005-09-26 10:40:57.000000000 -0700
@@ -32,22 +32,37 @@ struct vm_struct {
  *	Highlevel APIs for driver use
  */
 extern void *vmalloc(unsigned long size);
+extern void *vmalloc_node(unsigned long size, int node);
 extern void *vmalloc_exec(unsigned long size);
 extern void *vmalloc_32(unsigned long size);
-extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot);
-extern void vfree(void *addr);
+extern void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask,
+				pgprot_t prot);
+extern void *__vmalloc_node(unsigned long size, unsigned int __nocast gfp_mask,
+				pgprot_t prot, int node);
+extern void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+				pgprot_t prot, int node);
 
+extern void vfree(void *addr);
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
 extern void vunmap(void *addr);
- 
-/*
- *	Lowlevel-APIs (not for driver use!)
+
+/**
+ *      get_vm_area  -  reserve a contingous kernel virtual area
+ *
+ *      @size:          size of the area
+ *      @flags:         %VM_IOREMAP for I/O mappings or VM_ALLOC
+ *
+ *      Search an area of @size in the kernel virtual mapping area,
+ *      and reserved it for out purposes.  Returns the area descriptor
+ *      on success or %NULL on failure.
  */
-extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
 extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
-					unsigned long start, unsigned long end);
+					unsigned long start, unsigned long end, int node);
+#define get_vm_area(__size, __flags) __get_vm_area((__size), (__flags), VMALLOC_START, \
+				 VMALLOC_END, -1)
+#define get_vm_area_node(__size, __flags, __node) __get_vm_area((__size), (__flags), \
+				VMALLOC_START, VMALLOC_END, __node)
 extern struct vm_struct *remove_vm_area(void *addr);
 extern struct vm_struct *__remove_vm_area(void *addr);
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
Index: linux-2.6.14-rc2/mm/vmalloc.c
===================================================================
--- linux-2.6.14-rc2.orig/mm/vmalloc.c	2005-09-19 20:00:41.000000000 -0700
+++ linux-2.6.14-rc2/mm/vmalloc.c	2005-09-26 10:46:14.000000000 -0700
@@ -5,6 +5,7 @@
  *  Support of BIGMEM added by Gerhard Wichert, Siemens AG, July 1999
  *  SMP-safe vmalloc/vfree/ioremap, Tigran Aivazian <tigran@veritas.com>, May 2000
  *  Major rework to support vmap/vunmap, Christoph Hellwig, SGI, August 2002
+ *  Numa awareness, Christoph Lameter, SGI, June 2005
  */
 
 #include <linux/mm.h>
@@ -159,7 +160,7 @@ int map_vm_area(struct vm_struct *area, 
 }
 
 struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end, int node)
 {
 	struct vm_struct **p, *tmp, *area;
 	unsigned long align = 1;
@@ -178,7 +179,7 @@ struct vm_struct *__get_vm_area(unsigned
 	addr = ALIGN(start, align);
 	size = PAGE_ALIGN(size);
 
-	area = kmalloc(sizeof(*area), GFP_KERNEL);
+	area = kmalloc_node(sizeof(*area), GFP_KERNEL, node);
 	if (unlikely(!area))
 		return NULL;
 
@@ -231,21 +232,6 @@ out:
 	return NULL;
 }
 
-/**
- *	get_vm_area  -  reserve a contingous kernel virtual area
- *
- *	@size:		size of the area
- *	@flags:		%VM_IOREMAP for I/O mappings or VM_ALLOC
- *
- *	Search an area of @size in the kernel virtual mapping area,
- *	and reserved it for out purposes.  Returns the area descriptor
- *	on success or %NULL on failure.
- */
-struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
-{
-	return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END);
-}
-
 /* Caller must hold vmlist_lock */
 struct vm_struct *__remove_vm_area(void *addr)
 {
@@ -395,7 +381,8 @@ void *vmap(struct page **pages, unsigned
 
 EXPORT_SYMBOL(vmap);
 
-void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc_area(struct vm_struct *area, unsigned int __nocast gfp_mask,
+			pgprot_t prot, int node)
 {
 	struct page **pages;
 	unsigned int nr_pages, array_size, i;
@@ -406,9 +393,9 @@ void *__vmalloc_area(struct vm_struct *a
 	area->nr_pages = nr_pages;
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE)
-		pages = __vmalloc(array_size, gfp_mask, PAGE_KERNEL);
+		pages = __vmalloc_node(array_size, gfp_mask, PAGE_KERNEL, node);
 	else
-		pages = kmalloc(array_size, (gfp_mask & ~__GFP_HIGHMEM));
+		pages = kmalloc_node(array_size, (gfp_mask & ~__GFP_HIGHMEM), node);
 	area->pages = pages;
 	if (!area->pages) {
 		remove_vm_area(area->addr);
@@ -418,7 +405,10 @@ void *__vmalloc_area(struct vm_struct *a
 	memset(area->pages, 0, array_size);
 
 	for (i = 0; i < area->nr_pages; i++) {
-		area->pages[i] = alloc_page(gfp_mask);
+		if (node < 0)
+			area->pages[i] = alloc_page(gfp_mask);
+		else
+			area->pages[i] = alloc_pages_node(node, gfp_mask, 0);
 		if (unlikely(!area->pages[i])) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
@@ -436,17 +426,18 @@ fail:
 }
 
 /**
- *	__vmalloc  -  allocate virtually contiguous memory
+ *	__vmalloc_node  -  allocate virtually contiguous memory
  *
  *	@size:		allocation size
  *	@gfp_mask:	flags for the page level allocator
  *	@prot:		protection mask for the allocated pages
+ *	@node		node to use for allocation or -1
  *
  *	Allocate enough pages to cover @size from the page level
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
  */
-void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot)
+void *__vmalloc_node(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot, int node)
 {
 	struct vm_struct *area;
 
@@ -454,13 +445,18 @@ void *__vmalloc(unsigned long size, unsi
 	if (!size || (size >> PAGE_SHIFT) > num_physpages)
 		return NULL;
 
-	area = get_vm_area(size, VM_ALLOC);
+	area = get_vm_area_node(size, VM_ALLOC, node);
 	if (!area)
 		return NULL;
 
-	return __vmalloc_area(area, gfp_mask, prot);
+	return __vmalloc_area(area, gfp_mask, prot, node);
 }
+EXPORT_SYMBOL(__vmalloc_node);
 
+void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t prot)
+{
+	return __vmalloc_node(size, gfp_mask, prot, -1);
+}
 EXPORT_SYMBOL(__vmalloc);
 
 /**
@@ -481,6 +477,25 @@ void *vmalloc(unsigned long size)
 
 EXPORT_SYMBOL(vmalloc);
 
+/**
+ *	vmalloc_node  -  allocate memory on a specific node
+ *
+ *	@size:		allocation size
+ *	@node;		numa node
+ *
+ *	Allocate enough pages to cover @size from the page level
+ *	allocator and map them into contiguous kernel virtual space.
+ *
+ *	For tight cotrol over page level allocator and protection flags
+ *	use __vmalloc() instead.
+ */
+void *vmalloc_node(unsigned long size, int node)
+{
+       return __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL, node);
+}
+
+EXPORT_SYMBOL(vmalloc_node);
+
 #ifndef PAGE_KERNEL_EXEC
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: vmalloc_node
  2005-09-26 17:58                                       ` vmalloc_node Christoph Lameter
@ 2005-09-26 18:10                                         ` Dave Hansen
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Hansen @ 2005-09-26 18:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Harald Welte, Andi Kleen, Andrew Morton,
	Christoph Hellwig, David S. Miller, Linux Kernel Mailing List

On Mon, 2005-09-26 at 10:58 -0700, Christoph Lameter wrote:
> On Fri, 23 Sep 2005, Dave Hansen wrote:
> > Instead of hard-coding all of those -1's for the node to specify a
> > default allocation, and changing all of those callers, why not:
> 
> Done.

That looks much nicer.  Thanks!

> > 	__vmalloc_node(size, gfp_mask, prot, -1);
> > A named macro is probably better than -1, but if it is only used in one
> > place, it is hard to complain.
> 
> -1 is used consistently in the *_node functions to indicate that the node 
> is not specified. Should I replace -1 throughout the kernel with a 
> constant?

I certainly wouldn't mind.  Giving it a name like NODE_ANY or
NODE_UNSPECIFIED would certainly keep anyone from having to go dig into
the allocator functions to decide what it actually does.  

-- Dave


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-23 17:09               ` Harald Welte
@ 2005-09-27 16:23                 ` Andi Kleen
  2005-09-28  0:25                   ` Henrik Nordstrom
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2005-09-27 16:23 UTC (permalink / raw)
  To: Harald Welte; +Cc: Eric Dumazet, linux-kernel, netfilter-devel, netdev

On Friday 23 September 2005 19:09, Harald Welte wrote:
> On Thu, Sep 22, 2005 at 03:03:21PM +0200, Andi Kleen wrote:
> > > 1) No more central rwlock protecting each table (filter, nat, mangle,
> > > raw), but one lock per CPU. It avoids cache line ping pongs for each
> > > packet.
> >
> > Another useful change would be to not take the lock when there are no
> > rules. Currently just loading iptables has a large overhead.
>
> This is partially due to the netfilter hooks that are registered (so we
> always take nf_hook_slow() in the NF_HOOK() macro).

Not sure it's that. nf_hook_slow uses RCU, so it should be quite
fast.

> The default policies inside an iptables chain are internally implemented
> as a rule.  Thus, policies as built-in rules have packet/byte counters.

That could be special cased and done lockless, with the counting
done per CPU.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-27 16:23                 ` Andi Kleen
@ 2005-09-28  0:25                   ` Henrik Nordstrom
  2005-09-28  8:32                     ` Harald Welte
  0 siblings, 1 reply; 50+ messages in thread
From: Henrik Nordstrom @ 2005-09-28  0:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Harald Welte, netdev, netfilter-devel, linux-kernel

On Tue, 27 Sep 2005, Andi Kleen wrote:

> That could be special cased and done lockless, with the counting
> done per CPU.

It's also not very hard for iptables when verifying the table to conclude 
that there really isn't any "real" rules for a certain hook and then 
delete that hook registration (only policy ACCEPT rule found). Allowing 
you to have as many ip tables modules you like in the kernel, but only 
using the hooks where you have rules.  Drawback is that you loose the 
packet counters on the policy.

Exception: iptable_nat. Needs the hooks for other purposes as well, not 
just the iptable so here the hooks can not be deactivated when there is no 
rules.

Regards
Henrik

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-28  0:25                   ` Henrik Nordstrom
@ 2005-09-28  8:32                     ` Harald Welte
  2005-09-28  8:37                       ` Andi Kleen
  2005-09-28 10:34                       ` Henrik Nordstrom
  0 siblings, 2 replies; 50+ messages in thread
From: Harald Welte @ 2005-09-28  8:32 UTC (permalink / raw)
  To: Henrik Nordstrom; +Cc: Andi Kleen, netdev, netfilter-devel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2604 bytes --]

On Wed, Sep 28, 2005 at 02:25:16AM +0200, Henrik Nordstrom wrote:

> >That could be special cased and done lockless, with the counting
> >done per CPU.
> 
> It's also not very hard for iptables when verifying the table to
> conclude that there really isn't any "real" rules for a certain hook

The detection should be quite straight-forward, yes.

> Allowing you to have as many ip tables modules you like in the kernel,
> but only using the hooks where you have rules.  

I totally agree, that from a current perspective, I think the concept of
just loading a module (that has usage count 0) having severe impact on
system performance is just wrong.  But then, users are used to the
current behaviour for almost five years now.  Every decent script and/or
piece of documentation should by now refer to the fact that loading a
module implicitly registers it at the hooks, and that you only load
those modules that you really need.

In an ideal world, you would load iptables, and userspace would
configure a couple of chains, and then explicitly attach those chains to
the netfilter hooks. 

Therefore: Let's do this right next time, but live with that fact for
now.

> Drawback is that you loose the packet counters on the policy.

That's the big problem.  People rely on that fact, because they're used
to it.  If we introduce some new tool, or something that somehow works
differently, I don't have a change.  But silently changing the semantics
with a kernel upgrade is not something that I'm willing to do on
purpose.

Just imagine all those poor sysadmins who know nothing about current
kernel development, and who upgrade their kernel because their
distributor provides a new one - suddenly their accounting (which might
be relevant for their business) doesn't work anymore :(

> Exception: iptable_nat. Needs the hooks for other purposes as well,
> not just the iptable so here the hooks can not be deactivated when
> there is no rules.

yes, even though we could make the ip_nat / iptable_nat split (that is
introduced in 2.6.14) in a way that would make ip_nat.ko register those
hooks that are implicitly needed, and iptable_nat only the user-visible
chain-related hooks.

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-28  8:32                     ` Harald Welte
@ 2005-09-28  8:37                       ` Andi Kleen
  2005-10-04 17:01                         ` Patrick McHardy
  2005-09-28 10:34                       ` Henrik Nordstrom
  1 sibling, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2005-09-28  8:37 UTC (permalink / raw)
  To: Harald Welte; +Cc: Henrik Nordstrom, netdev, netfilter-devel, linux-kernel

On Wednesday 28 September 2005 10:32, Harald Welte wrote:

> I totally agree, that from a current perspective, I think the concept of
> just loading a module (that has usage count 0) having severe impact on
> system performance is just wrong.  But then, users are used to the
> current behaviour for almost five years now. 

That doesn't mean it cannot be improved - and I think it should.

In a sense it's even getting worse: For example us losing the CONFIG
option to disable local conntrack (Patrick has disabled it some time ago 
without even a comment why he did it) has a really bad impact in some cases.

> Therefore: Let's do this right next time, but live with that fact for
> now.

Even with a "quite straight-forward" (quoting you) fix?

> Just imagine all those poor sysadmins who know nothing about current
> kernel development, and who upgrade their kernel because their
> distributor provides a new one - suddenly their accounting (which might
> be relevant for their business) doesn't work anymore :(

Accounting with per CPU counters can be done fully lockless.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-28  8:32                     ` Harald Welte
  2005-09-28  8:37                       ` Andi Kleen
@ 2005-09-28 10:34                       ` Henrik Nordstrom
  1 sibling, 0 replies; 50+ messages in thread
From: Henrik Nordstrom @ 2005-09-28 10:34 UTC (permalink / raw)
  To: Harald Welte; +Cc: Andi Kleen, netdev, netfilter-devel, linux-kernel

On Wed, 28 Sep 2005, Harald Welte wrote:

> Just imagine all those poor sysadmins who know nothing about current
> kernel development, and who upgrade their kernel because their
> distributor provides a new one - suddenly their accounting (which might
> be relevant for their business) doesn't work anymore :(

You seriously expect anyone is using the counters on the policy rule in 
otherwise completely empty rulesets? These counters is also available 
elsewhere (route and interface statistics) and more intuitively so.

> yes, even though we could make the ip_nat / iptable_nat split (that is
> introduced in 2.6.14) in a way that would make ip_nat.ko register those
> hooks that are implicitly needed, and iptable_nat only the user-visible
> chain-related hooks.

Which adds additional overhead, so this approach is a bit 
counter-productive I would say.

Regards
Henrik

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-09-28  8:37                       ` Andi Kleen
@ 2005-10-04 17:01                         ` Patrick McHardy
  2005-10-05 16:53                           ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Patrick McHardy @ 2005-10-04 17:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Harald Welte, netdev, netfilter-devel, linux-kernel, Henrik Nordstrom

Andi Kleen wrote:
> In a sense it's even getting worse: For example us losing the CONFIG
> option to disable local conntrack (Patrick has disabled it some time ago 
> without even a comment why he did it) has a really bad impact in some cases.

It was necessary to correctly handle locally generated ICMP errors.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-10-04 17:01                         ` Patrick McHardy
@ 2005-10-05 16:53                           ` Andi Kleen
  2005-10-07  2:38                             ` Harald Welte
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2005-10-05 16:53 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, netdev, netfilter-devel, linux-kernel, Henrik Nordstrom

On Tuesday 04 October 2005 19:01, Patrick McHardy wrote:
> Andi Kleen wrote:
> > In a sense it's even getting worse: For example us losing the CONFIG
> > option to disable local conntrack (Patrick has disabled it some time ago
> > without even a comment why he did it) has a really bad impact in some
> > cases.
>
> It was necessary to correctly handle locally generated ICMP errors.

Well you most likely wrecked local performance then when it's enabled.

-Andi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-10-07  2:38                             ` Harald Welte
@ 2005-10-06 17:59                               ` Andi Kleen
  2005-10-07 17:08                                 ` Patrick McHardy
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2005-10-06 17:59 UTC (permalink / raw)
  To: Harald Welte, Andi Kleen, Patrick McHardy, netdev,
	netfilter-devel, linux-kernel, Henrik Nordstrom

On Fri, Oct 07, 2005 at 04:38:02AM +0200, Harald Welte wrote:
> On Wed, Oct 05, 2005 at 06:53:31PM +0200, Andi Kleen wrote:
> > On Tuesday 04 October 2005 19:01, Patrick McHardy wrote:
> > > Andi Kleen wrote:
> > > > In a sense it's even getting worse: For example us losing the CONFIG
> > > > option to disable local conntrack (Patrick has disabled it some time ago
> > > > without even a comment why he did it) has a really bad impact in some
> > > > cases.
> > >
> > > It was necessary to correctly handle locally generated ICMP errors.
> > 
> > Well you most likely wrecked local performance then when it's enabled.
> 
> so you would favour a system that incorrectly deals with ICMP errors but
> has higher performance?

I would favour a system where development doesn't lose sight of performance.
Perhaps there would be other ways to fix this problem without impacting
performance unduly? Can you describe it in detail? 

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-10-05 16:53                           ` Andi Kleen
@ 2005-10-07  2:38                             ` Harald Welte
  2005-10-06 17:59                               ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Harald Welte @ 2005-10-07  2:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Patrick McHardy, netdev, netfilter-devel, linux-kernel, Henrik Nordstrom

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On Wed, Oct 05, 2005 at 06:53:31PM +0200, Andi Kleen wrote:
> On Tuesday 04 October 2005 19:01, Patrick McHardy wrote:
> > Andi Kleen wrote:
> > > In a sense it's even getting worse: For example us losing the CONFIG
> > > option to disable local conntrack (Patrick has disabled it some time ago
> > > without even a comment why he did it) has a really bad impact in some
> > > cases.
> >
> > It was necessary to correctly handle locally generated ICMP errors.
> 
> Well you most likely wrecked local performance then when it's enabled.

so you would favour a system that incorrectly deals with ICMP errors but
has higher performance?

-- 
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-10-06 17:59                               ` Andi Kleen
@ 2005-10-07 17:08                                 ` Patrick McHardy
  2005-10-07 17:21                                   ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Patrick McHardy @ 2005-10-07 17:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Harald Welte, netdev, netfilter-devel, linux-kernel, Henrik Nordstrom

Andi Kleen wrote:
> On Fri, Oct 07, 2005 at 04:38:02AM +0200, Harald Welte wrote:
> 
>>On Wed, Oct 05, 2005 at 06:53:31PM +0200, Andi Kleen wrote:
>>
>>>Well you most likely wrecked local performance then when it's enabled.

There are lots of other hooks and conntrack/NAT already have a
quite large negative influence on performance. Do you have numbers
that show that enabling this actually causes more than a slight
decrease in performance? Besides, most distributors enable all
these options anyway, so it only makes a difference for a small
group of users.

>>so you would favour a system that incorrectly deals with ICMP errors but
>>has higher performance?
> 
> I would favour a system where development doesn't lose sight of performance.

I don't think we do.

> Perhaps there would be other ways to fix this problem without impacting
> performance unduly? Can you describe it in detail? 

When an ICMP error is send by the firewall itself, the inner
packet needs to be restored to its original state. That means
both DNAT and SNAT which might have been applied need to be
reversed. DNAT is reversed at places where we usually do
SNAT (POST_ROUTING), SNAT is reversed where usually DNAT is
done (PRE_ROUTING/LOCAL_OUT). Since locally generated packets
never go through PRE_ROUTING, it is done in LOCAL_OUT, which
required enabling NAT in LOCAL_OUT unconditionally. It might
be possible to move this to some different hook, I didn't
investigate it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-10-07 17:08                                 ` Patrick McHardy
@ 2005-10-07 17:21                                   ` Andi Kleen
  2005-10-07 17:50                                     ` Patrick McHardy
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2005-10-07 17:21 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Harald Welte, netdev, netfilter-devel, linux-kernel, Henrik Nordstrom

On Friday 07 October 2005 19:08, Patrick McHardy wrote:

> There are lots of other hooks and conntrack/NAT already have a
> quite large negative influence on performance. Do you have numbers
> that show that enabling this actually causes more than a slight
> decrease in performance? Besides, most distributors enable all
> these options anyway, so it only makes a difference for a small
> group of users.

I don't know about other distributions but SUSE at some point
found that some web benchmarks dramatically improved in the default 
configuration when local conntrack was off. It was off then since ever.


> > Perhaps there would be other ways to fix this problem without impacting
> > performance unduly? Can you describe it in detail?
>
> When an ICMP error is send by the firewall itself, the inner
> packet needs to be restored to its original state. That means
> both DNAT and SNAT which might have been applied need to be
> reversed. DNAT is reversed at places where we usually do
> SNAT (POST_ROUTING), SNAT is reversed where usually DNAT is
> done (PRE_ROUTING/LOCAL_OUT). Since locally generated packets
> never go through PRE_ROUTING, it is done in LOCAL_OUT, which
> required enabling NAT in LOCAL_OUT unconditionally. It might
> be possible to move this to some different hook, I didn't
> investigate it.

This sounds wrong anyways. You shouldn't be touching conntrack state for ICMPs 
generated by routers because they can be temporary errors (e.g. during a 
routing flap when the route moves). Only safe way to handle this is to wait 
for the timeout which doesn't need local handling. And the firewall cannot be 
an endhost here.

-Andi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance
  2005-10-07 17:21                                   ` Andi Kleen
@ 2005-10-07 17:50                                     ` Patrick McHardy
  0 siblings, 0 replies; 50+ messages in thread
From: Patrick McHardy @ 2005-10-07 17:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Harald Welte, netdev, netfilter-devel, linux-kernel, Henrik Nordstrom

Andi Kleen wrote:
> On Friday 07 October 2005 19:08, Patrick McHardy wrote:
> 
> I don't know about other distributions but SUSE at some point
> found that some web benchmarks dramatically improved in the default 
> configuration when local conntrack was off. It was off then since ever.

Interesting ..

>>When an ICMP error is send by the firewall itself, the inner
>>packet needs to be restored to its original state. That means
>>both DNAT and SNAT which might have been applied need to be
>>reversed. DNAT is reversed at places where we usually do
>>SNAT (POST_ROUTING), SNAT is reversed where usually DNAT is
>>done (PRE_ROUTING/LOCAL_OUT). Since locally generated packets
>>never go through PRE_ROUTING, it is done in LOCAL_OUT, which
>>required enabling NAT in LOCAL_OUT unconditionally. It might
>>be possible to move this to some different hook, I didn't
>>investigate it.
> 
> This sounds wrong anyways. You shouldn't be touching conntrack state for ICMPs 
> generated by routers because they can be temporary errors (e.g. during a 
> routing flap when the route moves). Only safe way to handle this is to wait 
> for the timeout which doesn't need local handling. And the firewall cannot be 
> an endhost here.

You misunderstood me. Its not about conntrack state, its about
the inner packet contained in the ICMP message. If it was
NATed by the box itself before the ICMP message is generated,
it needs to be restored to its original state, otherwise the
receipient will ignore it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2005-10-07 18:00 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <432EF0C5.5090908@cosmosbay.com>
     [not found] ` <200509191948.55333.ak@suse.de>
     [not found]   ` <432FDAC5.3040801@cosmosbay.com>
     [not found]     ` <200509201830.20689.ak@suse.de>
2005-09-20 21:45       ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h , Re: [PATCH, netfilter] NUMA aware ipv4/netfilter/ip_tables.c Eric Dumazet
2005-09-20 21:46         ` [PATCH] Adds sys_set_mempolicy() in include/linux/syscalls.h Eric Dumazet
2005-09-21 21:24           ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
2005-09-21 22:43             ` Christoph Lameter
2005-09-22  0:34               ` David S. Miller
2005-09-22  1:44                 ` Christoph Lameter
2005-09-22 12:11                   ` Eric Dumazet
2005-09-22 12:49                     ` Christoph Hellwig
2005-09-22 12:54                       ` Andi Kleen
2005-09-22 12:58                         ` Christoph Hellwig
2005-09-22 13:05                           ` Andi Kleen
2005-09-22 15:37                             ` Christoph Lameter
2005-09-22 15:50                               ` Eric Dumazet
2005-09-22 15:55                                 ` Christoph Lameter
2005-09-23 17:11                                 ` Harald Welte
2005-09-23 17:44                                   ` Christoph Lameter
2005-09-23 18:04                                     ` Dave Hansen
2005-09-26 17:58                                       ` vmalloc_node Christoph Lameter
2005-09-26 18:10                                         ` vmalloc_node Dave Hansen
2005-09-23 17:47                                   ` [PATCH 0/3] netfilter : 3 patches to boost ip_tables performance Eric Dumazet
2005-09-23 18:00                                     ` Kyle Moffett
2005-09-22  4:18             ` James Morris
2005-09-22  5:07               ` Eric Dumazet
2005-09-22 13:03             ` Andi Kleen
2005-09-22 13:30               ` Eric Dumazet
2005-09-23 17:09               ` Harald Welte
2005-09-27 16:23                 ` Andi Kleen
2005-09-28  0:25                   ` Henrik Nordstrom
2005-09-28  8:32                     ` Harald Welte
2005-09-28  8:37                       ` Andi Kleen
2005-10-04 17:01                         ` Patrick McHardy
2005-10-05 16:53                           ` Andi Kleen
2005-10-07  2:38                             ` Harald Welte
2005-10-06 17:59                               ` Andi Kleen
2005-10-07 17:08                                 ` Patrick McHardy
2005-10-07 17:21                                   ` Andi Kleen
2005-10-07 17:50                                     ` Patrick McHardy
2005-09-28 10:34                       ` Henrik Nordstrom
2005-09-21 21:29           ` [PATCH 1/3] " Eric Dumazet
2005-09-22 12:57             ` Harald Welte
2005-09-22 13:17               ` Eric Dumazet
2005-09-21 21:32           ` [PATCH 2/3] " Eric Dumazet
2005-09-22 12:48             ` Harald Welte
2005-09-22 13:05               ` Eric Dumazet
2005-09-23  4:02                 ` Willy Tarreau
2005-09-23  5:14                   ` Eric Dumazet
2005-09-23 11:33                     ` Willy Tarreau
2005-09-23 14:00                   ` Tim Mattox
2005-09-21 21:37           ` [PATCH 3/3] " Eric Dumazet
2005-09-22 12:50             ` Harald Welte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).