All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH + RFC] neighbour/ARP cache scalability
@ 2004-09-20 22:51 Harald Welte
  2004-09-21  6:29 ` David S. Miller
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Harald Welte @ 2004-09-20 22:51 UTC (permalink / raw)
  To: Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 17182 bytes --]

Hi!

I've recently had some problems with the neighbour cache on systems
which have large layer-2 networks (such as 10/12 ethernet interfaces,
most of them attached to networks with a /16 netmaks).  Yes, I know,
networks of that size is a bad design anyway.  But at the same time
there are people using such strange setups ...

The current garbage collection mechanism's ultimate goal is to prevent
ARP floods.  It therefore never expires INCOMPLETE entries from the
neighbour cache.  I've seen systems in a state which is what I believe
beign completely starved of neighbour cache entries, while gc is unable
to evict anything => increasing gc_thesh2/gc_thresh3 to large values.

As a result, on systems with large networks (think of even less than 16
bit netmasks!) the neighbour table can grow quite significantly,
especially if some jerk runs pktgen with random destinations, and we
start expiring 'real' neighbours in favour of incomplete ones.  

This large number of ncache entries is distributed over only 32 buckets,
which is quite unsatisfactory.  Especially if we consider that the
distribution within the hash has been reported to be quite sub-optimal[1]

Tim Gardner has already sent two patches for 2.4.x on this some time
ago [1], [2] adressing this issue - but none of them was integrated.

I've now ported [2] to 2.6.9-rc2-bk6, and dropped [1]. Furthermore, I've
added:
	- neigh_hash_bits boot parameter (if not specified, auto-sizing
	  hash based on system ram)
	- jenkins hash for ARP hash function, like Dave indicated [3]
	- make number of buckets readable via /proc/

The patch is not yet intende for mainline inclusion, since I want to
have feedback on

1) should I try to use jhash for ipv6, too?

2) Dave has indicated that there should be an upper limit for hash
   buckets.  What is considered a reasonable upper bound, even for very
   large systems?

3) What do you think about removing gc_interval in favour of the dynamic
   interval calculation, combined with the 'one bucket per gc run'
   approach that Tim selected?

4) Shouldn't we set gc_thresh3 automatically based on the netmassk of
   the interfaces we have IFF_RUNNING?  

5) What is the proposed solution for IPv6, when you have /48 or /64 bit
   prefixes and systems become even more vulnerable to neighbour cache
   attacks?  

6) Since everybody agrees the current scheme of bucket-sizing based on
   memory only is a bad idea, do we want to unify this bucket sizing
   code, so people have one single knob (boot option) which they can use
   to give the system some metrics on how large it should scale all the
   network hash tables?

[1] http://oss.sgi.com/projects/netdev/archive/2004-03/msg00265.html
[2] http://oss.sgi.com/projects/netdev/archive/2004-03/msg00250.html
[3] http://oss.sgi.com/projects/netdev/archive/2004-03/msg00279.html


diff -Nru --exclude-from=/sunbeam/home/laforge/scripts/dontdiff linux-2.6.9-rc2-bk6/net/atm/clip.c linux-2.6.9-rc2-bk6-ncache/net/atm/clip.c
--- linux-2.6.9-rc2-bk6/net/atm/clip.c	2004-09-21 00:23:16.000000000 +0200
+++ linux-2.6.9-rc2-bk6-ncache/net/atm/clip.c	2004-09-20 20:25:48.000000000 +0200
@@ -130,7 +130,7 @@
 
 	/*DPRINTK("idle_timer_check\n");*/
 	write_lock(&clip_tbl.lock);
-	for (i = 0; i <= NEIGH_HASHMASK; i++) {
+	for (i = 0; i < clip_tbl.num_hash_buckets; i++) {
 		struct neighbour **np;
 
 		for (np = &clip_tbl.hash_buckets[i]; *np;) {
@@ -341,6 +341,7 @@
 	return 0;
 }
 
+static struct neigh_table clip_tbl;
 static u32 clip_hash(const void *pkey, const struct net_device *dev)
 {
 	u32 hash_val;
@@ -349,7 +350,7 @@
 	hash_val ^= (hash_val>>16);
 	hash_val ^= hash_val>>8;
 	hash_val ^= hash_val>>3;
-	hash_val = (hash_val^dev->ifindex)&NEIGH_HASHMASK;
+	hash_val = (hash_val^dev->ifindex)&(clip_tbl.num_hash_buckets-1);
 
 	return hash_val;
 }
@@ -894,7 +895,7 @@
 {
 	void *v = NULL;
 
-	for (; state->bucket <= NEIGH_HASHMASK; state->bucket++) {
+	for (; state->bucket < clip_tbl.num_hash_buckets; state->bucket++) {
 		for (; state->n; state->n = state->n->next) {
 			v = arp_vcc_walk(state, NEIGH2ENTRY(state->n), &l);
 			if (v)
diff -Nru --exclude-from=/sunbeam/home/laforge/scripts/dontdiff linux-2.6.9-rc2-bk6/net/core/neighbour.c linux-2.6.9-rc2-bk6-ncache/net/core/neighbour.c
--- linux-2.6.9-rc2-bk6/net/core/neighbour.c	2004-09-21 00:23:16.000000000 +0200
+++ linux-2.6.9-rc2-bk6-ncache/net/core/neighbour.c	2004-09-20 22:13:56.000000000 +0200
@@ -12,6 +12,10 @@
  *
  *	Fixes:
  *	Vitaly E. Lavrov	releasing NULL neighbor in neigh_add.
+ *
+ *	Changes:
+ *	2004-09-17: Make cache size dynamically sized + configurable 
+ *	  (Tim Gardner <timg@tpi.com> & Harald Welte <laforge@netfilter.org>)
  */
 
 #include <linux/config.h>
@@ -47,6 +51,9 @@
 #define NEIGH_PRINTK2 NEIGH_PRINTK
 #endif
 
+static unsigned int neigh_hash_bits;
+module_param(neigh_hash_bits, int, 0400);
+
 static void neigh_timer_handler(unsigned long arg);
 #ifdef CONFIG_ARPD
 static void neigh_app_notify(struct neighbour *n);
@@ -113,7 +120,7 @@
 	int shrunk = 0;
 	int i;
 
-	for (i = 0; i <= NEIGH_HASHMASK; i++) {
+	for (i = 0; i < tbl->num_hash_buckets; i++) {
 		struct neighbour *n, **np;
 
 		np = &tbl->hash_buckets[i];
@@ -177,7 +184,7 @@
 
 	write_lock_bh(&tbl->lock);
 
-	for (i=0; i <= NEIGH_HASHMASK; i++) {
+	for (i=0; i < tbl->num_hash_buckets; i++) {
 		struct neighbour *n, **np;
 
 		np = &tbl->hash_buckets[i];
@@ -204,7 +211,7 @@
 
 	write_lock_bh(&tbl->lock);
 
-	for (i = 0; i <= NEIGH_HASHMASK; i++) {
+	for (i = 0; i < tbl->num_hash_buckets; i++) {
 		struct neighbour *n, **np = &tbl->hash_buckets[i];
 
 		while ((n = *np) != NULL) {
@@ -545,9 +552,8 @@
 static void neigh_periodic_timer(unsigned long arg)
 {
 	struct neigh_table *tbl = (struct neigh_table *)arg;
+	struct neighbour *n, **np;
 	unsigned long now = jiffies;
-	int i;
-
 
 	write_lock(&tbl->lock);
 
@@ -563,41 +569,44 @@
 				neigh_rand_reach_time(p->base_reachable_time);
 	}
 
-	for (i = 0; i <= NEIGH_HASHMASK; i++) {
-		struct neighbour *n, **np;
+	tbl->curr_hash_bucket &= (tbl->num_hash_buckets-1);
+	np = &tbl->hash_buckets[tbl->curr_hash_bucket++];
 
-		np = &tbl->hash_buckets[i];
-		while ((n = *np) != NULL) {
-			unsigned state;
+	while ((n = *np) != NULL) {
+		unsigned state;
 
-			write_lock(&n->lock);
+		write_lock(&n->lock);
 
-			state = n->nud_state;
-			if (state & (NUD_PERMANENT | NUD_IN_TIMER)) {
-				write_unlock(&n->lock);
-				goto next_elt;
-			}
+		state = n->nud_state;
+		if (state & (NUD_PERMANENT | NUD_IN_TIMER)) {
+			write_unlock(&n->lock);
+			goto next_elt;
+		}
 
-			if (time_before(n->used, n->confirmed))
-				n->used = n->confirmed;
+		if (time_before(n->used, n->confirmed))
+			n->used = n->confirmed;
 
-			if (atomic_read(&n->refcnt) == 1 &&
-			    (state == NUD_FAILED ||
-			     time_after(now, n->used + n->parms->gc_staletime))) {
-				*np = n->next;
-				n->dead = 1;
-				write_unlock(&n->lock);
-				neigh_release(n);
-				continue;
-			}
+		if (atomic_read(&n->refcnt) == 1 &&
+		    (state == NUD_FAILED ||
+		     time_after(now, n->used + n->parms->gc_staletime))) {
+			*np = n->next;
+			n->dead = 1;
 			write_unlock(&n->lock);
+			neigh_release(n);
+			continue;
+		}
+		write_unlock(&n->lock);
 
 next_elt:
-			np = &n->next;
-		}
+		np = &n->next;
 	}
-
-	mod_timer(&tbl->gc_timer, now + tbl->gc_interval);
+ 	/* Cycle through all hash buckets every base_reachable_time/2 ticks.
+ 	 * ARP entry timeouts range from 1/2 base_reachable_time to 3/2
+ 	 * base_reachable_time. */
+ 
+ 	mod_timer(&tbl->gc_timer, now + 
+		 ((tbl->parms.base_reachable_time>>1)/(tbl->num_hash_buckets)));
+ 
 	write_unlock(&tbl->lock);
 }
 
@@ -1205,6 +1214,37 @@
 void neigh_table_init(struct neigh_table *tbl)
 {
 	unsigned long now = jiffies;
+	unsigned int goal = neigh_hash_bits;
+
+	/* Allocate a power of 2 hash buckets for each power of 2 MB of RAM. */
+	if (!goal) {
+		unsigned int ram_mb = (num_physpages*PAGE_SIZE) / (1024*1024);
+		goal = 31;
+		while ((1<<goal) > ram_mb)
+		{
+			goal--;
+		}
+	}
+
+	tbl->hash_buckets = NULL;
+	while (goal && (!tbl->hash_buckets)) {
+		tbl->num_hash_buckets = (1<<goal);
+		tbl->hash_buckets = kmalloc(sizeof(struct neighbour *)
+					    *tbl->num_hash_buckets, GFP_ATOMIC);
+		goal--;
+	}
+
+	if (tbl->hash_buckets == NULL)
+		panic("%s: Could not allocate memory for hash buckets "
+		      "of table %s.\n", tbl->id, __FUNCTION__);
+	memset(tbl->hash_buckets, 0,
+	       sizeof(struct neighbour *)*tbl->num_hash_buckets);
+
+	if (neigh_hash_bits && 
+	    (tbl->num_hash_buckets != ((1<<neigh_hash_bits)+1)))
+		printk(KERN_WARNING "%s: Couldn't allocate %u hash buckets, "
+		       "did %u instead.\n", __FUNCTION__,
+			(1<<neigh_hash_bits)+1, tbl->num_hash_buckets);
 
 	atomic_set(&tbl->parms.refcnt, 1);
 	INIT_RCU_HEAD(&tbl->parms.rcu_head);
@@ -1224,8 +1264,7 @@
 	init_timer(&tbl->gc_timer);
 	tbl->gc_timer.data     = (unsigned long)tbl;
 	tbl->gc_timer.function = neigh_periodic_timer;
-	tbl->gc_timer.expires  = now + tbl->gc_interval +
-				 tbl->parms.reachable_time;
+	tbl->gc_timer.expires  = now + 1;
 	add_timer(&tbl->gc_timer);
 
 	init_timer(&tbl->proxy_timer);
@@ -1439,7 +1478,7 @@
 	int rc, h, s_h = cb->args[1];
 	int idx, s_idx = idx = cb->args[2];
 
-	for (h = 0; h <= NEIGH_HASHMASK; h++) {
+	for (h = 0; h < tbl->num_hash_buckets; h++) {
 		if (h < s_h)
 			continue;
 		if (h > s_h)
@@ -1628,14 +1667,6 @@
 			.proc_handler	= &proc_dointvec_userhz_jiffies,
 		},
 		{
-			.ctl_name	= NET_NEIGH_GC_INTERVAL,
-			.procname	= "gc_interval",
-			.maxlen		= sizeof(int),
-			.mode		= 0644,
-			.proc_handler	= &proc_dointvec_jiffies,
-			.strategy	= &sysctl_jiffies,
-		},
-		{
 			.ctl_name	= NET_NEIGH_GC_THRESH1,
 			.procname	= "gc_thresh1",
 			.maxlen		= sizeof(int),
@@ -1656,6 +1687,13 @@
 			.mode		= 0644,
 			.proc_handler	= &proc_dointvec,
 		},
+		{
+			.ctl_name	= NET_NEIGH_NUM_HASH_BUCKETS,
+			.procname	= "hash_buckets",
+			.maxlen		= sizeof(int),
+			.mode		= 0400,
+			.proc_handler	= &proc_dointvec,
+		},
 	},
 	.neigh_dev = {
 		{
@@ -1719,10 +1757,13 @@
 		t->neigh_dev[0].ctl_name = dev->ifindex;
 		memset(&t->neigh_vars[12], 0, sizeof(ctl_table));
 	} else {
+		/* If you listen carefully, you can actually hear this
+		 * code suck */
 		t->neigh_vars[12].data = (int *)(p + 1);
 		t->neigh_vars[13].data = (int *)(p + 1) + 1;
 		t->neigh_vars[14].data = (int *)(p + 1) + 2;
 		t->neigh_vars[15].data = (int *)(p + 1) + 3;
+		t->neigh_vars[16].data = (int *)(p + 1) + 4;
 	}
 
 	dev_name = net_sysctl_strdup(dev_name_source);
diff -Nru --exclude-from=/sunbeam/home/laforge/scripts/dontdiff linux-2.6.9-rc2-bk6/net/decnet/dn_neigh.c linux-2.6.9-rc2-bk6-ncache/net/decnet/dn_neigh.c
--- linux-2.6.9-rc2-bk6/net/decnet/dn_neigh.c	2004-09-13 07:32:55.000000000 +0200
+++ linux-2.6.9-rc2-bk6-ncache/net/decnet/dn_neigh.c	2004-09-20 20:25:48.000000000 +0200
@@ -128,7 +128,7 @@
 	hash_val ^= (hash_val >> 10);
 	hash_val ^= (hash_val >> 3);
 
-	return hash_val & NEIGH_HASHMASK;
+	return hash_val & (dn_neigh_table.num_hash_buckets-1);
 }
 
 static int dn_neigh_construct(struct neighbour *neigh)
@@ -526,7 +526,7 @@
 
 	read_lock_bh(&tbl->lock);
 
-	for(i = 0; i < NEIGH_HASHMASK; i++) {
+	for(i = 0; i < tbl->num_hash_buckets; i++) {
 		for(neigh = tbl->hash_buckets[i]; neigh != NULL; neigh = neigh->next) {
 			if (neigh->dev != dev)
 				continue;
@@ -567,7 +567,7 @@
 	struct neighbour *n = NULL;
 
 	for(state->bucket = 0;
-	    state->bucket <= NEIGH_HASHMASK;
+	    state->bucket < dn_neigh_table.num_hash_buckets;
 	    ++state->bucket) {
 		n = dn_neigh_table.hash_buckets[state->bucket];
 		if (n)
@@ -586,7 +586,7 @@
 try_again:
 	if (n)
 		goto out;
-	if (++state->bucket > NEIGH_HASHMASK)
+	if (++state->bucket >= dn_neigh_table.num_hash_buckets)
 		goto out;
 	n = dn_neigh_table.hash_buckets[state->bucket];
 	goto try_again;
diff -Nru --exclude-from=/sunbeam/home/laforge/scripts/dontdiff linux-2.6.9-rc2-bk6/net/ipv4/arp.c linux-2.6.9-rc2-bk6-ncache/net/ipv4/arp.c
--- linux-2.6.9-rc2-bk6/net/ipv4/arp.c	2004-09-21 00:23:16.000000000 +0200
+++ linux-2.6.9-rc2-bk6-ncache/net/ipv4/arp.c	2004-09-20 20:31:21.000000000 +0200
@@ -71,6 +71,7 @@
  *					arp_xmit so intermediate drivers like
  *					bonding can change the skb before
  *					sending (e.g. insert 8021q tag).
+ *		Harald Welte    :	convert to make use of jenkins hash
  */
 
 #include <linux/module.h>
@@ -97,6 +98,7 @@
 #include <linux/init.h>
 #include <linux/net.h>
 #include <linux/rcupdate.h>
+#include <linux/jhash.h>
 #ifdef CONFIG_SYSCTL
 #include <linux/sysctl.h>
 #endif
@@ -124,6 +126,9 @@
 
 #include <linux/netfilter_arp.h>
 
+static u32 arp_hash_rnd;
+static int arp_hash_rnd_initted;
+
 /*
  *	Interface to generic neighbour cache.
  */
@@ -194,7 +199,6 @@
 		.proxy_qlen =		64,
 		.locktime =		1 * HZ,
 	},
-	.gc_interval =	30 * HZ,
 	.gc_thresh1 =	128,
 	.gc_thresh2 =	512,
 	.gc_thresh3 =	1024,
@@ -223,15 +227,13 @@
 
 static u32 arp_hash(const void *pkey, const struct net_device *dev)
 {
-	u32 hash_val;
-
-	hash_val = *(u32*)pkey;
-	hash_val ^= (hash_val>>16);
-	hash_val ^= hash_val>>8;
-	hash_val ^= hash_val>>3;
-	hash_val = (hash_val^dev->ifindex)&NEIGH_HASHMASK;
+	if (unlikely(!arp_hash_rnd_initted)) {
+		get_random_bytes(&arp_hash_rnd, 4);
+		arp_hash_rnd_initted = 1;
+	}
 
-	return hash_val;
+	return (jhash_2words(*(u32 *)pkey, dev->ifindex,
+			     arp_hash_rnd) & arp_tbl.num_hash_buckets-1);
 }
 
 static int arp_constructor(struct neighbour *neigh)
@@ -1281,7 +1283,7 @@
 	state->is_pneigh = 0;
 
 	for (state->bucket = 0;
-	     state->bucket <= NEIGH_HASHMASK;
+	     state->bucket < arp_tbl.num_hash_buckets;
 	     ++state->bucket) {
 		n = arp_tbl.hash_buckets[state->bucket];
 		while (n && !(n->nud_state & ~NUD_NOARP))
@@ -1307,7 +1309,7 @@
 
 	if (n)
 		goto out;
-	if (++state->bucket > NEIGH_HASHMASK)
+	if (++state->bucket >= arp_tbl.num_hash_buckets)
 		goto out;
 	n = arp_tbl.hash_buckets[state->bucket];
 	goto try_again;
diff -Nru --exclude-from=/sunbeam/home/laforge/scripts/dontdiff linux-2.6.9-rc2-bk6/net/ipv6/ndisc.c linux-2.6.9-rc2-bk6-ncache/net/ipv6/ndisc.c
--- linux-2.6.9-rc2-bk6/net/ipv6/ndisc.c	2004-09-21 00:23:16.000000000 +0200
+++ linux-2.6.9-rc2-bk6-ncache/net/ipv6/ndisc.c	2004-09-20 20:25:48.000000000 +0200
@@ -147,7 +147,6 @@
 		.proxy_delay =		(8 * HZ) / 10,
 		.proxy_qlen =		64,
 	},
-	.gc_interval =	  30 * HZ,
 	.gc_thresh1 =	 128,
 	.gc_thresh2 =	 512,
 	.gc_thresh3 =	1024,
@@ -276,7 +275,7 @@
 	hash_val ^= (hash_val>>16);
 	hash_val ^= hash_val>>8;
 	hash_val ^= hash_val>>3;
-	hash_val = (hash_val^dev->ifindex)&NEIGH_HASHMASK;
+	hash_val = (hash_val^dev->ifindex)&(nd_tbl.num_hash_buckets-1);
 
 	return hash_val;
 }
diff -Nru --exclude-from=/sunbeam/home/laforge/scripts/dontdiff linux-2.6.9-rc2-bk6/include/net/neighbour.h linux-2.6.9-rc2-bk6-ncache/include/net/neighbour.h
--- linux-2.6.9-rc2-bk6/include/net/neighbour.h	2004-09-21 00:23:16.000000000 +0200
+++ linux-2.6.9-rc2-bk6-ncache/include/net/neighbour.h	2004-09-20 22:12:50.000000000 +0200
@@ -139,7 +139,6 @@
 	u8			key[0];
 };
 
-#define NEIGH_HASHMASK		0x1F
 #define PNEIGH_HASHMASK		0xF
 
 /*
@@ -160,11 +159,12 @@
 	void			(*proxy_redo)(struct sk_buff *skb);
 	char			*id;
 	struct neigh_parms	parms;
-	/* HACK. gc_* shoul follow parms without a gap! */
-	int			gc_interval;
+	/* UGLY HACK. gc_* and num_hash_buckets must follow parms without a
+	 * gap! */
 	int			gc_thresh1;
 	int			gc_thresh2;
 	int			gc_thresh3;
+	int			num_hash_buckets;
 	unsigned long		last_flush;
 	struct timer_list 	gc_timer;
 	struct timer_list 	proxy_timer;
@@ -175,7 +175,8 @@
 	struct neigh_parms	*parms_list;
 	kmem_cache_t		*kmem_cachep;
 	struct neigh_statistics	stats;
-	struct neighbour	*hash_buckets[NEIGH_HASHMASK+1];
+	struct neighbour	**hash_buckets;
+	int			curr_hash_bucket; /* For GC */
 	struct pneigh_entry	*phash_buckets[PNEIGH_HASHMASK+1];
 };
 
diff -Nru --exclude-from=/sunbeam/home/laforge/scripts/dontdiff linux-2.6.9-rc2-bk6/include/linux/sysctl.h linux-2.6.9-rc2-bk6-ncache/include/linux/sysctl.h
--- linux-2.6.9-rc2-bk6/include/linux/sysctl.h	2004-09-13 07:32:48.000000000 +0200
+++ linux-2.6.9-rc2-bk6-ncache/include/linux/sysctl.h	2004-09-20 22:14:47.000000000 +0200
@@ -494,7 +494,8 @@
 	NET_NEIGH_GC_INTERVAL=13,
 	NET_NEIGH_GC_THRESH1=14,
 	NET_NEIGH_GC_THRESH2=15,
-	NET_NEIGH_GC_THRESH3=16
+	NET_NEIGH_GC_THRESH3=16,
+	NET_NEIGH_NUM_HASH_BUCKETS=17
 };
 
 /* /proc/sys/net/ipx */


-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-20 22:51 [PATCH + RFC] neighbour/ARP cache scalability Harald Welte
@ 2004-09-21  6:29 ` David S. Miller
  2004-09-21  7:04 ` Andre Tomt
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: David S. Miller @ 2004-09-21  6:29 UTC (permalink / raw)
  To: Harald Welte; +Cc: netdev


Thanks Harald.

I have some ideas about how we might refine your approach,
and I'll work on that and report back to you tomorrow.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-20 22:51 [PATCH + RFC] neighbour/ARP cache scalability Harald Welte
  2004-09-21  6:29 ` David S. Miller
@ 2004-09-21  7:04 ` Andre Tomt
  2004-09-21  7:37 ` Robert Olsson
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Andre Tomt @ 2004-09-21  7:04 UTC (permalink / raw)
  To: Harald Welte; +Cc: Linux Netdev List

Harald Welte wrote:
> As a result, on systems with large networks (think of even less than 16
> bit netmasks!) the neighbour table can grow quite significantly,
> especially if some jerk runs pktgen with random destinations, and we
> start expiring 'real' neighbours in favour of incomplete ones.  

Or just regular worms/viruses traversing the Internet. Usually in setups 
I've seen, the arp cache overflows within seconds even on "smaller" 
prefixes like /21. The networks did get redesigned properly as a result 
of this however.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-20 22:51 [PATCH + RFC] neighbour/ARP cache scalability Harald Welte
  2004-09-21  6:29 ` David S. Miller
  2004-09-21  7:04 ` Andre Tomt
@ 2004-09-21  7:37 ` Robert Olsson
  2004-09-21  8:58 ` Andi Kleen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Robert Olsson @ 2004-09-21  7:37 UTC (permalink / raw)
  To: Harald Welte; +Cc: Linux Netdev List


Harald Welte writes:

 > 5) What is the proposed solution for IPv6, when you have /48 or /64 bit
 >    prefixes and systems become even more vulnerable to neighbour cache
 >    attacks?  

 Yep. Was asking the Ipv6-designers about this. A hell lot of filters or?

 
						--ro

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-20 22:51 [PATCH + RFC] neighbour/ARP cache scalability Harald Welte
                   ` (2 preceding siblings ...)
  2004-09-21  7:37 ` Robert Olsson
@ 2004-09-21  8:58 ` Andi Kleen
  2004-09-21 11:19 ` Pekka Savola
  2004-09-22  3:31 ` David S. Miller
  5 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2004-09-21  8:58 UTC (permalink / raw)
  To: Harald Welte; +Cc: Linux Netdev List

> 6) Since everybody agrees the current scheme of bucket-sizing based on
>    memory only is a bad idea, do we want to unify this bucket sizing
>    code, so people have one single knob (boot option) which they can use
>    to give the system some metrics on how large it should scale all the
>    network hash tables?

I think this is the right way to go. In fact I would go one step
farther and add a single knob for all hash tables.
This requires to put all the cut'n'pasted hash table setup
code into a single utility function that does this.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-20 22:51 [PATCH + RFC] neighbour/ARP cache scalability Harald Welte
                   ` (3 preceding siblings ...)
  2004-09-21  8:58 ` Andi Kleen
@ 2004-09-21 11:19 ` Pekka Savola
  2004-09-21 13:49   ` Harald Welte
  2004-09-22  3:31 ` David S. Miller
  5 siblings, 1 reply; 25+ messages in thread
From: Pekka Savola @ 2004-09-21 11:19 UTC (permalink / raw)
  To: Harald Welte; +Cc: Linux Netdev List

On Tue, 21 Sep 2004, Harald Welte wrote:
> 1) should I try to use jhash for ipv6, too?
> 
> 2) Dave has indicated that there should be an upper limit for hash
>    buckets.  What is considered a reasonable upper bound, even for very
>    large systems?
[...]
> 5) What is the proposed solution for IPv6, when you have /48 or /64 bit
>    prefixes and systems become even more vulnerable to neighbour cache
>    attacks?  

The situation with IPv6 is not much different than with IPv4.

It's better in the sense that nobody will be portscanning the whole 
address space or subnets as a means to look for nodes.  So, the 
viruses, worms, exploits etc. will need to use other techniques, so 
the practical need is lower. [and those nodes which try and fall over 
from resource exhaustion .. well, they deserve it ;-)]

It's worse in the sense that there is more space in each subnet for
doing aggressive probing -- but this may not be a big issue with a
good algorithm and a threshold.

In short, I don't think there needs to be anything special for IPv6.  
Just the same mechanisms as for IPv4 -- at some threshold, start 
garbage collecting more aggressively, using a "least-recently-used" 
algorithm (or the like). 

To constraint remote resource exhaustion exploits please make sure 
that the algorithm is sufficiently can also deal with the threat 
described at RFC 3756 section 4.3.2.

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 11:19 ` Pekka Savola
@ 2004-09-21 13:49   ` Harald Welte
  2004-09-21 14:10     ` Pekka Savola
  2004-09-21 15:14     ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 2 replies; 25+ messages in thread
From: Harald Welte @ 2004-09-21 13:49 UTC (permalink / raw)
  To: Pekka Savola; +Cc: Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 3254 bytes --]

On Tue, Sep 21, 2004 at 02:19:52PM +0300, Pekka Savola wrote:
> On Tue, 21 Sep 2004, Harald Welte wrote:
> > 1) should I try to use jhash for ipv6, too?
> > 
> > 2) Dave has indicated that there should be an upper limit for hash
> >    buckets.  What is considered a reasonable upper bound, even for very
> >    large systems?
> [...]
> > 5) What is the proposed solution for IPv6, when you have /48 or /64 bit
> >    prefixes and systems become even more vulnerable to neighbour cache
> >    attacks?  
> 
> The situation with IPv6 is not much different than with IPv4.

I disagree (see below).

> It's better in the sense that nobody will be portscanning the whole 
> address space or subnets as a means to look for nodes.  

I agree, but people will do it as a means to DoS the routers...

> So, the viruses, worms, exploits etc. will need to use other
> techniques, so the practical need is lower. > 

Just because worms cannot use this mechanism anymore doesn't mean that
it will not happen, initiated manually by somebody who wants to DoS your
routers.

> It's worse in the sense that there is more space in each subnet for
> doing aggressive probing -- but this may not be a big issue with a
> good algorithm and a threshold.

So what is that 'good algorithm'.  The current Linux algorithm is from
my point of view (only tested with ipv4) not very good when it comes to
high numbers of neighbours.

> In short, I don't think there needs to be anything special for IPv6.  
> Just the same mechanisms as for IPv4 -- at some threshold, start 
> garbage collecting more aggressively, using a "least-recently-used" 
> algorithm (or the like). 

Yes, but let's assume somebody floods you with 100MBit wirespeed, that's
148kpps, meaning you will have to have a limit of at least 148.800 plus
the number of 'real' hosts directly attached to your system in order to
cope with this.  Otherwise you will end up having all your neighbour
cache entries filled with INCOMPLETE entries, whose retrans_time of 1HZ
is not reached yet.

To do some quick calculations, this would require some 23.8 MByte RAM on
a 32bit platform(!)

Now what if you have multiple interfaces, or you start thinking about
gigabit ethernet...

> To constraint remote resource exhaustion exploits please make sure 
> that the algorithm is sufficiently can also deal with the threat 
> described at RFC 3756 section 4.3.2.

Isn't that exactly what we're talking about?
To quote from that RFC:

  In a way, this problem is fairly similar to the TCP SYN flooding
  problem.  For example, rate limiting Neighbor Solicitations,
  restricting the amount of state reserved for unresolved
  solicitations, and clever cache management may be applied.

So they encourage limiting the number of unresolved solicitations.  We
don't do that at this point, and allow all of the neighbour cache be
filled with them...

> Pekka Savola                 "You each name yourselves king, yet the

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 13:49   ` Harald Welte
@ 2004-09-21 14:10     ` Pekka Savola
  2004-09-21 15:14     ` YOSHIFUJI Hideaki / 吉藤英明
  1 sibling, 0 replies; 25+ messages in thread
From: Pekka Savola @ 2004-09-21 14:10 UTC (permalink / raw)
  To: Harald Welte; +Cc: Linux Netdev List

On Tue, 21 Sep 2004, Harald Welte wrote:
> > The situation with IPv6 is not much different than with IPv4.
> 
> I disagree (see below).

OK, I agree that there are more significant differentes wrt. router
DoS attacks because the number of routers which have /64 subnets,
vulnerable to an attack, is probably larger.

> > It's better in the sense that nobody will be portscanning the whole 
> > address space or subnets as a means to look for nodes.  
> 
> I agree, but people will do it as a means to DoS the routers...

OK, if we talk about this solely from the perspective of router DoS 
attack, then we have slightly different constraints as if we looked 
only at the "host" perspective, or both from the host and router 
perspective.

> > It's worse in the sense that there is more space in each subnet for
> > doing aggressive probing -- but this may not be a big issue with a
> > good algorithm and a threshold.
> 
> So what is that 'good algorithm'.  The current Linux algorithm is from
> my point of view (only tested with ipv4) not very good when it comes to
> high numbers of neighbours.

True.
 
> > In short, I don't think there needs to be anything special for IPv6.  
> > Just the same mechanisms as for IPv4 -- at some threshold, start 
> > garbage collecting more aggressively, using a "least-recently-used" 
> > algorithm (or the like). 
> 
> Yes, but let's assume somebody floods you with 100MBit wirespeed, that's
> 148kpps, meaning you will have to have a limit of at least 148.800 plus
> the number of 'real' hosts directly attached to your system in order to
> cope with this.  Otherwise you will end up having all your neighbour
> cache entries filled with INCOMPLETE entries, whose retrans_time of 1HZ
> is not reached yet.
> 
> To do some quick calculations, this would require some 23.8 MByte RAM on
> a 32bit platform(!)
> 
> Now what if you have multiple interfaces, or you start thinking about
> gigabit ethernet...

I may be in the minority, but I don't think 24 MB is a big deal. :)
Remember that the same box will need to be able to sustain 150 kpps in
any case -- and the low-end boxes probably can't do it.

If you get DoS'ed with that kind of flood, you're pretty much out of
luck, unless someone comes up with a nice algorithm to mitigate this
problem.  I don't know if such has been found yet. 

One possibility might be keeping track of the ingress interface, and
restricting ARP/ND messages to N (e.g., 100 or 1000/sec by default)  
per ingress interface.  That would allow "internal" ND lookups to work
even under an external attack.  

A more complex alternative might be purposefully delaying [at least on
untrusted interfaces] ARP/ND requests [e.g., by 1 or 2 seconds] which
request an address which does not already have any ARP/ND state at the
router, then check out how many new messages have arrived during that
delay period (and do some rate-limiting magic based on that). Read:
store the "known valid" ND state for as long as possible, and when you
get hit by the flood, you could de-prefer or ignore those requests
which pertain to addresses which haven't communicated with the router
in the last X [timevalue].

Another variation of the above might be two algorithms: just do every
lookup as normal until a "potential attack threshold" (e.g., 1000
entries, excluding those packets allowed by the restriction below).  
Then, restrict ARP/ND requests to X pps (e.g., 100 pps) which do not
relate to an address that already exists in the cache. This should
keep ND/ARP operational under an attack for the legitimate hosts,
while allowing some amount of ND traffic.

There are probably other ways of mitigating the problem in a way that
it does the least feasible amount of damage to the by-standers.

One might look at what (if anything) others do under these 
circumstances, e.g., BSD's.
 
> > To constraint remote resource exhaustion exploits please make sure 
> > that the algorithm is sufficiently can also deal with the threat 
> > described at RFC 3756 section 4.3.2.
> 
> Isn't that exactly what we're talking about?
> To quote from that RFC:
> 
>   In a way, this problem is fairly similar to the TCP SYN flooding
>   problem.  For example, rate limiting Neighbor Solicitations,
>   restricting the amount of state reserved for unresolved
>   solicitations, and clever cache management may be applied.
> 
> So they encourage limiting the number of unresolved solicitations.  We
> don't do that at this point, and allow all of the neighbour cache be
> filled with them...

Agreed, if we are restricting to this particular problem.

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 13:49   ` Harald Welte
  2004-09-21 14:10     ` Pekka Savola
@ 2004-09-21 15:14     ` YOSHIFUJI Hideaki / 吉藤英明
  2004-09-21 15:36       ` Robert Olsson
  2004-09-21 15:58       ` Pekka Savola
  1 sibling, 2 replies; 25+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2004-09-21 15:14 UTC (permalink / raw)
  To: laforge; +Cc: pekkas, netdev, yoshfuji

In article <20040921134918.GK1307@sunbeam.de.gnumonks.org> (at Tue, 21 Sep 2004 15:49:18 +0200), Harald Welte <laforge@gnumonks.org> says:

> > The situation with IPv6 is not much different than with IPv4.
> 
> I disagree (see below).

agreed (to harald).

> > It's worse in the sense that there is more space in each subnet for
> > doing aggressive probing -- but this may not be a big issue with a
> > good algorithm and a threshold.
> 
> So what is that 'good algorithm'.  The current Linux algorithm is from
> my point of view (only tested with ipv4) not very good when it comes to
> high numbers of neighbours.

Well, of course, we should limit number of total entries.

If we have enough memory for usual use,
we should not purge the "probably reachable" routes
(REACHABLE, STALE, DELAY, and PROBE) neighbours.
Probably, we should split neighbour entries to two parts.
 - INCOMPLETE
 - REACHABLE, STALE, DELAY and PROBE
Which means, we should NOT purge "known" entries by unknown entries.

If we don't have enough memory for usual use, hmm???
Probably (weighed) LFU.

--yoshfuji

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 15:14     ` YOSHIFUJI Hideaki / 吉藤英明
@ 2004-09-21 15:36       ` Robert Olsson
  2004-09-21 15:59         ` YOSHIFUJI Hideaki / 吉藤英明
  2004-09-21 15:58       ` Pekka Savola
  1 sibling, 1 reply; 25+ messages in thread
From: Robert Olsson @ 2004-09-21 15:36 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: laforge, pekkas, netdev


YOSHIFUJI Hideaki / 吉藤英明 writes:

 > > > The situation with IPv6 is not much different than with IPv4.
 > > 
 > > I disagree (see below).
 > 
 > agreed (to harald).

 Hello!

 Remember we discussed this with at OLS? Asking for my solution.I said 
 ask at IETF as you were going there. I guess it was not discussed?

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 15:14     ` YOSHIFUJI Hideaki / 吉藤英明
  2004-09-21 15:36       ` Robert Olsson
@ 2004-09-21 15:58       ` Pekka Savola
  2004-09-21 16:04         ` YOSHIFUJI Hideaki / 吉藤英明
  1 sibling, 1 reply; 25+ messages in thread
From: Pekka Savola @ 2004-09-21 15:58 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明; +Cc: laforge, netdev

On Wed, 22 Sep 2004, YOSHIFUJI Hideaki / [iso-2022-jp] ^[$B5HF#1QL@^[(B wrote:
> > > It's worse in the sense that there is more space in each subnet for
> > > doing aggressive probing -- but this may not be a big issue with a
> > > good algorithm and a threshold.
> > 
> > So what is that 'good algorithm'.  The current Linux algorithm is from
> > my point of view (only tested with ipv4) not very good when it comes to
> > high numbers of neighbours.
> 
> Well, of course, we should limit number of total entries.
> 
> If we have enough memory for usual use,
> we should not purge the "probably reachable" routes
> (REACHABLE, STALE, DELAY, and PROBE) neighbours.
> Probably, we should split neighbour entries to two parts.
>  - INCOMPLETE
>  - REACHABLE, STALE, DELAY and PROBE
> Which means, we should NOT purge "known" entries by unknown entries.

This still doesn't take a stance on rate-limiting the ND/ARP packets,
in case that there still is enough memory, but some kind of attack is
clearly underway.  Should it still be done?  Consider 100Kpps of
router-generated ARP/ND probes -- not good!

-- 
Pekka Savola                 "You each name yourselves king, yet the
Netcore Oy                    kingdom bleeds."
Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 15:36       ` Robert Olsson
@ 2004-09-21 15:59         ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 0 replies; 25+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2004-09-21 15:59 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: laforge, pekkas, netdev, yoshfuji

In article <16720.19049.100640.24977@robur.slu.se> (at Tue, 21 Sep 2004 17:36:09 +0200), Robert Olsson <Robert.Olsson@data.slu.se> says:

>  Remember we discussed this with at OLS? Asking for my solution.I said 
>  ask at IETF as you were going there. I guess it was not discussed?

Yes, I remember.

I talked with KAME (*BSD) developers.
I don't think that the BSD's behavior is good enough against the threats.
So, I think we should find by ourselves.

--yoshfuji

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 15:58       ` Pekka Savola
@ 2004-09-21 16:04         ` YOSHIFUJI Hideaki / 吉藤英明
  2004-09-21 16:39           ` Tim Gardner
  0 siblings, 1 reply; 25+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2004-09-21 16:04 UTC (permalink / raw)
  To: pekkas; +Cc: laforge, netdev, yoshfuji

In article <Pine.LNX.4.44.0409211856260.9906-100000@netcore.fi> (at Tue, 21 Sep 2004 18:58:05 +0300 (EEST)), Pekka Savola <pekkas@netcore.fi> says:

> This still doesn't take a stance on rate-limiting the ND/ARP packets,
> in case that there still is enough memory, but some kind of attack is
> clearly underway.  Should it still be done?  Consider 100Kpps of
> router-generated ARP/ND probes -- not good!

Right. We need to do this, of course. Probably, per-ingress interface.
(I mean, incoming interface which invokes NS.)

Note: I think similar idea (limiting per interface) was arose during chat 
      with Robert, Halard et. al at OLS.

-- 
Hideaki YOSHIFUJI @ USAGI Project <yoshfuji@linux-ipv6.org>
GPG FP: 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 16:04         ` YOSHIFUJI Hideaki / 吉藤英明
@ 2004-09-21 16:39           ` Tim Gardner
  2004-09-21 17:31             ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Tim Gardner @ 2004-09-21 16:39 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: pekkas, laforge, netdev

On Tue, 2004-09-21 at 10:04, YOSHIFUJI Hideaki / 吉藤英明 wrote:
> In article <Pine.LNX.4.44.0409211856260.9906-100000@netcore.fi> (at Tue, 21 Sep 2004 18:58:05 +0300 (EEST)), Pekka Savola <pekkas@netcore.fi> says:
> 
> > This still doesn't take a stance on rate-limiting the ND/ARP packets,
> > in case that there still is enough memory, but some kind of attack is
> > clearly underway.  Should it still be done?  Consider 100Kpps of
> > router-generated ARP/ND probes -- not good!
> 

Detecting an attack would require some kind of heuristic in the core
router code. I believe that logic is better suited for an iptables
filter. Why burden well guarded machines that are unikely to experience
this kind of attack? I think the only thing NUD should do is limit the
absolute number of NUD entries that it can create. Give it a sysctl knob
for large networks, but make the default something reasonable (like
2K).  

I've developed a variant of the Port Scan Detector (PSD) iptables filter
that combats this very problem. It only allows so many destination
IP/Port pairs from a given address to be opened over time. This limits
the rate at which connections can be opened as well as the absolute
number. For example, on my edge routers I set the policy that no single
IP source address can create more then 64 connections within a 30 second
sliding window. This has made a huge impact on the ARP storms that our
network used to experience.

rtg

-- 
timg@tpi.com http://www.tpi.com
406-443-5357(MT) 503-601-0234(OR)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 16:39           ` Tim Gardner
@ 2004-09-21 17:31             ` Andi Kleen
  2004-09-21 17:58               ` Tim Gardner
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2004-09-21 17:31 UTC (permalink / raw)
  To: Tim Gardner; +Cc: YOSHIFUJI Hideaki / ????????????, pekkas, laforge, netdev

> I've developed a variant of the Port Scan Detector (PSD) iptables filter
> that combats this very problem. It only allows so many destination
> IP/Port pairs from a given address to be opened over time. This limits
> the rate at which connections can be opened as well as the absolute
> number. For example, on my edge routers I set the policy that no single
> IP source address can create more then 64 connections within a 30 second
> sliding window. This has made a huge impact on the ARP storms that our
> network used to experience.

But also allows an easy DOS. Someone just has to spoof a lot of connections
attempts with the source address of your primary name server or 
some other important service.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 17:31             ` Andi Kleen
@ 2004-09-21 17:58               ` Tim Gardner
  2004-09-21 18:15                 ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Tim Gardner @ 2004-09-21 17:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: YOSHIFUJI Hideaki / ????????????, pekkas, laforge, netdev

On Tue, 2004-09-21 at 11:31, Andi Kleen wrote:

> But also allows an easy DOS. Someone just has to spoof a lot of connections
> attempts with the source address of your primary name server or 
> some other important service.
> 

That is what other iptables rules and filters are for. I get thousands
of source address spoofs from my Internet connection every day. Network
security is a layered approach.

rtg
-- 
timg@tpi.com http://www.tpi.com
406-443-5357(MT) 503-601-0234(OR)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 17:58               ` Tim Gardner
@ 2004-09-21 18:15                 ` Andi Kleen
  2004-09-21 20:34                   ` Harald Welte
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2004-09-21 18:15 UTC (permalink / raw)
  To: Tim Gardner
  Cc: Andi Kleen, YOSHIFUJI Hideaki / ????????????, pekkas, laforge, netdev

On Tue, Sep 21, 2004 at 11:58:27AM -0600, Tim Gardner wrote:
> On Tue, 2004-09-21 at 11:31, Andi Kleen wrote:
> 
> > But also allows an easy DOS. Someone just has to spoof a lot of connections
> > attempts with the source address of your primary name server or 
> > some other important service.
> > 
> 
> That is what other iptables rules and filters are for. I get thousands
> of source address spoofs from my Internet connection every day. Network
> security is a layered approach.

I don't think you can eliminate the problem with more filters.
Even when you can eliminate spoofing for some services you use
you cannot eliminate it for all possible services your user
use (unless you get rid of spoofing in the whole internet or you never
talk to any services outside your network) 

And certainly the solution wouldn't work as a Linux default.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 18:15                 ` Andi Kleen
@ 2004-09-21 20:34                   ` Harald Welte
  2004-09-21 20:58                     ` Tim Gardner
  0 siblings, 1 reply; 25+ messages in thread
From: Harald Welte @ 2004-09-21 20:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tim Gardner, YOSHIFUJI Hideaki / ????????????, pekkas, netdev

[-- Attachment #1: Type: text/plain, Size: 1461 bytes --]

On Tue, Sep 21, 2004 at 08:15:25PM +0200, Andi Kleen wrote:
> On Tue, Sep 21, 2004 at 11:58:27AM -0600, Tim Gardner wrote:
> > On Tue, 2004-09-21 at 11:31, Andi Kleen wrote:
> > 
> > > But also allows an easy DOS. Someone just has to spoof a lot of connections
> > > attempts with the source address of your primary name server or 
> > > some other important service.
> > > 
> > 
> > That is what other iptables rules and filters are for. I get thousands
> > of source address spoofs from my Internet connection every day. Network
> > security is a layered approach.
> 
> I don't think you can eliminate the problem with more filters.

I agree with Andi, and I think we're just being lazy if we say 'well,
the neighbour cache has this problem, but the solution has to be
manually implemented by the administrator'.

Also, we cannot put complex heuristics code in place, unless we can
prove that it again doesn't provide new possibilitis for DoS.

My personal (simplistic) favourite is still a simple threshold (absolute
value / percentage) for incomplete neighbour entries. This way we make
sure that we cannot starve 'real' (fully resolved) entries at the cost
of incomplete ones.

> -Andi

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 20:34                   ` Harald Welte
@ 2004-09-21 20:58                     ` Tim Gardner
  2004-09-22  1:14                       ` Harald Welte
  0 siblings, 1 reply; 25+ messages in thread
From: Tim Gardner @ 2004-09-21 20:58 UTC (permalink / raw)
  To: Harald Welte; +Cc: Andi Kleen, YOSHIFUJI Hideaki / ????????????, pekkas, netdev


> My personal (simplistic) favourite is still a simple threshold (absolute
> value / percentage) for incomplete neighbour entries. This way we make
> sure that we cannot starve 'real' (fully resolved) entries at the cost
> of incomplete ones.
> 
> > -Andi

It's not like NUD doesn't already have an overflow policy.
neigh_forced_gc() performs a cleanup on NUD_INCOMPLETE entries when
gc_thresh3 is exceeded.

rtg
-- 
timg@tpi.com http://www.tpi.com
406-443-5357(MT) 503-601-0234(OR)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-21 20:58                     ` Tim Gardner
@ 2004-09-22  1:14                       ` Harald Welte
  0 siblings, 0 replies; 25+ messages in thread
From: Harald Welte @ 2004-09-22  1:14 UTC (permalink / raw)
  To: Tim Gardner; +Cc: Andi Kleen, YOSHIFUJI Hideaki / ????????????, pekkas, netdev

[-- Attachment #1: Type: text/plain, Size: 1223 bytes --]

On Tue, Sep 21, 2004 at 02:58:37PM -0600, Tim Gardner wrote:
> 
> > My personal (simplistic) favourite is still a simple threshold (absolute
> > value / percentage) for incomplete neighbour entries. This way we make
> > sure that we cannot starve 'real' (fully resolved) entries at the cost
> > of incomplete ones.
> > 
> > > -Andi
> 
> It's not like NUD doesn't already have an overflow policy.
> neigh_forced_gc() performs a cleanup on NUD_INCOMPLETE entries when
> gc_thresh3 is exceeded.

No, that's not what it does.  neigh_forced_gc explicitly only elects
INCOMPLETE entries that exist for at least n->parms_retrans_time in
order to avoid flooding the network with too many arp/neighbour
requests (since we could delete an incomplete one before the reply
arrives, and could do this for quite some time over and over again).

That's been my point all over this discussion...

> rtg
> timg@tpi.com http://www.tpi.com
> 406-443-5357(MT) 503-601-0234(OR)

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-20 22:51 [PATCH + RFC] neighbour/ARP cache scalability Harald Welte
                   ` (4 preceding siblings ...)
  2004-09-21 11:19 ` Pekka Savola
@ 2004-09-22  3:31 ` David S. Miller
  2004-09-22 11:14   ` Harald Welte
  2004-09-24  5:47   ` David S. Miller
  5 siblings, 2 replies; 25+ messages in thread
From: David S. Miller @ 2004-09-22  3:31 UTC (permalink / raw)
  To: Harald Welte; +Cc: netdev


Ok Harald, I did some snooping around and here is what
I've come up with.

1) We have 5 or 6 implementations of "walk neighbour hash
   table in sequence", that's rediculious and is the only
   reason that hashtable layout or number of entries is
   even known by code outside of net/core/neighbour.c

   In fact, many cases get it wrong :(  For example, the
   ARP seq_file stuff locks the normal hash table correctly
   but does zero locking when traversing the pneigh hashes.
   Oops.  Another reason to put this logic in a common spot.

   So I think the first thing to do is to write table
   walker functions generically inside of net/core/neighbour.c

   This is the first step.

1.5) tbl->ops->hash() should return the raw hash, the caller
     will do the necessary masking.

     At this point, there is no reason to declare
     {P,}NEIGH_HASHMASK in net/neighbour.h

2) We should size these hash table dynamically, growing them
   at neigh_create() time.

   This makes the most sense, and the scheme is similar to how
   net/ipv4/fib_hash.c works, for example.

3) If we have a sysctl for this stuff at all, it will be for
   the limit of the hash table growth, but I don't think that
   is necessary given #2

I like the jenkins hash change and yes I think it should be applied
elsewhere too.

I'll work on the set of patches implementing the above tomorrow
and will send it to the list for review and commentary.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-22  3:31 ` David S. Miller
@ 2004-09-22 11:14   ` Harald Welte
  2004-09-24  5:53     ` David S. Miller
  2004-09-24  7:14     ` Glen Turner
  2004-09-24  5:47   ` David S. Miller
  1 sibling, 2 replies; 25+ messages in thread
From: Harald Welte @ 2004-09-22 11:14 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1902 bytes --]

On Tue, Sep 21, 2004 at 08:31:18PM -0700, David S. Miller wrote:
 
> Ok Harald, I did some snooping around and here is what
> I've come up with.

Thanks for digging into this issue.

> 1) We have 5 or 6 implementations of "walk neighbour hash
>    table in sequence", that's rediculious and is the only
>    reason that hashtable layout or number of entries is
>    even known by code outside of net/core/neighbour.c
> [...]
> 1.5) tbl->ops->hash() should return the raw hash, the caller
>      will do the necessary masking.
> [...]
> 
> 2) We should size these hash table dynamically, growing them
>    at neigh_create() time.

great, that sounds like a scalable approach.  I'm looking forward to
reading through the code :)

> 3) If we have a sysctl for this stuff at all, it will be for
>    the limit of the hash table growth, but I don't think that
>    is necessary given #2

But we will still have the existing sysctls for the garbage collector, I
guess (like gc_thresh*).  

And I still want to address the "all complete entries flushed due to
lots of bogus incomplete entires" issue somehow.  As stated before, I
have seen this on happen on live systems.  

Do you agree that this is an existing problem?

I think just having a certain 'reserve' for complete entries (or a let's
say 80% limit for incomplete ones) should address this issue.

btw: I guess you're implementing this as 2.6.x only patch.  Once you're
finished, I'll have to do a 2.4.x backport anyway [for business reasons
*g*].  So don't bother doing that, I'll post the patch here (which
doesn't imply that I want to have it merged mainline).

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-22  3:31 ` David S. Miller
  2004-09-22 11:14   ` Harald Welte
@ 2004-09-24  5:47   ` David S. Miller
  1 sibling, 0 replies; 25+ messages in thread
From: David S. Miller @ 2004-09-24  5:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: laforge, netdev

On Tue, 21 Sep 2004 20:31:18 -0700
"David S. Miller" <davem@davemloft.net> wrote:

> I'll work on the set of patches implementing the above tomorrow
> and will send it to the list for review and commentary.

Ok, coming in followup emails.

The hardest was abstracting the generic neigh seq_file stuff
such that net/atm/clip.c could still fit in.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-22 11:14   ` Harald Welte
@ 2004-09-24  5:53     ` David S. Miller
  2004-09-24  7:14     ` Glen Turner
  1 sibling, 0 replies; 25+ messages in thread
From: David S. Miller @ 2004-09-24  5:53 UTC (permalink / raw)
  To: Harald Welte; +Cc: netdev

On Wed, 22 Sep 2004 13:14:47 +0200
Harald Welte <laforge@gnumonks.org> wrote:

> And I still want to address the "all complete entries flushed due to
> lots of bogus incomplete entires" issue somehow.  As stated before, I
> have seen this on happen on live systems.  
> 
> Do you agree that this is an existing problem?

I totally agree.  Please scan my 6 patches, let's get that
all fleshed out, then we can move on to this issue.

Feel free to backport my 2.6.x neigh patches so far and
send that work to me, I'll happily integrate them into
my 2.4.x net tree.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH + RFC] neighbour/ARP cache scalability
  2004-09-22 11:14   ` Harald Welte
  2004-09-24  5:53     ` David S. Miller
@ 2004-09-24  7:14     ` Glen Turner
  1 sibling, 0 replies; 25+ messages in thread
From: Glen Turner @ 2004-09-24  7:14 UTC (permalink / raw)
  To: Harald Welte; +Cc: David S. Miller, netdev

On Wed, 2004-09-22 at 20:44, Harald Welte wrote:

> And I still want to address the "all complete entries flushed due to
> lots of bogus incomplete entires" issue somehow.  As stated before, I
> have seen this on happen on live systems.  
> 
> Do you agree that this is an existing problem?
> 
> I think just having a certain 'reserve' for complete entries (or a let's
> say 80% limit for incomplete ones) should address this issue.

Or two other ideas:

1)
Recognising that there are two classes of incomplete entries, those that
are recently-issued and those that are very unlikely to complete (as
you've allowed enough time for a handful of serialisation delays,
latencies and answering host turn-around (say 0.7s).

You can pretty safely aggressively flush the "unlikely to complete"
class of incomplete entries. The older the more aggressively.

Completed entries should be flushed before the first of the "recently
issued" incomplete entries, to stop repeated ARPing.

This gets less useful as your address allocation grows and scanning
rates rise, but is much better than the current algorithm or a
straight-forward least-recently-used algorithm.

2)
Record the source interface of the traffic which is causing this ARP.
When flushing the cache, apply a penalty to entries where that entry's
source interfaces has a large numbers of incomplete ARPs.

The effect of this is that scanning from an Internet-facing interface
won't succeed in pushing large numbers of entries generated from
local-to-local (eg, local file, print, intranet) traffic out of the
table.

A combination of (1) and (2) should be pretty solid. If it falls apart
under extreme scanning then hard-coding the ARP info for
externally-facing servers (such as web servers) is needed.  Currently
hard-coding all machines would be needed in the extreme scanning case.

Hope this helps,
Glen

-- 
Glen Turner          Tel: (08) 8303 3936 or +61 8 8303 3936
Australia's Academic & Research Network   www.aarnet.edu.au

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2004-09-24  7:14 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-20 22:51 [PATCH + RFC] neighbour/ARP cache scalability Harald Welte
2004-09-21  6:29 ` David S. Miller
2004-09-21  7:04 ` Andre Tomt
2004-09-21  7:37 ` Robert Olsson
2004-09-21  8:58 ` Andi Kleen
2004-09-21 11:19 ` Pekka Savola
2004-09-21 13:49   ` Harald Welte
2004-09-21 14:10     ` Pekka Savola
2004-09-21 15:14     ` YOSHIFUJI Hideaki / 吉藤英明
2004-09-21 15:36       ` Robert Olsson
2004-09-21 15:59         ` YOSHIFUJI Hideaki / 吉藤英明
2004-09-21 15:58       ` Pekka Savola
2004-09-21 16:04         ` YOSHIFUJI Hideaki / 吉藤英明
2004-09-21 16:39           ` Tim Gardner
2004-09-21 17:31             ` Andi Kleen
2004-09-21 17:58               ` Tim Gardner
2004-09-21 18:15                 ` Andi Kleen
2004-09-21 20:34                   ` Harald Welte
2004-09-21 20:58                     ` Tim Gardner
2004-09-22  1:14                       ` Harald Welte
2004-09-22  3:31 ` David S. Miller
2004-09-22 11:14   ` Harald Welte
2004-09-24  5:53     ` David S. Miller
2004-09-24  7:14     ` Glen Turner
2004-09-24  5:47   ` David S. Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.