All of lore.kernel.org
 help / color / mirror / Atom feed
* Extensible hashing and RCU
@ 2007-02-04  7:41 linux
  2007-02-05 18:02 ` akepner
  2007-02-05 18:41 ` [RFC/TOY]Extensible " akepner
  0 siblings, 2 replies; 102+ messages in thread
From: linux @ 2007-02-04  7:41 UTC (permalink / raw)
  To: davem, netdev; +Cc: linux

I noticed in an LCA talk mention that apprently extensible hashing
with RCU access is an unsolved problem.  Here's an idea for solving it.

I'm assuming the table is a power of 2 in size with open chaining
for collisions.  When the chains get too long, the table is doubled.
When the chains get too short, the table size is halved.

- Compute a sufficiently large (32-bit?) hash value for each entry.
  "Sufficiently large" is large enough for the largest possible hash table.

- The hash value is stored with each entry.  (Not strictly
  necessary if the update rate is sufficiently low.)

- The table is indexed on the *high* bits of the hash value.
  As it grows, additional bits are appended to the hash value.

- Each chain is stored in sorted order by hash value.
  (This is why storing the hash value is an efficiency win.)

To double the size of a hash table:
- Allocate new, larger, array of head pointers.
- The even slots are copied from the smaller hash table.
- The odd slots are initialized to point to the middle of
  the hash chains pointed to by the odd slots.  However, the
  even chains are NOT terminated yet; a search through one of
  them will go through the full chain length.
- The new table is declared open for business.
- Wait for RCU quiescent period to elapse, so there are no more readers
  of the old table.
- NOW truncate the even chains by setting the next pointers to NULL.
- Deallocate and free the old array of head pointers.

Likewise, to halve the size, copy the even heads to a smaller table,
link the odd heads onto the tails of the even chains, copy to
a smaller table, and declare it open for business.  When an RCU quiescent
period has elapsed, you can delete the old table.

Ths insight is that RCU makes taking stuff out of a linked list
very delicate, and moving it while preserving access is
basically impossible.  But you can append extraneous junk to the
end of a hash chain harmlessly enough and share the structure.

Thus, there is a period of overlap when both the old and the new hash
tables are valid and functional.

Indeed, after each of the above steps, you can actually allow new
insertions into the hash table while waiting for the RCU quiescent period.

If the insertion is at the head of chain, it won't be seen by readers
of the old table, but that's harmless.

The trickiest case I can think of is the deletion of a table
entry at the head of an odd chain while an expansion is pending.
When scanning the even chain afterwards to find where to truncate it,
you can't compare node->next to the odd chain head; you have to
look at the now deleted node's hash code and see that it exceeds
the threshold for the even chain.  (Equivalently, you can test to
see if the appropriate bit of the hash code is set.)

So that hash chain walking has to be done BEFORE the node is actually
deleted.  This requires an ordering guarantee on RCU callbacks, either
a priority system or FIFO.  call_rcu looks like it uses FIFO order,
but it's per-CPU lists.

Ah!  It's worse than that.  Even after the first RCU quiescent period,
there still could be a walker of the even chain holding a pointer to
the newly-deleted odd chain head.  Thus, it can't actually be reclaimed
until a *second* RCU quiescent period has elapsed.

The first RCU period is to get rid of anyone who needs the link, then you
remove it, then you need to wait until there's nobody who's still using it.

Still, it's probably not too horrible.


(You could index the hash table on the low-order bits, but then you
need to keep the chains sorted by bit-reversed hash value, which is
probably more of a pain.  Still pretty easy, though.  To compare x and
y bit-reversed, just let mask = x^y; mask ^= mask-1; compare (x&mask)
to (y&mask).)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-04  7:41 Extensible hashing and RCU linux
@ 2007-02-05 18:02 ` akepner
  2007-02-17 13:13   ` Evgeniy Polyakov
  2007-02-05 18:41 ` [RFC/TOY]Extensible " akepner
  1 sibling, 1 reply; 102+ messages in thread
From: akepner @ 2007-02-05 18:02 UTC (permalink / raw)
  To: linux; +Cc: davem, netdev

On Sat, 4 Feb 2007 linux@horizon.com wrote:

> I noticed in an LCA talk mention that apprently extensible hashing
> with RCU access is an unsolved problem.  Here's an idea for solving it.
> ....

Yes, I have been playing around with the same idea for
doing dynamic resizing of the TCP hashtable.

Did a prototype "toy" implementation, and I have a
"half-finished" patch which resizes the TCP hashtable
at runtime. Hmmm, your mail may be the impetus to get
me to finally finish this thing....

-- 
Arthur


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/TOY]Extensible hashing and RCU
  2007-02-04  7:41 Extensible hashing and RCU linux
  2007-02-05 18:02 ` akepner
@ 2007-02-05 18:41 ` akepner
  2007-02-06 19:09   ` linux
  1 sibling, 1 reply; 102+ messages in thread
From: akepner @ 2007-02-05 18:41 UTC (permalink / raw)
  To: linux; +Cc: davem, netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 624 bytes --]

On Sat, 4 Feb 2007 linux@horizon.com wrote:

> I noticed in an LCA talk mention that apprently extensible hashing
> with RCU access is an unsolved problem.  Here's an idea for solving it.
>
> I'm assuming the table is a power of 2 in size with open chaining
> for collisions.  When the chains get too long, the table is doubled.
> When the chains get too short, the table size is halved.
> .....

For purposes of discussion, I've attached a "toy" implementation
for doing dynamic resizing of a hashtable. It is useless, except
as a proof of concept.

I think this is very similar to what you are describing, no?

-- 
Arthur

[-- Attachment #2: resizable hashtable toy implementation --]
[-- Type: TEXT/PLAIN, Size: 18987 bytes --]


#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/vmalloc.h>
#include <linux/seqlock.h>
#include <linux/rcupdate.h>
#include <linux/netdevice.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

#include <asm/uaccess.h>

#define _NETHASH_BUFSZ_ 16
#define MODULE_NAME "nethash"

#define _DEBUG_IT_
#ifdef  _DEBUG_IT_
#define nprintk(x...) printk(KERN_ALERT x);
#else  /* !_DEBUG_IT_ */
#define nprintk(x)
#endif /* _DEBUG_IT_ */

static struct proc_dir_entry *nh_proc_dir;

enum nh_type { NH_ENTRY, NH_HEAD };


struct nh_item {
	/* the list head must be first followed by nh_type */
	struct list_head	nh_list;
	enum nh_type		nh_type;
};

struct nh_entry {
	/* the list head must be first followed by nh_type */
	struct list_head	nh_list;
	enum nh_type		nh_type;
	unsigned long		data;
	struct rcu_head		rcu_head;
};

struct nh_head {
	/* the list head must be first followed by nh_type */
	struct list_head	list;
	enum nh_type		nh_type;
	spinlock_t		lock;
};

struct nh {
	unsigned long	nentries;
	struct nh_head*	hash;
	struct rcu_head	rcu_head;
};


static struct nh * __nh;

static DEFINE_SEQLOCK(nethash_seq);
static DEFINE_MUTEX(nethash_resize_mutex);

extern void * system_hash_alloc(unsigned long sz);
extern void system_hash_free(void * v, unsigned long sz);

/* XXX nh_dump() is for for debug only 
 * it must be called under rcu_read_lock() */
static int nh_dump(struct nh * nh, const char * debug_str);

static struct nh * get_nh(void);

static unsigned long nh_hashval(unsigned long data)
{
	return data;
}

static void nh_entry_free(struct rcu_head * head)
{
	struct nh_entry * entry = container_of(head, struct nh_entry, rcu_head);
	nprintk("%s: freeing entry with data %lu\n", __FUNCTION__, entry->data);
	kfree(entry);
}

static void nh_free(struct rcu_head * head)
{
	struct nh * nh = container_of(head, struct nh, rcu_head);
	unsigned long nentries = nh->nentries;
	unsigned long size = sizeof (struct nh_head) * nentries;
	nprintk("%s: freeing nh of size %lu\n", __FUNCTION__, size);
	system_hash_free((void *)nh->hash, size);
	nprintk("%s: freeing nh\n", __FUNCTION__);
	kfree(nh);
}

/* called with head's spin_lock held */
static int __nh_insert(struct nh_entry * entry, 
		       struct nh_head * head,
		       unsigned long nentries)
{
	struct list_head * prev; /* insert entry after prev */
	struct list_head * list;

	nprintk("%s: linking entry with data %lu\n", __FUNCTION__, 
		entry->data);

	/* find the appropriate spot to place entry */
	prev = &head->list;
	if ( nh_hashval(entry->data) & nentries ) {
		list_for_each_rcu(list, &head->list) {
			struct nh_entry * tmp;
			struct nh_item * item = (struct nh_item *)list;
			if (item->nh_type != NH_ENTRY)
				continue;
			tmp = list_entry(list, struct nh_entry, nh_list);
			prev = &tmp->nh_list;
			if (nh_hashval(tmp->data) & nentries) {
				/* put entry after nh */
				break;
			}
		}
	}
	list_add_rcu(&entry->nh_list, prev);

	return 0;

}

/* called with head's lock held */
static void __nh_sort_chain(struct nh_head * head, unsigned long nentries)
{
	struct list_head tmp, *list = &head->list;
	struct nh_entry* entry;

	INIT_LIST_HEAD(&tmp);
	list_splice_init(list, &tmp);
	while ( !list_empty(&tmp) ) {
		struct list_head * first = tmp.next;

		list_del(first);
		entry = (struct nh_entry*)first;

		__nh_insert(entry, head, nentries);
	}

}

static void nh_fixup_grow(struct rcu_head * head)
{
	struct nh * nh;
	struct nh * oh = container_of(head, struct nh, rcu_head);
	unsigned long i, oentries = oh->nentries;

	rcu_read_lock();
	nh = get_nh();

	/* this is an rcu_callback - only the "new" nh is 
	 * visible elsewhere now, so make sure to access 
	 * the hashtable using the proper (new) locks,... */
	for (i = 0; i < oentries; i++) {
		struct nh_head * nhead1 = &nh->hash[i];
		struct nh_head * nhead2 = &nh->hash[i + oentries];
		struct list_head *olist, *tail1, *tail2;

		/* grab locks in order */
		spin_lock(&nhead1->lock);
		spin_lock(&nhead2->lock);

		olist = &oh->hash[i].list;

		list_del(olist); /* don't need rcu variant here */

		tail1 = nhead2->list.prev;
		tail2 = nhead1->list.prev;

		nhead1->list.prev = tail1;
		nhead2->list.prev = tail2;

		tail1->next = &nhead1->list;
		tail2->next = &nhead2->list;

		__nh_sort_chain(nhead1, oentries << 1);
		__nh_sort_chain(nhead2, oentries << 1);

		spin_unlock(&nhead2->lock);
		spin_unlock(&nhead1->lock);
	}
	rcu_read_unlock();
	nh_free(head);
}


static void nh_fixup_shrink(struct rcu_head * head)
{
	struct nh * nh;
	struct nh * oh = container_of(head, struct nh, rcu_head);
	unsigned long i, nentries;

	rcu_read_lock();
	nh = get_nh();

	nentries = nh->nentries;

	/* this is an rcu_callback - only the "new" nh is 
	 * visible elsewhere now, so make sure to access 
	 * the hashtable using the proper (new) locks,... */
	for (i = 0; i < nh->nentries; i++) {
		struct nh_head * nhead = &nh->hash[i];
		struct list_head * olist1 = &oh->hash[i].list;
		struct list_head * olist2 = &oh->hash[i + nentries].list;
		spin_lock(&nhead->lock);
		list_del(olist1); /* don't need RCU variants here */
		list_del(olist2);
		__nh_sort_chain(nhead, nentries);
		spin_unlock(&nhead->lock);
	}
	rcu_read_unlock();
	nh_free(head);
}


/* called under rcu_read_lock */
static struct nh * get_nh(void)
{
	unsigned long seq;
	struct nh * nh = NULL;

	do {
		seq = read_seqbegin(&nethash_seq);
		nh = __nh;
	} while (read_seqretry(&nethash_seq, seq));

	return nh;
}

/* use returned value only under rcu_read_lock */
static struct nh * nh_replace(struct nh * new)
{
	struct nh * old = NULL;

	write_seqlock(&nethash_seq);
	old = __nh;
	__nh = new;
	write_sequnlock(&nethash_seq);
	return old;
}



static void nh_destroy(void)
{
	unsigned long i;
	struct nh * nh;

	rcu_read_lock(); 

	nh = nh_replace(NULL);

	if (!nh)
		goto out_unlock;

	for (i = 0; i < nh->nentries; i++) {
		struct nh_head * head = &nh->hash[i];
		struct list_head *list;

		spin_lock(&head->lock);
		list_for_each_rcu(list, &head->list) {
			struct nh_item * item = (struct nh_item *)list;
			struct nh_entry *entry;
			if (item->nh_type != NH_ENTRY)
				continue;
			entry = list_entry(list, struct nh_entry, nh_list);
			list_del_rcu(&entry->nh_list);
			nprintk("%s: delinking entry with data %lu\n", 
				__FUNCTION__, entry->data)
			call_rcu(&entry->rcu_head, nh_entry_free);
		}
		spin_unlock(&head->lock);
	}
	
	call_rcu(&nh->rcu_head, nh_free);

out_unlock:
	rcu_read_unlock();
}

static void nh_init(struct nh* nh, 
		    struct nh_head * hash,
		    unsigned long nentries)
{
	unsigned long i;

	for (i = 0; i < nentries; i++) {
		INIT_LIST_HEAD(&hash[i].list);
		hash[i].nh_type = NH_HEAD;
		spin_lock_init(&hash[i].lock);
	}

	nh->hash = hash;
	nh->nentries = nentries;
}

static struct nh * nh_alloc(unsigned long nentries)
{

	struct nh * nh;
	unsigned long sz;
	struct nh_head * hash;
	/* nentries must be a power of 2 */
	nentries = roundup_pow_of_two(nentries);
	sz = nentries * sizeof(struct nh_head);

	nprintk("%s: creating hash of %lu entries, size %lu bytes\n", 
		__FUNCTION__, nentries, sz);

	nh = kmalloc(sizeof(struct nh), GFP_KERNEL);
	if (!nh) {
		nprintk("%s: failed to allocate hash of size %lu\n", 
			__FUNCTION__, sz);
		return NULL;
	}

	hash = system_hash_alloc(sz);
	if (!hash) {
		nprintk("%s: failed to allocate hash of size %lu\n", 
			__FUNCTION__, sz);
		kfree(nh);
		return NULL;
	}

	nh_init(nh, hash, nentries);

	return nh;
}

static int nh_create(unsigned long nentries)
{
	struct nh * nh;

	nh_destroy();

	nh = nh_alloc(nentries);

	if (!nh)
		return -ENOMEM;

	nh_replace(nh);

	return 0;
}

static unsigned long nh_bucket(unsigned long hashval, unsigned long nentries)
{
	/* nentries is a power of 2 */
	return (hashval & (nentries - 1));
}

static int nh_add(unsigned long data) 
{
	struct nh_entry * entry;
	struct nh * nh = NULL;
	int ret = 0;

	entry = kmalloc(sizeof(struct nh_entry), GFP_KERNEL);
	if (!entry) return -ENOMEM;

	entry->data = data;
	entry->nh_type = NH_ENTRY;

	rcu_read_lock();

	nh = get_nh();

	if (nh) {
		struct nh_head *hash = nh->hash;
		unsigned long nentries = nh->nentries;
		unsigned long hashval = nh_hashval(data);
		unsigned long bucket  = nh_bucket(hashval, nentries);
		struct nh_head * head = &hash[bucket];

		spin_lock(&head->lock);
		ret = __nh_insert(entry, head, nentries);
		spin_unlock(&head->lock);
	}

	rcu_read_unlock();

	return ret;
}

static void nh_migrate_grow(struct nh * old, struct nh * new)
{
	unsigned long i, oentries  = old->nentries;

	/* this is called prior to calling replace_nh(), so 
	 * only the "old" nh is visible elsewhere - only 
	 * access the hashtable under the appropriate "old" 
	 * locks */
	for ( i = 0; i < oentries; i++ ) {

		struct nh_head * ohead     = &old->hash[i];
		struct list_head  * olist  = &ohead->list;
		struct list_head  * nlist1 = &new->hash[i].list;
		struct list_head  * nlist2 = &new->hash[i + oentries].list;
		struct nh_entry *entry;
		struct list_head * list, * prev = NULL;

		nprintk("%s: migrating bucket %lu\n", __FUNCTION__, i);

		spin_lock(&ohead->lock);

		list_add_rcu(nlist1, olist);

		/* find where to split chain */
		list_for_each_rcu(list, olist) {
			struct nh_item * item = (struct nh_item *)list;
			if (item->nh_type != NH_ENTRY) {
				prev = list;
				continue;
			}
			entry = list_entry(list, struct nh_entry, nh_list);
			if (nh_hashval(entry->data) & oentries) {
				break;
			}
			prev = list;
		}

		BUG_ON(prev == NULL);

		list_add_rcu(nlist2, prev);

		spin_unlock(&ohead->lock);

	}
}

static void nh_migrate_shrink(struct nh * old, struct nh * new)
{
	unsigned long i, oentries = old->nentries;

	/* this is called prior to calling replace_nh(), so 
	 * only the "old" nh is visible elsewhere - only 
	 * access the hashtable under the appropriate "old" 
	 * locks */
	for ( i = 0; i < oentries >> 1; i++ ) {

		struct list_head * nlist  = &new->hash[i].list;
		struct nh_head * ohead1   = &old->hash[i];
		struct nh_head * ohead2   = &old->hash[i + (oentries>>1)];
		struct list_head * olist1, * olist2, tmp;

		nprintk("%s: migrating bucket %lu\n", __FUNCTION__, i);

		/* acquire locks in order */
		spin_lock(&ohead1->lock);
		spin_lock(&ohead2->lock);

		olist1 = &ohead1->list;
		olist2 = &ohead2->list;

		INIT_LIST_HEAD(&tmp);
		list_add_tail_rcu(&tmp, olist2);
		list_splice_init(&tmp, olist1->prev);

		list_add_tail_rcu(nlist, olist1);

		spin_unlock(&ohead2->lock);
		spin_unlock(&ohead1->lock);

	}

}

unsigned long nh_nentries_grow(void) {
	unsigned long nentries = 0;
	/* we're called under rcu_read_lock */
	struct nh * nh = get_nh();

	if (nh) 
		nentries = nh->nentries << 1;
	
	return nentries;
}

unsigned long nh_nentries_shrink(void) {
	unsigned long nentries = 0;
	/* we're called under rcu_read_lock */
	struct nh * nh = get_nh();

	if (nh) 
		nentries = nh->nentries >> 1;
	
	return nentries;
}

enum { NH_GROW_OPS, NH_SHRINK_OPS, NH_MAX_OPS };
/* all of the nh_resize_ops must be called under the 
 * nethash_resize_mutex and rcu_read_lock
 */
struct nh_resize_ops {
	unsigned long (*nh_nentries_op) (void);
	void (*nh_migrate_op) (struct nh *old, struct nh *new);
	void (*nh_fixup_op) (struct rcu_head *head);
} nh_resize_ops [NH_MAX_OPS] = {
	{
		nh_nentries_grow,
		nh_migrate_grow,
		nh_fixup_grow,
	},
	{
		nh_nentries_shrink,
		nh_migrate_shrink,
		nh_fixup_shrink,
	},
};

static int nh_resize(long dir)
{
	struct nh *nh, *new_nh;
	unsigned long new_nentries;
	int err = 0;

	struct nh_resize_ops * nh_ops;

	if (dir > 0)
		nh_ops = &nh_resize_ops[NH_GROW_OPS];
	else if (dir < 0)
		nh_ops = &nh_resize_ops[NH_SHRINK_OPS];
	else
		return -EINVAL;

	mutex_lock(&nethash_resize_mutex);
	rcu_read_lock();

	nh = get_nh();

	if (!nh) 
		goto out_unlock;

	new_nentries = nh_ops->nh_nentries_op();

	nprintk("%s: resizing nh to %lu buckets\n", __FUNCTION__, 
		new_nentries);

	new_nh = nh_alloc(new_nentries);

	if (!new_nh) {
		err = -ENOMEM;
		goto out_unlock;
	}

	nh_ops->nh_migrate_op(nh, new_nh);

	nh_dump(nh, "old nh [post migration]");

	nh_dump(new_nh, "new nh [newborn]");

	nh = nh_replace(new_nh);
	/* now callers of get_nh() will get the updated 
	 * version, but references to the old nh can still 
	 * be in use. It must be possible for users to see 
	 * all elements of the table using either the old 
	 * or new nh */
	
	call_rcu(&nh->rcu_head, nh_ops->nh_fixup_op);

	rcu_read_unlock();

	synchronize_net();
	/* by holding the nethash_resize_mutex until after 
	 * synchronize_net() we are sure that there are at 
	 * most 2 nh structures that we need to be 
	 * concerned with */

	mutex_unlock(&nethash_resize_mutex);

	rcu_read_lock();
	nh_dump(new_nh, "new nh [done]");
	rcu_read_unlock();

	return err;

out_unlock:
	rcu_read_unlock();
	mutex_unlock(&nethash_resize_mutex);
	return err;

}

static int nh_getalong(const char __user *buf, size_t count, long *res)
{
	char kbuf[_NETHASH_BUFSZ_+1];

	if (count > _NETHASH_BUFSZ_)
		return -EINVAL;
	if (copy_from_user(&kbuf, buf, count))
		return -EFAULT;
	kbuf[_NETHASH_BUFSZ_] = '\0';

	if (sscanf(kbuf, "%ld", res) != 1)
		return -EINVAL;

	return 0;
}

ssize_t nh_create_write(struct file *file, const char __user *buf,
			size_t count, loff_t *ppos)
{
	long res;
	int ret;

	if ((ret = nh_getalong(buf, count, &res)) != 0)
		return ret;

	if(res <= 0)
		return -EINVAL;

	if ((ret = nh_create((unsigned long)res)) != 0)
		return ret;

	return count;
}


static struct file_operations nh_create_fops = {
	.write	= nh_create_write,
};


ssize_t nh_add_write(struct file *file, const char __user *buf,
	 	     size_t count, loff_t *ppos)
{
	long res;
	int ret;

	if ((ret = nh_getalong(buf, count, &res)) != 0)
		return ret;

	if(res <= 0)
		return -EINVAL;

	if ((ret = nh_add((unsigned long)res)) != 0)
		return ret;

	return count;
}

static struct file_operations nh_add_fops = {
	.write	= nh_add_write,
};


ssize_t nh_resize_write(struct file *file, const char __user *buf,
			size_t count, loff_t *ppos)
{
	long res;
	int ret;

	if ((ret = nh_getalong(buf, count, &res)) != 0)
		return ret;

	if ((ret = nh_resize(res)) != 0)
		return ret;

	return count;
}



static struct file_operations nh_resize_fops = {
	.write	= nh_resize_write,
};

static void * d_start (struct seq_file *m, loff_t *pos)
{
	struct nh * nh;

	rcu_read_lock();

	nh = get_nh();

	if (nh) {
		struct nh_head *hash = nh->hash;
		unsigned long nentries = nh->nentries;

		if (nentries > *pos) {
			seq_printf(m, "%lu: ", (unsigned long)*pos);
			return (&hash[*pos]);
		}
	}
	
	return NULL;
}

static void * d_next (struct seq_file *m, void *v, loff_t *pos)
{
	struct nh * nh;

	(*pos)++;

	/* rcu_read_lock is still held */
	nh = get_nh();

	if (nh) {
		struct nh_head *hash = nh->hash;
		unsigned long nentries = nh->nentries;

		if (nentries > *pos) {
			seq_printf(m, "%lu: ", (unsigned long)*pos);
			return (&hash[*pos]);
		}
	}
	return NULL;
}

static void d_stop (struct seq_file *m, void *v)
{
	rcu_read_unlock();
}

static int d_show (struct seq_file *m, void *v)
{
	struct nh_head *head = v;
	struct nh_entry *entry;
	struct list_head * list;

	rcu_read_lock();
	list_for_each_rcu(list, &head->list) {
		struct nh_item * item = (struct nh_item *)list;
		switch (item->nh_type) {
		case NH_ENTRY:
			entry = list_entry(list, struct nh_entry, nh_list);
			seq_printf(m, "%lu", entry->data);
			seq_printf(m, "%s", " -> ");
			break;
		case NH_HEAD:
			seq_printf(m, "%s", " .. ");
			break;
		default:
			seq_printf(m, "%s", "bad nh_type!\n");
		}
	}
	rcu_read_unlock();
	seq_printf(m, "%s", "\n");

	return 0;
}

struct seq_operations nh_dump_op = {
	.start	=	d_start,
	.next	=	d_next,
	.stop	=	d_stop,
	.show	=	d_show,
};

static int nh_dump_open(struct inode *inode, struct file *file)
{
	return seq_open(file, &nh_dump_op);
}

static struct file_operations proc_nh_dump_operations = {
	.open		= nh_dump_open,
	.read		= seq_read,
	.llseek		= seq_lseek,
	.release	= seq_release,
};

/* XXX nh_dump() is for for debug only 
 * it must be called under rcu_read_lock() */
static int nh_dump(struct nh * nh, const char * debug_str)
{
	unsigned long i;
	struct nh_head * head;
	struct nh_entry *entry;
	struct list_head * list;

	nprintk("%s: %s\n", __FUNCTION__, debug_str);

	for (i = 0; i < nh->nentries; i++ ) {
		head = &nh->hash[i];
		
		nprintk("[%lu]: (%p)", i, head);
		list_for_each_rcu(list, &head->list) {
			struct nh_item * item = (struct nh_item *)list;
			nprintk(" %p ", list);
			switch (item->nh_type) {
			case NH_ENTRY:
				entry = list_entry(list, struct nh_entry, 
							nh_list);
				nprintk("%lu -> ", entry->data);
				break;
			case NH_HEAD:
				nprintk(" .. ");
				break;
			default:
				nprintk("bad nh_type!\n");
			}
		}
		nprintk("\n");

	}

	return 0;
}


static int nethash_module_init(void)
{
	struct proc_dir_entry *entry;

	nh_proc_dir = proc_mkdir(MODULE_NAME, NULL);

	if (!nh_proc_dir) return -ENODEV;

	/* first the entry to create the hashtable */
	entry = create_proc_entry("create", 0644, nh_proc_dir);
	if (!entry) return -ENODEV;
	entry->proc_fops = &nh_create_fops;

	/* and the entry to add an element to the hashtable */
	entry = create_proc_entry("add", 0644, nh_proc_dir);
	if (!entry) return -ENODEV;
	entry->proc_fops = &nh_add_fops;

	/* the entry to dump the hashtable */
	entry = create_proc_entry("dump", 0444, nh_proc_dir);
	if (!entry) return -ENODEV;
	entry->proc_fops = &proc_nh_dump_operations;

	/* and to resize the hashtable */
	entry = create_proc_entry("resize", 0644, nh_proc_dir);
	if (!entry) return -ENODEV;
	entry->proc_fops = &nh_resize_fops;

	return 0;	
}

static void nethash_module_exit(void)
{
	nh_destroy();
	rcu_barrier(); /* don't unload until rcu callbacks are done 
			* XXX is this really safe? */

	if (nh_proc_dir) {
		remove_proc_entry("resize", nh_proc_dir);
		remove_proc_entry("dump", nh_proc_dir);
		remove_proc_entry("add", nh_proc_dir);
		remove_proc_entry("create", nh_proc_dir);
	}
	remove_proc_entry(MODULE_NAME, NULL);
	printk("%s removed\n", MODULE_NAME);
}

module_init(nethash_module_init);
module_exit(nethash_module_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("akepner@sgi.com");


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/TOY]Extensible hashing and RCU
  2007-02-05 18:41 ` [RFC/TOY]Extensible " akepner
@ 2007-02-06 19:09   ` linux
  0 siblings, 0 replies; 102+ messages in thread
From: linux @ 2007-02-06 19:09 UTC (permalink / raw)
  To: akepner, linux; +Cc: davem, netdev

> For purposes of discussion, I've attached a "toy" implementation
> for doing dynamic resizing of a hashtable. It is useless, except
> as a proof of concept.
> 
> I think this is very similar to what you are describing, no?

Er... not quite; that has a lot more overhead than what I was
thinking about.

I have used the trick of distinguishable sentinel values in a
doubly-linked list to maintain read cursors while it's being updated,
but I don't think that's necessary here.

(You can also encode the nh_type flag in the lsbit of the pointer
if you're being sneaky.  That will attract curses from the
memleak detector, though.)

In particular, I was imagining a singly linked list.  To delete
an item, use the hash to find the head pointer and walk it
to find the pointer to be fixed up.  Since the chains are short
anyway, this is entirely reasonable.

Less fundamental comments include:

1) Is the seqlock in get_nh() and nh_replace() really required?
   Is there any architecture that doesn't have atomic pointer stores?
   If you wanted to store the table size in a fixed location as
   well, I could see the need...

2) I think the whole __nh_sort_chain business will royally confuse
   anyone walking the chain while it happens.  This is exactly
   what I was working to avoid.  The partial sorting in __nh_insert
   isn't good enough.

   Instead, try:

/* Return true if bitrev(x) > bitrev(y) */
static bool bitrev_gt(unsigned long x, unsinged long y)
{
	/* Identify the bits that differ between x and y */
	unsigned long mask = x ^ y;	/* Find the bits that differ */
	mask ^= mask-1;	/* Find lsbit of difference (and all lower bits) */
	return (x & mask) > (y & mask);
}

static void __nh_insert(struct nh_entry *entry, struct nh_head *head)
{
	struct list_head *p, *n;
	unsigned long const hashval = nh_hashval(entry->data);

	/*
	 * Insert the new entry just before the first element of the list
	 * that its hash value is not greater than (bit-reversed).
	 */
	p = &head->list;
	list_for_each_rcu(n, &head->list) {
		struct nh_entry *t = container_of(n, struct nh_entry, nh_list);
		if (t->nh_type == NH_ENTRY &&
		    !bitrev_gt(hashval, nh_hashval(t->data)))
			break;
		p = n;
	}
	__list_add_rcu(&entry->nh_list, p, n);
}

static int nh_add(unsigned long data)
{
	struct nh_entry *entry = kmalloc(sizeof *entry, GFP_KERNEL);
	struct nh *nh;

	if (!entry) return -ENOMEM;

	entry->nh_type = NH_ENTRY;
	entry->data = data;

	rcu_read_lock();

	nh = get_nh();	/* or nh = __nh */

	if (nh) {
		struct nh_head *h = \
			&nh->hash[ nh_bucket(nh_hashval(data), nh->nentries) ];

		spin_lock(&h->lock);
		__nh_insert(entry, h);
		spin_unlock(&h->lock);
	}
	rcu_read_unlock();
		
		
}

   Then there's no need for __nh_sort_chain at all.  Alternatively, if the
   upper bits of nh_hashval are as good as the lower bits, just index the
   hash table on them.

3) Code inside a mutex like nh_resize() can use plain list_for_each();
   the _rcu variant is only required if there can be simultaneous
   mutation.

That's a nice module framework.  I'll see if I can write some code
of the sort I was thinking about.

FWIW, I figured out a way around the need to delay deletion for two
RCU intervals.

Suppose that an expansion is pending, and we have just stretched the
table from 16 entries to 32, and the following hash values are
stored.  (Note the bit-reversed order.)

old[3] --\   
new[3] ---+-> 0x03 -> 0x43 -> 0x23 -> 0x63 -> 0x13 -> 0x53 -> 0x33 -> 0x73
                                             /
                        new[3+16]-----------/

After an RCU period, you can throw away the old[] array and NUL-terminate
the new[i] list after 0x63.  But until then, you need to leave the list alone
to accomodate people who are looking for 0x53 via the old head.

The tricky case comes when you delete 0x13.  If you only delete it from the
new[3+16] list, you can't discard it until the RCU quiescent period
after the one which dicsonnects the 0x63->0x13 link.

The solution is actually very simple: notice when you're
- Deleting the first entry in a list
- While an expension is pending
- And the list is in the second half of the expanded table
then unlink the entry from BOTH the new head and the old list.
It's a bit more work, and requires some lock-ordering care, but
it lets you queue the node for RCU cleanup the normal way.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-05 18:02 ` akepner
@ 2007-02-17 13:13   ` Evgeniy Polyakov
  2007-02-18 18:46     ` Eric Dumazet
  2007-03-02  8:52     ` Evgeniy Polyakov
  0 siblings, 2 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-17 13:13 UTC (permalink / raw)
  To: akepner; +Cc: linux, davem, netdev

On Mon, Feb 05, 2007 at 10:02:53AM -0800, akepner@sgi.com (akepner@sgi.com) wrote:
> On Sat, 4 Feb 2007 linux@horizon.com wrote:
> 
> >I noticed in an LCA talk mention that apprently extensible hashing
> >with RCU access is an unsolved problem.  Here's an idea for solving it.
> >....
> 
> Yes, I have been playing around with the same idea for
> doing dynamic resizing of the TCP hashtable.
> 
> Did a prototype "toy" implementation, and I have a
> "half-finished" patch which resizes the TCP hashtable
> at runtime. Hmmm, your mail may be the impetus to get
> me to finally finish this thing....

Why anyone do not want to use trie - for socket-like loads it has
exactly constant search/insert/delete time and scales as hell.

> -- 
> Arthur
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-17 13:13   ` Evgeniy Polyakov
@ 2007-02-18 18:46     ` Eric Dumazet
  2007-02-18 19:10       ` Evgeniy Polyakov
  2007-02-19 14:04       ` Evgeniy Polyakov
  2007-03-02  8:52     ` Evgeniy Polyakov
  1 sibling, 2 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-18 18:46 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

Evgeniy Polyakov a e'crit :
> On Mon, Feb 05, 2007 at 10:02:53AM -0800, akepner@sgi.com (akepner@sgi.com) wrote:
>> On Sat, 4 Feb 2007 linux@horizon.com wrote:
>>
>>> I noticed in an LCA talk mention that apprently extensible hashing
>>> with RCU access is an unsolved problem.  Here's an idea for solving it.
>>> ....
>> Yes, I have been playing around with the same idea for
>> doing dynamic resizing of the TCP hashtable.
>>
>> Did a prototype "toy" implementation, and I have a
>> "half-finished" patch which resizes the TCP hashtable
>> at runtime. Hmmm, your mail may be the impetus to get
>> me to finally finish this thing....
> 
> Why anyone do not want to use trie - for socket-like loads it has
> exactly constant search/insert/delete time and scales as hell.
> 

Because we want to be *very* fast. You cannot beat hash table.

Say you have 1.000.000 tcp connections, with 50.000 incoming packets per 
second to *random* streams...
With a 2^20 hashtable, a lookup uses one cache line (the hash head pointer) 
plus one cache line to get the socket (you need it to access its refcounter)

Several attempts were done in the past to add RCU to ehash table (last done by 
  Benjamin LaHaise last March). I believe this was delayed a bit, because 
David would like to be able to resize the hash table...

I am not really interested in hash resizing, because an admin can size it at 
boot time. But RCU is definitly *wanted*

Note : It would be good to also use RCU for UDP, because the current rwlock 
protecting udp_hash[] is a scalability problem.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 18:46     ` Eric Dumazet
@ 2007-02-18 19:10       ` Evgeniy Polyakov
  2007-02-18 20:21         ` Eric Dumazet
  2007-02-19 14:04       ` Evgeniy Polyakov
  1 sibling, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-18 19:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> >Why anyone do not want to use trie - for socket-like loads it has
> >exactly constant search/insert/delete time and scales as hell.
> >
> 
> Because we want to be *very* fast. You cannot beat hash table.
> 
> Say you have 1.000.000 tcp connections, with 50.000 incoming packets per 
> second to *random* streams...

What is really good in trie, that you may have upto 2^32 connections
without _any_ difference in lookup performance of random streams.

> With a 2^20 hashtable, a lookup uses one cache line (the hash head pointer) 
> plus one cache line to get the socket (you need it to access its refcounter)
> 
> Several attempts were done in the past to add RCU to ehash table (last done 
> by Benjamin LaHaise last March). I believe this was delayed a bit, because 
> David would like to be able to resize the hash table...

This is a theory.
Practice includes cost for hashing, locking, and list traversal
(each pointer is in own cache line btw, which must be fetched) and plus
the same for time wait sockets (if we are unlucky).

No need to talk about price of cache miss when there might be more
serious problems - for example length of the linked list to traverse each 
time new packet is received.

For example lookup time in trie with 1.6 millions random 3-dimensional
32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 
3500 cpu (test was ran in userspace emulator though).
Data obtained from netchannels research:
http://tservice.net.ru/~s0mbre/blog/2006/12/02#2006_12_02

I think I need to setup similar emulator for hash table of the above
size to check hash/list goodness.

> I am not really interested in hash resizing, because an admin can size it 
> at boot time. But RCU is definitly *wanted*

Trie in that regard is true winner - its RCU'fication is trivial.

> Note : It would be good to also use RCU for UDP, because the current rwlock 
> protecting udp_hash[] is a scalability problem.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 19:10       ` Evgeniy Polyakov
@ 2007-02-18 20:21         ` Eric Dumazet
  2007-02-18 21:23           ` Michael K. Edwards
                             ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-18 20:21 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

Evgeniy Polyakov a e'crit :
> On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>>> Why anyone do not want to use trie - for socket-like loads it has
>>> exactly constant search/insert/delete time and scales as hell.
>>>
>> Because we want to be *very* fast. You cannot beat hash table.
>>
>> Say you have 1.000.000 tcp connections, with 50.000 incoming packets per 
>> second to *random* streams...
> 
> What is really good in trie, that you may have upto 2^32 connections
> without _any_ difference in lookup performance of random streams.

So are you speaking of one memory cache miss per lookup ?
If not, you loose.

> 
>> With a 2^20 hashtable, a lookup uses one cache line (the hash head pointer) 
>> plus one cache line to get the socket (you need it to access its refcounter)
>>
>> Several attempts were done in the past to add RCU to ehash table (last done 
>> by Benjamin LaHaise last March). I believe this was delayed a bit, because 
>> David would like to be able to resize the hash table...
> 
> This is a theory.

Not theory, but actual practice, on a real machine.

# cat /proc/net/sockstat
sockets: used 918944
TCP: inuse 925413 orphan 7401 tw 4906 alloc 926292 mem 304759
UDP: inuse 9
RAW: inuse 0
FRAG: inuse 9 memory 18360


> Practice includes cost for hashing, locking, and list traversal
> (each pointer is in own cache line btw, which must be fetched) and plus
> the same for time wait sockets (if we are unlucky).
> 
> No need to talk about price of cache miss when there might be more
> serious problems - for example length of the linked list to traverse each 
> time new packet is received.
> 
> For example lookup time in trie with 1.6 millions random 3-dimensional
> 32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 
> 3500 cpu (test was ran in userspace emulator though).

1 microsecond ? Are you kidding ? We want no more than 50 ns.

You can check on this dual cpu machine, tcp_v4_rcv() uses 2.29 % of cpu.

CPU: AMD64 processors, speed 1992.67 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
2009510   4.6863  memcpy_c
1668842   3.8918  tg3_start_xmit_dma_bug
1485844   3.4651  tg3_poll
1293558   3.0167  kmem_cache_free
1232862   2.8751  kfree
1131012   2.6376  free_block
1000671   2.3336  ip_route_input
982655    2.2916  tcp_v4_rcv
955554    2.2284  __alloc_skb
863753    2.0143  tcp_ack
863222    2.0131  tcp_recvmsg
834680    1.9465  fget_light
801445    1.8690  lock_sock_nested
793699    1.8510  tcp_sendmsg
764689    1.7833  copy_user_generic_string
743515    1.7339  ip_queue_xmit
712314    1.6612  sock_wfree
650486    1.5170  tcp_rcv_established


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 20:21         ` Eric Dumazet
@ 2007-02-18 21:23           ` Michael K. Edwards
  2007-02-18 22:04             ` Michael K. Edwards
  2007-02-19 12:04             ` Andi Kleen
  2007-02-19 11:41           ` Evgeniy Polyakov
  2007-02-19 12:02           ` Andi Kleen
  2 siblings, 2 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-18 21:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl

A better data structure for RCU, even with a fixed key space, is
probably a splay tree.  Much less vulnerable to cache eviction DDoS
than a hash, because the hot connections get rotated up into non-leaf
layers and get traversed enough to keep them in the LRU set.  But I am
not a data structures guru; a stratified tree or van Emde-Boas
priority queue might be a better choice.  RCU splay tree has lots of
other potential applications inside the kernel, so it would make a fun
project for someone with time to burn.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 21:23           ` Michael K. Edwards
@ 2007-02-18 22:04             ` Michael K. Edwards
  2007-02-19 12:04             ` Andi Kleen
  1 sibling, 0 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-18 22:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl

On 2/18/07, Michael K. Edwards <medwards.linux@gmail.com> wrote:
> ...  Much less vulnerable to cache eviction DDoS
> than a hash, because the hot connections get rotated up into non-leaf
> layers and get traversed enough to keep them in the LRU set.

Let me enlarge on this a bit.  I used to work for a company that built
a custom firewall/VPN ASIC with all sorts of special sauce in it,
mostly focused on dealing with DDoS.  Some really smart guys, some
really good technology, I hope they grab the brass ring someday.  On
the scale they were dealing with, there's only one thing to do about
DDoS: bend over and take it.  Provision enough memory bandwidth to
cold-cache every packet, every session lookup, and every
packet-processing-progress structure.  Massively parallelize, spinlock
on on-chip SRAM, tune for the cold-cache case.  If you can't afford to
do that -- and if you haven't designed your own chip, with separate
cache windows for each of these use cases, you can't, because they're
all retrograde loads for an LRU cache -- then a hash is not the right
answer.  The interaction between resizing and RCU is just the canary
in the coal mine.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 20:21         ` Eric Dumazet
  2007-02-18 21:23           ` Michael K. Edwards
@ 2007-02-19 11:41           ` Evgeniy Polyakov
  2007-02-19 13:38             ` Eric Dumazet
  2007-02-19 12:02           ` Andi Kleen
  2 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-19 11:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Sun, Feb 18, 2007 at 09:21:30PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Evgeniy Polyakov a e'crit :
> >On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet 
> >(dada1@cosmosbay.com) wrote:
> >>>Why anyone do not want to use trie - for socket-like loads it has
> >>>exactly constant search/insert/delete time and scales as hell.
> >>>
> >>Because we want to be *very* fast. You cannot beat hash table.
> >>
> >>Say you have 1.000.000 tcp connections, with 50.000 incoming packets per 
> >>second to *random* streams...
> >
> >What is really good in trie, that you may have upto 2^32 connections
> >without _any_ difference in lookup performance of random streams.
> 
> So are you speaking of one memory cache miss per lookup ?
> If not, you loose.

With trie big part of it _does_ live in cache compared to hash table
where similar addresses ends up in a completely different hash entries.

> >>With a 2^20 hashtable, a lookup uses one cache line (the hash head 
> >>pointer) plus one cache line to get the socket (you need it to access its 
> >>refcounter)
> >>
> >>Several attempts were done in the past to add RCU to ehash table (last 
> >>done by Benjamin LaHaise last March). I believe this was delayed a bit, 
> >>because David would like to be able to resize the hash table...
> >
> >This is a theory.
> 
> Not theory, but actual practice, on a real machine.
> 
> # cat /proc/net/sockstat
> sockets: used 918944
> TCP: inuse 925413 orphan 7401 tw 4906 alloc 926292 mem 304759
> UDP: inuse 9
> RAW: inuse 0
> FRAG: inuse 9 memory 18360

Theory is a speculation about performance.

Highly cache usage optimized bubble sorting still much worse than 
cache usage non-optimized binary tree.
 
> >Practice includes cost for hashing, locking, and list traversal
> >(each pointer is in own cache line btw, which must be fetched) and plus
> >the same for time wait sockets (if we are unlucky).
> >
> >No need to talk about price of cache miss when there might be more
> >serious problems - for example length of the linked list to traverse each 
> >time new packet is received.
> >
> >For example lookup time in trie with 1.6 millions random 3-dimensional
> >32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 
> >3500 cpu (test was ran in userspace emulator though).
> 
> 1 microsecond ? Are you kidding ? We want no more than 50 ns.

Theory again.

Existing table does not scale that good - I created (1<<20)/2 (to cover only
established part) entries table and filled it with 1 million of random entries 
-search time is about half of microsecod.

Wanna see a code? I copied Linux hash table magic into userspace and run
the same inet_hash() and inet_lookup() in a loop. Result above.

Trie is still 2 times worse, but I've just found a bug in my implementation.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 20:21         ` Eric Dumazet
  2007-02-18 21:23           ` Michael K. Edwards
  2007-02-19 11:41           ` Evgeniy Polyakov
@ 2007-02-19 12:02           ` Andi Kleen
  2007-02-19 12:35             ` Robert Olsson
  2 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2007-02-19 12:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl

Eric Dumazet <dada1@cosmosbay.com> writes:
> 
> So are you speaking of one memory cache miss per lookup ?

Actually two: if the trie'ing allows RCUing you would save
the spinlock cache line too. This would increase the 
break-even budget for the trie.

> If not, you loose.

It all depends on if the higher levels on the trie are small
enough to be kept in cache. Even with two cache misses it might
still break even, but have better scalability.

Another advantage would be to eliminate the need for large memory
blocks, which cause problems too e.g. on NUMA. It certainly would
save quite some memory if the tree levels are allocated on demand
only. However breaking it up might also cost more TLB misses, 
but those could be eliminated by preallocating the tree in
the same way as the hash today. Don't know if it's needed or not.

I guess someone needs to code it up and try it.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 21:23           ` Michael K. Edwards
  2007-02-18 22:04             ` Michael K. Edwards
@ 2007-02-19 12:04             ` Andi Kleen
  2007-02-19 19:18               ` Michael K. Edwards
  1 sibling, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2007-02-19 12:04 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Eric Dumazet, Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl

"Michael K. Edwards" <medwards.linux@gmail.com> writes:

> A better data structure for RCU, even with a fixed key space, is
> probably a splay tree.  Much less vulnerable to cache eviction DDoS
> than a hash, because the hot connections get rotated up into non-leaf
> layers and get traversed enough to keep them in the LRU set.

LRU tends to be hell for caches in MP systems, because it writes to
the cache lines too and makes them exclusive and more expensive.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 12:02           ` Andi Kleen
@ 2007-02-19 12:35             ` Robert Olsson
  0 siblings, 0 replies; 102+ messages in thread
From: Robert Olsson @ 2007-02-19 12:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric Dumazet, Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl


Andi Kleen writes:

 > > If not, you loose.
 > 
 > It all depends on if the higher levels on the trie are small
 > enough to be kept in cache. Even with two cache misses it might
 > still break even, but have better scalability.

 Yes the trick to keep root large to allow a very flat tree and few 
 cache misses. Stefan Nilsson (author of LC-trie) and were able to 
 improve the the LC-trie quite a bit we called this trie+hash ->trash

 The paper discusses trie/hash... (you've seen it)

 http://www.nada.kth.se/~snilsson/public/papers/trash/

 > Another advantage would be to eliminate the need for large memory
 > blocks, which cause problems too e.g. on NUMA. It certainly would
 > save quite some memory if the tree levels are allocated on demand
 > only. However breaking it up might also cost more TLB misses, 
 > but those could be eliminated by preallocating the tree in
 > the same way as the hash today. Don't know if it's needed or not.
 > 
 > I guess someone needs to code it up and try it.
 
 I've implemented trie/trash as replacement for the dst hash to full
 key lookup for ipv4 (unicache) to start with. And is still is focusing 
 on the nasty parts, packet forwarding, as we don't want break this.... 
 So the benfits of full flow lookup is not accounted. I.e the full flow
 lookup could give socket at no cost and do some conntrack support
 like Evgeniy did in netchannels pathes.

 Below, some recent comparisions and profiles for the packet forwardning
 
Input 2 * 65k concurrent flows eth0->eth1, eth2->eth3 in forwarding
On separate CPU Opteron 2218 (2.6 GHZ)  net-2.6.21 git. Numbers are very 
approximative but should still be representative. Profiles are 
collected. 

Performance comparison
----------------------
Table below holds: dst-entries in use, lookup hits, slow path and total pps

Flowlen 40

250k 1020 + 21 = 1041 pps  Vanilla rt_hash=32k
1M    950 + 29 =  979 pps  Vanilla rt_hash=131k
260k  980 + 24 = 1004 pps  Unicache

Flowlen 4 (rdos)

290k 560 + 162 = 722 kpps  Vanilla  rt_hash=32k
1M   400 + 165 = 565 kpps  Vanilla  rt_hash=131k
230k 570 + 170 = 740 kpps  Unicache 


unicache flen=4 pkts
--------------------
c02df84f 5257     7.72078     tkey_extract_bits
c023151a 5230     7.68112     e1000_clean_rx_irq
c02df908 3306     4.85541     tkey_equals
c014cf31 3296     4.84072     kfree
c02f8c3b 3067     4.5044      ip_route_input
c02fbdf0 2948     4.32963     ip_forward
c023024e 2809     4.12548     e1000_xmit_frame
c02e06f1 2792     4.10052     trie_lookup
c02fd764 2159     3.17085     ip_output
c032591c 1965     2.88593     fn_trie_lookup
c014cd82 1456     2.13838     kmem_cache_alloc
c02fa941 1337     1.96361     ip_rcv
c014ced0 1334     1.9592      kmem_cache_free
c02e1538 1289     1.89311     unicache_tcp_establish
c02e2d70 1218     1.78884     dev_queue_xmit
c02e31af 1074     1.57735     netif_receive_skb
c02f8484 1053     1.54651     ip_route_input_slow
c02db552 987      1.44957     __alloc_skb
c02e626f 913      1.34089     dst_alloc
c02edaad 828      1.21606     __qdisc_run
c0321ccf 810      1.18962     fib_get_table
c02e14c1 782      1.1485      match_pktgen
c02e6375 766      1.125       dst_destroy
c02e10e8 728      1.06919     unicache_hash_code
c0231242 647      0.950227    e1000_clean_tx_irq
c02f7d23 625      0.917916    ipv4_dst_destroy

unicache flen=40 pkts
---------------------
c023151a 6742     10.3704     e1000_clean_rx_irq
c02df908 4553     7.00332     tkey_equals
c02fbdf0 4455     6.85258     ip_forward
c02e06f1 4067     6.25577     trie_lookup
c02f8c3b 3951     6.07734     ip_route_input
c02df84f 3929     6.0435      tkey_extract_bits
c023024e 3538     5.44207     e1000_xmit_frame
c014cf31 3152     4.84834     kfree
c02fd764 2711     4.17        ip_output
c02e1538 1930     2.96868     unicache_tcp_establish
c02fa941 1696     2.60875     ip_rcv
c02e31af 1466     2.25497     netif_receive_skb
c02e2d70 1412     2.17191     dev_queue_xmit
c014cd82 1397     2.14883     kmem_cache_alloc
c02db552 1394     2.14422     __alloc_skb
c02edaad 1032     1.5874      __qdisc_run
c02ed6b8 957      1.47204     eth_header
c02e15dd 904      1.39051     unicache_garbage_collect_active
c02db94e 861      1.32437     kfree_skb
c0231242 794      1.22131     e1000_clean_tx_irq
c022fd58 778      1.1967      e1000_tx_map
c014ce73 756      1.16286     __kmalloc
c014ced0 740      1.13825     kmem_cache_free
c02e14c1 701      1.07826     match_pktgen
c023002c 621      0.955208    e1000_tx_queue
c02e78fa 519      0.798314    neigh_resolve_output

Vanilla w. flen=4 pkts rt_hash=32k
----------------------------------
c02f6852 15704    22.9102     ip_route_input
c023151a 5324     7.76705     e1000_clean_rx_irq
c02f84a1 4457     6.5022      ip_rcv
c02f9950 3065     4.47145     ip_forward
c023024e 2630     3.83684     e1000_xmit_frame
c0323380 2343     3.41814     fn_trie_lookup
c02fb2c4 2181     3.1818      ip_output
c02f4a3b 1839     2.68287     rt_intern_hash
c02f4480 1762     2.57054     rt_may_expire
c02f6046 1697     2.47571     ip_route_input_slow
c014ced0 1615     2.35608     kmem_cache_free
c014cf31 1522     2.22041     kfree
c02e0bcb 1321     1.92717     netif_receive_skb
c02e078c 1251     1.82505     dev_queue_xmit
c02e3c8b 1089     1.58871     dst_alloc
c02eb0d4 1010     1.47346     eth_header
c02e3d91 973      1.41948     dst_destroy
c02db552 972      1.41803     __alloc_skb
c02eb4c9 883      1.28819     __qdisc_run
c031f733 798      1.16418     fib_get_table
c01c0fa2 749      1.0927      memcmp
c02f59a5 703      1.02559     ipv4_dst_destroy
c014cd82 673      0.981822    kmem_cache_alloc
c0324fb1 670      0.977446    fib_lookup
c02db94e 670      0.977446    kfree_skb
c022fd58 631      0.92055     e1000_tx_map

Vanilla w. flen=40 pkts rt_hash=32k
-----------------------------------
c02f6852 17445    25.507      ip_route_input
c023151a 7657     11.1956     e1000_clean_rx_irq
c02f84a1 7004     10.2408     ip_rcv
c02f9950 4851     7.09283     ip_forward
c023024e 3874     5.66432     e1000_xmit_frame
c02fb2c4 2905     4.24751     ip_output
c014cf31 2065     3.01931     kfree
c02e078c 1874     2.74005     dev_queue_xmit
c02e0bcb 1712     2.50318     netif_receive_skb
c02db552 1352     1.97681     __alloc_skb
c02eb4c9 1204     1.76041     __qdisc_run
c02eb0d4 980      1.4329      eth_header
c02db94e 846      1.23697     kfree_skb
c022fd58 833      1.21796     e1000_tx_map
c014ced0 819      1.19749     kmem_cache_free
c014cd82 806      1.17848     kmem_cache_alloc
c014ce73 761      1.11269     __kmalloc
c0231242 717      1.04835     e1000_clean_tx_irq
c023002c 595      0.869972    e1000_tx_queue
c02e5316 563      0.823184    neigh_resolve_output
c02eb221 498      0.728145    eth_type_trans
c0231ea2 471      0.688667    e1000_alloc_rx_buffers
c0323380 459      0.671121    fn_trie_lookup
c02ef8fd 456      0.666735    qdisc_dequeue_head
c02e06dd 422      0.617022    dev_hard_start_xmit
c02f4074 404      0.590704    rt_hash_code

Vanilla w. flen=4 pkts rt_hash=131k
------------------------------------
c02f6852 9679     17.3758     ip_route_input
c023151a 3895     6.99232     e1000_clean_rx_irq
c02f84a1 2858     5.13069     ip_rcv
c0323380 2507     4.50057     fn_trie_lookup
c023024e 2233     4.00869     e1000_xmit_frame
c02f4480 2075     3.72505     rt_may_expire
c02f9950 1966     3.52937     ip_forward
c014ced0 1659     2.97824     kmem_cache_free
c02f4a3b 1654     2.96927     rt_intern_hash
c02f6046 1523     2.73409     ip_route_input_slow
c02fb2c4 1500     2.6928      ip_output
c014cf31 1375     2.4684      kfree
c02e078c 1098     1.97113     dev_queue_xmit
c02e3c8b 1081     1.94061     dst_alloc
c02e3d91 947      1.70006     dst_destroy
c014cc12 890      1.59773     free_block
c02e0bcb 874      1.56901     netif_receive_skb
c02eb4c9 861      1.54567     __qdisc_run
c02db552 777      1.39487     __alloc_skb
c02f59a5 705      1.26562     ipv4_dst_destroy
c031f733 689      1.2369      fib_get_table
c02ef8fd 669      1.20099     qdisc_dequeue_head
c01c0fa2 665      1.19381     memcmp
c0231242 642      1.15252     e1000_clean_tx_irq
c0324fb1 636      1.14175     fib_lookup
c014c8f1 620      1.11303     slab_put_obj

Vanilla w. flen=40 pkts rt_hash=131k
------------------------------------
c02f6852 9647     16.9913     ip_route_input
c023151a 6879     12.116      e1000_clean_rx_irq
c02f84a1 6102     10.7475     ip_rcv
c02f9950 4248     7.48203     ip_forward
c023024e 3625     6.38474     e1000_xmit_frame
c02fb2c4 2501     4.40503     ip_output
c014cf31 2349     4.13731     kfree
c02e078c 1714     3.01888     dev_queue_xmit
c02e0bcb 1461     2.57327     netif_receive_skb
c02eb4c9 1286     2.26504     __qdisc_run
c02db552 1239     2.18226     __alloc_skb
c0231242 1034     1.82119     e1000_clean_tx_irq
c02ef8fd 936      1.64858     qdisc_dequeue_head
c014ced0 888      1.56404     kmem_cache_free
c014cd82 877      1.54467     kmem_cache_alloc
c022fd58 828      1.45836     e1000_tx_map
c014ce73 789      1.38967     __kmalloc
c02db94e 741      1.30513     kfree_skb
c023002c 622      1.09553     e1000_tx_queue
c02eb221 524      0.922925    eth_type_trans
c0323380 473      0.833098    fn_trie_lookup
c02e039b 449      0.790827    dev_kfree_skb_any
c0231ea2 442      0.778498    e1000_alloc_rx_buffers
c02e06dd 420      0.739749    dev_hard_start_xmit
c0230227 391      0.688671    e1000_maybe_stop_tx
c02f4074 336      0.591799    rt_hash_code


Some comments:

To some extent we compare apples and bananas but still interesting 
uniccahe does the full key lookup and we see the lookup cost itself 
is not expensive (trie_lookup) but we see  key handling costs a bit. 
tkey_extract_bits is when me make the key. The corresponding 
for the hash would be when we create the actual hash.

tkey_equals is to verify the lookup also this we do with hash
maybe it should be possible i case of routing to optimize a bit.
Only if packet is for localhost we would need to do the full
tkey_equals - this could possible save some cache misses.


Cheers.
					--ro


 
 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 11:41           ` Evgeniy Polyakov
@ 2007-02-19 13:38             ` Eric Dumazet
  2007-02-19 13:56               ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-19 13:38 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote:

> > 1 microsecond ? Are you kidding ? We want no more than 50 ns.
>
> Theory again.


Theory is nice, but I personally prefer oprofile :)
I base my comments on real facts.
We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one 
for exclusive access intent)

>
> Existing table does not scale that good - I created (1<<20)/2 (to cover
> only established part) entries table and filled it with 1 million of random
> entries -search time is about half of microsecod.

I use exactly 1^20 slots, not 1^19 (see commit  
dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash 
table so that two chains (established/timewait) are on the same cache line. 
every cache miss *counts*) 

http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html

(Of course, you may have to change MAX_ORDER to 14 or else the hash table hits 
the MAX_ORDER limit)

Search time under 100 ns, for real trafic (kind of random... but not quite)
Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally 
in...

In your tests, please make sure a User process is actually doing real work on 
each CPU, ie evicting cpu caches every ms...

The rule is : On a normal machine, cpu caches contain UserMode data, not 
kernel data. (as a typical machine spends 15% of its cpu time in kernel land, 
and 85% in User land). You can assume kernel text is in cache, but even this 
assumption may be wrong.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 13:38             ` Eric Dumazet
@ 2007-02-19 13:56               ` Evgeniy Polyakov
  2007-02-19 14:14                 ` Eric Dumazet
  2007-02-19 22:10                 ` Andi Kleen
  0 siblings, 2 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-19 13:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote:
> 
> > > 1 microsecond ? Are you kidding ? We want no more than 50 ns.
> >
> > Theory again.
> 
> 
> Theory is nice, but I personally prefer oprofile :)
> I base my comments on real facts.
> We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one 
> for exclusive access intent)

I said that your words are theory in previous mails :)

Current code works 10 times worse than you expect.

> > Existing table does not scale that good - I created (1<<20)/2 (to cover
> > only established part) entries table and filled it with 1 million of random
> > entries -search time is about half of microsecod.
> 
> I use exactly 1^20 slots, not 1^19 (see commit  
> dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash 
> table so that two chains (established/timewait) are on the same cache line. 
> every cache miss *counts*) 

Forget about cache misses and cache lines - we have a hash table, only
part of which is used (part for time-wait sockets, part for established
ones).

Anyway, even with 2^20 (i.e. when the whole table is only used for
established sockets) search time is about 360-370 nsec on 3.7 GHz Core
Duo (only one CPU is used) with 2 GB of ram.

> http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html
> 
> (Of course, you may have to change MAX_ORDER to 14 or else the hash table hits 
> the MAX_ORDER limit)
> 
> Search time under 100 ns, for real trafic (kind of random... but not quite)
> Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally 
> in...

My experiment shows almost 400 nsecs without _any_ locks - they are
removed completely - it is pure hash selection/list traverse time.

> In your tests, please make sure a User process is actually doing real work on 
> each CPU, ie evicting cpu caches every ms...
> 
> The rule is : On a normal machine, cpu caches contain UserMode data, not 
> kernel data. (as a typical machine spends 15% of its cpu time in kernel land, 
> and 85% in User land). You can assume kernel text is in cache, but even this 
> assumption may be wrong.

In my tests _only_ hash tables are in memory (well with some bits of
other stuff) - I use exactly the same approach for both trie and hash
table tests - table/trie is allocated, filled and lookup of random
values is performed in a loop. It is done in userspace - I just moved
list.h inet_hashtable.h and other needed files into separate project and
compiled them (with removed locks, atomic operations and other pure
kernel stuff). So actual time even more for hash table - at least it
requires locks while trie implementation works with RCU.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-18 18:46     ` Eric Dumazet
  2007-02-18 19:10       ` Evgeniy Polyakov
@ 2007-02-19 14:04       ` Evgeniy Polyakov
  1 sibling, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-19 14:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

Actually for socket code any other binary tree will work perfectly ok -
socket code does not have wildcards (except listening sockets), so it is
possible to combine all values into one search key used in flat
one-dimensional tree - it scales as hell and allows still very high
lookup time.

As of cache usage - such trees can be combined with different protocols
to increase cache locality.

The only reason I implemented trie is that netchannels support
wildcards, that is how netfilter is implemented on top of them.

Tree with lazy deletion (i.e. without deletion at all) can be moved to 
RCU very easily.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 13:56               ` Evgeniy Polyakov
@ 2007-02-19 14:14                 ` Eric Dumazet
  2007-02-19 14:25                   ` Evgeniy Polyakov
  2007-02-19 22:10                 ` Andi Kleen
  1 sibling, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-19 14:14 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

On Monday 19 February 2007 14:56, Evgeniy Polyakov wrote:
> On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote:
> > > > 1 microsecond ? Are you kidding ? We want no more than 50 ns.
> > >
> > > Theory again.
> >
> > Theory is nice, but I personally prefer oprofile :)
> > I base my comments on real facts.
> > We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent,
> > one for exclusive access intent)
>
> I said that your words are theory in previous mails :)
>
> Current code works 10 times worse than you expect.
>
> > > Existing table does not scale that good - I created (1<<20)/2 (to cover
> > > only established part) entries table and filled it with 1 million of
> > > random entries -search time is about half of microsecod.
> >
> > I use exactly 1^20 slots, not 1^19 (see commit
> > dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of
> > ehash table so that two chains (established/timewait) are on the same
> > cache line. every cache miss *counts*)
>
> Forget about cache misses and cache lines - we have a hash table, only
> part of which is used (part for time-wait sockets, part for established
> ones).

No you didnt not read my mail. Current ehash is not as decribed by you.

>
> Anyway, even with 2^20 (i.e. when the whole table is only used for
> established sockets) search time is about 360-370 nsec on 3.7 GHz Core
> Duo (only one CPU is used) with 2 GB of ram.

Your tests are user land, so unfortunatly are biased...

(Unless you use hugetlb data ?)


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 14:14                 ` Eric Dumazet
@ 2007-02-19 14:25                   ` Evgeniy Polyakov
  2007-02-19 15:14                     ` Eric Dumazet
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-19 14:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > Forget about cache misses and cache lines - we have a hash table, only
> > part of which is used (part for time-wait sockets, part for established
> > ones).
> 
> No you didnt not read my mail. Current ehash is not as decribed by you.

I did.
And I also said that my tests do not have timewait sockets at all - I
removed sk_for_each and so on, which should effectively increase lookup 
time twice on busy system with lots of created/removed sockets per
timeframe (that is theory from my side already).
Anyway, I ran the same test with increased table too.

> > Anyway, even with 2^20 (i.e. when the whole table is only used for
> > established sockets) search time is about 360-370 nsec on 3.7 GHz Core
> > Duo (only one CPU is used) with 2 GB of ram.
> 
> Your tests are user land, so unfortunatly are biased...
> 
> (Unless you use hugetlb data ?)

No I do not. But the same can be applied to trie test - it is also
performed in userspace and thus suffers from possible swapping/cache
flushing and so on.

And I doubt moving test into kernel will suddenly end up with 10 times 
increased rates.

Anyway, trie test (broken implementation) is two times slower than hash
table (resized already), and it does not include locking isses of the
hash table access and further scalability issues.

I think I need to fix my trie implementation to fully show its
potential, but original question was why tree/trie based implementation
is not considered at all although it promises better performance and
scalability.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 14:25                   ` Evgeniy Polyakov
@ 2007-02-19 15:14                     ` Eric Dumazet
  2007-02-19 18:13                       ` Eric Dumazet
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-19 15:14 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

On Monday 19 February 2007 15:25, Evgeniy Polyakov wrote:
> On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > > Forget about cache misses and cache lines - we have a hash table, only
> > > part of which is used (part for time-wait sockets, part for established
> > > ones).
> >
> > No you didnt not read my mail. Current ehash is not as decribed by you.
>
> I did.
> And I also said that my tests do not have timewait sockets at all - I
> removed sk_for_each and so on, which should effectively increase lookup
> time twice on busy system with lots of created/removed sockets per
> timeframe (that is theory from my side already).
> Anyway, I ran the same test with increased table too.
>
> > > Anyway, even with 2^20 (i.e. when the whole table is only used for
> > > established sockets) search time is about 360-370 nsec on 3.7 GHz Core
> > > Duo (only one CPU is used) with 2 GB of ram.
> >
> > Your tests are user land, so unfortunatly are biased...
> >
> > (Unless you use hugetlb data ?)
>
> No I do not. But the same can be applied to trie test - it is also
> performed in userspace and thus suffers from possible swapping/cache
> flushing and so on.
>
> And I doubt moving test into kernel will suddenly end up with 10 times
> increased rates.

At least some architectures pay a high price using vmalloc() instead of 
kmalloc(), and TLB misses means something for them. Not everybody has the 
latest Intel cpu. Normally, ehash table is using huge pages.

>
> Anyway, trie test (broken implementation) is two times slower than hash
> table (resized already), and it does not include locking isses of the
> hash table access and further scalability issues.
>

You mix apples and oranges. We already know locking has nothing to do with 
hashing or trie-ing. We *can* put RCU on top of the existing ehash. We also 
can add hash resizing if we really care.


> I think I need to fix my trie implementation to fully show its
> potential, but original question was why tree/trie based implementation
> is not considered at all although it promises better performance and
> scalability.

Because you mix performance and scalability. Thats not exactly the same.
Sometime, high performance means *suboptimal* scalability.

Because O(1) is different of O(log(N)) ?
if N = 2^20, it certainly makes a difference.
Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are 
touching less than 4 cache lines.
With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4, I 
will be very pleased.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 15:14                     ` Eric Dumazet
@ 2007-02-19 18:13                       ` Eric Dumazet
  2007-02-19 18:26                         ` Benjamin LaHaise
  2007-02-20  9:25                         ` Evgeniy Polyakov
  0 siblings, 2 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-19 18:13 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

On Monday 19 February 2007 16:14, Eric Dumazet wrote:
>
> Because O(1) is different of O(log(N)) ?
> if N = 2^20, it certainly makes a difference.
> Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are
> touching less than 4 cache lines.
> With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4,
> I will be very pleased.
>

Here is the tcp ehash chain length distribution on a real server :
ehash_addr=0xffff810476000000
ehash_size=1048576
333835 used chains, 3365 used twchains
Distribution of sockets/chain length
[chain length]:number of sockets
[1]:221019 37.4645%
[2]:56590 56.6495%
[3]:21250 67.4556%
[4]:12534 75.9541%
[5]:8677 83.3082%
[6]:5862 89.2701%
[7]:3640 93.5892%
[8]:2219 96.5983%
[9]:1083 98.2505%
[10]:539 99.1642%
[11]:244 99.6191%
[12]:112 99.8469%
[13]:39 99.9329%
[14]:16 99.9708%
[15]:6 99.9861%
[16]:3 99.9942%
[17]:2 100%
total : 589942 sockets

So even with a lazy hash function, 89 % of lookups are satisfied with less 
than 6 compares.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 18:13                       ` Eric Dumazet
@ 2007-02-19 18:26                         ` Benjamin LaHaise
  2007-02-19 18:38                           ` Benjamin LaHaise
  2007-02-20  9:25                         ` Evgeniy Polyakov
  1 sibling, 1 reply; 102+ messages in thread
From: Benjamin LaHaise @ 2007-02-19 18:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev

On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote:
> So even with a lazy hash function, 89 % of lookups are satisfied with less 
> than 6 compares.

Which sucks, as those are typically going to be cache misses (costing many 
hundreds of cpu cycles).  Hash chains fair very poorly under DoS conditions, 
and must be removed under a heavy load.  Worst case handling is very 
important next to common case.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 18:26                         ` Benjamin LaHaise
@ 2007-02-19 18:38                           ` Benjamin LaHaise
  0 siblings, 0 replies; 102+ messages in thread
From: Benjamin LaHaise @ 2007-02-19 18:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev

On Mon, Feb 19, 2007 at 01:26:42PM -0500, Benjamin LaHaise wrote:
> On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote:
> > So even with a lazy hash function, 89 % of lookups are satisfied with less 
> > than 6 compares.
> 
> Which sucks, as those are typically going to be cache misses (costing many 
> hundreds of cpu cycles).  Hash chains fair very poorly under DoS conditions, 
> and must be removed under a heavy load.  Worst case handling is very 
> important next to common case.

I should clarify.  Back of the napkin calculations show that there is only 
157 cycles on a 3GHz processor in which to decide what happens to a packet, 
which means 1 cache miss is more than enough.  In theory we can get pretty 
close to line rate with quad core processors, but it definately needs some 
of the features that newer chipsets have for stuffing packets directly into 
the cache.  I would venture a guess that we also need to intelligently 
partition packets so that we make the most use of available cache resources.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 12:04             ` Andi Kleen
@ 2007-02-19 19:18               ` Michael K. Edwards
  0 siblings, 0 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-19 19:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric Dumazet, Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl

On 19 Feb 2007 13:04:12 +0100, Andi Kleen <andi@firstfloor.org> wrote:
> LRU tends to be hell for caches in MP systems, because it writes to
> the cache lines too and makes them exclusive and more expensive.

That's why you let the hardware worry about LRU.  You don't write to
the upper layers of the splay tree when you don't have to.  It's the
mere traversal of the upper layers that keeps them in cache, causing
the cache hierarchy to mimic the data structure hierarchy.

RCU changes the whole game, of course, because you don't write to the
old copy at all; you have to clone the altered node and all its
ancestors and swap out the root node itself under a spinlock.  Except
you don't use a spinlock; you have a ring buffer of root nodes and
atomically increment the writer index.  That atomically incremented
index is the only thing on which there's any write contention.
(Obviously you need a completion flag on the new root node for the
next writer to poll on, so the sequence is atomic-increment ... copy
and alter from leaf to root ... wmb() ... mark new root complete.)

When you share TCP sessions among CPUs, and packets associated with
the same session may hit softirq in any CPU, you are going to eat a
lot of interconnect bandwidth keeping the sessions coherent.  (The
only way out of this is to partition the tuple space by CPU at the NIC
layer with separate per-core, or perhaps per-cache, receive queues; at
which point the NIC is so smart that you might as well put the DDoS
handling there.)  But at least it's cache coherency protocol bandwidth
and not bandwidth to and from DRAM, which has much nastier latencies.
The only reason the data structure matters _at_all_ is that DDoS
attacks threaten to evict the working set of real sessions out of
cache.  That's why you add new sessions at the leaves and don't rotate
them up until they're hit a second time.

Of course the leaf layer can't be RCU, but it doesn't have to be; it's
just a bucket of tuples.  You need an auxiliary structure to hold the
session handshake trackers for the leaf layer, but you assume that
you're always hitting cold cache when diving into this structure and
ration accesses accordingly.  Maybe you even explicitly evict entries
from cache after sending the SYNACK, so they don't crowd other stuff
out; they go to DRAM and get pulled into the new CPU (and rotated up)
if and when the next packet in the session arrives.  (I'm assuming
T/TCP here, so you can't skimp much on session tracker size during the
handshake.)

Every software firewall I've seen yet falls over under DDoS.  If you
want to change that, you're going to need more than the
back-of-the-napkin calculations that show that session lookup
bandwidth exceeds frame throughput for min-size packets.  You're going
to need to strategize around exploiting the cache hierarchy already
present in your commodity processor to implicitly partition real
traffic from the DDoS storm.  It's not a trivial problem, even in the
mathematician's sense (in which all problems are either trivial or
unsolved).

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 13:56               ` Evgeniy Polyakov
  2007-02-19 14:14                 ` Eric Dumazet
@ 2007-02-19 22:10                 ` Andi Kleen
  1 sibling, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2007-02-19 22:10 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, akepner, linux, davem, netdev, bcrl

Evgeniy Polyakov <johnpol@2ka.mipt.ru> writes:
> 
> My experiment shows almost 400 nsecs without _any_ locks - they are
> removed completely - it is pure hash selection/list traverse time.

Are you sure you're not measuring TLB misses too? In user space
you likely use 4K pages. The kernel would use 2MB pages.
I would suggest putting the tables into hugetlbfs allocated memory
in your test program.

-Andei

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-19 18:13                       ` Eric Dumazet
  2007-02-19 18:26                         ` Benjamin LaHaise
@ 2007-02-20  9:25                         ` Evgeniy Polyakov
  2007-02-20  9:57                           ` David Miller
                                             ` (2 more replies)
  1 sibling, 3 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20  9:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Monday 19 February 2007 16:14, Eric Dumazet wrote:
> >
> > Because O(1) is different of O(log(N)) ?
> > if N = 2^20, it certainly makes a difference.
> > Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are
> > touching less than 4 cache lines.
> > With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4,
> > I will be very pleased.
> >
> 
> Here is the tcp ehash chain length distribution on a real server :
> ehash_addr=0xffff810476000000
> ehash_size=1048576
> 333835 used chains, 3365 used twchains
> Distribution of sockets/chain length
> [chain length]:number of sockets
> [1]:221019 37.4645%
> [2]:56590 56.6495%
> [3]:21250 67.4556%
> [4]:12534 75.9541%
> [5]:8677 83.3082%
> [6]:5862 89.2701%
> [7]:3640 93.5892%
> [8]:2219 96.5983%
> [9]:1083 98.2505%
> [10]:539 99.1642%
> [11]:244 99.6191%
> [12]:112 99.8469%
> [13]:39 99.9329%
> [14]:16 99.9708%
> [15]:6 99.9861%
> [16]:3 99.9942%
> [17]:2 100%
> total : 589942 sockets
>
> So even with a lazy hash function, 89 % of lookups are satisfied with less 
> than 6 compares.

This is only 2^19 sockets.

What you miss is the fact, that upper layers of the tree are always in the
cache. So its access cost nothing compared to every time new entry in
the hash table.

And there is _no_ O(1) - O(1) is just a hash entry selection, then you
need to add the whole chain path, which can be long enough.
For example for the length 9 you have 1000 entries - it is exactly size
of the tree - sum of all entries upto and including 2^9.
One third of the table is accessible within 17 lookups (hash chain
length is 1), but given into account size of the entry - let's say it 
is equal to
32+32+32 - size of the key
32+32+32 - size of the left/right/parent pointer
so 192 bytes per entry - given into acount that 1/4 of the 1-2 MB cache is
filled with it, we get about 1300 top-level entries in the hash, so
about _10_ first lookups are in the cache _always_ due to nature of the
tree lookup.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20  9:25                         ` Evgeniy Polyakov
@ 2007-02-20  9:57                           ` David Miller
  2007-02-20 10:22                             ` Evgeniy Polyakov
  2007-02-20 10:04                           ` Eric Dumazet
  2007-02-20 12:11                           ` Evgeniy Polyakov
  2 siblings, 1 reply; 102+ messages in thread
From: David Miller @ 2007-02-20  9:57 UTC (permalink / raw)
  To: johnpol; +Cc: dada1, akepner, linux, netdev, bcrl

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Tue, 20 Feb 2007 12:25:23 +0300

> What you miss is the fact, that upper layers of the tree are always in the
> cache. So its access cost nothing compared to every time new entry in
> the hash table.

It will not be on a real computer spending reasonable if not
predominant time in userspace.

I agree with others here in this thread, in that your test program is
not accurate at all in several aspects as it makes many incorrect
assumptions about the environment in which these lookups run:

1) cache must be assumed cold, ALWAYS
2) large PTE mappings must be used to simulate kernel costs
   and timings accurately

Even on computer not spending much time in userspace, cache will
be cold too in most of these paths.  Do you remember the cpu cycle
counter experiment I ran in the tg3 driver a few years back?

That test showed that in NAPI loop, second packet was always processed
some 40 times faster than first one, and it's all because of cache
warming done by first packet.

It is a very essential and sometimes non-intuitive aspect of packet
input processing.  You must code for the cold cache case.  Therefore
you cannot run silly loops in userspace to measure the performance of
a flow demux algorithm, it is a completely pointless exercise in fact
:-)

> And there is _no_ O(1) - O(1) is just a hash entry selection, then you
> need to add the whole chain path, which can be long enough.
> For example for the length 9 you have 1000 entries - it is exactly size
> of the tree - sum of all entries upto and including 2^9.

Hash table should grow to exactly number of hash chains as number
of sockets that are active.  That is how we program every dynamically
grown hash table algorithm in the kernel, and we would do the same
for TCP once dynamic growth is possible.

It is assumption of every hash algorithm.  We choose poor sizes at
boot time on Eric's computer, or there is some limit preventing from
using a proper size.  It does not make a flaw in the hash approach.

People don't use trees or tries for socket demux in any modern
operating system, there is a reason and it is not because this
approach has not been tried. :-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20  9:25                         ` Evgeniy Polyakov
  2007-02-20  9:57                           ` David Miller
@ 2007-02-20 10:04                           ` Eric Dumazet
  2007-02-20 10:12                             ` David Miller
  2007-02-20 10:44                             ` Evgeniy Polyakov
  2007-02-20 12:11                           ` Evgeniy Polyakov
  2 siblings, 2 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 10:04 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

On Tuesday 20 February 2007 10:25, Evgeniy Polyakov wrote:
> On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > On Monday 19 February 2007 16:14, Eric Dumazet wrote:
> > > Because O(1) is different of O(log(N)) ?
> > > if N = 2^20, it certainly makes a difference.
> > > Yes, 1% of chains might have a length > 10, but yet 99% of the lookups
> > > are touching less than 4 cache lines.
> > > With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's
> > > 4, I will be very pleased.
> >
> > Here is the tcp ehash chain length distribution on a real server :
> > ehash_addr=0xffff810476000000
> > ehash_size=1048576
> > 333835 used chains, 3365 used twchains
> > Distribution of sockets/chain length
> > [chain length]:number of sockets
> > [1]:221019 37.4645%
> > [2]:56590 56.6495%
> > [3]:21250 67.4556%
> > [4]:12534 75.9541%
> > [5]:8677 83.3082%
> > [6]:5862 89.2701%
> > [7]:3640 93.5892%
> > [8]:2219 96.5983%
> > [9]:1083 98.2505%
> > [10]:539 99.1642%
> > [11]:244 99.6191%
> > [12]:112 99.8469%
> > [13]:39 99.9329%
> > [14]:16 99.9708%
> > [15]:6 99.9861%
> > [16]:3 99.9942%
> > [17]:2 100%
> > total : 589942 sockets
> >
> > So even with a lazy hash function, 89 % of lookups are satisfied with
> > less than 6 compares.
>
> This is only 2^19 sockets.
>
> What you miss is the fact, that upper layers of the tree are always in the
> cache. So its access cost nothing compared to every time new entry in
> the hash table.
>
> And there is _no_ O(1) - O(1) is just a hash entry selection, then you
> need to add the whole chain path, which can be long enough.
> For example for the length 9 you have 1000 entries - it is exactly size
> of the tree - sum of all entries upto and including 2^9.
> One third of the table is accessible within 17 lookups (hash chain
> length is 1), but given into account size of the entry - let's say it
> is equal to
> 32+32+32 - size of the key
> 32+32+32 - size of the left/right/parent pointer
> so 192 bytes per entry - given into acount that 1/4 of the 1-2 MB cache is
> filled with it, we get about 1300 top-level entries in the hash, so
> about _10_ first lookups are in the cache _always_ due to nature of the
> tree lookup.

You totally miss the fact that the 1-2-4 MB cache is not available for you at 
all. It is filled by User accesses. I dont care about DOS. I care about real 
servers, servicing tcp clients. The TCP service/stack should not take more 
than 10% of CPU (cycles and caches). The User application is certainly more 
important because it hosts the real added value.

As soon softirq finished its job, CPU returns to User Space, blowing out your 
top-level entries from the cache. Next ms (tg3 for example, batching XX 
packets per interrupt), softirq handler must repopulate the cache again and 
again.

Is linux kernel aimed to construct firewalls ? Certainly not. Linux kernel is 
a generalist kernel. Of course your tree algo might be better in certain 
situations, but those situations are not the norm. In these situations, we 
might reserve 50% of cpu cache to hold top level of route cache and tcp 
cache, but provide no power to user programs.

With ehash, pressure on cpu caches is small and User application can make 
progress.

Using a jenkin's hash permits a better hash distribution for a litle cpu cost.
I will post later a distribution simulation based on the data gathered from 
the same real server.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 10:04                           ` Eric Dumazet
@ 2007-02-20 10:12                             ` David Miller
  2007-02-20 10:30                               ` Evgeniy Polyakov
                                                 ` (2 more replies)
  2007-02-20 10:44                             ` Evgeniy Polyakov
  1 sibling, 3 replies; 102+ messages in thread
From: David Miller @ 2007-02-20 10:12 UTC (permalink / raw)
  To: dada1; +Cc: johnpol, akepner, linux, netdev, bcrl

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 20 Feb 2007 11:04:15 +0100

> Using a jenkin's hash permits a better hash distribution for a litle
> cpu cost.  I will post later a distribution simulation based on the
> data gathered from the same real server.

Actually someone (I think it was Evgeniy in fact) made such
comparisons and found in his studies that not only does the current
ehash xor hash function distribute about as well as jenkins, it's
significantly cheaper to calculate :-)

If you find jenkins is better, great, but I hope it works that
way for many workloads.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20  9:57                           ` David Miller
@ 2007-02-20 10:22                             ` Evgeniy Polyakov
  0 siblings, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 10:22 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 01:57:45AM -0800, David Miller (davem@davemloft.net) wrote:
> > What you miss is the fact, that upper layers of the tree are always in the
> > cache. So its access cost nothing compared to every time new entry in
> > the hash table.
> 
> It will not be on a real computer spending reasonable if not
> predominant time in userspace.
> 
> I agree with others here in this thread, in that your test program is
> not accurate at all in several aspects as it makes many incorrect
> assumptions about the environment in which these lookups run:
> 
> 1) cache must be assumed cold, ALWAYS
> 2) large PTE mappings must be used to simulate kernel costs
>    and timings accurately
> 
> Even on computer not spending much time in userspace, cache will
> be cold too in most of these paths.  Do you remember the cpu cycle
> counter experiment I ran in the tg3 driver a few years back?
> 
> That test showed that in NAPI loop, second packet was always processed
> some 40 times faster than first one, and it's all because of cache
> warming done by first packet.
> 
> It is a very essential and sometimes non-intuitive aspect of packet
> input processing.  You must code for the cold cache case.  Therefore
> you cannot run silly loops in userspace to measure the performance of
> a flow demux algorithm, it is a completely pointless exercise in fact
> :-)

It is not correct.
Getting size of the table for 2^20 entries is about 4Mb, we get
half/third of it in the cache in my userspace test - and it still performs 
bad (well, it is noticebly better than all other approaches (yet), but it 
is not that good as expected it to be).

With my trie test, _smaller_ part of the trie is in the cache due to
size of the node, which includes _a lot_ of additional information used
for wildcard processing (needed for netfilter and/or route
implementation in netchannels).

Above words are _only_ correct for hash tables - yes, in that case cache
is always cold, since each new flow goes into completely different
entry. With tree/trie (and _especially_ on SMP_) access to low-level
entry _requires_ all above entries to be in the cache, so until load
fully flushes it away, there will be a win for tree implementation.
If load is that, that it flushes the whole cache each time, then we are
going to die just because populating of the struct sock and/or tcp
socket is much worse than several tree/trie entris in the lookup.

> > And there is _no_ O(1) - O(1) is just a hash entry selection, then you
> > need to add the whole chain path, which can be long enough.
> > For example for the length 9 you have 1000 entries - it is exactly size
> > of the tree - sum of all entries upto and including 2^9.
> 
> Hash table should grow to exactly number of hash chains as number
> of sockets that are active.  That is how we program every dynamically
> grown hash table algorithm in the kernel, and we would do the same
> for TCP once dynamic growth is possible.

Eric's test shows that only one third of the sockets is in the chains
with 1 length, so it does not work that way, which is not a bad or good
- table scaling is not trivial operation indeed.

> It is assumption of every hash algorithm.  We choose poor sizes at
> boot time on Eric's computer, or there is some limit preventing from
> using a proper size.  It does not make a flaw in the hash approach.
> 
> People don't use trees or tries for socket demux in any modern
> operating system, there is a reason and it is not because this
> approach has not been tried. :-)

Trees are used _always_ in huge route tables - not just because it is
too ugly to implement wildcards in hash tables :).

I would agree that routers do not perform any real work besides packet
forwarding, so they do have theirs caches filled with all needed
information (especially if all above is implemented in hardware and thus
is completely different), but it does not mean that existing systems do
not allow to scale with trees with existing cache issues.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 10:12                             ` David Miller
@ 2007-02-20 10:30                               ` Evgeniy Polyakov
  2007-02-20 11:10                                 ` Eric Dumazet
  2007-02-20 10:49                               ` Eric Dumazet
  2007-02-20 15:07                               ` Michael K. Edwards
  2 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 10:30 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 02:12:09AM -0800, David Miller (davem@davemloft.net) wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Tue, 20 Feb 2007 11:04:15 +0100
> 
> > Using a jenkin's hash permits a better hash distribution for a litle
> > cpu cost.  I will post later a distribution simulation based on the
> > data gathered from the same real server.
> 
> Actually someone (I think it was Evgeniy in fact) made such
> comparisons and found in his studies that not only does the current
> ehash xor hash function distribute about as well as jenkins, it's
> significantly cheaper to calculate :-)

Yep, it happend to be my tests :)
Jenkins hash was slower and had significant artifacts for some usage
cases ended up with extremely long chain length.
One can find more details at
http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14
http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 10:04                           ` Eric Dumazet
  2007-02-20 10:12                             ` David Miller
@ 2007-02-20 10:44                             ` Evgeniy Polyakov
  2007-02-20 11:09                               ` Eric Dumazet
  1 sibling, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 10:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Tue, Feb 20, 2007 at 11:04:15AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> You totally miss the fact that the 1-2-4 MB cache is not available for you at 
> all. It is filled by User accesses. I dont care about DOS. I care about real 
> servers, servicing tcp clients. The TCP service/stack should not take more 
> than 10% of CPU (cycles and caches). The User application is certainly more 
> important because it hosts the real added value.

TCP socket is 4k in size, one tree entry can be reduced to 200 bytes?

No one says about _that_ cache miss, it is considered OK to have, but
tree cache miss becomes the worst thing ever.
In softirq we process socket's state, lock, reference counter several
pointer, and if we are happy - the whole TCP state machine fields - and
most of it stasy there when kernel is over - userspace issues syscalls
which must populate it back. Why don't we see that it is moved into
cache each time syscall is invoked? Because it is in the cache as long
as part of the hash table assotiated with last recently used hash
entries, which should not be there, and instead part of the tree can be.

> As soon softirq finished its job, CPU returns to User Space, blowing out your 
> top-level entries from the cache. Next ms (tg3 for example, batching XX 
> packets per interrupt), softirq handler must repopulate the cache again and 
> again.
> 
> Is linux kernel aimed to construct firewalls ? Certainly not. Linux kernel is 
> a generalist kernel. Of course your tree algo might be better in certain 
> situations, but those situations are not the norm. In these situations, we 
> might reserve 50% of cpu cache to hold top level of route cache and tcp 
> cache, but provide no power to user programs.
> 
> With ehash, pressure on cpu caches is small and User application can make 
> progress.

With tree it is exacly the same - but instead of unneded entries one can
put there head of the tree.

To prove myself right (wrong) and make me eat my hat I will implement
simple rb-tree-based socket lookup code (which does not allow simple
RCUfication though) or trie (without wildcards support) which allows
trivial RCUfication. Will you care to test patches or provide testcase
repetable in my lab (couple of 2-way core duo, athlon 3500 and via
epia)?

> Using a jenkin's hash permits a better hash distribution for a litle cpu cost.
> I will post later a distribution simulation based on the data gathered from 
> the same real server.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 10:12                             ` David Miller
  2007-02-20 10:30                               ` Evgeniy Polyakov
@ 2007-02-20 10:49                               ` Eric Dumazet
  2007-02-20 15:07                               ` Michael K. Edwards
  2 siblings, 0 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 10:49 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 11:12, David Miller wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Tue, 20 Feb 2007 11:04:15 +0100
>
> > Using a jenkin's hash permits a better hash distribution for a litle
> > cpu cost.  I will post later a distribution simulation based on the
> > data gathered from the same real server.
>
> Actually someone (I think it was Evgeniy in fact) made such
> comparisons and found in his studies that not only does the current
> ehash xor hash function distribute about as well as jenkins, it's
> significantly cheaper to calculate :-)
>
> If you find jenkins is better, great, but I hope it works that
> way for many workloads.

Here are results from real workload :

ehash_addr=0xffff810476000000
ehash_size=1048576
330241 chains, 2441 twchains
Distribution of sockets/chain length
[chain length]:number of sockets
[0]:718335 0%
[1]:223485 42.1436%
[2]:60655 65.0196%
[3]:22451 77.7207%
[4]:11164 86.1416%
[5]:6376 92.1534%
[6]:3236 95.8148%
[7]:1600 97.9268%
[8]:793 99.1231%
[9]:286 99.6085%
[10]:111 99.8178%
[11]:56 99.934%
[12]:20 99.9793%
[13]:4 99.9891%
[14]:2 99.9943%
[15]:2 100%
total : 530294 sockets


Imagine we double hash size (2097152). Distribution would be:
[0]:1698209 0%
[1]:309255 58.3177%
[2]:61638 81.5644%
[3]:18593 92.0829%
[4]:6479 96.97%
[5]:2092 98.9425%
[6]:655 99.6836%
[7]:184 99.9265%
[8]:37 99.9823%
[9]:6 99.9925%
[10]:4 100%

If we use jenkin hash: jhash_3words(daddr, saddr, (dport<<16)|sport, 0);

[0]:632435 0%
[1]:319916 60.328%
[2]:80465 90.6754%
[3]:13816 98.4914%
[4]:1735 99.8001%
[5]:196 99.9849%
[6]:11 99.9974%
[7]:2 100%


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 10:44                             ` Evgeniy Polyakov
@ 2007-02-20 11:09                               ` Eric Dumazet
  2007-02-20 11:29                                 ` Evgeniy Polyakov
  2007-02-21 12:41                                 ` Andi Kleen
  0 siblings, 2 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 11:09 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

On Tuesday 20 February 2007 11:44, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 11:04:15AM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > You totally miss the fact that the 1-2-4 MB cache is not available for
> > you at all. It is filled by User accesses. I dont care about DOS. I care
> > about real servers, servicing tcp clients. The TCP service/stack should
> > not take more than 10% of CPU (cycles and caches). The User application
> > is certainly more important because it hosts the real added value.
>
> TCP socket is 4k in size, one tree entry can be reduced to 200 bytes?
>
> No one says about _that_ cache miss, it is considered OK to have, but
> tree cache miss becomes the worst thing ever.
> In softirq we process socket's state, lock, reference counter several
> pointer, and if we are happy - the whole TCP state machine fields - and
> most of it stasy there when kernel is over - userspace issues syscalls
> which must populate it back. Why don't we see that it is moved into
> cache each time syscall is invoked? Because it is in the cache as long
> as part of the hash table assotiated with last recently used hash
> entries, which should not be there, and instead part of the tree can be.

No I see cache misses everywhere...

This is because my machines are doing real work in user land. They are not lab 
machines. Even if I had cpus with 16-32MB cache, it would be the same, 
because User land wants GBs ... 

For example, sock_wfree() uses 1.6612 % of cpu because of false sharing of 
sk_flags (dirtied each time SOCK_QUEUE_SHRUNK is set :(

ffffffff803c2850 <sock_wfree>: /* sock_wfree total: 714241  1.6613 */
  1307  0.0030 :ffffffff803c2850:       push   %rbp
 55056  0.1281 :ffffffff803c2851:       mov    %rsp,%rbp
    94 2.2e-04 :ffffffff803c2854:       push   %rbx
               :ffffffff803c2855:       sub    $0x8,%rsp
  1090  0.0025 :ffffffff803c2859:       mov    0x10(%rdi),%rbx
     3 7.0e-06 :ffffffff803c285d:       mov    0xb8(%rdi),%eax
    38 8.8e-05 :ffffffff803c2863:       lock sub %eax,0x90(%rbx)

/* HOT : access to sk_flags */
 81979  0.1907 :ffffffff803c286a:       mov    0x100(%rbx),%eax
512119  1.1912 :ffffffff803c2870:       test   $0x2,%ah

   262 6.1e-04 :ffffffff803c2873:       jne    ffffffff803c2880 
<sock_wfree+0x30>
   142 3.3e-04 :ffffffff803c2875:       mov    %rbx,%rdi
 14467  0.0336 :ffffffff803c2878:       callq  *0x200(%rbx)
    63 1.5e-04 :ffffffff803c287e:       data16
               :ffffffff803c287f:       nop    
  9046  0.0210 :ffffffff803c2880:       lock decl 0x28(%rbx)
 29792  0.0693 :ffffffff803c2884:       sete   %al
    56 1.3e-04 :ffffffff803c2887:       test   %al,%al
   789  0.0018 :ffffffff803c2889:       je     ffffffff803c2893 
<sock_wfree+0x43>
               :ffffffff803c288b:       mov    %rbx,%rdi
   144 3.3e-04 :ffffffff803c288e:       callq  ffffffff803c0f90 <sk_free>
  1685  0.0039 :ffffffff803c2893:       add    $0x8,%rsp
  2462  0.0057 :ffffffff803c2897:       pop    %rbx
   684  0.0016 :ffffffff803c2898:       leaveq 
  2963  0.0069 :ffffffff803c2899:       retq   


This is why tcp lookups should not take more than 1% themselves : other parts 
of the stack *want* to make many cache misses too.

If we want to optimize tcp, we should reorder fields to reduce number of cache 
lines, not change algos. struct sock fields are currently placed to reduce 
holes, while they should be grouped by related fields sharing cache lines.




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 10:30                               ` Evgeniy Polyakov
@ 2007-02-20 11:10                                 ` Eric Dumazet
  2007-02-20 11:23                                   ` Evgeniy Polyakov
  2007-02-20 11:30                                   ` Eric Dumazet
  0 siblings, 2 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 11:10 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 11:30, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 02:12:09AM -0800, David Miller (davem@davemloft.net) 
wrote:
> > From: Eric Dumazet <dada1@cosmosbay.com>
> > Date: Tue, 20 Feb 2007 11:04:15 +0100
> >
> > > Using a jenkin's hash permits a better hash distribution for a litle
> > > cpu cost.  I will post later a distribution simulation based on the
> > > data gathered from the same real server.
> >
> > Actually someone (I think it was Evgeniy in fact) made such
> > comparisons and found in his studies that not only does the current
> > ehash xor hash function distribute about as well as jenkins, it's
> > significantly cheaper to calculate :-)
>
> Yep, it happend to be my tests :)
> Jenkins hash was slower and had significant artifacts for some usage
> cases ended up with extremely long chain length.
> One can find more details at
> http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14
> http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01

Please explain why you chose    h = jhash_2words(faddr, laddr, ports);
        h ^= h >> 16;
        h ^= h >> 8;

jhash is very good, no need to try to be smarter, shufling some bytes... and 
adding artifacts.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 11:10                                 ` Eric Dumazet
@ 2007-02-20 11:23                                   ` Evgeniy Polyakov
  2007-02-20 11:30                                   ` Eric Dumazet
  1 sibling, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 11:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 12:10:22PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Please explain why you chose    h = jhash_2words(faddr, laddr, ports);
>         h ^= h >> 16;
>         h ^= h >> 8;
> 
> jhash is very good, no need to try to be smarter, shufling some bytes... and 
> adding artifacts.

If distribution is fair its whift produces still fair distribution.
In the discussion linked at that test I also produced results without
shifts, which were exactly the same.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 11:09                               ` Eric Dumazet
@ 2007-02-20 11:29                                 ` Evgeniy Polyakov
  2007-02-20 11:34                                   ` Eric Dumazet
  2007-02-21 12:41                                 ` Andi Kleen
  1 sibling, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 11:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Tue, Feb 20, 2007 at 12:09:51PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> If we want to optimize tcp, we should reorder fields to reduce number of cache 
> lines, not change algos. struct sock fields are currently placed to reduce 
> holes, while they should be grouped by related fields sharing cache lines.

Getting into account that network stack with NAPI schedules several
packets to be queued into socket and all that happens without any
infuence from userspace, trie/tree wins again in that regard that
majority of the tree will be in the cache already.

Hash table has its fast access only in theory, practice adds caches,
NAPI and a lot of other stuff. Even simple test (maybe broken, but it
is equally broken for both trie and hash, even worse for trie))
whows that hash table does not behave as good as expected and close to
trie.

I'm going back to drawing board to design simple trie algo/patch
suitable for hash table selection replacement, so that we would test it
in a real life examples.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 11:10                                 ` Eric Dumazet
  2007-02-20 11:23                                   ` Evgeniy Polyakov
@ 2007-02-20 11:30                                   ` Eric Dumazet
  2007-02-20 11:41                                     ` Evgeniy Polyakov
  1 sibling, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 11:30 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 12:10, Eric Dumazet wrote:
 >
> > Yep, it happend to be my tests :)
> > Jenkins hash was slower and had significant artifacts for some usage
> > cases ended up with extremely long chain length.
> > One can find more details at
> > http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14
> > http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01
>
> Please explain why you chose    h = jhash_2words(faddr, laddr, ports);
>         h ^= h >> 16;
>         h ^= h >> 8;
>
> jhash is very good, no need to try to be smarter, shufling some bytes...
> and adding artifacts.

I checked with my simulator and got no differences with the extra ops, at 
least no artifacts. Maybe this is related to the fact my hash size is 2^20 ?

If we use jenkin hash:
[0]:617469 0%
[1]:326671 58.7654%
[2]:86704 89.9601%
[3]:15387 98.264%
[4]:2103 99.7773%
[5]:216 99.9716%
[6]:24 99.9975%
[7]:2 100%

If we use jenkin hash (+plus Evgeniy Polyakov shifts) :
[0]:617553 0%
[1]:326403 58.7172%
[2]:86902 89.9831%
[3]:15462 98.3275%
[4]:2012 99.7753%
[5]:216 99.9696%
[6]:27 99.9987%
[7]:1 100%

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 11:29                                 ` Evgeniy Polyakov
@ 2007-02-20 11:34                                   ` Eric Dumazet
  2007-02-20 11:45                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 11:34 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev, bcrl

On Tuesday 20 February 2007 12:29, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 12:09:51PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > If we want to optimize tcp, we should reorder fields to reduce number of
> > cache lines, not change algos. struct sock fields are currently placed to
> > reduce holes, while they should be grouped by related fields sharing
> > cache lines.
>
> Getting into account that network stack with NAPI schedules several
> packets to be queued into socket and all that happens without any
> infuence from userspace, trie/tree wins again in that regard that
> majority of the tree will be in the cache already.

Nope, you receive 100 packets for 100 different flows. Only the code may be in 
cache (minus for the first packet), and the very top of the tree...

Forget cache. Forget it.

>
> Hash table has its fast access only in theory, practice adds caches,
> NAPI and a lot of other stuff. Even simple test (maybe broken, but it
> is equally broken for both trie and hash, even worse for trie))
> whows that hash table does not behave as good as expected and close to
> trie.
>
> I'm going back to drawing board to design simple trie algo/patch
> suitable for hash table selection replacement, so that we would test it
> in a real life examples.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 11:30                                   ` Eric Dumazet
@ 2007-02-20 11:41                                     ` Evgeniy Polyakov
  0 siblings, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 11:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 12:30:18PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Tuesday 20 February 2007 12:10, Eric Dumazet wrote:
>  >
> > > Yep, it happend to be my tests :)
> > > Jenkins hash was slower and had significant artifacts for some usage
> > > cases ended up with extremely long chain length.
> > > One can find more details at
> > > http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14
> > > http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01
> >
> > Please explain why you chose    h = jhash_2words(faddr, laddr, ports);
> >         h ^= h >> 16;
> >         h ^= h >> 8;
> >
> > jhash is very good, no need to try to be smarter, shufling some bytes...
> > and adding artifacts.
> 
> I checked with my simulator and got no differences with the extra ops, at 
> least no artifacts. Maybe this is related to the fact my hash size is 2^20 ?
> 
> If we use jenkin hash:
> [0]:617469 0%
> [1]:326671 58.7654%
> [2]:86704 89.9601%
> [3]:15387 98.264%
> [4]:2103 99.7773%
> [5]:216 99.9716%
> [6]:24 99.9975%
> [7]:2 100%
> 
> If we use jenkin hash (+plus Evgeniy Polyakov shifts) :
> [0]:617553 0%
> [1]:326403 58.7172%
> [2]:86902 89.9831%
> [3]:15462 98.3275%
> [4]:2012 99.7753%
> [5]:216 99.9696%
> [6]:27 99.9987%
> [7]:1 100%

Yes, they are the same - but artifacts are artifacts in that regard that
they appear only in special sets, as far as I recall one of the parameters
must be constant (i.e. address or port pair).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 11:34                                   ` Eric Dumazet
@ 2007-02-20 11:45                                     ` Evgeniy Polyakov
  0 siblings, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 11:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Tue, Feb 20, 2007 at 12:34:59PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > Getting into account that network stack with NAPI schedules several
> > packets to be queued into socket and all that happens without any
> > infuence from userspace, trie/tree wins again in that regard that
> > majority of the tree will be in the cache already.
> 
> Nope, you receive 100 packets for 100 different flows. Only the code may be in 
> cache (minus for the first packet), and the very top of the tree...
> 
> Forget cache. Forget it.

First packet populates half of the top of the tree/trie to get to the
bottom - another one goes to completely different location, but it still
need to go through some of the just populated entries - it is a
tree/trie nature - probability of a hit decreases two times in each
level (for binary tree, for trie is is very different) until first 'miss', 
but it is still non zero.

But generally I afgree that talking about caches can only be valid when
all other issues are resolved completely. So, we need a patch.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20  9:25                         ` Evgeniy Polyakov
  2007-02-20  9:57                           ` David Miller
  2007-02-20 10:04                           ` Eric Dumazet
@ 2007-02-20 12:11                           ` Evgeniy Polyakov
  2 siblings, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 12:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev, bcrl

On Tue, Feb 20, 2007 at 12:25:23PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> And there is _no_ O(1) - O(1) is just a hash entry selection, then you
> need to add the whole chain path, which can be long enough.
> For example for the length 9 you have 1000 entries - it is exactly size
> of the tree - sum of all entries upto and including 2^9.
> One third of the table is accessible within 17 lookups (hash chain
> length is 1), but given into account size of the entry - let's say it 
> is equal to
> 32+32+32 - size of the key
> 32+32+32 - size of the left/right/parent pointer
> so 192 bytes per entry - given into acount that 1/4 of the 1-2 MB cache is

I just realized that it is in _BITS_, not in bytes, so typical trie/tree
entry is about 24 bytes - less than one cache line!

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 10:12                             ` David Miller
  2007-02-20 10:30                               ` Evgeniy Polyakov
  2007-02-20 10:49                               ` Eric Dumazet
@ 2007-02-20 15:07                               ` Michael K. Edwards
  2007-02-20 15:11                                 ` Evgeniy Polyakov
  2 siblings, 1 reply; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-20 15:07 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, johnpol, akepner, linux, netdev, bcrl

On 2/20/07, David Miller <davem@davemloft.net> wrote:
> Actually someone (I think it was Evgeniy in fact) made such
> comparisons and found in his studies that not only does the current
> ehash xor hash function distribute about as well as jenkins, it's
> significantly cheaper to calculate :-)

However, it's vulnerable to hash collision attacks, which the Jenkins
hash (used as a poor man's HMAC with a boot-time random salt) is not.
But if you don't care about DoS ....

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 15:07                               ` Michael K. Edwards
@ 2007-02-20 15:11                                 ` Evgeniy Polyakov
  2007-02-20 15:49                                   ` Michael K. Edwards
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 15:11 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: David Miller, dada1, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 07:07:16AM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> On 2/20/07, David Miller <davem@davemloft.net> wrote:
> >Actually someone (I think it was Evgeniy in fact) made such
> >comparisons and found in his studies that not only does the current
> >ehash xor hash function distribute about as well as jenkins, it's
> >significantly cheaper to calculate :-)
> 
> However, it's vulnerable to hash collision attacks, which the Jenkins
> hash (used as a poor man's HMAC with a boot-time random salt) is not.
> But if you don't care about DoS ....

Jenkins _does_ have them, I showed tests half a year ago and in this
thread too. Actually _any_ hash has them it is just a matter of time 
to find one.

> Cheers,
> - Michael

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 15:11                                 ` Evgeniy Polyakov
@ 2007-02-20 15:49                                   ` Michael K. Edwards
  2007-02-20 15:59                                     ` Evgeniy Polyakov
                                                       ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-20 15:49 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, dada1, akepner, linux, netdev, bcrl

On 2/20/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Jenkins _does_ have them, I showed tests half a year ago and in this
> thread too. Actually _any_ hash has them it is just a matter of time
> to find one.

I think you misunderstood me.  If you are trying to DoS me from
outside with a hash collision attack, you are trying to feed me
packets that fall into the same hash bucket.  The Jenkins hash does
not have to be artifact-free, and does not have to be
cryptographically strong.  It just has to do a passable job of mixing
a random salt into the tuple, so you don't know which string of
packets to feed me in order to fill one (or a few) of my buckets.
XORing salt into a folded tuple doesn't help; it just permutes the
buckets.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 15:49                                   ` Michael K. Edwards
@ 2007-02-20 15:59                                     ` Evgeniy Polyakov
  2007-02-20 16:08                                       ` Eric Dumazet
  2007-02-20 16:04                                     ` Eric Dumazet
  2007-02-22 23:49                                     ` linux
  2 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 15:59 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: David Miller, dada1, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 07:49:11AM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> On 2/20/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >Jenkins _does_ have them, I showed tests half a year ago and in this
> >thread too. Actually _any_ hash has them it is just a matter of time
> >to find one.
> 
> I think you misunderstood me.  If you are trying to DoS me from
> outside with a hash collision attack, you are trying to feed me
> packets that fall into the same hash bucket.  The Jenkins hash does
> not have to be artifact-free, and does not have to be
> cryptographically strong.  It just has to do a passable job of mixing
> a random salt into the tuple, so you don't know which string of
> packets to feed me in order to fill one (or a few) of my buckets.
> XORing salt into a folded tuple doesn't help; it just permutes the
> buckets.

Adding XOR with constant value does not change distribution.
Variable salt will end up with differnet buckets for the same flow.
It is forbidden - it is not the situation created for passwd/des decades
ago.

> Cheers,
> - Michael

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 15:49                                   ` Michael K. Edwards
  2007-02-20 15:59                                     ` Evgeniy Polyakov
@ 2007-02-20 16:04                                     ` Eric Dumazet
  2007-02-22 23:49                                     ` linux
  2 siblings, 0 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 16:04 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Evgeniy Polyakov, David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 16:49, Michael K. Edwards wrote:
> On 2/20/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > Jenkins _does_ have them, I showed tests half a year ago and in this
> > thread too. Actually _any_ hash has them it is just a matter of time
> > to find one.
>
> I think you misunderstood me.  If you are trying to DoS me from
> outside with a hash collision attack, you are trying to feed me
> packets that fall into the same hash bucket.  The Jenkins hash does
> not have to be artifact-free, and does not have to be
> cryptographically strong.  It just has to do a passable job of mixing
> a random salt into the tuple, so you don't know which string of
> packets to feed me in order to fill one (or a few) of my buckets.
> XORing salt into a folded tuple doesn't help; it just permutes the
> buckets.

Yes. I must say I had an attack like that some years ago on one particular 
server : Some tcp ehash chains had a length > 1000. I had to plug jenkin hash 
to stop the attack (thanks to David :), and thanks to oprofile to let me 
understand what was happening )

The attacker was controlling several thousand of zombies and was able to 
choose its src port (knowing its src ip addr) to target *one* particular hash 
bucket on my web server.

Each zombie was opening one tcp socket only, so a firewall could not detect 
them, they had a absolutely normal behavior.

XOR, combined with the 16 bits range of src port, permits a lot of easy 
guessing for the attacker (since it knows the ehash_size of target is a power 
of two...)


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 15:59                                     ` Evgeniy Polyakov
@ 2007-02-20 16:08                                       ` Eric Dumazet
  2007-02-20 16:20                                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 16:08 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 16:59, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 07:49:11AM -0800, Michael K. Edwards 
(medwards.linux@gmail.com) wrote:
> > On 2/20/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > >Jenkins _does_ have them, I showed tests half a year ago and in this
> > >thread too. Actually _any_ hash has them it is just a matter of time
> > >to find one.
> >
> > I think you misunderstood me.  If you are trying to DoS me from
> > outside with a hash collision attack, you are trying to feed me
> > packets that fall into the same hash bucket.  The Jenkins hash does
> > not have to be artifact-free, and does not have to be
> > cryptographically strong.  It just has to do a passable job of mixing
> > a random salt into the tuple, so you don't know which string of
> > packets to feed me in order to fill one (or a few) of my buckets.
> > XORing salt into a folded tuple doesn't help; it just permutes the
> > buckets.
>
> Adding XOR with constant value does not change distribution.
> Variable salt will end up with differnet buckets for the same flow.
> It is forbidden - it is not the situation created for passwd/des decades
> ago.

Adding a random hint to jhash (random value picked at boot time, not known by 
attacker) permits to have a secure hash table : An attacker cannot build an 
attack to fill one particular hash chain.

See net/ipv4/route.c (function rt_hash_code()) to see how its used for route 
cache.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 16:08                                       ` Eric Dumazet
@ 2007-02-20 16:20                                         ` Evgeniy Polyakov
  2007-02-20 16:38                                           ` Eric Dumazet
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 16:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 05:08:12PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > Adding XOR with constant value does not change distribution.
> > Variable salt will end up with differnet buckets for the same flow.
> > It is forbidden - it is not the situation created for passwd/des decades
> > ago.
> 
> Adding a random hint to jhash (random value picked at boot time, not known by 
> attacker) permits to have a secure hash table : An attacker cannot build an 
> attack to fill one particular hash chain.
> 
> See net/ipv4/route.c (function rt_hash_code()) to see how its used for route 
> cache.

It is secrecy, not security - attacker will check the source and find
where constant per-boot value is added and recalculate attack vector -
we all were college students, it would be even more fun to crack.

In that regard Jenkins ahsh and XOR one have _exactly_ the same attack
vector, only Jenkins is a bit more sophisticated. I even think that
example in rt_hash_code() will endup with heavy problems when one of the
addresses is constant - my tests show problem exactly in the case of
jhash_2words() with random third parameter and constant one of the first
like in rt_hash_code().

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 16:20                                         ` Evgeniy Polyakov
@ 2007-02-20 16:38                                           ` Eric Dumazet
  2007-02-20 16:59                                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 16:38 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 17:20, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 05:08:12PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > > Adding XOR with constant value does not change distribution.
> > > Variable salt will end up with differnet buckets for the same flow.
> > > It is forbidden - it is not the situation created for passwd/des
> > > decades ago.
> >
> > Adding a random hint to jhash (random value picked at boot time, not
> > known by attacker) permits to have a secure hash table : An attacker
> > cannot build an attack to fill one particular hash chain.
> >
> > See net/ipv4/route.c (function rt_hash_code()) to see how its used for
> > route cache.
>
> It is secrecy, not security - attacker will check the source and find
> where constant per-boot value is added and recalculate attack vector -
> we all were college students, it would be even more fun to crack.
>
> In that regard Jenkins ahsh and XOR one have _exactly_ the same attack
> vector, only Jenkins is a bit more sophisticated. I even think that
> example in rt_hash_code() will endup with heavy problems when one of the
> addresses is constant - my tests show problem exactly in the case of
> jhash_2words() with random third parameter and constant one of the first
> like in rt_hash_code().

Please define heavy problem.

On most hosts, with one NIC, one IP address, most entries in cache have the 
same address (IP address of eth0 or localhost). It just works.

Last time I checked, the 2^21 route cache I am using was correctly filled, 
thanks to jhash.

Again, the random value is 32bits. If jhash happens to be cracked by your 
students, we just put md5 or whatever in...

You can call it secrecy or whatever, fact is : it's just working, far better 
than XOR previous hash function.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 16:38                                           ` Eric Dumazet
@ 2007-02-20 16:59                                             ` Evgeniy Polyakov
  2007-02-20 17:05                                               ` Evgeniy Polyakov
  2007-02-20 17:20                                               ` Eric Dumazet
  0 siblings, 2 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 16:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 05:38:19PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > It is secrecy, not security - attacker will check the source and find
> > where constant per-boot value is added and recalculate attack vector -
> > we all were college students, it would be even more fun to crack.
> >
> > In that regard Jenkins ahsh and XOR one have _exactly_ the same attack
> > vector, only Jenkins is a bit more sophisticated. I even think that
> > example in rt_hash_code() will endup with heavy problems when one of the
> > addresses is constant - my tests show problem exactly in the case of
> > jhash_2words() with random third parameter and constant one of the first
> > like in rt_hash_code().
> 
> Please define heavy problem.
> 
> On most hosts, with one NIC, one IP address, most entries in cache have the 
> same address (IP address of eth0 or localhost). It just works.
> 
> Last time I checked, the 2^21 route cache I am using was correctly filled, 
> thanks to jhash.
> 
> Again, the random value is 32bits. If jhash happens to be cracked by your 
> students, we just put md5 or whatever in...
>
> You can call it secrecy or whatever, fact is : it's just working, far better 
> than XOR previous hash function.

Hmm, I've just ran following test:
1. created 2^20 hash table.
2. ran in loop (100*(2^20) iterations) following hashes:
 a. xor hash (const_ip, const_ip, random_word)
 b. jhash_3words(const_ip, const_ip, random_word, 123123) - it is
	 exactly as jhash_2words(const_ip, const_ip, wandom_word)
3. hash &= hash_size - 1;
4. table[hash].counter++;
5. 	for (i=0; i<hash_size; ++i)
		results[table[i].counter]++;

And got pretty artefact for jenkins hash attached in the picture
artefact.png. Distribution.png file contains distribution of the 2^10
hash table for 100*(2^10) for the above scenario.

I've attached source code and running script.
$ ./run.sh

will produce gnuplot window with shown artefact.

If one change comments at the end of the file, run.sh will produce
distribution graphs.

P.S. jenkins hash is about two times slower.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 16:59                                             ` Evgeniy Polyakov
@ 2007-02-20 17:05                                               ` Evgeniy Polyakov
  2007-02-20 17:53                                                 ` Eric Dumazet
  2007-02-20 17:20                                               ` Eric Dumazet
  1 sibling, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 17:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

[-- Attachment #1: Type: text/plain, Size: 189 bytes --]

On Tue, Feb 20, 2007 at 07:59:07PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> I've attached source code and running script.
> $ ./run.sh

Yep, of course.

-- 
	Evgeniy Polyakov

[-- Attachment #2: test.c --]
[-- Type: text/plain, Size: 2308 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <time.h>

#include <arpa/inet.h>

typedef unsigned int u32;
typedef unsigned short u16;
typedef unsigned char u8;

typedef unsigned int __u32;
typedef unsigned short __u16;
typedef unsigned char __u8;

#include "jhash.h"

struct hash_entry
{
	unsigned long		counter;
};

static inline __u32 num2ip(__u8 a1, __u8 a2, __u8 a3, __u8 a4)
{
	__u32 a = 0;

	a |= a1;
	a <<= 8;
	a |= a2;
	a <<= 8;
	a |= a3;
	a <<= 8;
	a |= a4;

	return a;
}

static inline __u8 get_random_byte(void)
{
	return 1 + (int) (255.0 * (rand() / (RAND_MAX + 1.0)));
}

static inline __u16 get_random_word(void)
{
	return 1 + (int) (65536.0 * (rand() / (RAND_MAX + 1.0)));
}

unsigned int hash_addr1(__u32 faddr, __u16 fport, __u32 laddr, __u16 lport)
{
	unsigned int h = (laddr ^ lport) ^ (faddr ^ fport);
	h ^= h >> 16;
	h ^= h >> 8;
	return h;
}

unsigned int hash_addr(__u32 faddr, __u16 fport, __u32 laddr, __u16 lport)
{
	u32 ports;
	unsigned int h;

	ports = lport;
	ports <<= 16;
	ports |= fport;

	h = jhash_3words(faddr, laddr, ports, 123123);
#ifdef HASH_FOLD
	h ^= h >> 16;
	h ^= h >> 8;
#endif
	return h;
}

int main()
{
	struct hash_entry *table;
	__u32 saddr, daddr;
	__u16 sport, dport;
	unsigned int hash, i, *results;
	unsigned int hash_size = (1<<10), iter_num = 100;

	table = malloc(hash_size * sizeof(struct hash_entry));
	if (!table)
		return -ENOMEM;
	
	results = malloc(hash_size * sizeof(unsigned int));
	if (!results)
		return -ENOMEM;

	for (i=0; i<hash_size; ++i) {
		results[i] = 0;
		table[i].counter = 0;
	}

	srand(time(NULL));

	daddr = num2ip(192, 168, 0, 1);
	dport = htons(80);
	saddr = num2ip(192, 168, 0, 2);
	saddr = num2ip(get_random_byte(), get_random_byte(), get_random_byte(), get_random_byte());
	sport = ntohs(10000);
	
	for (i=0; i<hash_size*iter_num; ++i) {
		//sport++;
		sport = get_random_word();

#ifdef USE_XOR
		hash = hash_addr1(saddr, ntohs(sport), daddr, dport);
#else
		hash = hash_addr(saddr, ntohs(sport), daddr, dport);
#endif
		hash &= (hash_size - 1);

		table[hash].counter++;
	}

	for (i=0; i<hash_size; ++i)
		results[table[i].counter]++;
	
	for (i=1; i<hash_size; ++i)
		printf("%u %u\n", i, results[i]);
		//printf("%u %lu\n", i, table[i].counter);
	
	return 0;
}

[-- Attachment #3: run.sh --]
[-- Type: application/x-sh, Size: 414 bytes --]

[-- Attachment #4: artefact.png --]
[-- Type: image/png, Size: 5071 bytes --]

[-- Attachment #5: distribution.png --]
[-- Type: image/png, Size: 9984 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 16:59                                             ` Evgeniy Polyakov
  2007-02-20 17:05                                               ` Evgeniy Polyakov
@ 2007-02-20 17:20                                               ` Eric Dumazet
  2007-02-20 17:55                                                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 17:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 17:59, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 05:38:19PM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > > It is secrecy, not security - attacker will check the source and find
> > > where constant per-boot value is added and recalculate attack vector -
> > > we all were college students, it would be even more fun to crack.
> > >
> > > In that regard Jenkins ahsh and XOR one have _exactly_ the same attack
> > > vector, only Jenkins is a bit more sophisticated. I even think that
> > > example in rt_hash_code() will endup with heavy problems when one of
> > > the addresses is constant - my tests show problem exactly in the case
> > > of jhash_2words() with random third parameter and constant one of the
> > > first like in rt_hash_code().
> >
> > Please define heavy problem.
> >
> > On most hosts, with one NIC, one IP address, most entries in cache have
> > the same address (IP address of eth0 or localhost). It just works.
> >
> > Last time I checked, the 2^21 route cache I am using was correctly
> > filled, thanks to jhash.
> >
> > Again, the random value is 32bits. If jhash happens to be cracked by your
> > students, we just put md5 or whatever in...
> >
> > You can call it secrecy or whatever, fact is : it's just working, far
> > better than XOR previous hash function.
>
> Hmm, I've just ran following test:
> 1. created 2^20 hash table.
> 2. ran in loop (100*(2^20) iterations) following hashes:
>  a. xor hash (const_ip, const_ip, random_word)

So what ? to attack me you want to send 100*2^20 packets every minute ?

Thats nonsense... If you really can send so many packets, My pipe is full 
whatever I do of received packets. No Algo will protect me, even designed by 
Einstein.

If you look again at route cache, you will see chains length are limited by 
elasticity factor, that is usually 8... No need to try to reach 100 entries 
in a chain.

Yes, I can destroy Russia sending 2^10 nuclear weapons on major cities. You 
really should build a bunker right now :)

Now try to build an attack with 100 packets per second... and I will try to be 
smart too.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 17:05                                               ` Evgeniy Polyakov
@ 2007-02-20 17:53                                                 ` Eric Dumazet
  2007-02-20 18:00                                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 17:53 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 18:05, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 07:59:07PM +0300, Evgeniy Polyakov 
(johnpol@2ka.mipt.ru) wrote:
> > I've attached source code and running script.
> > $ ./run.sh
>
> Yep, of course.

Your test program is just bogus. artefacts come from your 'random' generator.

You just increment a counter, assuming the key you search is not in the table 
yet.

But obviously with only a variation of sport (16 bits), you have a maximum of 
65536 values. No need to feed 100*2^20 values are most of them are dups.

Now if you change your program to do a real lookups with the 2^16 possible 
values of sport you'll see :

jhash function :
1 61578
2 1916
3 42

that is : 61578 chains of length 1
1916 chains of length 2
42 chains of length 3

(for reference, with HASH_FOLD, about same results :
1 61692
2 1856
3 44

Pretty good results... for the gain jhash gives us.

Of course, XOR hash gives a 'perfect' 65536 chains of length 1.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 17:20                                               ` Eric Dumazet
@ 2007-02-20 17:55                                                 ` Evgeniy Polyakov
  2007-02-20 18:12                                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 17:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 06:20:26PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > Hmm, I've just ran following test:
> > 1. created 2^20 hash table.
> > 2. ran in loop (100*(2^20) iterations) following hashes:
> >  a. xor hash (const_ip, const_ip, random_word)
> 
> So what ? to attack me you want to send 100*2^20 packets every minute ?

:) No, I will specially craft 1000 packets which will hist the same
chain.

> Thats nonsense... If you really can send so many packets, My pipe is full 
> whatever I do of received packets. No Algo will protect me, even designed by 
> Einstein.

Did you ever read what I wrote?
It is test, which shows that 
1. jenkins has problems
2. it is two times slower than xor

How to explot problem in a real world is out of that research, but it is
enough to say that it is broken.

> If you look again at route cache, you will see chains length are limited by 
> elasticity factor, that is usually 8... No need to try to reach 100 entries 
> in a chain.
> 
> Yes, I can destroy Russia sending 2^10 nuclear weapons on major cities. You 
> really should build a bunker right now :)

France only has 100 delivery vehicles (about 50 submarines and 50
Mirages) - so no, I will not :)

> Now try to build an attack with 100 packets per second... and I will try to be 
> smart too.

Depending on the end result... Wanna buy me (or suggest) couple of bottles of 
good not expensive french wine? :)

Here is a dump of possible addr/port pairs which end up badly
distributed:

8e363a50:27652 -> c0a80001:20480
8e363a50:35529 -> c0a80001:20480
8e363a50:40919 -> c0a80001:20480
8e363a50:46720 -> c0a80001:20480

they produce the same hash value in the test described above.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 17:53                                                 ` Eric Dumazet
@ 2007-02-20 18:00                                                   ` Evgeniy Polyakov
  2007-02-20 18:55                                                     ` Eric Dumazet
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 18:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 06:53:59PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > > I've attached source code and running script.
> > > $ ./run.sh
> >
> > Yep, of course.
> 
> Your test program is just bogus. artefacts come from your 'random' generator.
> 
> You just increment a counter, assuming the key you search is not in the table 
> yet.

No, that part is commented, but it does not matter.

> But obviously with only a variation of sport (16 bits), you have a maximum of 
> 65536 values. No need to feed 100*2^20 values are most of them are dups.
> 
> Now if you change your program to do a real lookups with the 2^16 possible 
> values of sport you'll see :

No need to change something - I showed that in some cases jenkins ends
up with a huge problem. If test will be changed, or solar will be turned
off, things will behave good, but as is they behave very bad.

> jhash function :
> 1 61578
> 2 1916
> 3 42
> 
> that is : 61578 chains of length 1
> 1916 chains of length 2
> 42 chains of length 3
> 
> (for reference, with HASH_FOLD, about same results :
> 1 61692
> 2 1856
> 3 44
> 
> Pretty good results... for the gain jhash gives us.
> 
> Of course, XOR hash gives a 'perfect' 65536 chains of length 1.

As you can see, jhash has problems in a really trivial case of 2^16,
which in local lan is a disaster. The only reason jenkins hash is good
for hashing purposes is that is it more complex than xor one, and thus
it is harder to find a collision law. That's all.
And it is two times slower.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 17:55                                                 ` Evgeniy Polyakov
@ 2007-02-20 18:12                                                   ` Evgeniy Polyakov
  2007-02-20 19:13                                                     ` Michael K. Edwards
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 18:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 08:55:50PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> Here is a dump of possible addr/port pairs which end up badly
> distributed:
> 
> 8e363a50:27652 -> c0a80001:20480
> 8e363a50:35529 -> c0a80001:20480
> 8e363a50:40919 -> c0a80001:20480
> 8e363a50:46720 -> c0a80001:20480
> 
> they produce the same hash value in the test described above.

And here are another ones which produce the same hash value.
Of course searching for pair for jhash('jhash is broken') 
will require more steps, but it is doable.

That means that if attacker has a full control over one host, it can
create a chain of maximum 4 entries in socket table (if jhash is used). 
If it is udp, that means that attacker control addresses too without 
syn cookies, which in turn means that below list can be increased to 
infinite.

8e363a50:22210 -> c0a80001:20480 10403
8e363a50:58377 -> c0a80001:20480 10403
8e363a50:9272 -> c0a80001:20480 10403
8e363a50:4173 -> c0a80001:20480 130f8
8e363a50:44401 -> c0a80001:20480 130f8
8e363a50:53439 -> c0a80001:20480 130f8
8e363a50:44525 -> c0a80001:20480 14391
8e363a50:46858 -> c0a80001:20480 14391
8e363a50:50030 -> c0a80001:20480 14391
8e363a50:40337 -> c0a80001:20480 1c66d
8e363a50:53249 -> c0a80001:20480 1c66d
8e363a50:65307 -> c0a80001:20480 1c66d
8e363a50:10433 -> c0a80001:20480 1fd1b
8e363a50:49548 -> c0a80001:20480 1fd1b
8e363a50:64835 -> c0a80001:20480 1fd1b
8e363a50:14889 -> c0a80001:20480 206ae
8e363a50:29984 -> c0a80001:20480 206ae
8e363a50:44282 -> c0a80001:20480 206ae
8e363a50:27521 -> c0a80001:20480 2a8c8
8e363a50:34493 -> c0a80001:20480 2a8c8
8e363a50:41134 -> c0a80001:20480 2a8c8
8e363a50:50387 -> c0a80001:20480 2c1fc
8e363a50:56740 -> c0a80001:20480 2c1fc
8e363a50:58943 -> c0a80001:20480 2c1fc
8e363a50:23856 -> c0a80001:20480 31ac2
8e363a50:35034 -> c0a80001:20480 31ac2
8e363a50:62638 -> c0a80001:20480 31ac2
8e363a50:15623 -> c0a80001:20480 33b81
8e363a50:24235 -> c0a80001:20480 33b81
8e363a50:38581 -> c0a80001:20480 33b81
8e363a50:23779 -> c0a80001:20480 37e65
8e363a50:42244 -> c0a80001:20480 37e65
8e363a50:6729 -> c0a80001:20480 37e65
8e363a50:11002 -> c0a80001:20480 3d06d
8e363a50:4321 -> c0a80001:20480 3d06d
8e363a50:5255 -> c0a80001:20480 3d06d
8e363a50:19326 -> c0a80001:20480 439c7
8e363a50:6187 -> c0a80001:20480 439c7
8e363a50:61932 -> c0a80001:20480 439c7
8e363a50:36916 -> c0a80001:20480 472ce
8e363a50:39670 -> c0a80001:20480 472ce
8e363a50:50520 -> c0a80001:20480 472ce
8e363a50:14229 -> c0a80001:20480 4e5f2
8e363a50:16897 -> c0a80001:20480 4e5f2
8e363a50:3340 -> c0a80001:20480 4e5f2
8e363a50:12892 -> c0a80001:20480 5d11
8e363a50:3998 -> c0a80001:20480 5d11
8e363a50:50654 -> c0a80001:20480 5d11
8e363a50:37267 -> c0a80001:20480 5e30e
8e363a50:41659 -> c0a80001:20480 5e30e
8e363a50:57118 -> c0a80001:20480 5e30e
8e363a50:27652 -> c0a80001:20480 6a284
8e363a50:35529 -> c0a80001:20480 6a284
8e363a50:40919 -> c0a80001:20480 6a284
8e363a50:46720 -> c0a80001:20480 6a284
8e363a50:1825 -> c0a80001:20480 6af47
8e363a50:3025 -> c0a80001:20480 6af47
8e363a50:49431 -> c0a80001:20480 6af47
8e363a50:17218 -> c0a80001:20480 77300
8e363a50:48400 -> c0a80001:20480 77300
8e363a50:9188 -> c0a80001:20480 77300
8e363a50:48327 -> c0a80001:20480 7cf09
8e363a50:55417 -> c0a80001:20480 7cf09
8e363a50:57221 -> c0a80001:20480 7cf09
8e363a50:10586 -> c0a80001:20480 809af
8e363a50:11371 -> c0a80001:20480 809af
8e363a50:27313 -> c0a80001:20480 809af
8e363a50:34688 -> c0a80001:20480 80bf3
8e363a50:58611 -> c0a80001:20480 80bf3
8e363a50:61056 -> c0a80001:20480 80bf3
8e363a50:10367 -> c0a80001:20480 85eae
8e363a50:3761 -> c0a80001:20480 85eae
8e363a50:57021 -> c0a80001:20480 85eae
8e363a50:10940 -> c0a80001:20480 88c52
8e363a50:26256 -> c0a80001:20480 88c52
8e363a50:7363 -> c0a80001:20480 88c52
8e363a50:10613 -> c0a80001:20480 89d75
8e363a50:54306 -> c0a80001:20480 89d75
8e363a50:59263 -> c0a80001:20480 89d75
8e363a50:16004 -> c0a80001:20480 91821
8e363a50:269 -> c0a80001:20480 91821
8e363a50:38109 -> c0a80001:20480 91821
8e363a50:1073 -> c0a80001:20480 96854
8e363a50:34201 -> c0a80001:20480 96854
8e363a50:58160 -> c0a80001:20480 96854
8e363a50:11353 -> c0a80001:20480 a17c4
8e363a50:37120 -> c0a80001:20480 a17c4
8e363a50:43332 -> c0a80001:20480 a17c4
8e363a50:26356 -> c0a80001:20480 a2e03
8e363a50:46187 -> c0a80001:20480 a2e03
8e363a50:61198 -> c0a80001:20480 a2e03
8e363a50:12881 -> c0a80001:20480 a7466
8e363a50:45272 -> c0a80001:20480 a7466
8e363a50:52661 -> c0a80001:20480 a7466
8e363a50:32863 -> c0a80001:20480 a7eeb
8e363a50:33575 -> c0a80001:20480 a7eeb
8e363a50:9977 -> c0a80001:20480 a7eeb
8e363a50:23136 -> c0a80001:20480 a9e47
8e363a50:41222 -> c0a80001:20480 a9e47
8e363a50:43554 -> c0a80001:20480 a9e47
8e363a50:3248 -> c0a80001:20480 b365
8e363a50:3417 -> c0a80001:20480 b365
8e363a50:61275 -> c0a80001:20480 b365
8e363a50:25606 -> c0a80001:20480 b511e
8e363a50:46638 -> c0a80001:20480 b511e
8e363a50:59262 -> c0a80001:20480 b511e
8e363a50:24384 -> c0a80001:20480 b571d
8e363a50:34078 -> c0a80001:20480 b571d
8e363a50:64346 -> c0a80001:20480 b571d
8e363a50:11934 -> c0a80001:20480 b90b1
8e363a50:32598 -> c0a80001:20480 b90b1
8e363a50:54122 -> c0a80001:20480 b90b1
8e363a50:41677 -> c0a80001:20480 ba2fe
8e363a50:61476 -> c0a80001:20480 ba2fe
8e363a50:65145 -> c0a80001:20480 ba2fe
8e363a50:31764 -> c0a80001:20480 cd942
8e363a50:48000 -> c0a80001:20480 cd942
8e363a50:57653 -> c0a80001:20480 cd942
8e363a50:247 -> c0a80001:20480 db891
8e363a50:28001 -> c0a80001:20480 db891
8e363a50:53241 -> c0a80001:20480 db891
8e363a50:46947 -> c0a80001:20480 e820c
8e363a50:51565 -> c0a80001:20480 e820c
8e363a50:63465 -> c0a80001:20480 e820c
8e363a50:1046 -> c0a80001:20480 ec738
8e363a50:17629 -> c0a80001:20480 ec738
8e363a50:63098 -> c0a80001:20480 ec738
8e363a50:35056 -> c0a80001:20480 f0ae6
8e363a50:42973 -> c0a80001:20480 f0ae6
8e363a50:51422 -> c0a80001:20480 f0ae6
8e363a50:10479 -> c0a80001:20480 fefc9
8e363a50:42078 -> c0a80001:20480 fefc9
8e363a50:45178 -> c0a80001:20480 fefc9

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 18:00                                                   ` Evgeniy Polyakov
@ 2007-02-20 18:55                                                     ` Eric Dumazet
  2007-02-20 19:06                                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 18:55 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 19:00, Evgeniy Polyakov wrote:
> As you can see, jhash has problems in a really trivial case of 2^16,
> which in local lan is a disaster. The only reason jenkins hash is good
> for hashing purposes is that is it more complex than xor one, and thus
> it is harder to find a collision law. That's all.
> And it is two times slower.

I see no problems at all. An attacker can not exploit the fact that two (or 
three) different values of sport will hit the same hash bucket.

A hash function may have collisions. This is *designed* like that.

The complexity of the hash function is a tradeoff. A perfect hash would be :
- Perfect distribution
- Hard (or even : impossible) to guess for an attacker.
- No CPU cost.

There is no perfect hash function... given 96 bits in input.
So what ? hashes are 'badly broken' ?
Thats just not even funny Evgeniy.

The 'two times slower' is a simplistic view, or maybe you have an alien CPU, 
or a CPU from the past ?

On my oprofile, rt_hash_code() uses 0.24% of cpu (about 50 x86_64 
instructions)

Each time a cache miss is done because your bucket length is (X+1) instead of 
(X), your CPU is stuck while it could have do 150 instructions. Next CPUS 
will do 300 instructions per cache miss, maybe 1000 one day... yes, life is 
hard.

I added to my 'simulator_plugged_on_real_server' the average cost calculation, 
relative to number of cache line per lookup.

ehash_size=2^20
xor hash :
386290 sockets, Avg lookup cost=3.2604 cache lines/lookup
393667 sockets, Avg lookup cost=3.30579 cache lines/lookup
400777 sockets, Avg lookup cost=3.3493 cache lines/lookup
404720 sockets, Avg lookup cost=3.36705 cache lines/lookup
406671 sockets, Avg lookup cost=3.37677 cache lines/lookup
jenkin hash:
386290 sockets, Avg lookup cost=2.36763 cache lines/lookup
393667 sockets, Avg lookup cost=2.37533 cache lines/lookup
400777 sockets, Avg lookup cost=2.38211 cache lines/lookup
404720 sockets, Avg lookup cost=2.38582 cache lines/lookup
406671 sockets, Avg lookup cost=2.38679 cache lines/lookup

(you can see that when number of sockets increase, the xor hash becomes worst)

So the jenkin hash function CPU cost is balanced by the fact its distribution 
is better. In the end you are faster.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 18:55                                                     ` Eric Dumazet
@ 2007-02-20 19:06                                                       ` Evgeniy Polyakov
  2007-02-20 19:17                                                         ` Eric Dumazet
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 19:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 07:55:15PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Tuesday 20 February 2007 19:00, Evgeniy Polyakov wrote:
> > As you can see, jhash has problems in a really trivial case of 2^16,
> > which in local lan is a disaster. The only reason jenkins hash is good
> > for hashing purposes is that is it more complex than xor one, and thus
> > it is harder to find a collision law. That's all.
> > And it is two times slower.
> 
> I see no problems at all. An attacker can not exploit the fact that two (or 
> three) different values of sport will hit the same hash bucket.
> 
> A hash function may have collisions. This is *designed* like that.
> 
> The complexity of the hash function is a tradeoff. A perfect hash would be :
> - Perfect distribution
> - Hard (or even : impossible) to guess for an attacker.
> - No CPU cost.
> 
> There is no perfect hash function... given 96 bits in input.
> So what ? hashes are 'badly broken' ?
> Thats just not even funny Evgeniy.

Jenkins has _worse_ distribution than xor one.
_That_ is bad, not the fact that hash has collisions.

hash(val) = val >> 16;
is a hash too, and it has even worse distribution - so it is designed
even worse, so we do not use it.

> The 'two times slower' is a simplistic view, or maybe you have an alien CPU, 
> or a CPU from the past ?

It is core duo 3.7 ghz.
Timings are printed in the test I showed in the list.

> On my oprofile, rt_hash_code() uses 0.24% of cpu (about 50 x86_64 
> instructions)
> 
> Each time a cache miss is done because your bucket length is (X+1) instead of 
> (X), your CPU is stuck while it could have do 150 instructions. Next CPUS 
> will do 300 instructions per cache miss, maybe 1000 one day... yes, life is 
> hard.
> 
> I added to my 'simulator_plugged_on_real_server' the average cost calculation, 
> relative to number of cache line per lookup.
> 
> ehash_size=2^20
> xor hash :
> 386290 sockets, Avg lookup cost=3.2604 cache lines/lookup
> 393667 sockets, Avg lookup cost=3.30579 cache lines/lookup
> 400777 sockets, Avg lookup cost=3.3493 cache lines/lookup
> 404720 sockets, Avg lookup cost=3.36705 cache lines/lookup
> 406671 sockets, Avg lookup cost=3.37677 cache lines/lookup
> jenkin hash:
> 386290 sockets, Avg lookup cost=2.36763 cache lines/lookup
> 393667 sockets, Avg lookup cost=2.37533 cache lines/lookup
> 400777 sockets, Avg lookup cost=2.38211 cache lines/lookup
> 404720 sockets, Avg lookup cost=2.38582 cache lines/lookup
> 406671 sockets, Avg lookup cost=2.38679 cache lines/lookup
> 
> (you can see that when number of sockets increase, the xor hash becomes worst)
> 
> So the jenkin hash function CPU cost is balanced by the fact its distribution 
> is better. In the end you are faster.

Very strange test - it shows that jenkins distribution for your setup is
better than xor one, although for the true random data they are roughly
the same, and jenkins one has more instructions.

But _you_ have shown that with true random data of 2^16 ports jenkins
distribution is _worse_ than xor without any gain to buy.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 18:12                                                   ` Evgeniy Polyakov
@ 2007-02-20 19:13                                                     ` Michael K. Edwards
  2007-02-20 19:44                                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-20 19:13 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, David Miller, akepner, linux, netdev, bcrl

On 2/20/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> And here are another ones which produce the same hash value.
> Of course searching for pair for jhash('jhash is broken')
> will require more steps, but it is doable.
>
> That means that if attacker has a full control over one host, it can
> create a chain of maximum 4 entries in socket table (if jhash is used).
> If it is udp, that means that attacker control addresses too without
> syn cookies, which in turn means that below list can be increased to
> infinite.

Evgeniy, you're not gettin' it.  Have you ever heard of a Poisson
distribution?  That's what you have to aim for in a (bog-standard)
hash table -- a hash function that scrambles the key space until you
have a Poisson distribution in the hash buckets no matter whan pile of
keys the attacker throws at you.  That's a "perfect" hash function in
the statistician's sense (not a "perfect hash" function in the
compiler designer's sense).  Yes there will be collisions, and they
will start happening much earlier than you will think, if you have
never heard of Poisson distributions before; that's called the
"birthday problem" and it is a chestnut of a mathematical puzzle
generally encountered by bright twelve-year-olds in extra-curricular
math classes.  The only sensible way to deal with them is to use a
better data structure -- like a 2-left hash (Google it) or something
tree-ish.

That is not, however, what you're not getting.  What you're not
getting is the use of a "salt" or "secret" or whatever you want to
call it to turn a weak hash into an impenetrable defense against
chosen-plaintext collision attacks.  You can run your PRNG until you
slag your poor botnet's little CPUs searching for a set of tuples that
collide in the same bucket on your test system-under-attack.  But they
won't collide on mine, and they won't collide on Eric's, and they
won't collide on yours the next time it's rebooted.  Because even a
weak hash (in the cryptographer's sense) is good enough to scramble
the key space radically differently with two different salts.  XOR
doesn't cut it -- the salt will permute the buckets but not toss each
one's contents up into the air to land in a bunch of different
buckets.  But Jenkins does cut it, as discussed in the source code
Eric pointed out to you.

Of course you don't change the salt except when resizing the hash
(which is a dumb thing to do anyway, and a sure sign that a hash table
is not the right data structure).  That would kinda defeat the purpose
of having a hash table in the first place.

If you can't assimilate this and articulate it back to us in your own
words instead of repeating "any hash that doesn't totally, utterly
suck slows my meaningless artificial benchmark by 50%", you may not be
the right person to design and implement a new RCU data structure in a
kernel that thousands of us trust to withstand exposure to the open
Internet to the best of a commodity processor's ability.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 19:06                                                       ` Evgeniy Polyakov
@ 2007-02-20 19:17                                                         ` Eric Dumazet
  2007-02-20 19:36                                                           ` Evgeniy Polyakov
  2007-02-20 19:44                                                           ` Michael K. Edwards
  0 siblings, 2 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-20 19:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tuesday 20 February 2007 20:06, Evgeniy Polyakov wrote:

> > I added to my 'simulator_plugged_on_real_server' the average cost
> > calculation, relative to number of cache line per lookup.
> >
> > ehash_size=2^20
> > xor hash :
> > 386290 sockets, Avg lookup cost=3.2604 cache lines/lookup
> > 393667 sockets, Avg lookup cost=3.30579 cache lines/lookup
> > 400777 sockets, Avg lookup cost=3.3493 cache lines/lookup
> > 404720 sockets, Avg lookup cost=3.36705 cache lines/lookup
> > 406671 sockets, Avg lookup cost=3.37677 cache lines/lookup
> > jenkin hash:
> > 386290 sockets, Avg lookup cost=2.36763 cache lines/lookup
> > 393667 sockets, Avg lookup cost=2.37533 cache lines/lookup
> > 400777 sockets, Avg lookup cost=2.38211 cache lines/lookup
> > 404720 sockets, Avg lookup cost=2.38582 cache lines/lookup
> > 406671 sockets, Avg lookup cost=2.38679 cache lines/lookup
> >
> > (you can see that when number of sockets increase, the xor hash becomes
> > worst)
> >
> > So the jenkin hash function CPU cost is balanced by the fact its
> > distribution is better. In the end you are faster.
>
> Very strange test - it shows that jenkins distribution for your setup is
> better than xor one, although for the true random data they are roughly
> the same, and jenkins one has more instructions.
>
> But _you_ have shown that with true random data of 2^16 ports jenkins
> distribution is _worse_ than xor without any gain to buy.

I shown your test was bogus. All your claims are just bogus.
I claim your 'true random data' is a joke. rand() in your program is a pure 
joke.

Given 48 bits of input, you *can* find a lot of addr/port to hit one 
particular cache line if XOR function is used. With jhash, without knowing 
the 32bits random secret, you *cant*.

Again, you dont take into account the chain length.

If all chains were of length <= 1, then yes, xor would be faster. In real 
life, we *know* chain length can be larger, especially in DOS situations.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 19:17                                                         ` Eric Dumazet
@ 2007-02-20 19:36                                                           ` Evgeniy Polyakov
  2007-02-20 19:44                                                           ` Michael K. Edwards
  1 sibling, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 19:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 08:17:31PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> I shown your test was bogus. All your claims are just bogus.
> I claim your 'true random data' is a joke. rand() in your program is a pure 
> joke.

Care to reread your mail about your true random case with hash chain
length of 3 and 4? Anyway, I just shown that jenkins hash is simple to
crack and to find its collisions - even if you will put there some
constant value it will be the same. It is math, not something special
speculation about input values.

> Given 48 bits of input, you *can* find a lot of addr/port to hit one 
> particular cache line if XOR function is used. With jhash, without knowing 
> the 32bits random secret, you *cant*.

You seems to do not want to understand that it is exactly the same as
searching for collision law. It is simple, and results will be
dangerous.

> Again, you dont take into account the chain length.
> 
> If all chains were of length <= 1, then yes, xor would be faster. In real 
> life, we *know* chain length can be larger, especially in DOS situations.

I.e. you propose to add a hash, which has broken case for the same ip
addresses and different ports compared to good xor?
It was shown that hash(const, const, non_const) ends up with _broken_
distribution comapred to xor hash.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 19:13                                                     ` Michael K. Edwards
@ 2007-02-20 19:44                                                       ` Evgeniy Polyakov
  2007-02-20 20:03                                                         ` Michael K. Edwards
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-20 19:44 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Eric Dumazet, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 11:13:50AM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> On 2/20/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >And here are another ones which produce the same hash value.
> >Of course searching for pair for jhash('jhash is broken')
> >will require more steps, but it is doable.
> >
> >That means that if attacker has a full control over one host, it can
> >create a chain of maximum 4 entries in socket table (if jhash is used).
> >If it is udp, that means that attacker control addresses too without
> >syn cookies, which in turn means that below list can be increased to
> >infinite.
> 
> Evgeniy, you're not gettin' it.  Have you ever heard of a Poisson
> distribution?  That's what you have to aim for in a (bog-standard)
> hash table -- a hash function that scrambles the key space until you
> have a Poisson distribution in the hash buckets no matter whan pile of
> keys the attacker throws at you.  That's a "perfect" hash function in
> the statistician's sense (not a "perfect hash" function in the
> compiler designer's sense).  Yes there will be collisions, and they
> will start happening much earlier than you will think, if you have
> never heard of Poisson distributions before; that's called the
> "birthday problem" and it is a chestnut of a mathematical puzzle
> generally encountered by bright twelve-year-olds in extra-curricular
> math classes.  The only sensible way to deal with them is to use a
> better data structure -- like a 2-left hash (Google it) or something
> tree-ish.
> 
> That is not, however, what you're not getting.  What you're not
> getting is the use of a "salt" or "secret" or whatever you want to
> call it to turn a weak hash into an impenetrable defense against
> chosen-plaintext collision attacks.  You can run your PRNG until you
> slag your poor botnet's little CPUs searching for a set of tuples that
> collide in the same bucket on your test system-under-attack.  But they
> won't collide on mine, and they won't collide on Eric's, and they
> won't collide on yours the next time it's rebooted.  Because even a
> weak hash (in the cryptographer's sense) is good enough to scramble
> the key space radically differently with two different salts.  XOR
> doesn't cut it -- the salt will permute the buckets but not toss each
> one's contents up into the air to land in a bunch of different
> buckets.  But Jenkins does cut it, as discussed in the source code
> Eric pointed out to you.
> 
> Of course you don't change the salt except when resizing the hash
> (which is a dumb thing to do anyway, and a sure sign that a hash table
> is not the right data structure).  That would kinda defeat the purpose
> of having a hash table in the first place.
> 
> If you can't assimilate this and articulate it back to us in your own
> words instead of repeating "any hash that doesn't totally, utterly
> suck slows my meaningless artificial benchmark by 50%", you may not be
> the right person to design and implement a new RCU data structure in a
> kernel that thousands of us trust to withstand exposure to the open
> Internet to the best of a commodity processor's ability.

How I like personal insults - it is always fun to read about myself from
people who never knew me :)

I just shown a problem in jenkins hash - it is not how to find a
differnet input for the same output - it is a _law_ which allows to
break a hash. You will add some constant, and that law will be turned
into something different (getting into account what was written, it will
end up with the same law).

Using jenkins hash is equal to the situation, when part of you hash
chains will be 5 times longer than median square value, with XOR one
there is no such distribution.

Added somthing into permutations will not endup in different
distribution, since it is permutations which are broken, not its result
(which can be xored with something).

> Cheers,
> - Michael

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 19:17                                                         ` Eric Dumazet
  2007-02-20 19:36                                                           ` Evgeniy Polyakov
@ 2007-02-20 19:44                                                           ` Michael K. Edwards
  1 sibling, 0 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-20 19:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, David Miller, akepner, linux, netdev, bcrl

All right, Eric, you and me so clevvah, let's embarrass our own selves
designing this thing in public instead of picking on poor Evgeniy.
Which would you rather RCU, a 2-left hash or a splay tree?  2-left
hash gets you excellent occupation fraction before the first real
collision, so you can be utterly stupid about collisions in the second
hash table (just toss all double collisions in an overflow radix
tree).  Splay tree is a lot bigger bite to chew initially, but it gets
you a hot set that's at least sometimes warm in cache, and it's easier
to think about how to RCU it, and it doubles as a priority queue.  You
pick.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 19:44                                                       ` Evgeniy Polyakov
@ 2007-02-20 20:03                                                         ` Michael K. Edwards
  2007-02-20 20:09                                                           ` Michael K. Edwards
  2007-02-21  8:54                                                           ` Evgeniy Polyakov
  0 siblings, 2 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-20 20:03 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, David Miller, akepner, linux, netdev, bcrl

On 2/20/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> How I like personal insults - it is always fun to read about myself from
> people who never knew me :)

On this occasion, I did not set out to insult you.  I set out to
suggest an explanation for why cooler and grayer heads than mine are
not falling over themselves to merge your designs into the mainline
kernel.  Did I succeed?

> I just shown a problem in jenkins hash - it is not how to find a
> differnet input for the same output - it is a _law_ which allows to
> break a hash. You will add some constant, and that law will be turned
> into something different (getting into account what was written, it will
> end up with the same law).

Correct.  That's called a "weak hash", and Jenkins is known to be a
thoroughly weak hash.  That's why you never, ever use it without a
salt, and you don't let an attacker inspect the hash output either.

> Using jenkins hash is equal to the situation, when part of you hash
> chains will be 5 times longer than median square value, with XOR one
> there is no such distribution.

Show us the numbers.  Salt properly this time to reduce the artifacts
that come of applying a weak hash to a poor PRNG, and histogram your
results.  If you don't get a Poisson distribution you probably don't
know how to use gnuplot either.  :-)

> Added somthing into permutations will not endup in different
> distribution, since it is permutations which are broken, not its result
> (which can be xored with something).

I can't parse this.  Care to try again?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 20:03                                                         ` Michael K. Edwards
@ 2007-02-20 20:09                                                           ` Michael K. Edwards
  2007-02-21  8:56                                                             ` Evgeniy Polyakov
  2007-02-21  8:54                                                           ` Evgeniy Polyakov
  1 sibling, 1 reply; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-20 20:09 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, David Miller, akepner, linux, netdev, bcrl

On 2/20/07, Michael K. Edwards <medwards.linux@gmail.com> wrote:
> Correct.  That's called a "weak hash", and Jenkins is known to be a
> thoroughly weak hash.  That's why you never, ever use it without a
> salt, and you don't let an attacker inspect the hash output either.

Weak in a cryptographic sense, of course.  Excellent avalanche
behavior, though, which is what you care about in a salted hash.
http://en.wikipedia.org/wiki/Hash_table

I know nothing about data structures and algorithms except what I read
on the Internet.  But you'd be amazed what's on the Internet.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 20:03                                                         ` Michael K. Edwards
  2007-02-20 20:09                                                           ` Michael K. Edwards
@ 2007-02-21  8:54                                                           ` Evgeniy Polyakov
  2007-02-21  9:15                                                             ` Eric Dumazet
  1 sibling, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-21  8:54 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Eric Dumazet, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 12:03:04PM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> >I just shown a problem in jenkins hash - it is not how to find a
> >differnet input for the same output - it is a _law_ which allows to
> >break a hash. You will add some constant, and that law will be turned
> >into something different (getting into account what was written, it will
> >end up with the same law).
> 
> Correct.  That's called a "weak hash", and Jenkins is known to be a
> thoroughly weak hash.  That's why you never, ever use it without a
> salt, and you don't let an attacker inspect the hash output either.

Again, where will be your salt?
I'm going to show you that having constant xor on fairly distributed
system will not change distribution as long as bad one.

> >Using jenkins hash is equal to the situation, when part of you hash
> >chains will be 5 times longer than median square value, with XOR one
> >there is no such distribution.
> 
> Show us the numbers.  Salt properly this time to reduce the artifacts
> that come of applying a weak hash to a poor PRNG, and histogram your
> results.  If you don't get a Poisson distribution you probably don't
> know how to use gnuplot either.  :-)

I shown that numbers 4 times already, do you read mails and links?
Did you see an artifact Eric showed with his data?

> >Added somthing into permutations will not endup in different
> >distribution, since it is permutations which are broken, not its result
> >(which can be xored with something).
> 
> I can't parse this.  Care to try again?

Whre are you going to add a salt into jenkins hash to fix its
distribution?

In other words - jenkins hash is equal to simple shift - it is a hash
too, and it has bad distribution too, where will added salt ever help in
that scenario?

> Cheers,
> - Michael

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 20:09                                                           ` Michael K. Edwards
@ 2007-02-21  8:56                                                             ` Evgeniy Polyakov
  2007-02-21  9:34                                                               ` David Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-21  8:56 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Eric Dumazet, David Miller, akepner, linux, netdev, bcrl

On Tue, Feb 20, 2007 at 12:09:59PM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> On 2/20/07, Michael K. Edwards <medwards.linux@gmail.com> wrote:
> >Correct.  That's called a "weak hash", and Jenkins is known to be a
> >thoroughly weak hash.  That's why you never, ever use it without a
> >salt, and you don't let an attacker inspect the hash output either.
> 
> Weak in a cryptographic sense, of course.  Excellent avalanche
> behavior, though, which is what you care about in a salted hash.
> http://en.wikipedia.org/wiki/Hash_table

I repeat again - add your salt into jenkins hash and I will show you
that it has the same problems.
So, I'm waiting for your patch for jhash_*_words().

> I know nothing about data structures and algorithms except what I read
> on the Internet.  But you'd be amazed what's on the Internet.
> 
> Cheers,
> - Michael

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  8:54                                                           ` Evgeniy Polyakov
@ 2007-02-21  9:15                                                             ` Eric Dumazet
  2007-02-21  9:27                                                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-21  9:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote:
> I shown that numbers 4 times already, do you read mails and links?
> Did you see an artifact Eric showed with his data?
>

I showed all your thinking is wrong.

Some guys think MD5 checksum is full of artifacts, since its certainly 
possible to construct two differents files having the same md5sum.
Yet, lot of people use md5 checksums. In 10 years, we probably switch to 
another stronger hash. Its only a question of current state of the art.

Jenkins hash is far better than XOR, at least in the tcp ehash context.

Some hackers already exploited the XOR hash weak, more than two years ago.
They never succeeded since I changed to Jenkins hash.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  9:15                                                             ` Eric Dumazet
@ 2007-02-21  9:27                                                               ` Evgeniy Polyakov
  2007-02-21  9:38                                                                 ` Eric Dumazet
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-21  9:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Wed, Feb 21, 2007 at 10:15:11AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote:
> > I shown that numbers 4 times already, do you read mails and links?
> > Did you see an artifact Eric showed with his data?
> >
> 
> I showed all your thinking is wrong.

You mix all the time distribution fairness and complexity of searching
for collisions. Jenkins hash is unfair - having Linux random generator
as attacker's tool we end up in the case, when jenkins hash has some
chains with 5 times longer length.

Short history:
I showed that jenkins is unfair, you confirmed that with your data.
Another question is about complexity of searching for collisions - you
showed that with xor it is quite easy and with jenkins it is complex, 
then I showed that it is not that complex to find collisions in jenkins
too.

> Some guys think MD5 checksum is full of artifacts, since its certainly 
> possible to construct two differents files having the same md5sum.
> Yet, lot of people use md5 checksums. In 10 years, we probably switch to 
> another stronger hash. Its only a question of current state of the art.

It is _NOT_ about having two imputs with the same output.
It is about its distribution.

I say that having random input output will not be fairly distributed -
and it was confirmed by you - having 2^16 random values ended up with
some chains of length 4, while hash table size perfectly allowed them
all to have length 1 as was in case of xor hash.

> Jenkins hash is far better than XOR, at least in the tcp ehash context.

How it can be better than xor when its distribution is unfair?
You get hash chain longer with it compared to xor one - _you_ showed
that data too if you do not believe my data.

> Some hackers already exploited the XOR hash weak, more than two years ago.
> They never succeeded since I changed to Jenkins hash.

It is _different_ problem. It has _absolutely_ nothing in common with
distribution fairness problem found in jenkins hash, do not mix both.

XOR has only one problem - it is quite easy to find a collision, but it
has perfect distribution fairness.

Fix XOR, add random permutation, use sha/md5/whatever which has 
1. fair distribution
2. strong against searching for collisions

Jenkins hash does not have at least first, and as I showed, it is not
that complex to break second (for example case of hash(const, const,
random)).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  8:56                                                             ` Evgeniy Polyakov
@ 2007-02-21  9:34                                                               ` David Miller
  2007-02-21  9:51                                                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: David Miller @ 2007-02-21  9:34 UTC (permalink / raw)
  To: johnpol; +Cc: medwards.linux, dada1, akepner, linux, netdev, bcrl

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Wed, 21 Feb 2007 11:56:08 +0300

> On Tue, Feb 20, 2007 at 12:09:59PM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> > On 2/20/07, Michael K. Edwards <medwards.linux@gmail.com> wrote:
> > >Correct.  That's called a "weak hash", and Jenkins is known to be a
> > >thoroughly weak hash.  That's why you never, ever use it without a
> > >salt, and you don't let an attacker inspect the hash output either.
> > 
> > Weak in a cryptographic sense, of course.  Excellent avalanche
> > behavior, though, which is what you care about in a salted hash.
> > http://en.wikipedia.org/wiki/Hash_table
> 
> I repeat again - add your salt into jenkins hash and I will show you
> that it has the same problems.
> So, I'm waiting for your patch for jhash_*_words().

The problem is that whilst XOR, with arbitrary random input
seed, can be forced to use choosen hash chains easily, with
jenkins this is not the case.

The reason is that, due to jenkin's sophisticated mixing, each
random input produces unique "pattern" of hash chains even for
the most carefully crafted inputs.

It is not trivial to target matching hash chains even with known
input seed, and it is impossible with unknown seed such as that
which we use in routing cache.

I do not talk about distribution characteristics here, only about
attackability.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  9:27                                                               ` Evgeniy Polyakov
@ 2007-02-21  9:38                                                                 ` Eric Dumazet
  2007-02-21  9:57                                                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-21  9:38 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Wednesday 21 February 2007 10:27, Evgeniy Polyakov wrote:
> On Wed, Feb 21, 2007 at 10:15:11AM +0100, Eric Dumazet (dada1@cosmosbay.com) 
wrote:
> > On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote:
> > > I shown that numbers 4 times already, do you read mails and links?
> > > Did you see an artifact Eric showed with his data?
> >
> > I showed all your thinking is wrong.
>
> You mix all the time distribution fairness and complexity of searching
> for collisions. Jenkins hash is unfair - having Linux random generator
> as attacker's tool we end up in the case, when jenkins hash has some
> chains with 5 times longer length.
>
> Short history:
> I showed that jenkins is unfair, you confirmed that with your data.
> Another question is about complexity of searching for collisions - you
> showed that with xor it is quite easy and with jenkins it is complex,
> then I showed that it is not that complex to find collisions in jenkins
> too.

Again here is your 'test'

enter_hash(u32 val)
{
counter[val & mask]++;
}

for (i = 0 ; i < 1000 ; i++)
  enter_hash(CONSTANT1)
for (i = 0 ; i < 1000 ; i++)
  enter_hash(CONSTANT2)

Oh my god, I have two chains with 1000 elems in it.

Real data are :
1) XOR hash, with a load factor of 0.41

Distribution of sockets/chain length
[chain length]:number of sockets
[0]:752255 0%
[1]:208850 47.455%
[2]:54281 72.1225%
[3]:19236 85.235%
[4]:8199 92.6869%
[5]:3487 96.6485%
[6]:1489 98.6785%
[7]:515 99.4976%
[8]:192 99.8466%
[9]:53 99.955%
[10]:14 99.9868%
[11]:3 99.9943%
[12]:1 99.997%
[13]:1 100%
total : 440101 sockets, Avg lookup cost=3.07896 cache lines

2) Jenkins hash, same load factor
[0]:688813 0%
[1]:289874 65.8653%
[2]:60452 93.3372%
[3]:8493 99.1266%
[4]:879 99.9255%
[5]:62 99.9959%
[6]:3 100%
total : 440101 sockets, Avg lookup cost=2.4175 cache lines




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  9:34                                                               ` David Miller
@ 2007-02-21  9:51                                                                 ` Evgeniy Polyakov
  2007-02-21 10:03                                                                   ` David Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-21  9:51 UTC (permalink / raw)
  To: David Miller; +Cc: medwards.linux, dada1, akepner, linux, netdev, bcrl

On Wed, Feb 21, 2007 at 01:34:40AM -0800, David Miller (davem@davemloft.net) wrote:
> > I repeat again - add your salt into jenkins hash and I will show you
> > that it has the same problems.
> > So, I'm waiting for your patch for jhash_*_words().
> 
> The problem is that whilst XOR, with arbitrary random input
> seed, can be forced to use choosen hash chains easily, with
> jenkins this is not the case.
> 
> The reason is that, due to jenkin's sophisticated mixing, each
> random input produces unique "pattern" of hash chains even for
> the most carefully crafted inputs.
> 
> It is not trivial to target matching hash chains even with known
> input seed, and it is impossible with unknown seed such as that
> which we use in routing cache.

Routing cache is safe in that regard, that found problem seems to only
affect the case, when $c parameter is specially crafted - at least I
can not reproduce distribution problem with first two parameters
changed.

But having different initial value, i.e. 
hash(const, const, random_from_attacker, random_per_boot) instead of 
hash(const, const, random_from_attacker, 0) ends up in the same problem.
Attacker will not know which chain is selected, but changing that
parameter will only change chain position, but distribution will be 
still broken.

> I do not talk about distribution characteristics here, only about
> attackability.

It _seems_ (no full analysis) that problem is in the nature of jenkins 
hash - no matter how $initval is selected, its third parameter should 
_not_ be changed to data controlled by attacker, otherwise result is 
reproduced pretty easy.

Linux route cache does not change $c (third parameter), and it _seems_
that distribution for the random $a and $b is fair, while when $c is
formed over attacker's data, random per-boot $initval does not help.

In that regard jhash_2/1words() are only safe - they have $c as zero,
jhash_3words() with attackers $c and random per-boot $initval is
trivially attackable.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  9:38                                                                 ` Eric Dumazet
@ 2007-02-21  9:57                                                                   ` Evgeniy Polyakov
  2007-02-21 21:15                                                                     ` Michael K. Edwards
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-02-21  9:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael K. Edwards, David Miller, akepner, linux, netdev, bcrl

On Wed, Feb 21, 2007 at 10:38:22AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Wednesday 21 February 2007 10:27, Evgeniy Polyakov wrote:
> > On Wed, Feb 21, 2007 at 10:15:11AM +0100, Eric Dumazet (dada1@cosmosbay.com) 
> wrote:
> > > On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote:
> > > > I shown that numbers 4 times already, do you read mails and links?
> > > > Did you see an artifact Eric showed with his data?
> > >
> > > I showed all your thinking is wrong.
> >
> > You mix all the time distribution fairness and complexity of searching
> > for collisions. Jenkins hash is unfair - having Linux random generator
> > as attacker's tool we end up in the case, when jenkins hash has some
> > chains with 5 times longer length.
> >
> > Short history:
> > I showed that jenkins is unfair, you confirmed that with your data.
> > Another question is about complexity of searching for collisions - you
> > showed that with xor it is quite easy and with jenkins it is complex,
> > then I showed that it is not that complex to find collisions in jenkins
> > too.
> 
> Again here is your 'test'
> 
> enter_hash(u32 val)
> {
> counter[val & mask]++;
> }
> 
> for (i = 0 ; i < 1000 ; i++)
>   enter_hash(CONSTANT1)
> for (i = 0 ; i < 1000 ; i++)
>   enter_hash(CONSTANT2)
> 
> Oh my god, I have two chains with 1000 elems in it.

Drop me a phone of your drug diller - I want that crack too :)

Code is pretty simple:
hash = jhash_3words(const, const, random_of_2^16, const_or_random_per_run);
table[hash & (hash_size - 1)]++;

> Real data are :
> 1) XOR hash, with a load factor of 0.41
> 
> Distribution of sockets/chain length
> [chain length]:number of sockets
> [0]:752255 0%
> [1]:208850 47.455%
> [2]:54281 72.1225%
> [3]:19236 85.235%
> [4]:8199 92.6869%
> [5]:3487 96.6485%
> [6]:1489 98.6785%
> [7]:515 99.4976%
> [8]:192 99.8466%
> [9]:53 99.955%
> [10]:14 99.9868%
> [11]:3 99.9943%
> [12]:1 99.997%
> [13]:1 100%
> total : 440101 sockets, Avg lookup cost=3.07896 cache lines
> 
> 2) Jenkins hash, same load factor
> [0]:688813 0%
> [1]:289874 65.8653%
> [2]:60452 93.3372%
> [3]:8493 99.1266%
> [4]:879 99.9255%
> [5]:62 99.9959%
> [6]:3 100%
> total : 440101 sockets, Avg lookup cost=2.4175 cache lines

Please read my e-mail about fairness and different usage models I found
to be vulnerable.

Your data only shows that your feel yourself safe with jenkins hash,
while it has problems (as long as xor).

With described above usage case jenkins will have enrties with 50 
elements in the chain - you have not received such packets yet, or you
use jhash_2words().

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  9:51                                                                 ` Evgeniy Polyakov
@ 2007-02-21 10:03                                                                   ` David Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David Miller @ 2007-02-21 10:03 UTC (permalink / raw)
  To: johnpol; +Cc: medwards.linux, dada1, akepner, linux, netdev, bcrl

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Wed, 21 Feb 2007 12:51:09 +0300

> Linux route cache does not change $c (third parameter), and it _seems_
> that distribution for the random $a and $b is fair, while when $c is
> formed over attacker's data, random per-boot $initval does not help.

It is not only initialized per-boot, it is also regenerated
every 10 minutes by default, via rt_secret_rebuild().

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 11:09                               ` Eric Dumazet
  2007-02-20 11:29                                 ` Evgeniy Polyakov
@ 2007-02-21 12:41                                 ` Andi Kleen
  2007-02-21 13:19                                   ` Eric Dumazet
  1 sibling, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2007-02-21 12:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl

Eric Dumazet <dada1@cosmosbay.com> writes:
> 
> For example, sock_wfree() uses 1.6612 % of cpu because of false sharing of 
> sk_flags (dirtied each time SOCK_QUEUE_SHRUNK is set :(

Might be easily fixable by moving the fields around a bit?

> If we want to optimize tcp, we should reorder fields to reduce number of cache 
> lines, not change algos. struct sock fields are currently placed to reduce 
> holes, while they should be grouped by related fields sharing cache lines.

Regrouping is definitely a good thing; but I'm not sure why you are so
deadset against exploring other data structures. The promise of RCUing
and avoiding the big hash tables seems alluding to me, even if it 
only breaks even in the end in terms of cycles.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21 12:41                                 ` Andi Kleen
@ 2007-02-21 13:19                                   ` Eric Dumazet
  2007-02-21 13:37                                     ` David Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-02-21 13:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev, bcrl

On Wednesday 21 February 2007 13:41, Andi Kleen wrote:
> Eric Dumazet <dada1@cosmosbay.com> writes:
> > For example, sock_wfree() uses 1.6612 % of cpu because of false sharing
> > of sk_flags (dirtied each time SOCK_QUEUE_SHRUNK is set :(
>
> Might be easily fixable by moving the fields around a bit?

For this one, it seems sk_flags is mostly read, but SOCK_QUEUE_SHRUNK is 
mostly writen. It would make sense to move it to another point, to keep 
sk_flags shared by different cpus.

Maybe using one low order bit in a related pointer ? (like the rb_color() 
trick). Maybe this is time for a new include/linux/bap.h (bits and pointer) 
include :)

>
> > If we want to optimize tcp, we should reorder fields to reduce number of
> > cache lines, not change algos. struct sock fields are currently placed to
> > reduce holes, while they should be grouped by related fields sharing
> > cache lines.
>
> Regrouping is definitely a good thing; but I'm not sure why you are so
> deadset against exploring other data structures. The promise of RCUing
> and avoiding the big hash tables seems alluding to me, even if it
> only breaks even in the end in terms of cycles.

RCU is definitely wanted, and IP routing demonstrated the wins.

rbtree was successfully plugged into epoll instead of initial hash table 
implementation. 

Now, when the rate of lookups/inserts/delete is high, with totally random 
endpoints and cache *always* cold , 'tree structures' are not welcome (not 
cache friendly)


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21 13:19                                   ` Eric Dumazet
@ 2007-02-21 13:37                                     ` David Miller
  2007-02-21 23:13                                       ` Robert Olsson
  0 siblings, 1 reply; 102+ messages in thread
From: David Miller @ 2007-02-21 13:37 UTC (permalink / raw)
  To: dada1; +Cc: andi, johnpol, akepner, linux, netdev, bcrl

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 21 Feb 2007 14:19:30 +0100

> Now, when the rate of lookups/inserts/delete is high, with totally random 
> endpoints and cache *always* cold , 'tree structures' are not welcome (not 
> cache friendly)

But what about if tree lookup were free :-)

This is why I consider Robert Olsson's trash work the most promising,
if we stick sockets into his full flow identified routing cache trie
entries, we can eliminate lookup altogether.

Just like how he already uses traffic inspection to kill cache entries
when FIN shutdown sequence is snooped, we can explicitly flush such
entries when socket is closed fully on local system.

Really, possibilities are truly endless with that thing.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21  9:57                                                                   ` Evgeniy Polyakov
@ 2007-02-21 21:15                                                                     ` Michael K. Edwards
  2007-02-22  9:06                                                                       ` David Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-21 21:15 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, David Miller, akepner, linux, netdev, bcrl

Look, Evgeniy.  Eric and I may be morons but davem is not.  He's
telling you, again and again, that DoS attacks do happen, that to
survive them you need for the distribution of tuples within hash
buckets to vary unpredictably from system to system and boot to boot,
and that XOR hash does not accomplish this.  He has told you that the
Jenkins hash works, for reasons discussed exhaustively in the link I
gave you, which you can follow to statistical analysis that is beyond
your expertise or mine.  He has told you that your performance
analysis is fundamentally flawed.  Do you think you might need to drop
back five and punt?

As for the distribution issues:  who cares?  You fundamentally can't
do any better, for random input drawn from a tuple space much larger
than the number of hash buckets, than a Poisson distribution.  And
that's enough to kill a naive hash table design, because chaining is
going to cost you another cache miss for every link, and you couldn't
afford the first cache miss in the first place.  If you absolutely
insist on a hash table (on the theory that you can't keep the hot
connections warm in cache anyway, which isn't necessarily true if you
use a splay tree), it had better be a 2-left hash with a compact
overflow pool for the rare case of second collision.

I recommend Michael Mitzenmacher's paper to you, or rather to
whoever's reading along with the intention of improving the kernel
without reinventing the square wheel.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21 13:37                                     ` David Miller
@ 2007-02-21 23:13                                       ` Robert Olsson
  2007-02-22  6:06                                         ` Eric Dumazet
  2007-02-22 11:41                                         ` Andi Kleen
  0 siblings, 2 replies; 102+ messages in thread
From: Robert Olsson @ 2007-02-21 23:13 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, andi, johnpol, akepner, linux, netdev, bcrl


David Miller writes:

 > But what about if tree lookup were free :-)
 > 
 > This is why I consider Robert Olsson's trash work the most promising,
 > if we stick sockets into his full flow identified routing cache trie
 > entries, we can eliminate lookup altogether.
 > 
 > Just like how he already uses traffic inspection to kill cache entries
 > when FIN shutdown sequence is snooped, we can explicitly flush such
 > entries when socket is closed fully on local system.

 Below is current tree-stats from a "production" flowlogging application at a 
 large university (real traffic) using this full flow lookup.

 With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right 
 now there is only a dst entry in leaf nodes. So yes anything else we can put 
 in leaf is "free".

trie:
        Aver depth:     1.31
        Max depth:      4
        Leaves:         99930
        Internal nodes: 14925
          1: 13450  2: 1465  3: 9  18: 1
        Pointers: 294976
Null ptrs: 180122
Total size: 7259  kB

Cheers.
						--ro

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21 23:13                                       ` Robert Olsson
@ 2007-02-22  6:06                                         ` Eric Dumazet
  2007-02-22 11:41                                         ` Andi Kleen
  1 sibling, 0 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-02-22  6:06 UTC (permalink / raw)
  To: Robert Olsson; +Cc: David Miller, andi, johnpol, akepner, linux, netdev, bcrl

Robert Olsson a écrit :
> David Miller writes:
> 
>  > But what about if tree lookup were free :-)
>  > 
>  > This is why I consider Robert Olsson's trash work the most promising,
>  > if we stick sockets into his full flow identified routing cache trie
>  > entries, we can eliminate lookup altogether.
>  > 
>  > Just like how he already uses traffic inspection to kill cache entries
>  > when FIN shutdown sequence is snooped, we can explicitly flush such
>  > entries when socket is closed fully on local system.
> 
>  Below is current tree-stats from a "production" flowlogging application at a 
>  large university (real traffic) using this full flow lookup.
> 
>  With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right 
>  now there is only a dst entry in leaf nodes. So yes anything else we can put 
>  in leaf is "free".
> 
> trie:
>         Aver depth:     1.31
>         Max depth:      4
>         Leaves:         99930
>         Internal nodes: 14925
>           1: 13450  2: 1465  3: 9  18: 1
>         Pointers: 294976
> Null ptrs: 180122
> Total size: 7259  kB
>
Hi Robert

This sounds *very* promising.
I read again the pdf presentation and theory seems ok.

However you speak of average depth and max depth without mentioning 
corresponding cache lines. Are all your structures under L1_CACHE_BYTES ?

Do you have a pointer to the patch so that I can take a look on it and 
eventually start to think about it ? (ie adding free bits :) )

Thank you
Eric

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21 21:15                                                                     ` Michael K. Edwards
@ 2007-02-22  9:06                                                                       ` David Miller
  2007-02-22 11:00                                                                         ` Michael K. Edwards
  0 siblings, 1 reply; 102+ messages in thread
From: David Miller @ 2007-02-22  9:06 UTC (permalink / raw)
  To: medwards.linux; +Cc: johnpol, dada1, akepner, linux, netdev, bcrl

From: "Michael K. Edwards" <medwards.linux@gmail.com>
Date: Wed, 21 Feb 2007 13:15:51 -0800

> it had better be a 2-left hash with a compact
> overflow pool for the rare case of second collision.

Just like Evgeniy, you should do your research too :-))))

The gain for multiple-hash schemes, such as 2-left, exists only if you
can parallelize the N lookups in hardware.  This is an assumption
built into all the papers you will read on this subject, including the
one you reference by Mitzenmacher.

If you read any paper on IP route lookup algorithms, prefix
your reading of that paper with something that reads like:

	And by the way we are targeting implementations in
	hardware that can parallelize and pipeline memory
	accesses to fast SRAM.  Do not consider this algorithm
	seriously for running on general purpose processors
	in software.

Probably a few read's of Network Algorithmics would help your
understanding in this area as well.

Tries and similar structures do help in the case of our routing
because we're going from a potential 32-hashtable probe down
to a relatively small number of probes into a trie.  That's what
Robert's LC-Trie and upcoming TRASH work do.

But it won't help the same when going from hashes to a trie,
as would be the case for TCP.

If we can get the sockets into the TRASH entries, as Eric will
experiment with, we'll be able to eliminate the TCP hash lookup
in the fast path entirely.  It is the only reasonable suggestion
I've seen made in this area to date.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-22  9:06                                                                       ` David Miller
@ 2007-02-22 11:00                                                                         ` Michael K. Edwards
  2007-02-22 11:07                                                                           ` David Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-22 11:00 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, dada1, akepner, linux, netdev, bcrl

On 2/22/07, David Miller <davem@davemloft.net> wrote:
> From: "Michael K. Edwards" <medwards.linux@gmail.com>
> Date: Wed, 21 Feb 2007 13:15:51 -0800
>
> > it had better be a 2-left hash with a compact
> > overflow pool for the rare case of second collision.
>
> Just like Evgeniy, you should do your research too :-))))

Fair enough.  :-)

> The gain for multiple-hash schemes, such as 2-left, exists only if you
> can parallelize the N lookups in hardware.  This is an assumption
> built into all the papers you will read on this subject, including the
> one you reference by Mitzenmacher.

Er, no.  The principal gain for the 2-left hash is that the only time
you ever look at the right hash is when you have a collision in the
left.  The right hash is therefore substantially less occupied and the
odds of second collision are lower.  If you happen to be able to
prefetch the left and right buckets in parallel, that's gravy.

Roughly speaking, for a single hash, P(0) = (1 - 1/k)^n \approx
e^{-n/k}, where P(0) is the probability that a given bucket has 0
entries, k is the number of buckets, and n is the number of elements.
One entry in the bucket replaces a factor (1-1/k) with a factor n/k,
the second entry replaces another (1-1/k) with (n-1)/2k, and so forth.
 It's the good old birthday problem.

Take the simple example of an equal number of items and of one-entry
buckets, and compare the simple and 2-left hashes.  When n/k \approx
1, you have about 37% empty buckets, but when n/k \approx 2 you only
have about 13% empty buckets.  On the other hand, when n/k \approx 1
and the buckets only hold one entry, about 37% of the items overflow;
when n/k \approx 2, about 57% do.  Result: a 2-left hash with exactly
as many 1-item buckets as there are items will only see about 20% of
the items overflow (collide in the left hash and then collide again in
the right hash), versus the 37% that would have overflowed in a single
hash of the same size.

In practical applications you try to make your hash buckets cache-line
sized and fit 2 or 4 items into them so that overflows are farther out
on the tail of the distribution.  The result can be a hash that is
only oversized by a factor of two or so relative to the space occupied
by the filled slots, yet has an amortized cost of less than 1.5 cache
misses and has a fraction of a percent of items that overflow the
right hash.  That fraction of a percent can go into a single overflow
pool (a way oversized hash or an rbtree or something else you've
already coded; it's a rare case) rather than having to chain from each
hash bucket.

> If you read any paper on IP route lookup algorithms, prefix
> your reading of that paper with something that reads like:
>
>         And by the way we are targeting implementations in
>         hardware that can parallelize and pipeline memory
>         accesses to fast SRAM.  Do not consider this algorithm
>         seriously for running on general purpose processors
>         in software.

I generally agree with this statement.  However, the 2-left hash is a
very useful data structure for lots of things other than IP route
lookups, especially in memory-constrained systems where you don't want
to oversize the hash table by _too_ much but you still want actual
overflow chaining to be quite rare.  It's much simpler than
next-empty-bucket types of schemes, simple enough that I've coded it
for little DSP cores, and almost as good at minimizing overflows.

> Probably a few read's of Network Algorithmics would help your
> understanding in this area as well.

Hey, I don't think a hash is the right solution in the first place.
If you're on the open Internet you have to assume a DDoS half-open
connection load a thousandfold higher than the number of real traffic
flows.  If you don't have the memory bandwidth and cache architecture
to bend over and take it (and on a commodity processor, you don't, not
unless you're a hell of a lot cleverer about hinting to the cache
controller than I am), you want a data structure that rotates the real
connections up into a frequently traversed set.

Besides, RCUing a hash of any sort sounds like a nightmare, whereas
something like a splay tree is reputed to be very straightforward to
make "persistent" in the functional-programming sense.  I haven't
coded that one myself but surely it's in the current edition of Knuth.
 Network Algorithmics is fine as far as it goes, but if you want to
distribute the networking stack across a NUMA box and still not have
it go pear-shaped under sophisticated DDoS, I think you probably have
to think it out for yourself.

In any case, I am not setting myself up as some kind of networking or
data structure guru, just trying to articulate what's wrong with
Evgeniy's analysis and why RCUing a naive hash is a complete waste of
time.

> Tries and similar structures do help in the case of our routing
> because we're going from a potential 32-hashtable probe down
> to a relatively small number of probes into a trie.  That's what
> Robert's LC-Trie and upcoming TRASH work do.
>
> But it won't help the same when going from hashes to a trie,
> as would be the case for TCP.

Well, I arrived in this discussion in the middle, and haven't really
looked at what people are proposing to replace.  Robert's analysis
looks sane, and I certainly haven't invested the research to pick
holes in it.  :-)  Would be nice to know how it fares under massive
half-open connection attacks, though.

> If we can get the sockets into the TRASH entries, as Eric will
> experiment with, we'll be able to eliminate the TCP hash lookup
> in the fast path entirely.  It is the only reasonable suggestion
> I've seen made in this area to date.

Agreed.  You are much better off obtaining all per-flow state with a
single data structure lookup.  It would be kind of nice if state for
other layers like IPSec and NAT could live there too.  By the way, are
these flows unidirectional or bidirectional?  If memory serves, TCP is
going to want a bidirectional session construct, right?  So do you
canonicalize the tuple in some way (lower IP address first?) before
hashing it for lookup?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-22 11:00                                                                         ` Michael K. Edwards
@ 2007-02-22 11:07                                                                           ` David Miller
  2007-02-22 19:24                                                                             ` Stephen Hemminger
  0 siblings, 1 reply; 102+ messages in thread
From: David Miller @ 2007-02-22 11:07 UTC (permalink / raw)
  To: medwards.linux; +Cc: johnpol, dada1, akepner, linux, netdev, bcrl

From: "Michael K. Edwards" <medwards.linux@gmail.com>
Date: Thu, 22 Feb 2007 03:00:32 -0800

> Besides, RCUing a hash of any sort sounds like a nightmare, whereas
> something like a splay tree is reputed to be very straightforward to
> make "persistent" in the functional-programming sense.  I haven't
> coded that one myself but surely it's in the current edition of Knuth.

Knuth tends to be thorough, yet archaic. :-)

> You are much better off obtaining all per-flow state with a
> single data structure lookup.  It would be kind of nice if state for
> other layers like IPSec and NAT could live there too.

This is the eventual idea.

> By the way, are these flows unidirectional or bidirectional?

The entries created by Robert's TRASH are unidirectional, they have to
be because the path from remote to localhost is different than the
path from localhost to remote :-)

But we don't need to do a socket demux on packet output, we just
need a plain route (+ firewall, IPSEC, etc.) destination cache
entry for that.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-21 23:13                                       ` Robert Olsson
  2007-02-22  6:06                                         ` Eric Dumazet
@ 2007-02-22 11:41                                         ` Andi Kleen
  2007-02-22 11:44                                           ` David Miller
  1 sibling, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2007-02-22 11:41 UTC (permalink / raw)
  To: Robert Olsson
  Cc: David Miller, dada1, andi, johnpol, akepner, linux, netdev, bcrl

>  With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right 

This includes full srcport:dstport?

>  now there is only a dst entry in leaf nodes. So yes anything else we can put 
>  in leaf is "free".

How much would fit there?


-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-22 11:41                                         ` Andi Kleen
@ 2007-02-22 11:44                                           ` David Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David Miller @ 2007-02-22 11:44 UTC (permalink / raw)
  To: andi; +Cc: Robert.Olsson, dada1, johnpol, akepner, linux, netdev, bcrl

From: Andi Kleen <andi@firstfloor.org>
Date: Thu, 22 Feb 2007 12:41:05 +0100

> >  With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right 
> 
> This includes full srcport:dstport?

Yes.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-22 11:07                                                                           ` David Miller
@ 2007-02-22 19:24                                                                             ` Stephen Hemminger
  0 siblings, 0 replies; 102+ messages in thread
From: Stephen Hemminger @ 2007-02-22 19:24 UTC (permalink / raw)
  To: David Miller; +Cc: medwards.linux, johnpol, dada1, akepner, linux, netdev, bcrl

On Thu, 22 Feb 2007 03:07:55 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: "Michael K. Edwards" <medwards.linux@gmail.com>
> Date: Thu, 22 Feb 2007 03:00:32 -0800
> 
> > Besides, RCUing a hash of any sort sounds like a nightmare, whereas
> > something like a splay tree is reputed to be very straightforward to
> > make "persistent" in the functional-programming sense.  I haven't
> > coded that one myself but surely it's in the current edition of Knuth.
> 
> Knuth tends to be thorough, yet archaic. :-)
> 
> > You are much better off obtaining all per-flow state with a
> > single data structure lookup.  It would be kind of nice if state for
> > other layers like IPSec and NAT could live there too.
> 
> This is the eventual idea.
> 
> > By the way, are these flows unidirectional or bidirectional?
> 
> The entries created by Robert's TRASH are unidirectional, they have to
> be because the path from remote to localhost is different than the
> path from localhost to remote :-)
> 
> But we don't need to do a socket demux on packet output, we just
> need a plain route (+ firewall, IPSEC, etc.) destination cache
> entry for that.
> -


There are a number of other lock-less algorithms rather than simple RCU
that might be worth exploring. A number of them rely on a working implementation
of compare-exchange; that can be a problem since some architectures don't have
a fast way of doing that.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-20 15:49                                   ` Michael K. Edwards
  2007-02-20 15:59                                     ` Evgeniy Polyakov
  2007-02-20 16:04                                     ` Eric Dumazet
@ 2007-02-22 23:49                                     ` linux
  2007-02-23  2:31                                       ` Michael K. Edwards
  2 siblings, 1 reply; 102+ messages in thread
From: linux @ 2007-02-22 23:49 UTC (permalink / raw)
  To: johnpol, medwards.linux; +Cc: akepner, bcrl, dada1, davem, linux, netdev

> I think you misunderstood me.  If you are trying to DoS me from
> outside with a hash collision attack, you are trying to feed me
> packets that fall into the same hash bucket.  The Jenkins hash does
> not have to be artifact-free, and does not have to be
> cryptographically strong.  It just has to do a passable job of mixing
> a random salt into the tuple, so you don't know which string of
> packets to feed me in order to fill one (or a few) of my buckets.
> XORing salt into a folded tuple doesn't help; it just permutes the
> buckets.

If you want to understand this more formally, read up on "universal
families of hash functions," which is the name cryptologists give to
this concept.

When used according to directions, they are actually *more* secure than
standard cryptographic hashes such as MD5 and SHA.  The key difference
is that *the attacker doesn't get to see the hash output*.

The basic pattern is:
- Here's a family of hash functions, e.g. a salted hash function.
- I pick one at random.  (E.g. choose a salt.)
- Now your challenge is to generate a pair of inputs which will
  collide.
- Note that you never get to see sample input/output pairs of the
  hash function.  All you know is that it's a member of the family.
- It is surprisingly easy to find families of size N such that
  an attacker has on the order of a 1/N chance to construct a collision.
- This remains true even if you assume that the attacker has
  infinite computational power.

This pattern corresponds exactly to an attacker trying to force collisions
in a hash table they can't see.


As far as I know, nobody has proved salted jash a truly universal family,
but so many amazingly simple algorithms have been proved universal that
it wouldn't surprise me if it was.

For example, the family of all CRCs computed modulo n-bit primitive
polynomials is a universal family.  If you do know the polynomial, it's
ridiculously easy to build a collision.  If you don't, it's provably
impossible.

(Footnote: the chance isn't exactly 1/N, but also depends on the size
of the input relative to the size of the hash.  With bigger inputs, it's
easier to make them match according to more of the hashes.  Ultimately,
if you have N k-bit CRC polynomials, you can make them all collide with
an N*k-bit input.  But since N is proportional to 2^k, it's easy to make
k big enough that this is impractical.)

The rehash-every-10-minutes detail is theoretically unnecessary,
but does cover the case where a would-be attacker *does* get a chance
to look at a machine, such as by using routing delays to measure the
effectiveness of a collision attempt.


Now, as for flaming about how xor generates more uniform distributions
than jhash - that's to be expected from a weak hash.  By relying on
non-uniform properties of the input (particularly that hosts tend to walk
linearly through the source port space), you can make hash values walk
linearly through your hash table, and get a completely even distribution
rather than a *random* one.

This is great for efficiency, but depends on letting patterns in the hash
input through to the output, which is exactly the property that makes it
vulnerable to a deliberate DoS attempt.

If you want to test your distribution for randomness, go have a look
at Knuth volume 2 (seminumerical algorithms) and see the discussion of
the Kolmogorov-Smirnov test.  Some lumpiness is *expected* in a truly
random distribution.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-22 23:49                                     ` linux
@ 2007-02-23  2:31                                       ` Michael K. Edwards
  0 siblings, 0 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-02-23  2:31 UTC (permalink / raw)
  To: linux; +Cc: johnpol, akepner, bcrl, dada1, davem, netdev

On 22 Feb 2007 18:49:00 -0500, linux@horizon.com <linux@horizon.com> wrote:
> The rehash-every-10-minutes detail is theoretically unnecessary,
> but does cover the case where a would-be attacker *does* get a chance
> to look at a machine, such as by using routing delays to measure the
> effectiveness of a collision attempt.

AOL -- and that's one of the reasons why RCUing hashes is a nightmare
in practice, if you care about smooth peristalsis.  Again, the
resizing headache is just the canary in the coal mine.

> If you want to test your distribution for randomness, go have a look
> at Knuth volume 2 (seminumerical algorithms) and see the discussion of
> the Kolmogorov-Smirnov test.  Some lumpiness is *expected* in a truly
> random distribution.

Ah, Kolmogorov-Smirnov, the astronomer's friend.  Somewhere on a DAT
in my garage lies a rather nice implementation of K-S tests on
censored data for the Mirella/Mirage Forth image analysis package.  If
you want something industrial-strength, easy to use, and pretty
besides, I recommend SM (http://www.astro.princeton.edu/~rhl/sm/),
which will cost you $500 U.S. for a department-scale license.  SM so
utterly exceeds the capabilities of gnuplot it isn't even funny.  But
then, while you don't always get what you pay for, you rarely fail to
pay for what you get, sooner or later, in one currency or another.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-02-17 13:13   ` Evgeniy Polyakov
  2007-02-18 18:46     ` Eric Dumazet
@ 2007-03-02  8:52     ` Evgeniy Polyakov
  2007-03-02  9:56       ` Eric Dumazet
  2007-03-13  9:32       ` Evgeniy Polyakov
  1 sibling, 2 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02  8:52 UTC (permalink / raw)
  To: akepner; +Cc: linux, davem, netdev

On Sat, Feb 17, 2007 at 04:13:02PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > >I noticed in an LCA talk mention that apprently extensible hashing
> > >with RCU access is an unsolved problem.  Here's an idea for solving it.
> > >....
> > 
> > Yes, I have been playing around with the same idea for
> > doing dynamic resizing of the TCP hashtable.
> > 
> > Did a prototype "toy" implementation, and I have a
> > "half-finished" patch which resizes the TCP hashtable
> > at runtime. Hmmm, your mail may be the impetus to get
> > me to finally finish this thing....
> 
> Why anyone do not want to use trie - for socket-like loads it has
> exactly constant search/insert/delete time and scales as hell.

Ok, I've ran an analysis of linked lists and trie traversals and found 
that (at least on x86) optimized one list traversal is about 4 (!) 
times faster than one bit lookup in trie traversal (or actually one
lookup in binary tree-like structure) - that is because of the fact 
that trie traversal needs to have more instructions per lookup, and at 
least one additional branch which can not be predicted.

Tests with rdtsc shows that one bit lookup in trie (actually it is any
lookup in binary tree structures) is about 3-4 times slower than one
lookup in linked list.

Since hash table usually has upto 4 elements in each hash entry,
competing binary tree/trie stucture must get an entry in one lookup,
which is essentially impossible with usual tree/trie implementations.

Things dramatically change when linked list became too long, but it
should not happend with proper resizing of the hash table, wildcards
implementation also introduce additional requirements, which can not be
easily solved in hash tables.

So I get my words about tree/trie implementation instead of hash table 
for socket lookup back.

Interested reader can find more details on tests, asm outputs and
conclusions at:
http://tservice.net.ru/~s0mbre/blog/2007/03/01#2007_03_01

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-02  8:52     ` Evgeniy Polyakov
@ 2007-03-02  9:56       ` Eric Dumazet
  2007-03-02 10:28         ` Evgeniy Polyakov
  2007-03-02 20:45         ` Michael K. Edwards
  2007-03-13  9:32       ` Evgeniy Polyakov
  1 sibling, 2 replies; 102+ messages in thread
From: Eric Dumazet @ 2007-03-02  9:56 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev

[-- Attachment #1: Type: text/plain, Size: 1686 bytes --]

On Friday 02 March 2007 09:52, Evgeniy Polyakov wrote:

> Ok, I've ran an analysis of linked lists and trie traversals and found
> that (at least on x86) optimized one list traversal is about 4 (!)
> times faster than one bit lookup in trie traversal (or actually one
> lookup in binary tree-like structure) - that is because of the fact
> that trie traversal needs to have more instructions per lookup, and at
> least one additional branch which can not be predicted.
>
> Tests with rdtsc shows that one bit lookup in trie (actually it is any
> lookup in binary tree structures) is about 3-4 times slower than one
> lookup in linked list.
>
> Since hash table usually has upto 4 elements in each hash entry,
> competing binary tree/trie stucture must get an entry in one lookup,
> which is essentially impossible with usual tree/trie implementations.
>
> Things dramatically change when linked list became too long, but it
> should not happend with proper resizing of the hash table, wildcards
> implementation also introduce additional requirements, which can not be
> easily solved in hash tables.
>
> So I get my words about tree/trie implementation instead of hash table
> for socket lookup back.
>
> Interested reader can find more details on tests, asm outputs and
> conclusions at:
> http://tservice.net.ru/~s0mbre/blog/2007/03/01#2007_03_01

Thank you for this report. (Still avoiding cache misses studies, while they 
obviously are the limiting factor)

Anyqay, if data is in cache and you want optimum performance from your cpu,
you may try to use an algorithm without conditional branches :
(well 4 in this case for the whole 32 bits tests)

gcc -O2 -S -march=i686 test1.c


[-- Attachment #2: test1.c --]
[-- Type: text/plain, Size: 476 bytes --]

struct node {
	struct node *left;
	struct node *right;
	int value;
	};
struct node *head;
int v1;

#define PASS2(bit) \
	n2 = n1->left; \
	right = n1->right; \
    if (value & (1<<bit)) \
        n2 = right; \
	n1 = n2->left; \
	right = n2->right; \
	if (value & (2<<bit)) \
		n1 = right;

main()
{
int j;
unsigned int value = v1;
struct node *n1 = head, *n2, *right;
for (j=0; j<4; ++j) {
	PASS2(0)
	PASS2(2)
	PASS2(4)
	PASS2(6)
	value >>= 8;
	}
printf("result=%p\n", n1);
}

[-- Attachment #3: test1.s --]
[-- Type: text/plain, Size: 1164 bytes --]

	.file	"test1.c"
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"result=%p\n"
	.text
	.p2align 4,,15
.globl main
	.type	main, @function
main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	xorl	%ebx, %ebx
	pushl	%ecx
	subl	$16, %esp
	movl	v1, %ecx
	movl	head, %edx
	.p2align 4,,7
.L2:
	movl	4(%edx), %eax
	testb	$1, %cl
	cmove	(%edx), %eax
	testb	$2, %cl
	movl	4(%eax), %edx
	cmove	(%eax), %edx
	testb	$4, %cl
	movl	4(%edx), %eax
	cmove	(%edx), %eax
	testb	$8, %cl
	movl	4(%eax), %edx
	cmove	(%eax), %edx
	testb	$16, %cl
	movl	4(%edx), %eax
	cmove	(%edx), %eax
	testb	$32, %cl
	movl	4(%eax), %edx
	cmove	(%eax), %edx
	testb	$64, %cl
	movl	4(%edx), %eax
	cmove	(%edx), %eax
	testb	%cl, %cl
	movl	4(%eax), %edx
	cmovns	(%eax), %edx
	addl	$1, %ebx
	cmpl	$4, %ebx
	je	.L19
	shrl	$8, %ecx
	jmp	.L2
	.p2align 4,,7
.L19:
	movl	%edx, 4(%esp)
	movl	$.LC0, (%esp)
	call	printf
	addl	$16, %esp
	popl	%ecx
	popl	%ebx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret
	.size	main, .-main
	.comm	head,4,4
	.comm	v1,4,4
	.ident	"GCC: (GNU) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)"
	.section	.note.GNU-stack,"",@progbits

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-02  9:56       ` Eric Dumazet
@ 2007-03-02 10:28         ` Evgeniy Polyakov
  2007-03-02 20:45         ` Michael K. Edwards
  1 sibling, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-03-02 10:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev

On Fri, Mar 02, 2007 at 10:56:23AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Friday 02 March 2007 09:52, Evgeniy Polyakov wrote:
> 
> > Ok, I've ran an analysis of linked lists and trie traversals and found
> > that (at least on x86) optimized one list traversal is about 4 (!)
> > times faster than one bit lookup in trie traversal (or actually one
> > lookup in binary tree-like structure) - that is because of the fact
> > that trie traversal needs to have more instructions per lookup, and at
> > least one additional branch which can not be predicted.
> >
> > Tests with rdtsc shows that one bit lookup in trie (actually it is any
> > lookup in binary tree structures) is about 3-4 times slower than one
> > lookup in linked list.
> >
> > Since hash table usually has upto 4 elements in each hash entry,
> > competing binary tree/trie stucture must get an entry in one lookup,
> > which is essentially impossible with usual tree/trie implementations.
> >
> > Things dramatically change when linked list became too long, but it
> > should not happend with proper resizing of the hash table, wildcards
> > implementation also introduce additional requirements, which can not be
> > easily solved in hash tables.
> >
> > So I get my words about tree/trie implementation instead of hash table
> > for socket lookup back.
> >
> > Interested reader can find more details on tests, asm outputs and
> > conclusions at:
> > http://tservice.net.ru/~s0mbre/blog/2007/03/01#2007_03_01
> 
> Thank you for this report. (Still avoiding cache misses studies, while they 
> obviously are the limiting factor)
> 
> Anyqay, if data is in cache and you want optimum performance from your cpu,
> you may try to use an algorithm without conditional branches :
> (well 4 in this case for the whole 32 bits tests)

Tests were always for no-cache-miss case.
I also ran them in kenel mode (to eliminate tlb flushes per rescheduling
and to get into account that kernel tlb covers 8mb while userspace only
4k), but results were essentially the same (modulo several percents). I
only tested trie, in my impementation its memory usage is smaller than
hash table for 2^20 entries.

> gcc -O2 -S -march=i686 test1.c
> 

> struct node {
> 	struct node *left;
> 	struct node *right;
> 	int value;
> 	};
> struct node *head;
> int v1;
> 
> #define PASS2(bit) \
> 	n2 = n1->left; \
> 	right = n1->right; \
>     if (value & (1<<bit)) \
>         n2 = right; \
> 	n1 = n2->left; \
> 	right = n2->right; \
> 	if (value & (2<<bit)) \
> 		n1 = right;
> 
> main()
> {
> int j;
> unsigned int value = v1;
> struct node *n1 = head, *n2, *right;
> for (j=0; j<4; ++j) {
> 	PASS2(0)
> 	PASS2(2)
> 	PASS2(4)
> 	PASS2(6)
> 	value >>= 8;
> 	}
> printf("result=%p\n", n1);
> }

This one resulted in 10*4 and 2*4 branches per loop.
So total 32 branches (instead of 64 in simpler code) and 160
instructions (instead of 128 in simpler code).
Getting that branch is two times longer to execute (though it is quite
strange sentence, but I must admit, that I did not read x86 processor
manual at all (only ppc32)) according to tests, we do not get any gain
for 32bit value (32 lookups): 64*2+128 in old case, 32*2+160 in new one.

I also have advanced trie implementation, which caches values in nodes
if there are no child entries, and it _greatly_ decrease number of
lookups and memory usage for smaller sets, but in long run and huge 
amount of entries in trie, it does not matter since only the 
lowest layer caches values.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-02  9:56       ` Eric Dumazet
  2007-03-02 10:28         ` Evgeniy Polyakov
@ 2007-03-02 20:45         ` Michael K. Edwards
  2007-03-03 10:46           ` Evgeniy Polyakov
  1 sibling, 1 reply; 102+ messages in thread
From: Michael K. Edwards @ 2007-03-02 20:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Evgeniy Polyakov, akepner, linux, davem, netdev

On 3/2/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Thank you for this report. (Still avoiding cache misses studies, while they
> obviously are the limiting factor)

1)  The entire point of going to a tree-like structure would be to
allow the leaves to age out of cache (or even forcibly evict them)
when the structure bloats (generally under DDoS attack), on the theory
that most of them are bogus and won't be referenced again.  It's not
about the speed of the data structure -- it's about managing its
impact on the rest of the system.

2)  The other entire point of going to a tree-like structure is that
they're drastically simpler to RCU than hashes, and more generally
they don't involve individual atomic operations (RCU reaping passes,
resizing, etc.) that cause big latency hiccups and evict a bunch of
other stuff from cache.

3)  The third entire point of going to a tree-like structure is to
have a richer set of efficient operations, since you can give them a
second "priority"-type index and have "pluck-highest-priority-item",
three-sided search, and bulk delete operations.  These aren't that
much harder to RCU than the basic modify-existing-node operation.

Now can we give these idiotic micro-benchmarks a rest until Robert's
implementation is tuned and ready for stress-testing?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-02 20:45         ` Michael K. Edwards
@ 2007-03-03 10:46           ` Evgeniy Polyakov
  2007-03-04 10:02             ` Michael K. Edwards
  0 siblings, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-03-03 10:46 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Eric Dumazet, akepner, linux, davem, netdev

On Fri, Mar 02, 2007 at 12:45:36PM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> On 3/2/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> >Thank you for this report. (Still avoiding cache misses studies, while they
> >obviously are the limiting factor)
> 
> 1)  The entire point of going to a tree-like structure would be to
> allow the leaves to age out of cache (or even forcibly evict them)
> when the structure bloats (generally under DDoS attack), on the theory
> that most of them are bogus and won't be referenced again.  It's not
> about the speed of the data structure -- it's about managing its
> impact on the rest of the system.
> 
> 2)  The other entire point of going to a tree-like structure is that
> they're drastically simpler to RCU than hashes, and more generally
> they don't involve individual atomic operations (RCU reaping passes,
> resizing, etc.) that cause big latency hiccups and evict a bunch of
> other stuff from cache.
> 
> 3)  The third entire point of going to a tree-like structure is to
> have a richer set of efficient operations, since you can give them a
> second "priority"-type index and have "pluck-highest-priority-item",
> three-sided search, and bulk delete operations.  These aren't that
> much harder to RCU than the basic modify-existing-node operation.
> 
> Now can we give these idiotic micro-benchmarks a rest until Robert's
> implementation is tuned and ready for stress-testing?

Mmm, you have learnt new words from other threads :)
It is not a benchmark, it is analysis of the structure processing.
All you have written above is correct, since it was said in this thread
multiple times, but it does not change the fact, that tree traversal is
slower than list one, so to compete tree (or another algo, which would 
be even more interesting) implementation must have that in mind and be 
faster in any (the most) load. As is tree/trie does not have it (it is
feature of algorithm), but has another advantages (extremely suitable in
routing cache whihc requires wildcard support and scaling) you did not
mention - ability to scale without structure recreations.

Btw, you could try to implement something you have written above to show
its merits, so that it would not be an empty words :)

> Cheers,
> - Michael

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-03 10:46           ` Evgeniy Polyakov
@ 2007-03-04 10:02             ` Michael K. Edwards
  2007-03-04 20:36               ` David Miller
  2007-03-05 10:00               ` Evgeniy Polyakov
  0 siblings, 2 replies; 102+ messages in thread
From: Michael K. Edwards @ 2007-03-04 10:02 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, akepner, linux, davem, netdev

On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> Btw, you could try to implement something you have written above to show
> its merits, so that it would not be an empty words :)

Before I implement, I design.  Before I design, I analyze.  Before I
analyze, I prototype.  Before I prototype, I gather requirements.
Before I gather requirements, I think -- and the only way I know how
to think about technical matters is to write down my intuitions and
compare them against the sea of published research on the topic.  I'm
only partway through thinking about RCU and DDoS, especially as this
is on the fringe of my professional expertise and the appropriate
literature is not currently at my fingertips.

The only times that I make exceptions to the above sequence are 1,
when someone is paying me well to do so (usually to retrofit some kind
of sanity onto a pile of crap someone else wrote) and 2, when I really
feel like it.  At present neither exception applies here, although I
may yet get so het up about threadlets that I go into a coding binge
(which may or may not produce an RCU splay tree as a side effect).  I
wouldn't hold my breath if I were you, though; it's the first of what
promises to be a string of fine weekends, and if I binge on anything
this spring it's likely to be gardening.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-04 10:02             ` Michael K. Edwards
@ 2007-03-04 20:36               ` David Miller
  2007-03-05  7:12                 ` Michael K. Edwards
  2007-03-05 10:00               ` Evgeniy Polyakov
  1 sibling, 1 reply; 102+ messages in thread
From: David Miller @ 2007-03-04 20:36 UTC (permalink / raw)
  To: medwards.linux; +Cc: johnpol, dada1, akepner, linux, netdev

From: "Michael K. Edwards" <medwards.linux@gmail.com>
Date: Sun, 4 Mar 2007 02:02:36 -0800

> On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > Btw, you could try to implement something you have written above to show
> > its merits, so that it would not be an empty words :)
> 
> Before I implement, I design.  Before I design, I analyze.  Before I
> analyze, I prototype.  Before I prototype, I gather requirements.

How the heck do you ever get to writing ANYTHING if you work that way?

I certainly would never have written one single line of Linux kernel
code if I had to go through that kind of sequence to actually get to
writing code.

And that's definitely not the "Linux way".  You code up ideas as soon
as you come up with one that has a chance of working, and you see what
happens.  Sure, you'll throw a lot away, but at least you will "know"
instead of "think".

You have to try things, "DO" stuff, not just sit around and theorize
and design things and shoot down ideas on every negative minute detail
you can come up with before you type any code in.  That mode of
development doesn't inspire people and get a lot of code written.

I definitely do not think others should use this
design/prototype/analyze/blah/balh way of developing as an example,
instead I think folks should use people like Ingo Molnar as an example
of a good Linux developer.  People like Ingo rewrite the scheduler one
night because of a tiny cool idea, and even if only 1 out of 10 hacks
like that turn out to be useful, his work is invaluable and since he's
actually trying to do things and writing lots of code this inspires
other people.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-04 20:36               ` David Miller
@ 2007-03-05  7:12                 ` Michael K. Edwards
  2007-03-05 10:02                   ` Robert Olsson
  0 siblings, 1 reply; 102+ messages in thread
From: Michael K. Edwards @ 2007-03-05  7:12 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, dada1, akepner, linux, netdev

On 3/4/07, David Miller <davem@davemloft.net> wrote:
> From: "Michael K. Edwards" <medwards.linux@gmail.com>
> > Before I implement, I design.  Before I design, I analyze.  Before I
> > analyze, I prototype.  Before I prototype, I gather requirements.
>
> How the heck do you ever get to writing ANYTHING if you work that way?

Oh, c'mon, the whole design process takes maybe 3 weeks for something
I'm going to spend 3 months implementing.  And half the time the
client mistakes the prototype for the implementation and tries to run
with it -- with the foreseeable consequences.  Smaller projects are a
lot less formal but the principle still holds: every time I've
implemented without actually doing the homework leading up to a
design, I've had cause to regret it.  That goes double for problems
that already have off-the-shelf solutions and only need improvement in
ease of use and robustness in hostile environments.

> I certainly would never have written one single line of Linux kernel
> code if I had to go through that kind of sequence to actually get to
> writing code.

Lots of _code_ gets written as a side effect of each stage of that
sequence.  But none of that code should be mistaken for the
_implementation_.  Not when the implementation is going to ship inside
hundreds of thousands of widgets that people are going to stick in
their ears, or is going to monitor the integrity of an
intercontinental telecoms network.  Most shipping code, including most
code that I have written and large swathes of the Linux kernel, has
never really gone beyond the prototype stage.  There's no shame in
that; the shame would be in refusing to admit it.

> And that's definitely not the "Linux way".  You code up ideas as soon
> as you come up with one that has a chance of working, and you see what
> happens.  Sure, you'll throw a lot away, but at least you will "know"
> instead of "think".

I call that process "prototyping".  I used to do a lot of it; but I
have since found that thinking first is actually more efficient.
There is very little point in prototyping the square wheel again and
again and again.  And especially given that Linux already has plenty
of nice, round wheels, there aren't that many places left where you
can impress the world by replacing a square wheel with a hexagonal
one.  Replacing wheels with maglev tracks requires a real design
phase.

> You have to try things, "DO" stuff, not just sit around and theorize
> and design things and shoot down ideas on every negative minute detail
> you can come up with before you type any code in.  That mode of
> development doesn't inspire people and get a lot of code written.

I'll cop to the "negative" part of this, at least to some degree, but
not the rest.  "Naive hashes DDoS easily" is not a minute detail.
"Hash tables don't provide the operations characteristic of priority
queues" is not a minute detail.  "The benchmarks offered do not
accurately model system impact" is not a minute detail.  If I were not
sufficiently familiar with some of Evgeniy's other contributions to
the kernel to think him capable of responding to these critiques with
a better design, I would not have bothered.

> I definitely do not think others should use this
> design/prototype/analyze/blah/balh way of developing as an example,
> instead I think folks should use people like Ingo Molnar as an example
> of a good Linux developer.  People like Ingo rewrite the scheduler one
> night because of a tiny cool idea, and even if only 1 out of 10 hacks
> like that turn out to be useful, his work is invaluable and since he's
> actually trying to do things and writing lots of code this inspires
> other people.

Ingo Molnar is certainly a good, nay an excellent, Linux developer.
His prototypes are brilliant even when they're under-designed by my
way of thinking.  By all means, readers should go thou and do
likewise.  And when someone presses an alternative design, "show me
the code" is a fair response.

At present, I am not offering code, nor design, nor even a prototype.
But I am starting to figure out which bits of the Linux kernel really
are implementations based on a solid design.  Any prototyping I wind
up doing is going to be bolted firmly onto those implementations, not
floating in the surrounding sea of prototype code; and any analysis I
offer based on that prototype will reflect, to the best of my ability,
an understanding of its weaknesses as well as its strengths.  If that
analysis passes community scrutiny, and if I remain interested in the
project, perhaps I will go through with design and implementation as
well.

This, incidentally, seems very similar to the process that Robert
Olsson and Stefan Nilsson have gone through with their trie/hash
project.  Although I haven't tried it out yet and don't have any basis
for an independent opinion, the data and analysis provided in their
paper are quite convincing.  Any prototyping I might do would probably
build on their work, perhaps adding a more explicit DDoS-survival
strategy based on a priority (heap) index and rotation tactics that
preserve it.  Based on my current understanding that would shade it in
the direction of a splay tree; but I am no expert on data structures
and it may take me some time to think it through completely.  In the
meantime, I am inclined to hold them up as an inspiring example.  :-)

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-04 10:02             ` Michael K. Edwards
  2007-03-04 20:36               ` David Miller
@ 2007-03-05 10:00               ` Evgeniy Polyakov
  1 sibling, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-03-05 10:00 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: Eric Dumazet, akepner, linux, davem, netdev

On Sun, Mar 04, 2007 at 02:02:36AM -0800, Michael K. Edwards (medwards.linux@gmail.com) wrote:
> On 3/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> >Btw, you could try to implement something you have written above to show
> >its merits, so that it would not be an empty words :)
> 
> Before I implement, I design.  Before I design, I analyze.  Before I
> analyze, I prototype.  Before I prototype, I gather requirements.
> Before I gather requirements, I think -- and the only way I know how
> to think about technical matters is to write down my intuitions and
> compare them against the sea of published research on the topic.  I'm

You forgot 'before (and whlist) I do all above I write tons of words in
mail lists about nothing, thus spending my and others time'.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-05  7:12                 ` Michael K. Edwards
@ 2007-03-05 10:02                   ` Robert Olsson
  0 siblings, 0 replies; 102+ messages in thread
From: Robert Olsson @ 2007-03-05 10:02 UTC (permalink / raw)
  To: Michael K. Edwards; +Cc: David Miller, johnpol, dada1, akepner, linux, netdev


Michael K. Edwards writes:

 > This, incidentally, seems very similar to the process that Robert
 > Olsson and Stefan Nilsson have gone through with their trie/hash
 > project.  Although I haven't tried it out yet and don't have any basis
 > for an independent opinion, the data and analysis provided in their
 > paper are quite convincing. 

 I've info about this "process" :) Moved fib_trie.c to userland to extend 
 it longer an variable keylengths. Testprogram was doing insert/lookup/remove 
 with random data with blistering performance (very flat trees).

 So quite happy I moved this trie back to the kernel and started testing with
 "real data" - ip addresses and rDoS.  Result was disastrous.... very deep 
 trees and awful network performance. Random data is very different from 
 real data was the lesson learned again. Gave up. 

 A couple days later an idea came up. I'll remembered the poof from the
 LC-trie paper that length of key does not impact tree depth. So went 
 to Stefan, Can't we just fix up the data rather than fiddling with an 
 new improved algorithm? The result was this "hash" header to boost tree 
 compression.
 
 Cheers
					--ro

 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-02  8:52     ` Evgeniy Polyakov
  2007-03-02  9:56       ` Eric Dumazet
@ 2007-03-13  9:32       ` Evgeniy Polyakov
  2007-03-13 10:08         ` Eric Dumazet
  1 sibling, 1 reply; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-03-13  9:32 UTC (permalink / raw)
  To: akepner; +Cc: linux, davem, netdev

On Fri, Mar 02, 2007 at 11:52:47AM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> So I get my words about tree/trie implementation instead of hash table 
> for socket lookup back.

Hmm, I was a bit fast to draw a line, there are some possibilities to
have faster than hash table lookup using different algorithms...

So, I ask network developers about testing environment for socket lookup
benchmarking. What would be the best test case to determine performance
of the lookup algo? Is it enough to replace algo and locking and create
say one million of connections and try to run trivial web server (that
is what I'm going to test if there will not be any better suggestion,
but I only have single-core athlon 64 with 1gb of ram as a test bed and 
two core duo machines as generators, probably I can use one of them as a
test machine too. They have gigabit adapters and aree connected over 
gigabit switch)?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-13  9:32       ` Evgeniy Polyakov
@ 2007-03-13 10:08         ` Eric Dumazet
  2007-03-13 10:24           ` Evgeniy Polyakov
  0 siblings, 1 reply; 102+ messages in thread
From: Eric Dumazet @ 2007-03-13 10:08 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: akepner, linux, davem, netdev

On Tuesday 13 March 2007 10:32, Evgeniy Polyakov wrote:
> On Fri, Mar 02, 2007 at 11:52:47AM +0300, Evgeniy Polyakov 
(johnpol@2ka.mipt.ru) wrote:
> So, I ask network developers about testing environment for socket lookup
> benchmarking. What would be the best test case to determine performance
> of the lookup algo? Is it enough to replace algo and locking and create
> say one million of connections and try to run trivial web server (that
> is what I'm going to test if there will not be any better suggestion,
> but I only have single-core athlon 64 with 1gb of ram as a test bed and
> two core duo machines as generators, probably I can use one of them as a
> test machine too. They have gigabit adapters and aree connected over
> gigabit switch)?

One million concurrent sockets on your machines will be tricky :)

$ egrep "(filp|dent|^TCP|sock_inode_cache)" /proc/slabinfo |cut -c1-40
TCP                   12     14   1152  
sock_inode_cache     423    430    384  
dentry_cache       36996  47850    132  
filp                4081   4680    192  

that means at the minimum 1860 bytes of LOWMEM per tcp socket on 32bit kernel, 
(2512 bytes on a 64bit kernel)

I had one bench program but apparently I lost it :(
It was able to open long lived sockets, (one million if enough memory), and 
was generating kind of random trafic on all sockets. damned.
The 'server' side had to listen to many (>16) ports because of the 65536 
limit.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Extensible hashing and RCU
  2007-03-13 10:08         ` Eric Dumazet
@ 2007-03-13 10:24           ` Evgeniy Polyakov
  0 siblings, 0 replies; 102+ messages in thread
From: Evgeniy Polyakov @ 2007-03-13 10:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akepner, linux, davem, netdev

On Tue, Mar 13, 2007 at 11:08:27AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Tuesday 13 March 2007 10:32, Evgeniy Polyakov wrote:
> > On Fri, Mar 02, 2007 at 11:52:47AM +0300, Evgeniy Polyakov 
> (johnpol@2ka.mipt.ru) wrote:
> > So, I ask network developers about testing environment for socket lookup
> > benchmarking. What would be the best test case to determine performance
> > of the lookup algo? Is it enough to replace algo and locking and create
> > say one million of connections and try to run trivial web server (that
> > is what I'm going to test if there will not be any better suggestion,
> > but I only have single-core athlon 64 with 1gb of ram as a test bed and
> > two core duo machines as generators, probably I can use one of them as a
> > test machine too. They have gigabit adapters and aree connected over
> > gigabit switch)?
> 
> One million concurrent sockets on your machines will be tricky :)
> 
> $ egrep "(filp|dent|^TCP|sock_inode_cache)" /proc/slabinfo |cut -c1-40
> TCP                   12     14   1152  
> sock_inode_cache     423    430    384  
> dentry_cache       36996  47850    132  
> filp                4081   4680    192  
> 
> that means at the minimum 1860 bytes of LOWMEM per tcp socket on 32bit kernel, 
> (2512 bytes on a 64bit kernel)
> 
> I had one bench program but apparently I lost it :(
> It was able to open long lived sockets, (one million if enough memory), and 
> was generating kind of random trafic on all sockets. damned.
> The 'server' side had to listen to many (>16) ports because of the 65536 
> limit.

Yep, I was too optimistic about my hardware - getting size of the tcp
socket it is impossible to even create such amount of them with 1 or 2 gb of
ram.

Well, I can run additional tests in userspace (ideally with hugetlb support, 
but given that both socket hash table and my algo use essentially the same
amount of ram it should not matter) with more precise analysis...

And just send a patch with detailed description.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2007-03-13 10:25 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-04  7:41 Extensible hashing and RCU linux
2007-02-05 18:02 ` akepner
2007-02-17 13:13   ` Evgeniy Polyakov
2007-02-18 18:46     ` Eric Dumazet
2007-02-18 19:10       ` Evgeniy Polyakov
2007-02-18 20:21         ` Eric Dumazet
2007-02-18 21:23           ` Michael K. Edwards
2007-02-18 22:04             ` Michael K. Edwards
2007-02-19 12:04             ` Andi Kleen
2007-02-19 19:18               ` Michael K. Edwards
2007-02-19 11:41           ` Evgeniy Polyakov
2007-02-19 13:38             ` Eric Dumazet
2007-02-19 13:56               ` Evgeniy Polyakov
2007-02-19 14:14                 ` Eric Dumazet
2007-02-19 14:25                   ` Evgeniy Polyakov
2007-02-19 15:14                     ` Eric Dumazet
2007-02-19 18:13                       ` Eric Dumazet
2007-02-19 18:26                         ` Benjamin LaHaise
2007-02-19 18:38                           ` Benjamin LaHaise
2007-02-20  9:25                         ` Evgeniy Polyakov
2007-02-20  9:57                           ` David Miller
2007-02-20 10:22                             ` Evgeniy Polyakov
2007-02-20 10:04                           ` Eric Dumazet
2007-02-20 10:12                             ` David Miller
2007-02-20 10:30                               ` Evgeniy Polyakov
2007-02-20 11:10                                 ` Eric Dumazet
2007-02-20 11:23                                   ` Evgeniy Polyakov
2007-02-20 11:30                                   ` Eric Dumazet
2007-02-20 11:41                                     ` Evgeniy Polyakov
2007-02-20 10:49                               ` Eric Dumazet
2007-02-20 15:07                               ` Michael K. Edwards
2007-02-20 15:11                                 ` Evgeniy Polyakov
2007-02-20 15:49                                   ` Michael K. Edwards
2007-02-20 15:59                                     ` Evgeniy Polyakov
2007-02-20 16:08                                       ` Eric Dumazet
2007-02-20 16:20                                         ` Evgeniy Polyakov
2007-02-20 16:38                                           ` Eric Dumazet
2007-02-20 16:59                                             ` Evgeniy Polyakov
2007-02-20 17:05                                               ` Evgeniy Polyakov
2007-02-20 17:53                                                 ` Eric Dumazet
2007-02-20 18:00                                                   ` Evgeniy Polyakov
2007-02-20 18:55                                                     ` Eric Dumazet
2007-02-20 19:06                                                       ` Evgeniy Polyakov
2007-02-20 19:17                                                         ` Eric Dumazet
2007-02-20 19:36                                                           ` Evgeniy Polyakov
2007-02-20 19:44                                                           ` Michael K. Edwards
2007-02-20 17:20                                               ` Eric Dumazet
2007-02-20 17:55                                                 ` Evgeniy Polyakov
2007-02-20 18:12                                                   ` Evgeniy Polyakov
2007-02-20 19:13                                                     ` Michael K. Edwards
2007-02-20 19:44                                                       ` Evgeniy Polyakov
2007-02-20 20:03                                                         ` Michael K. Edwards
2007-02-20 20:09                                                           ` Michael K. Edwards
2007-02-21  8:56                                                             ` Evgeniy Polyakov
2007-02-21  9:34                                                               ` David Miller
2007-02-21  9:51                                                                 ` Evgeniy Polyakov
2007-02-21 10:03                                                                   ` David Miller
2007-02-21  8:54                                                           ` Evgeniy Polyakov
2007-02-21  9:15                                                             ` Eric Dumazet
2007-02-21  9:27                                                               ` Evgeniy Polyakov
2007-02-21  9:38                                                                 ` Eric Dumazet
2007-02-21  9:57                                                                   ` Evgeniy Polyakov
2007-02-21 21:15                                                                     ` Michael K. Edwards
2007-02-22  9:06                                                                       ` David Miller
2007-02-22 11:00                                                                         ` Michael K. Edwards
2007-02-22 11:07                                                                           ` David Miller
2007-02-22 19:24                                                                             ` Stephen Hemminger
2007-02-20 16:04                                     ` Eric Dumazet
2007-02-22 23:49                                     ` linux
2007-02-23  2:31                                       ` Michael K. Edwards
2007-02-20 10:44                             ` Evgeniy Polyakov
2007-02-20 11:09                               ` Eric Dumazet
2007-02-20 11:29                                 ` Evgeniy Polyakov
2007-02-20 11:34                                   ` Eric Dumazet
2007-02-20 11:45                                     ` Evgeniy Polyakov
2007-02-21 12:41                                 ` Andi Kleen
2007-02-21 13:19                                   ` Eric Dumazet
2007-02-21 13:37                                     ` David Miller
2007-02-21 23:13                                       ` Robert Olsson
2007-02-22  6:06                                         ` Eric Dumazet
2007-02-22 11:41                                         ` Andi Kleen
2007-02-22 11:44                                           ` David Miller
2007-02-20 12:11                           ` Evgeniy Polyakov
2007-02-19 22:10                 ` Andi Kleen
2007-02-19 12:02           ` Andi Kleen
2007-02-19 12:35             ` Robert Olsson
2007-02-19 14:04       ` Evgeniy Polyakov
2007-03-02  8:52     ` Evgeniy Polyakov
2007-03-02  9:56       ` Eric Dumazet
2007-03-02 10:28         ` Evgeniy Polyakov
2007-03-02 20:45         ` Michael K. Edwards
2007-03-03 10:46           ` Evgeniy Polyakov
2007-03-04 10:02             ` Michael K. Edwards
2007-03-04 20:36               ` David Miller
2007-03-05  7:12                 ` Michael K. Edwards
2007-03-05 10:02                   ` Robert Olsson
2007-03-05 10:00               ` Evgeniy Polyakov
2007-03-13  9:32       ` Evgeniy Polyakov
2007-03-13 10:08         ` Eric Dumazet
2007-03-13 10:24           ` Evgeniy Polyakov
2007-02-05 18:41 ` [RFC/TOY]Extensible " akepner
2007-02-06 19:09   ` linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.