Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal

From: Karsten Blees <karsten.blees@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git List <git@vger.kernel.org>, Thomas Rast <tr@thomasrast.ch>,
	Jens Lehmann <Jens.Lehmann@web.de>
Subject: Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal
Date: Fri, 08 Nov 2013 11:27:26 +0100	[thread overview]
Message-ID: <527CBC8E.6050507@gmail.com> (raw)
In-Reply-To: <xmqq4n7nriuj.fsf@gitster.dls.corp.google.com>

Am 07.11.2013 22:40, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> +`void hashmap_add(struct hashmap *map, void *entry)`::
>> +
>> +	Adds a hashmap entry. This allows to add duplicate entries (i.e.
>> +	separate values with the same key according to hashmap_cmp_fn).
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add.
>> +
>> +`void *hashmap_put(struct hashmap *map, void *entry)`::
>> +
>> +	Adds or replaces a hashmap entry.
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add or update.
>> ++
>> +Returns the previous entry, or NULL if not found (i.e. the entry was added).
> 
> What happens when there were duplicate entries previously added?
> All are replaced?  One of them is randomly chosen and gets replaced?
> 
> With s/replaced/removed/, the same question applies to hashmap_remove().
> 
> I am guessing that "we pick one at random and pretend as if other
> duplicates do not exist" is the answer, 

Exactly, however, you won't have duplicates if you use put exclusively.

> and I do not immediately
> have an objection to that semantics, but whatever the rule is, it
> needs to be spelled out.
> 

I'll clarify this.

>> +`void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)`::
>> +
>> +	Removes a hashmap entry matching the specified key.
>> ...
>> +Usage example
>> +-------------
>> +
>> +Here's a simple usage example that maps long keys to double values.
>> +[source,c]
>> +------------
>> +struct hashmap map;
>> +
>> +struct long2double {
>> +	struct hashmap_entry ent; /* must be the first member! */
>> +	long key;
>> +	double value;
>> +};
>> +
>> +static int long2double_cmp(const struct long2double *e1, const struct long2double *e2, const void *unused)
>> +{
>> +	return !(e1->key == e2->key);
>> +}
>> +
>> +void long2double_init()
> 
> void long2double_init(void)
> 
>> +{
>> +	hashmap_init(&map, (hashmap_cmp_fn) long2double_cmp, 0);
>> +}
>> +
>> +void long2double_free()
> 
> Likewise.
> 
>> diff --git a/hashmap.c b/hashmap.c
>> new file mode 100644
>> index 0000000..3a9f8c1
>> --- /dev/null
>> +++ b/hashmap.c
>> ...
>> +unsigned int memhash(const void *buf, size_t len)
>> +{
>> +	unsigned int hash = FNV32_BASE;
>> +	unsigned char *ucbuf = (unsigned char*) buf;
> 
> "(unsigned char *)buf"; pointer-asterisk does not stick to type.
> 

Ok, I'll recheck all casts.

>> +	while (len--) {
>> +		unsigned int c = *ucbuf++;
>> +		hash = (hash * FNV32_PRIME) ^ c;
>> +	}
>> +	return hash;
>> +}
>> +
>> +unsigned int memihash(const void *buf, size_t len)
>> +{
>> +	unsigned int hash = FNV32_BASE;
>> +	unsigned char *ucbuf = (unsigned char*) buf;
>> +	while (len--) {
>> +		unsigned int c = *ucbuf++;
>> +		if (c >= 'a' && c <= 'z')
>> +			c -= 'a' - 'A';
>> +		hash = (hash * FNV32_PRIME) ^ c;
>> +	}
>> +	return hash;
>> +}
>> +
>> +#define HASHMAP_INITIAL_SIZE 64
>> +/* grow / shrink by 2^2 */
>> +#define HASHMAP_GROW 2
>> +/* grow if > 80% full (to 20%) */
>> +#define HASHMAP_GROW_AT 1.25
>> +/* shrink if < 16.6% full (to 66.6%) */
>> +#define HASHMAP_SHRINK_AT 6
> 
> May be I am too old fashioned, but seeing a floating-point constant
> in our otherwise pretty-much integer-only code makes me feel uneasy.
> "gcc -S -O2" does seem to generate floating point instruction for
> "i" but not for "j":
> 
>     extern void inspect(unsigned int i, unsigned int j);
> 
>     void grow(unsigned int i, unsigned int j)
>     {
>             i *= 1.25;
>             j += j >> 2;
>             inspect(i, j);
>     }
> 

I guess there's no more i486SXs around, so using floating point should not be a problem (at least performance-wise...).

However, defining the constants inversely is a bit unintuitive (i.e. 1.25 instead of 0.8, 6 instead of 0.166). Perhaps the thresholds should also be calculated once on resize, not on every add / remove.

What about this:

#define HASHMAP_GROW_AT 80
#define HASHMAP_SHRINK_AT 16

...in rehash:

map->grow_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_GROW_AT / 100);
map->shrink_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_SHRINK_AT / 100);

...in add:

size++;
if (map->size > map->grow_at)

...in remove:

size--;
if (map->size < map->shrink_at)

> Perhaps
> 
> #define HASHMAP_GROW_AT(current) ((current) + (current) >> 2)
> #define HASHMAP_SHRINK_AT(current) ((current) * 6)
> #define HASHMAP_GROW(current) ((current) << 2)
> #define HASHMAP_SHRINK(current) ((current) >> 2)
> 
> may alleviate my worries; I dunno.
> 
>> +
>> +static inline int entry_equals(const struct hashmap *map,
>> +		const struct hashmap_entry *e1, const struct hashmap_entry *e2,
>> +		const void *keydata)
>> +{
>> +	return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2, keydata));
> 
> Our code tends not to say "this is a pointer to a function"
> explicitly, i.e.
> 
> 	!map->cmpfn(e1, e2, keydata)
> 
> is more in-line with our code and should also be sufficient.
> 
>> +}
>> +
>> +static inline unsigned int bucket(const struct hashmap *map,
>> +		const struct hashmap_entry *key)
>> +{
>> +	return key->hash & (map->tablesize - 1);
>> +}
>> +
>> +static void rehash(struct hashmap *map, unsigned int newsize)
>> +{
>> +	unsigned int i, oldsize = map->tablesize;
>> +	struct hashmap_entry **oldtable = map->table;
>> +
>> +	map->tablesize = newsize;
>> +	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
> 
> Even though multiplication is commutative, please match the official
> function signature, i.e.
> 
> 	calloc(size_t nmemb, size_t size)
> 
> where the number of elements comes first and size of an element
> comes next.  I.e.
> 
> 	xcalloc(map->tablesize, sizeof(struct hashmap_entry *));
> 
> Also don't forget the SP between type and asterisk.
> 
>> +	for (i = 0; i < oldsize; i++) {
>> +		struct hashmap_entry *e = oldtable[i];
>> +		while (e) {
>> +			struct hashmap_entry *next = e->next;
>> +			unsigned int b = bucket(map, e);
>> +			e->next = map->table[b];
>> +			map->table[b] = e;
>> +			e = next;
>> +		}
>> +	}
>> +	free(oldtable);
>> +}
>> +
>> +static inline struct hashmap_entry **find_entry_ptr(const struct hashmap *map,
>> +		const struct hashmap_entry *key, const void *keydata)
>> +{
>> +	struct hashmap_entry **e = &map->table[bucket(map, key)];
>> +	while (*e && !entry_equals(map, *e, key, keydata))
>> +		e = &(*e)->next;
>> +	return e;
>> +}
>> +
>> +static int always_equal(const void *unused1, const void *unused2, const void *unused3)
>> +{
>> +	return 0;
>> +}
>> +
>> +void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function,
>> +		size_t initial_size)
>> +{
>> +	map->size = 0;
>> +	map->cmpfn = equals_function ? equals_function : always_equal;
>> +	/* calculate initial table size and allocate the table */
>> +	map->tablesize = HASHMAP_INITIAL_SIZE;
>> +	initial_size *= HASHMAP_GROW_AT;
>> +	while (initial_size > map->tablesize)
>> +		map->tablesize <<= HASHMAP_GROW;
>> +	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
>> +}
>> +
>> +void hashmap_free(struct hashmap *map, hashmap_free_fn free_function)
>> +{
> 
> Why is free_function not part of the constants defiend at
> hashmap_init() time?  Your API allows the same hashmap, depending on
> the way it has been used, to be cleaned up with different
> free_function, but I am not sure if that "flexibility" is intended
> (and in what application it would be useful).
> 

The free_function is a convenience so you don't have to loop over the entries yourself. It is only needed by hashmap_free (e.g. remove() and put() return the entry without freeing it), so I don't see a reason to carry the free_function around from construction time.

Git uses overallocated structs a lot (i.e. ending in char name[FLEX_ARRAY]), so stdlib free is all we need so far. If the entries have char *name1; char *name2; which are individually allocated, you need a special free function. But perhaps this is a case of premature generalization and a simple 'to free or not to free' boolean would suffice.

> Just like a NULL passed for equals_function gets the internal
> always_equal fallback function, if you make this a part of
> hashmap_init(), a NULL passed for free_funcion can be used as a
> signal to use the system's free(3) and you no longer have to say
> "free from stdlib" in the docs and the comment.
> 

No, there are cases where the entries are not owned by the hashmap, so passing NULL = 'don't free entries' is a valid case. See name-hash.c:

	hashmap_free(&istate->name_hash, NULL);
	hashmap_free(&istate->dir_hash, free);

The cache_entries are owned by index_state.cache, while the dir_entries are our own.

>> +	if (!map || !map->table)
>> +		return;
>> +	if (free_function) {
>> +		struct hashmap_iter iter;
>> +		struct hashmap_entry *e;
>> +		hashmap_iter_init(map, &iter);
>> +		while ((e = hashmap_iter_next(&iter)))
>> +			(*free_function)(e);
>> +	}
>> +	free(map->table);
>> +	memset(map, 0, sizeof(*map));
>> +}
>> +
>> +void *hashmap_get(const struct hashmap *map, const void *key, const void *keydata)
>> +{
>> +	return *find_entry_ptr(map, key, keydata);
>> +}
>> +
>> +void *hashmap_get_next(const struct hashmap *map, const void *entry)
>> +{
>> +	struct hashmap_entry *e = ((struct hashmap_entry*) entry)->next;
>> +	for (; e; e = e->next)
>> +		if (entry_equals(map, entry, e, NULL))
>> +			return e;
>> +	return NULL;
>> +}
>> +
>> +void hashmap_add(struct hashmap *map, void *entry)
>> +{
>> +	unsigned int b = bucket(map, entry);
>> +
>> +	/* add entry */
>> +	((struct hashmap_entry*) entry)->next = map->table[b];
>> +	map->table[b] = entry;
>> +
>> +	/* fix size and rehash if appropriate */
>> +	map->size++;
>> +	if (map->size * HASHMAP_GROW_AT > map->tablesize)
>> +		rehash(map, map->tablesize << HASHMAP_GROW);
>> +}
>> +
>> +void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)
>> +{
>> +	struct hashmap_entry *old;
>> +	struct hashmap_entry **e = find_entry_ptr(map, key, keydata);
>> +	if (!*e)
>> +		return NULL;
>> +
>> +	/* remove existing entry */
>> +	old = *e;
>> +	*e = old->next;
>> +	old->next = NULL;
>> +
>> +	/* fix size and rehash if appropriate */
>> +	map->size--;
>> +	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
>> +	    map->size * HASHMAP_SHRINK_AT < map->tablesize)
>> +		rehash(map, map->tablesize >> HASHMAP_GROW);
> 
> This "we shrink by the same amount" looks inconsistent with the use
> of separate grow-at and shrink-at constants (see above for four
> suggested #define's).
> 

These values account for a small hysteresis so that there is no size at which a sequence of add, remove, add, remove (or put, put, put, put) results in permanent resizes.

We grow at 80% (* 4 = 20% full after grow), and shrink at 16.6% ( / 4 = 66.6% full after shrink).