All of lore.kernel.org
 help / color / mirror / Atom feed
From: Karsten Blees <karsten.blees@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git List <git@vger.kernel.org>, Thomas Rast <tr@thomasrast.ch>,
	Jens Lehmann <Jens.Lehmann@web.de>
Subject: Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal
Date: Fri, 08 Nov 2013 11:27:26 +0100	[thread overview]
Message-ID: <527CBC8E.6050507@gmail.com> (raw)
In-Reply-To: <xmqq4n7nriuj.fsf@gitster.dls.corp.google.com>

Am 07.11.2013 22:40, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> +`void hashmap_add(struct hashmap *map, void *entry)`::
>> +
>> +	Adds a hashmap entry. This allows to add duplicate entries (i.e.
>> +	separate values with the same key according to hashmap_cmp_fn).
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add.
>> +
>> +`void *hashmap_put(struct hashmap *map, void *entry)`::
>> +
>> +	Adds or replaces a hashmap entry.
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add or update.
>> ++
>> +Returns the previous entry, or NULL if not found (i.e. the entry was added).
> 
> What happens when there were duplicate entries previously added?
> All are replaced?  One of them is randomly chosen and gets replaced?
> 
> With s/replaced/removed/, the same question applies to hashmap_remove().
> 
> I am guessing that "we pick one at random and pretend as if other
> duplicates do not exist" is the answer, 

Exactly, however, you won't have duplicates if you use put exclusively.

> and I do not immediately
> have an objection to that semantics, but whatever the rule is, it
> needs to be spelled out.
> 

I'll clarify this.

>> +`void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)`::
>> +
>> +	Removes a hashmap entry matching the specified key.
>> ...
>> +Usage example
>> +-------------
>> +
>> +Here's a simple usage example that maps long keys to double values.
>> +[source,c]
>> +------------
>> +struct hashmap map;
>> +
>> +struct long2double {
>> +	struct hashmap_entry ent; /* must be the first member! */
>> +	long key;
>> +	double value;
>> +};
>> +
>> +static int long2double_cmp(const struct long2double *e1, const struct long2double *e2, const void *unused)
>> +{
>> +	return !(e1->key == e2->key);
>> +}
>> +
>> +void long2double_init()
> 
> void long2double_init(void)
> 
>> +{
>> +	hashmap_init(&map, (hashmap_cmp_fn) long2double_cmp, 0);
>> +}
>> +
>> +void long2double_free()
> 
> Likewise.
> 
>> diff --git a/hashmap.c b/hashmap.c
>> new file mode 100644
>> index 0000000..3a9f8c1
>> --- /dev/null
>> +++ b/hashmap.c
>> ...
>> +unsigned int memhash(const void *buf, size_t len)
>> +{
>> +	unsigned int hash = FNV32_BASE;
>> +	unsigned char *ucbuf = (unsigned char*) buf;
> 
> "(unsigned char *)buf"; pointer-asterisk does not stick to type.
> 

Ok, I'll recheck all casts.

>> +	while (len--) {
>> +		unsigned int c = *ucbuf++;
>> +		hash = (hash * FNV32_PRIME) ^ c;
>> +	}
>> +	return hash;
>> +}
>> +
>> +unsigned int memihash(const void *buf, size_t len)
>> +{
>> +	unsigned int hash = FNV32_BASE;
>> +	unsigned char *ucbuf = (unsigned char*) buf;
>> +	while (len--) {
>> +		unsigned int c = *ucbuf++;
>> +		if (c >= 'a' && c <= 'z')
>> +			c -= 'a' - 'A';
>> +		hash = (hash * FNV32_PRIME) ^ c;
>> +	}
>> +	return hash;
>> +}
>> +
>> +#define HASHMAP_INITIAL_SIZE 64
>> +/* grow / shrink by 2^2 */
>> +#define HASHMAP_GROW 2
>> +/* grow if > 80% full (to 20%) */
>> +#define HASHMAP_GROW_AT 1.25
>> +/* shrink if < 16.6% full (to 66.6%) */
>> +#define HASHMAP_SHRINK_AT 6
> 
> May be I am too old fashioned, but seeing a floating-point constant
> in our otherwise pretty-much integer-only code makes me feel uneasy.
> "gcc -S -O2" does seem to generate floating point instruction for
> "i" but not for "j":
> 
>     extern void inspect(unsigned int i, unsigned int j);
> 
>     void grow(unsigned int i, unsigned int j)
>     {
>             i *= 1.25;
>             j += j >> 2;
>             inspect(i, j);
>     }
> 

I guess there's no more i486SXs around, so using floating point should not be a problem (at least performance-wise...).

However, defining the constants inversely is a bit unintuitive (i.e. 1.25 instead of 0.8, 6 instead of 0.166). Perhaps the thresholds should also be calculated once on resize, not on every add / remove.

What about this:

#define HASHMAP_GROW_AT 80
#define HASHMAP_SHRINK_AT 16

...in rehash:

map->grow_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_GROW_AT / 100);
map->shrink_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_SHRINK_AT / 100);

...in add:

size++;
if (map->size > map->grow_at)

...in remove:

size--;
if (map->size < map->shrink_at)

> Perhaps
> 
> #define HASHMAP_GROW_AT(current) ((current) + (current) >> 2)
> #define HASHMAP_SHRINK_AT(current) ((current) * 6)
> #define HASHMAP_GROW(current) ((current) << 2)
> #define HASHMAP_SHRINK(current) ((current) >> 2)
> 
> may alleviate my worries; I dunno.
> 
>> +
>> +static inline int entry_equals(const struct hashmap *map,
>> +		const struct hashmap_entry *e1, const struct hashmap_entry *e2,
>> +		const void *keydata)
>> +{
>> +	return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2, keydata));
> 
> Our code tends not to say "this is a pointer to a function"
> explicitly, i.e.
> 
> 	!map->cmpfn(e1, e2, keydata)
> 
> is more in-line with our code and should also be sufficient.
> 
>> +}
>> +
>> +static inline unsigned int bucket(const struct hashmap *map,
>> +		const struct hashmap_entry *key)
>> +{
>> +	return key->hash & (map->tablesize - 1);
>> +}
>> +
>> +static void rehash(struct hashmap *map, unsigned int newsize)
>> +{
>> +	unsigned int i, oldsize = map->tablesize;
>> +	struct hashmap_entry **oldtable = map->table;
>> +
>> +	map->tablesize = newsize;
>> +	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
> 
> Even though multiplication is commutative, please match the official
> function signature, i.e.
> 
> 	calloc(size_t nmemb, size_t size)
> 
> where the number of elements comes first and size of an element
> comes next.  I.e.
> 
> 	xcalloc(map->tablesize, sizeof(struct hashmap_entry *));
> 
> Also don't forget the SP between type and asterisk.
> 
>> +	for (i = 0; i < oldsize; i++) {
>> +		struct hashmap_entry *e = oldtable[i];
>> +		while (e) {
>> +			struct hashmap_entry *next = e->next;
>> +			unsigned int b = bucket(map, e);
>> +			e->next = map->table[b];
>> +			map->table[b] = e;
>> +			e = next;
>> +		}
>> +	}
>> +	free(oldtable);
>> +}
>> +
>> +static inline struct hashmap_entry **find_entry_ptr(const struct hashmap *map,
>> +		const struct hashmap_entry *key, const void *keydata)
>> +{
>> +	struct hashmap_entry **e = &map->table[bucket(map, key)];
>> +	while (*e && !entry_equals(map, *e, key, keydata))
>> +		e = &(*e)->next;
>> +	return e;
>> +}
>> +
>> +static int always_equal(const void *unused1, const void *unused2, const void *unused3)
>> +{
>> +	return 0;
>> +}
>> +
>> +void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function,
>> +		size_t initial_size)
>> +{
>> +	map->size = 0;
>> +	map->cmpfn = equals_function ? equals_function : always_equal;
>> +	/* calculate initial table size and allocate the table */
>> +	map->tablesize = HASHMAP_INITIAL_SIZE;
>> +	initial_size *= HASHMAP_GROW_AT;
>> +	while (initial_size > map->tablesize)
>> +		map->tablesize <<= HASHMAP_GROW;
>> +	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
>> +}
>> +
>> +void hashmap_free(struct hashmap *map, hashmap_free_fn free_function)
>> +{
> 
> Why is free_function not part of the constants defiend at
> hashmap_init() time?  Your API allows the same hashmap, depending on
> the way it has been used, to be cleaned up with different
> free_function, but I am not sure if that "flexibility" is intended
> (and in what application it would be useful).
> 

The free_function is a convenience so you don't have to loop over the entries yourself. It is only needed by hashmap_free (e.g. remove() and put() return the entry without freeing it), so I don't see a reason to carry the free_function around from construction time.

Git uses overallocated structs a lot (i.e. ending in char name[FLEX_ARRAY]), so stdlib free is all we need so far. If the entries have char *name1; char *name2; which are individually allocated, you need a special free function. But perhaps this is a case of premature generalization and a simple 'to free or not to free' boolean would suffice.

> Just like a NULL passed for equals_function gets the internal
> always_equal fallback function, if you make this a part of
> hashmap_init(), a NULL passed for free_funcion can be used as a
> signal to use the system's free(3) and you no longer have to say
> "free from stdlib" in the docs and the comment.
> 

No, there are cases where the entries are not owned by the hashmap, so passing NULL = 'don't free entries' is a valid case. See name-hash.c:

	hashmap_free(&istate->name_hash, NULL);
	hashmap_free(&istate->dir_hash, free);

The cache_entries are owned by index_state.cache, while the dir_entries are our own.

>> +	if (!map || !map->table)
>> +		return;
>> +	if (free_function) {
>> +		struct hashmap_iter iter;
>> +		struct hashmap_entry *e;
>> +		hashmap_iter_init(map, &iter);
>> +		while ((e = hashmap_iter_next(&iter)))
>> +			(*free_function)(e);
>> +	}
>> +	free(map->table);
>> +	memset(map, 0, sizeof(*map));
>> +}
>> +
>> +void *hashmap_get(const struct hashmap *map, const void *key, const void *keydata)
>> +{
>> +	return *find_entry_ptr(map, key, keydata);
>> +}
>> +
>> +void *hashmap_get_next(const struct hashmap *map, const void *entry)
>> +{
>> +	struct hashmap_entry *e = ((struct hashmap_entry*) entry)->next;
>> +	for (; e; e = e->next)
>> +		if (entry_equals(map, entry, e, NULL))
>> +			return e;
>> +	return NULL;
>> +}
>> +
>> +void hashmap_add(struct hashmap *map, void *entry)
>> +{
>> +	unsigned int b = bucket(map, entry);
>> +
>> +	/* add entry */
>> +	((struct hashmap_entry*) entry)->next = map->table[b];
>> +	map->table[b] = entry;
>> +
>> +	/* fix size and rehash if appropriate */
>> +	map->size++;
>> +	if (map->size * HASHMAP_GROW_AT > map->tablesize)
>> +		rehash(map, map->tablesize << HASHMAP_GROW);
>> +}
>> +
>> +void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)
>> +{
>> +	struct hashmap_entry *old;
>> +	struct hashmap_entry **e = find_entry_ptr(map, key, keydata);
>> +	if (!*e)
>> +		return NULL;
>> +
>> +	/* remove existing entry */
>> +	old = *e;
>> +	*e = old->next;
>> +	old->next = NULL;
>> +
>> +	/* fix size and rehash if appropriate */
>> +	map->size--;
>> +	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
>> +	    map->size * HASHMAP_SHRINK_AT < map->tablesize)
>> +		rehash(map, map->tablesize >> HASHMAP_GROW);
> 
> This "we shrink by the same amount" looks inconsistent with the use
> of separate grow-at and shrink-at constants (see above for four
> suggested #define's).
> 

These values account for a small hysteresis so that there is no size at which a sequence of add, remove, add, remove (or put, put, put, put) results in permanent resizes.

We grow at 80% (* 4 = 20% full after grow), and shrink at 16.6% ( / 4 = 66.6% full after shrink).

  reply	other threads:[~2013-11-08 10:27 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-07 14:32 [PATCH v4 00/14] New hash table implementation Karsten Blees
2013-11-07 14:33 ` [PATCH v4 01/14] submodule: don't access the .gitmodules cache entry after removing it Karsten Blees
2013-11-07 22:27   ` Heiko Voigt
2013-11-07 14:34 ` [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal Karsten Blees
2013-11-07 21:40   ` Junio C Hamano
2013-11-08 10:27     ` Karsten Blees [this message]
2013-11-08 16:45       ` Philip Oakley
2013-11-08 17:08       ` Junio C Hamano
2013-11-13 16:37         ` Karsten Blees
2013-11-07 14:35 ` [PATCH v4 03/14] buitin/describe.c: use new hash map implementation Karsten Blees
2013-11-07 14:36 ` [PATCH v4 04/14] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
2013-11-07 14:36 ` [PATCH v4 05/14] diffcore-rename.c: simplify finding exact renames Karsten Blees
2013-11-07 14:37 ` [PATCH v4 06/14] diffcore-rename.c: use new hash map implementation Karsten Blees
2013-11-07 14:38 ` [PATCH v4 07/14] name-hash.c: use new hash map implementation for directories Karsten Blees
2013-11-07 14:38 ` [PATCH v4 08/14] name-hash.c: remove unreferenced directory entries Karsten Blees
2013-11-07 14:39 ` [PATCH v4 09/14] name-hash.c: use new hash map implementation for cache entries Karsten Blees
2013-11-07 14:39 ` [PATCH v4 10/14] name-hash.c: remove cache entries instead of marking them CE_UNHASHED Karsten Blees
2013-11-07 14:40 ` [PATCH v4 11/14] remove old hash.[ch] implementation Karsten Blees
2013-11-07 14:43 ` [PATCH v4 12/14] fix 'git update-index --verbose --again' output Karsten Blees
2013-11-07 22:12   ` Junio C Hamano
2013-11-08 10:27     ` Karsten Blees
2013-11-07 14:44 ` [PATCH v4 13/14] builtin/update-index.c: cleanup update_one Karsten Blees
2013-11-07 21:40   ` Junio C Hamano
2013-11-08 10:27     ` Karsten Blees
2013-11-07 14:45 ` [PATCH v4 14/14] read-cache.c: fix memory leaks caused by removed cache entries Karsten Blees
2013-11-07 21:40   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=527CBC8E.6050507@gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=Jens.Lehmann@web.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=tr@thomasrast.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.