Re: [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API

From: Karsten Blees <karsten.blees@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git List <git@vger.kernel.org>
Subject: Re: [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API
Date: Fri, 11 Jul 2014 21:11:51 +0200	[thread overview]
Message-ID: <53C036F7.5070104@gmail.com> (raw)
In-Reply-To: <xmqqk37pdpzb.fsf@gitster.dls.corp.google.com>

Am 07.07.2014 19:43, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> Hashmap entries are typically looked up by just a key. The hashmap_get()
>> API expects an initialized entry structure instead, to support compound
>> keys. This flexibility is currently only needed by find_dir_entry() in
>> name-hash.c (and compat/win32/fscache.c in the msysgit fork). All other
>> (currently five) call sites of hashmap_get() have to set up a near emtpy
> 
> s/emtpy/empty/;
> 
>> entry structure, resulting in duplicate code like this:
>>
>>   struct hashmap_entry keyentry;
>>   hashmap_entry_init(&keyentry, hash(key));
>>   return hashmap_get(map, &keyentry, key);
>>
>> Add a hashmap_get_from_hash() API that allows hashmap lookups by just
>> specifying the key and its hash code, i.e.:
>>
>>   return hashmap_get_from_hash(map, hash(key), key);
> 
> While I think the *_get_from_hash() is an improvement over the
> three-line sequence, I find it somewhat strange that callers of it
> still must compute the hash code themselves, instead of letting the
> hashmap itself to apply the user supplied hash function to the key
> to derive it.  After all, the map must know how to compare two
> entries via a user-supplied cmpfn given at the map initialization
> time, and the algorithm to derive the hash code falls into the same
> category, in the sense that both must stay the same during the
> lifetime of a hashmap, no?  Unless there is a situation where you
> need to be able to initialize a hashmap_entry without knowing which
> hashmap the entry will eventually be stored, it feels dubious API
> that exposes to the outside callers hashmap_entry_init() that takes
> the hash code without taking the hashmap itself.
> 
> In other words, why isn't hashmap_get() more like this:
> 
>         void *hashmap_get(const struct hashmap *map, const void *key)
>         {
>                 struct hashmap_entry keyentry;
>                 hashmap_entry_init(&keyentry, map->hash(key));
>                 return *find_entry_ptr(map, &keyentry, key);
>         }
> 
> with hashmap_entry_init() purely a static helper in hashmap.c?
> 

1. Performance

Calculating hash codes is the most expensive operation when working with
hash tables (unless you already have a good hash such as sha1). And using
hash tables as a cache of some sort is probably the most common use case.
So you'll often have code like this:

1  unsigned int h = hash(key);
2  struct my_entry *e = hashmap_get_from_hash(&map, hash, key);
3  if (!e) {
4    e = malloc(sizeof(*e));
5    hashmap_entry_init(e, h);
6    e->key = key;
6    hashmap_add(&map, e);
7  }

Note that the hash code from line 1 can be reused in line 5. You cannot do
that if calculating the hash code is buried in the implementation.

Callbacks cannot be inlined either...

2. Simplicity

Most APIs take a user defined hashmap_entry structure, so you'd either need
two hash functions, or a hash function that can distinguish between the
'entry' and 'key-only' case, e.g.

  unsigned int my_hash(const struct my_entry *entry, const void *keydata)
  {
    if (keydata)
      return strhash(keydata);
    else
      return strhash(entry->key);
  }

Hashmap clients will typically provide small, type safe wrappers around the
hashmap API. That's perfect for calculating the hash code without breaking
encapsulation. IMO there's no need to make things more complex by requiring
another callback.

3. Compound keys

The key doesn't always consist of just a single word. E.g. for struct
dir_entry, the key is [char *name, int len]. So an API like this:

  void *hashmap_get(const struct hashmap *map, const void *key)

won't do in the general case (unless you require clients to define their
own key structure in addition to the entry structure...).