Re: [PATCH] fetch: replace string-list used as a look-up table with a hashmap

From: Stefan Beller <sbeller@google.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git <git@vger.kernel.org>
Subject: Re: [PATCH] fetch: replace string-list used as a look-up table with a hashmap
Date: Wed, 26 Sep 2018 15:59:08 -0700	[thread overview]
Message-ID: <CAGZ79kamkLV7yHtCu_qSqNzj7bJm0=Uw1CKxss+zTKJWsYz3qw@mail.gmail.com> (raw)
In-Reply-To: <xmqqin2sj6df.fsf@gitster-ct.c.googlers.com>

On Wed, Sep 26, 2018 at 2:28 PM Junio C Hamano <gitster@pobox.com> wrote:

>
> +struct refname_hash_entry {
> +       struct hashmap_entry ent; /* must be the first member */

$ git grep "struct hashmap_entry" reveals that we have another
convention that we follow but not document, which is to stress
the importance of putting the hashmap_entry first. ;-)

> +       struct object_id oid;
> +       char refname[FLEX_ARRAY];
> +};
> +
> +static int refname_hash_entry_cmp(const void *hashmap_cmp_fn_data,
> +                                 const struct refname_hash_entry *e1,
> +                                 const struct refname_hash_entry *e2,
> +                                 const void *keydata)
> +{
> +       return strcmp(e1->refname, keydata ? keydata : e2->refname);
> +}

and later

    hashmap_init(...  (hashmap_cmp_fn)refname_hash_entry_cmp, ...);

I wonder if we'd want to stick to this style, as that seems to be the easiest
to do, and drop the efforts started in 55c965f3a2 (Merge branch
'sb/hashmap-cleanup', 2017-08-11), that avoids the cast in the init,
but puts the (implicit) casts in the _cmp function as we'd take
void pointers as the function signature and recast to a local variable.

> +
> +static struct refname_hash_entry *refname_hash_add(struct hashmap *map,
> +                                                  const char *refname,
> +                                                  const struct object_id *oid)
> +{
> +       struct refname_hash_entry *ent;
> +       size_t len = strlen(refname);
> +
> +       FLEX_ALLOC_MEM(ent, refname, refname, len);
> +       hashmap_entry_init(ent, memhash(refname, len));

memhash is preferred over strhash as we already know the length.
For a second I wondered why we use it instead of strhash

> +static int refname_hash_exists(struct hashmap *map, const char *refname)
> +{
> +       struct hashmap_entry key;
> +       size_t len = strlen(refname);
> +       hashmap_entry_init(&key, memhash(refname, len));

Here we could use strhash, but as they could change to
differing implementation, we keep using memhash.

>         /*
> -        * For all the tags in the remote_refs string list,
> +        * For all the tags in the remote_refs_list,
>          * add them to the list of refs to be fetched
>          */
> -       for_each_string_list_item(item, &remote_refs) {
> +       for_each_string_list_item(remote_ref_item, &remote_refs_list) {
> +               const char *refname = remote_ref_item->string;
> +               struct hashmap_entry key;
> +
> +               hashmap_entry_init(&key, memhash(refname, strlen(refname)));
> +               item = hashmap_get(&remote_refs, &key, refname);

> +               if (!item)
> +                       continue; /* can this happen??? */

I thought this could happen when we have hash collisions,
I convinced myself otherwise, as we pass the refname to hashmap_get
as the 'keydata', so when there is a hash collision we keep comparing
to the real value.

And as whenever we have an insert to remote_refs_list, we also
have a refname_hash_add to the remote_refs hashmap,
I think the return of NULL is not possible, and we could switch
it to BUG(...);

> All this code predates more modern in-core lookup API like hashmap;
> replace them with two hashmaps and one string list---the hashmaps
> are used for look-ups and the string list is to keep the order of
> items in the returned result stable (which is the only single thing
> hashmap does worse than lookups on string-list).
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---

I wonder if we want to add an API to the hashmap that allows for ordered
iteration (as hashmap_iter_* relies on the hashing function for its output)

I think it may be too early to do so, but there are 2 thoughts on how to do it:
* Each element keeps (prev/next) pointers to the previously inserted
  element and updates the next pointer when the next element is inserted.
  When removing elements we'd have to update the next and prev pointer
  of the adjacent elements.
  Then iteration can be done in insertion-order.
  Probably this would go into its own file ordered-hashmap.{c, h}
* provide an order function to hashmap_iter_init, which then
  orders according to that function before the first call of
  hashmap_iter_next returns.

>  * Just to remind ourselves that we talked about getting rid of
>    string-list used as look-up tables by replacing them with hashmap
>    but haven't seen much effort expended on it.  I think this is my
>    first semi-serious use of hashmap, and the code is expected to be
>    full of newbie mistakes, inefficiencies and ignorance of idioms;
>    pointing out any of which is much appreciated.

I could not find newbie mistakes, but gave my thoughts anyway.

Stefan