From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, FROM_EXCESS_BASE64,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,SPF_HELO_NONE,SPF_NONE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 348E31F461 for ; Wed, 28 Aug 2019 16:30:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726462AbfH1Qaw (ORCPT ); Wed, 28 Aug 2019 12:30:52 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:51130 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726315AbfH1Qaw (ORCPT ); Wed, 28 Aug 2019 12:30:52 -0400 Received: by mail-wm1-f66.google.com with SMTP id v15so752853wml.0 for ; Wed, 28 Aug 2019 09:30:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=nESZQ+2OR3JI6AgiOFDzOHte75ILqc3RVQA5VcCKlhc=; b=iZfYKxU/liH/QhII8hS7Jm+PuVjjlJjM3jFX4bv7KI4hWe+0xxq8pEdrATXRbmbTvb btQlupo8vKIya1Dh9AD0nbw7o/vTH2KVDKTtTJtmu0Zqe5HgGNWBeyVXLT8ol+DiVX8Y TgYjI31Q/QC5Z8VwWJnsrYI614qrEE3dO3SZQ1bIS604TM5lO1aDMegJeUKfoWoWsDxj ObzjxyFal4vo+/tA1rKd5SHqW8eDUKsaBUpgxuzIYnHK4FqegmJhXYNOjoZ/QmU9pmXz sp3pRFKmqgwDzsW3R5gM+dCAD0eXw/Q9GbBfpAvFu1z7GYKYxX4WHKj9BUJGViqjGR+B l2Xg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=nESZQ+2OR3JI6AgiOFDzOHte75ILqc3RVQA5VcCKlhc=; b=ucJqSU307DhIYTycu/iP/gpPfcSv+nemE9Jqg+5prOVIMICGEj785csyWh3z6L7Mt1 lxms2oOQnfpehRLuOjHggtPF2CaLS0krxr1Bewn5ZOhBpd0A8wi1SYruT+rIbM5E9HHG nlRULyQRA2gOxsqtphTu1jRkeiROxlM2ngE/a9VeQBTbNw4F59dqsu8P6yt464ofYK2n OuRvbFSSgPu27sdur1E+wUlU3AX2MKAVJjKCu503SZ5rB8temg2LjSbVRy3SHootYo/I dGR8Byx1J/kxqbSfn/xGZouKR6kZeUZK5qukjUNTg0wn6JnaTi6kVtsA52OG1sgKyDRy jxqQ== X-Gm-Message-State: APjAAAUsBq1vbRlExEIsl0mv9gFCgfj24G1X/5QCrQnLC6+GXdY68eeI KDpZEElq4k8h4t7uX3K+UrhcNQHG X-Google-Smtp-Source: APXvYqxYWE/kUaSS7pwn5omEKGV/auefdtFie5/I7p2/uIiJhss8dkWByhKrI7KB0HhEw2yvtSLRfQ== X-Received: by 2002:a1c:544e:: with SMTP id p14mr5702569wmi.4.1567009848443; Wed, 28 Aug 2019 09:30:48 -0700 (PDT) Received: from szeder.dev (x4db66bea.dyn.telefonica.de. [77.182.107.234]) by smtp.gmail.com with ESMTPSA id e9sm4045815wmd.25.2019.08.28.09.30.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Aug 2019 09:30:47 -0700 (PDT) Date: Wed, 28 Aug 2019 18:30:45 +0200 From: SZEDER =?utf-8?B?R8OhYm9y?= To: Johannes Schindelin Cc: Slavica Djukic via GitGitGadget , git@vger.kernel.org, Jeff Hostetler , Jeff King , Junio C Hamano , Slavica Djukic Subject: Re: [PATCH v3 07/11] Add a function to determine unique prefixes for a list of strings Message-ID: <20190828163045.GF8571@szeder.dev> References: <3000d7d08dfb64511b4ebf9d05617897dd7252f7.1563289115.git.gitgitgadget@gmail.com> <20190824123811.GL20404@szeder.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Tue, Aug 27, 2019 at 02:14:06PM +0200, Johannes Schindelin wrote: > Hi Gábor, > > On Sat, 24 Aug 2019, SZEDER Gábor wrote: > > > On Tue, Jul 16, 2019 at 07:58:42AM -0700, Slavica Djukic via GitGitGadget wrote: > > > In the `git add -i` command, we show unique prefixes of the commands and > > > files, to give an indication what prefix would select them. > > > > > > Naturally, the C implementation looks a lot different than the Perl > > > implementation: in Perl, a trie is much easier implemented, while we > > > already have a pretty neat hashmap implementation in C that we use for > > > the purpose of storing (not necessarily unique) prefixes. > > > > > > The idea: for each item that we add, we generate prefixes starting with > > > the first letter, then the first two letters, then three, etc, until we > > > find a prefix that is unique (or until the prefix length would be > > > longer than we want). If we encounter a previously-unique prefix on the > > > way, we adjust that item's prefix to make it unique again (or we mark it > > > as having no unique prefix if we failed to find one). These partial > > > prefixes are stored in a hash map (for quick lookup times). > > > > > > To make sure that this function works as expected, we add a test using a > > > special-purpose test helper that was added for that purpose. > > > > > > Note: We expect the list of prefix items to be passed in as a list of > > > pointers rather than as regular list to avoid having to copy information > > > (the actual items will most likely contain more information than just > > > the name and the length of the unique prefix, but passing in `struct > > > prefix_item *` would not allow for that). > > > > > diff --git a/prefix-map.c b/prefix-map.c > > > new file mode 100644 > > > index 0000000000..747ddb4ebc > > > --- /dev/null > > > +++ b/prefix-map.c > > > @@ -0,0 +1,109 @@ > > > +#include "cache.h" > > > +#include "prefix-map.h" > > > + > > > +static int map_cmp(const void *unused_cmp_data, > > > + const void *entry, > > > + const void *entry_or_key, > > > + const void *unused_keydata) > > > +{ > > > + const struct prefix_map_entry *a = entry; > > > + const struct prefix_map_entry *b = entry_or_key; > > > + > > > + return a->prefix_length != b->prefix_length || > > > + strncmp(a->name, b->name, a->prefix_length); > > > +} > > > + > > > +static void add_prefix_entry(struct hashmap *map, const char *name, > > > + size_t prefix_length, struct prefix_item *item) > > > +{ > > > + struct prefix_map_entry *result = xmalloc(sizeof(*result)); > > > + result->name = name; > > > + result->prefix_length = prefix_length; > > > + result->item = item; > > > + hashmap_entry_init(result, memhash(name, prefix_length)); > > > + hashmap_add(map, result); > > > +} > > > + > > > +static void init_prefix_map(struct prefix_map *prefix_map, > > > + int min_prefix_length, int max_prefix_length) > > > +{ > > > + hashmap_init(&prefix_map->map, map_cmp, NULL, 0); > > > + prefix_map->min_length = min_prefix_length; > > > + prefix_map->max_length = max_prefix_length; > > > +} > > > + > > > +static void add_prefix_item(struct prefix_map *prefix_map, > > > + struct prefix_item *item) > > > +{ > > > + struct prefix_map_entry e = { { NULL } }, *e2; > > > + int j; > > > + > > > + e.item = item; > > > + e.name = item->name; > > > + > > > + for (j = prefix_map->min_length; > > > + j <= prefix_map->max_length && e.name[j]; j++) { > > > + /* Avoid breaking UTF-8 multi-byte sequences */ Ok, that's a good idea, but... > > > + if (!isascii(e.name[j])) > > > + break; This doesn't seem right. It just breaks the loop, potentially leaving 'item->prefix_length' set from the previous iteration. Try the two strings "123" and "12ő", in this order: they both get prefix_length = 3, i.e. the prefix ends between the two bytes of the multi-byte character. Furthermore, neither the docstring of find_unique_prefixes() nor the commit message mention what is supposed to happen with paths with multi-byte characters. I think at least one of them should. > > > + > > > + e.prefix_length = j; > > > + hashmap_entry_init(&e, memhash(e.name, j)); > > > + e2 = hashmap_get(&prefix_map->map, &e, NULL); > > > + if (!e2) { > > > + /* prefix is unique at this stage */ > > > + item->prefix_length = j; > > > + add_prefix_entry(&prefix_map->map, e.name, j, item); > > > + break; > > > + } > > > + > > > + if (!e2->item) > > > + continue; /* non-unique prefix */ > > > + > > > + if (j != e2->item->prefix_length || memcmp(e.name, e2->name, j)) > > > + BUG("unexpected prefix length: %d != %d (%s != %s)", > > > + j, (int)e2->item->prefix_length, e.name, e2->name); > > > + > > > + /* skip common prefix */ > > > + for (; j < prefix_map->max_length && e.name[j]; j++) { > > > + if (e.item->name[j] != e2->item->name[j]) > > > + break; > > > + add_prefix_entry(&prefix_map->map, e.name, j + 1, > > > + NULL); > > > + } > > > + > > > + /* e2 no longer refers to a unique prefix */ > > > + if (j < prefix_map->max_length && e2->name[j]) { > > > + /* found a new unique prefix for e2's item */ > > > + e2->item->prefix_length = j + 1; > > > + add_prefix_entry(&prefix_map->map, e2->name, j + 1, > > > + e2->item); > > > + } > > > + else > > > + e2->item->prefix_length = 0; > > > + e2->item = NULL; > > > + > > > + if (j < prefix_map->max_length && e.name[j]) { > > > + /* found a unique prefix for the item */ > > > + e.item->prefix_length = j + 1; > > > + add_prefix_entry(&prefix_map->map, e.name, j + 1, > > > + e.item); > > > + } else > > > + /* item has no (short enough) unique prefix */ > > > + e.item->prefix_length = 0; > > > + > > > + break; > > > + } > > > +} > > > + > > > +void find_unique_prefixes(struct prefix_item **list, size_t nr, > > > + int min_length, int max_length) > > > +{ > > > + int i; > > > + struct prefix_map prefix_map; > > > + > > > + init_prefix_map(&prefix_map, min_length, max_length); > > > + for (i = 0; i < nr; i++) > > > + add_prefix_item(&prefix_map, list[i]); > > > + hashmap_free(&prefix_map.map, 1); > > > +} > > > > Between the commit message, the in-code comment, the names of the new > > files, and implementation I was left somewhat confused about what this > > is about and how it works. TBH, I didn't even try to understand how > > all the above works, in particular the add_prefix_item() function. > > Let me try to explain it here, and maybe you can help me by suggesting > an improved commit message and/or code comments? > > The problem is this: given a set of items with labels (e.g. file names), > find, for each item, the unique prefix that identifies it. Example: > given the files `hello.txt`, `heaven.txt` and `hell.txt`, the items' > unique prefixes would be `hello`, `hea` and `hell.`, respectively. > > In `git add -i`, we actually only want to allow alphanumerical prefixes, > and we also want at least one, and at most three characters, so only the > second item would have an admissible unique prefix: `hea`. You say "at most three characters", but I notive that the call added to 'add-interactive.c' in a later commit specifies 4. I wonder what's the reason for the minimum prefix length, though. I mean, you can't really have a unique prefix shorter than one character... > > However, I think it would be much-much simpler to first sort (a copy > > of?) the array of prefix item pointers based on their 'name' field, > > and then look for a unique prefix in each neighboring pair. Perhaps > > it would even be faster, because it doesn't have to allocate a bunch > > of hashmap items, though I don't think that it matters much in > > practice (i.e. I expect the number of items to be fairly small; > > presumably nobody will run interactive add after a mass refactoring > > modifying thousands of files). > > The time complexity of the sorted list would be O(n*log(n)), while the > hashmap-based complexity would be an amortized O(n). Oh, so you don't know how/why it works, either!? ;) This hashmap-based implementation calls hashmap_get() approx. n * (max_length - min_length) times. Furthermore, depending on the strings and 'max_length', it can call add_prefix_entry() more than n times. E.g. the six strings "a1", "a2", "ab1", "ab2", "abc1", "abc2", in this order, and max_length=4 result in 9 add_prefix_entry() calls. And each add_prefix_entry() call involves a memory allocation. Ok, that's theory, let's see practice. For a (very) quick'n'dirty performance test I replaced those five items in 't/helper/test-prefix-map.c' with the output of: git -C ..../webkit.git/ ls-tree -r --name-only HEAD | sed -e 's/^/{ "/; s/$/" },/' That's 280734 real paths. Then a simple 'time test-tool prefix-map' with the hashmap-based implementation ran for 0.055s, while using my PoC sort-based algorithm it ran for 0.063s. So while the hashmap-based implementation is indeed faster, they are both well below 0.1s, so in the context of 'git add -i' the performance difference between the two approaches doesn't really matter. If we look beyond 'git add -i's requirements, and specify larger max_length values, the hashmap-based implementation gets progressively slower. Already at max_length=5 it's a tad slower than the sort-based implementation, and at 9 it takes twice as long. The current hashmap-based implementation is ~50% more lines than my sort-based PoC. I'll send it out in a minute. > And yes, you would not _want_ to run interactive add after a mass > refactoring. But it happens. It happens to me more times than I care to > admit. And you know what? I really appreciate that even the Perl version > is relatively snappy in those circumstances. > > > > diff --git a/prefix-map.h b/prefix-map.h > > > new file mode 100644 > > > index 0000000000..ce3b8a4a32 > > > --- /dev/null > > > +++ b/prefix-map.h > > > @@ -0,0 +1,40 @@ > > > +#ifndef PREFIX_MAP_H > > > +#define PREFIX_MAP_H > > > > > > +#include "hashmap.h" > > > + > > > +struct prefix_item { > > > + const char *name; > > > + size_t prefix_length; > > > +}; > > > > This struct is part of find_unique_prefixes()'s signature, good. > > > > > +struct prefix_map_entry { > > > + struct hashmap_entry e; > > > + const char *name; > > > + size_t prefix_length; > > > + /* if item is NULL, the prefix is not unique */ > > > + struct prefix_item *item; > > > +}; > > > + > > > +struct prefix_map { > > > + struct hashmap map; > > > + int min_length, max_length; > > > +}; > > > > However, neither of these two structs nor the hashmap appear in the > > function's signature, but are all implementation details. Therefore, > > they should not be defined and included here in the header but in the > > .c source file. (But as mentioned above, I think this could be > > implemented much simpler without these data structures.) > > Right you are! > > > Furthermore, this is not a map. > > A map, in general, is a container of key-value pairs that allows > > efficient insertion, removal and lookup. This so-called prefix_map > > does none of that, so it should not be called a map. > > What would you call it instead? Well, I would rather get rid of it in the first place :) In my PoC I called the files 'unique-prefix.{c,h}'. > > > +/* > > > + * Find unique prefixes in a given list of strings. > > > > ... and stores the length of the unique prefixes in the > > 'prefix_length' field of the elements of the given array. > > Good idea. I changed it to also explain what is meant by "unique > prefix": > > * Given a list of names, find unique prefixes (i.e. the first characters > * that uniquely identify the names) and store the lengths of the unique > * prefixes in the 'prefix_length' field of the elements of the given array.. Sounds good. > > > + * > > > + * Typically, the `struct prefix_item` information will be but a field in the > > > + * actual item struct; For this reason, the `list` parameter is specified as a > > > + * list of pointers to the items. > > > + * > > > + * The `min_length`/`max_length` parameters define what length the unique > > > + * prefixes should have. > > > + * > > > + * If no unique prefix could be found for a given item, its `prefix_length` > > > + * will be set to 0. I ran into cases where the hashmap-based implementation didn't set 'prefix_length' to 0 when it, in my opinion, should have, but left it untouched. Just try any string shorter than 'min_length'. May or may not be related: try "ab", "ab", "abc", in this order: both "ab" get their 'prefix_length' set to 0, as they should, but the 'prefix_length' of "abc" is left untouched, although it should be set to 3. Hmm, now that I've mentioned "in this order" three times already: another benefit of the sort-based approach is that the order of elements doesn't matter. > > > + */ > > > +void find_unique_prefixes(struct prefix_item **list, size_t nr, > > > > The first argument is not a list but an array. > > Indeed. > > > > + int min_length, int max_length); > > > > size_t, perhaps? These are closely related to > > 'prefix_item.prefix_length', which is already (rightfully) size_t. > > True. > > > > +#endif > > > diff --git a/t/helper/test-prefix-map.c b/t/helper/test-prefix-map.c > > > new file mode 100644 > > > index 0000000000..3f1c90eaf0 > > > --- /dev/null > > > +++ b/t/helper/test-prefix-map.c > > > @@ -0,0 +1,58 @@ > > > +#include "test-tool.h" > > > +#include "cache.h" > > > +#include "prefix-map.h" > > > + > > > +static size_t test_count, failed_count; > > > + > > > +static void check(int succeeded, const char *file, size_t line_no, > > > + const char *fmt, ...) > > > +{ > > > + va_list ap; > > > + > > > + test_count++; > > > + if (succeeded) > > > + return; > > > + > > > + va_start(ap, fmt); > > > + fprintf(stderr, "%s:%d: ", file, (int)line_no); > > > + vfprintf(stderr, fmt, ap); > > > + fputc('\n', stderr); > > > + va_end(ap); > > > + > > > + failed_count++; > > > +} > > > + > > > +#define EXPECT_SIZE_T_EQUALS(expect, actual, hint) \ > > > + check(expect == actual, __FILE__, __LINE__, \ > > > + "size_t's do not match: %" \ > > > + PRIdMAX " != %" PRIdMAX " (%s) (%s)", \ > > > + (intmax_t)expect, (intmax_t)actual, #actual, hint) > > > + > > > +int cmd__prefix_map(int argc, const char **argv) > > > +{ > > > +#define NR 5 > > > + struct prefix_item items[NR] = { > > > > You don't have to tell the compiler how many elements this array will > > contain, it will figure that out on its own. > > > > > + { "unique" }, > > > + { "hell" }, > > > + { "hello" }, > > > + { "wok" }, > > > + { "world" }, > > > + }; > > > + struct prefix_item *list[NR] = { > > > > Likewise. > > That is correct. > > What the compiler _cannot_ figure out, on its own, however, is that > `items` and `list` _need_ to contain the same number of items. > > Hence the need for `NR`. OK, that's a good point. I'd still like to get rid of that NR with something like this instead: struct prefix_item items[] = { .... }; struct prefix_item *list[ARRAY_SIZE(items)]; for (i = 0; i < ARRAY_SIZE(items); i++) { list[i] = &items[i]; list[i]->prefix_length = 12345; /* dummy */ } find_unique_prefixes(list, ARRAY_SIZE(list), 0, 3); Note the dummy value, to catch cases when 'prefix_length' is left untouched. This made this test tool more convenient to use, because I didn't have to fiddle with NR and, worse, with those 'items, items+1, items+N'. It's still a bit PITA (having to re-build each time I change the test vector), though, compared to driving it from stdin and verifying it through its stdout. > > > + items, items + 1, items + 2, items + 3, items + 4 > > > + }; > > > + > > > + find_unique_prefixes(list, NR, 1, 3); > > > > This could be find_unique_prefixes(list, ARRAY_SIZE(list), 1, 3), and > > then there is no need for that NR macro anymore. > > > > > + > > > +#define EXPECT_PREFIX_LENGTH_EQUALS(expect, index) \ > > > + EXPECT_SIZE_T_EQUALS(expect, list[index]->prefix_length, \ > > > + list[index]->name) > > > + > > > + EXPECT_PREFIX_LENGTH_EQUALS(1, 0); > > > + EXPECT_PREFIX_LENGTH_EQUALS(0, 1); > > > + EXPECT_PREFIX_LENGTH_EQUALS(0, 2); > > > + EXPECT_PREFIX_LENGTH_EQUALS(3, 3); > > > + EXPECT_PREFIX_LENGTH_EQUALS(3, 4); > > > + > > > + return !!failed_count; > > > +} > > Thank you for your review! > Dscho