All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/4] hashmap improvements
@ 2014-07-02 22:18 Karsten Blees
  2014-07-02 22:20 ` [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1 Karsten Blees
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Karsten Blees @ 2014-07-02 22:18 UTC (permalink / raw)
  To: Git List; +Cc: Tanay Abhra, Matthieu Moy

Here are a few small hashmap improvements, partly resulting from recent
discussion of the config-cache topic.

Karsten Blees (4):
  hashmap: factor out getting an int hash code from a SHA1
  hashmap: improve struct hashmap member documentation
  hashmap: add simplified hashmap_get_from_hash() API
  hashmap: add string interning API

 Documentation/technical/api-hashmap.txt | 54 ++++++++++++++++++++++++++++++---
 builtin/describe.c                      | 13 ++------
 decorate.c                              |  5 +--
 diffcore-rename.c                       | 11 +++----
 hashmap.c                               | 38 +++++++++++++++++++++++
 hashmap.h                               | 27 +++++++++++++++++
 khash.h                                 | 11 ++-----
 name-hash.c                             |  5 ++-
 object.c                                | 13 +-------
 pack-objects.c                          |  5 ++-
 t/t0011-hashmap.sh                      | 13 ++++++++
 test-hashmap.c                          | 25 ++++++++++-----
 12 files changed, 159 insertions(+), 61 deletions(-)

-- 
1.9.4.msysgit.0.dirty

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1
  2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
@ 2014-07-02 22:20 ` Karsten Blees
  2014-07-07 17:22   ` Junio C Hamano
  2014-07-02 22:21 ` [PATCH v1 2/4] hashmap: improve struct hashmap member documentation Karsten Blees
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Karsten Blees @ 2014-07-02 22:20 UTC (permalink / raw)
  To: Git List

Copying the first bytes of a SHA1 is duplicated in six places, however,
the implications (wrong byte order on little-endian systems) is documented
only once.

Add a properly documented API for this.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 Documentation/technical/api-hashmap.txt |  9 +++++++++
 builtin/describe.c                      | 11 ++---------
 decorate.c                              |  5 +----
 diffcore-rename.c                       |  4 +---
 hashmap.h                               | 11 +++++++++++
 khash.h                                 | 11 ++---------
 object.c                                | 13 +------------
 pack-objects.c                          |  5 ++---
 8 files changed, 29 insertions(+), 40 deletions(-)

diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
index b977ae8..4689968 100644
--- a/Documentation/technical/api-hashmap.txt
+++ b/Documentation/technical/api-hashmap.txt
@@ -58,6 +58,15 @@ Functions
 +
 `strihash` and `memihash` are case insensitive versions.
 
+`unsigned int sha1hash(const unsigned char *sha1)`::
+
+	Converts a cryptographic hash (e.g. SHA-1) into an int-sized hash code
+	for use in hash tables. Cryptographic hashes are supposed to have
+	uniform distribution, so in contrast to `memhash()`, this just copies
+	the first `sizeof(int)` bytes without shuffling any bits. Note that
+	the results will be different on big-endian and little-endian
+	platforms, so they should not be stored or transferred over the net!
+
 `void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function, size_t initial_size)`::
 
 	Initializes a hashmap structure.
diff --git a/builtin/describe.c b/builtin/describe.c
index 24d740c..57e84c8 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -56,17 +56,10 @@ static int commit_name_cmp(const struct commit_name *cn1,
 	return hashcmp(cn1->peeled, peeled ? peeled : cn2->peeled);
 }
 
-static inline unsigned int hash_sha1(const unsigned char *sha1)
-{
-	unsigned int hash;
-	memcpy(&hash, sha1, sizeof(hash));
-	return hash;
-}
-
 static inline struct commit_name *find_commit_name(const unsigned char *peeled)
 {
 	struct commit_name key;
-	hashmap_entry_init(&key, hash_sha1(peeled));
+	hashmap_entry_init(&key, sha1hash(peeled));
 	return hashmap_get(&names, &key, peeled);
 }
 
@@ -114,7 +107,7 @@ static void add_to_known_names(const char *path,
 		if (!e) {
 			e = xmalloc(sizeof(struct commit_name));
 			hashcpy(e->peeled, peeled);
-			hashmap_entry_init(e, hash_sha1(peeled));
+			hashmap_entry_init(e, sha1hash(peeled));
 			hashmap_add(&names, e);
 			e->path = NULL;
 		}
diff --git a/decorate.c b/decorate.c
index 7cb5d29..b2aac90 100644
--- a/decorate.c
+++ b/decorate.c
@@ -8,10 +8,7 @@
 
 static unsigned int hash_obj(const struct object *obj, unsigned int n)
 {
-	unsigned int hash;
-
-	memcpy(&hash, obj->sha1, sizeof(unsigned int));
-	return hash % n;
+	return sha1hash(obj->sha1) % n;
 }
 
 static void *insert_decoration(struct decoration *n, const struct object *base, void *decoration)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 749a35d..6fa97d4 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -242,14 +242,12 @@ struct file_similarity {
 
 static unsigned int hash_filespec(struct diff_filespec *filespec)
 {
-	unsigned int hash;
 	if (!filespec->sha1_valid) {
 		if (diff_populate_filespec(filespec, 0))
 			return 0;
 		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
 	}
-	memcpy(&hash, filespec->sha1, sizeof(hash));
-	return hash;
+	return sha1hash(filespec->sha1);
 }
 
 static int find_identical_files(struct hashmap *srcs,
diff --git a/hashmap.h b/hashmap.h
index a816ad4..ed5425a 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -13,6 +13,17 @@ extern unsigned int strihash(const char *buf);
 extern unsigned int memhash(const void *buf, size_t len);
 extern unsigned int memihash(const void *buf, size_t len);
 
+static inline unsigned int sha1hash(const unsigned char *sha1)
+{
+	/*
+	 * Equivalent to 'return *(int *)sha1;', but safe on platforms that
+	 * don't support unaligned reads.
+	 */
+	unsigned int hash;
+	memcpy(&hash, sha1, sizeof(hash));
+	return hash;
+}
+
 /* data structures */
 
 struct hashmap_entry {
diff --git a/khash.h b/khash.h
index 57ff603..06c7906 100644
--- a/khash.h
+++ b/khash.h
@@ -320,19 +320,12 @@ static const double __ac_HASH_UPPER = 0.77;
 		code;												\
 	} }
 
-static inline khint_t __kh_oid_hash(const unsigned char *oid)
-{
-	khint_t hash;
-	memcpy(&hash, oid, sizeof(hash));
-	return hash;
-}
-
 #define __kh_oid_cmp(a, b) (hashcmp(a, b) == 0)
 
-KHASH_INIT(sha1, const unsigned char *, void *, 1, __kh_oid_hash, __kh_oid_cmp)
+KHASH_INIT(sha1, const unsigned char *, void *, 1, sha1hash, __kh_oid_cmp)
 typedef kh_sha1_t khash_sha1;
 
-KHASH_INIT(sha1_pos, const unsigned char *, int, 1, __kh_oid_hash, __kh_oid_cmp)
+KHASH_INIT(sha1_pos, const unsigned char *, int, 1, sha1hash, __kh_oid_cmp)
 typedef kh_sha1_pos_t khash_sha1_pos;
 
 #endif /* __AC_KHASH_H */
diff --git a/object.c b/object.c
index 57a0890..73acaea 100644
--- a/object.c
+++ b/object.c
@@ -50,18 +50,7 @@ int type_from_string(const char *str)
  */
 static unsigned int hash_obj(const unsigned char *sha1, unsigned int n)
 {
-	unsigned int hash;
-
-	/*
-	 * Since the sha1 is essentially random, we just take the
-	 * required number of bits directly from the first
-	 * sizeof(unsigned int) bytes of sha1.  First we have to copy
-	 * the bytes into a properly aligned integer.  If we cared
-	 * about getting consistent results across architectures, we
-	 * would have to call ntohl() here, too.
-	 */
-	memcpy(&hash, sha1, sizeof(unsigned int));
-	return hash & (n - 1);
+	return sha1hash(sha1) & (n - 1);
 }
 
 /*
diff --git a/pack-objects.c b/pack-objects.c
index 4f36c32..9992f3e 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -7,10 +7,9 @@ static uint32_t locate_object_entry_hash(struct packing_data *pdata,
 					 const unsigned char *sha1,
 					 int *found)
 {
-	uint32_t i, hash, mask = (pdata->index_size - 1);
+	uint32_t i, mask = (pdata->index_size - 1);
 
-	memcpy(&hash, sha1, sizeof(uint32_t));
-	i = hash & mask;
+	i = sha1hash(sha1) & mask;
 
 	while (pdata->index[i] > 0) {
 		uint32_t pos = pdata->index[i] - 1;
-- 
1.9.4.msysgit.0.dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 2/4] hashmap: improve struct hashmap member documentation
  2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
  2014-07-02 22:20 ` [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1 Karsten Blees
@ 2014-07-02 22:21 ` Karsten Blees
  2014-07-02 22:22 ` [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API Karsten Blees
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Karsten Blees @ 2014-07-02 22:21 UTC (permalink / raw)
  To: Git List

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 Documentation/technical/api-hashmap.txt | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
index 4689968..dc21a7c 100644
--- a/Documentation/technical/api-hashmap.txt
+++ b/Documentation/technical/api-hashmap.txt
@@ -8,11 +8,19 @@ Data Structures
 
 `struct hashmap`::
 
-	The hash table structure.
+	The hash table structure. Members can be used as follows, but should
+	not be modified directly:
 +
-The `size` member keeps track of the total number of entries. The `cmpfn`
-member is a function used to compare two entries for equality. The `table` and
-`tablesize` members store the hash table and its size, respectively.
+The `size` member keeps track of the total number of entries (0 means the
+hashmap is empty).
++
+`tablesize` is the allocated size of the hash table. A non-0 value indicates
+that the hashmap is initialized. It may also be useful for statistical purposes
+(i.e. `size / tablesize` is the current load factor).
++
+`cmpfn` stores the comparison function specified in `hashmap_init()`. In
+advanced scenarios, it may be useful to change this, e.g. to switch between
+case-sensitive and case-insensitive lookup.
 
 `struct hashmap_entry`::
 
-- 
1.9.4.msysgit.0.dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API
  2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
  2014-07-02 22:20 ` [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1 Karsten Blees
  2014-07-02 22:21 ` [PATCH v1 2/4] hashmap: improve struct hashmap member documentation Karsten Blees
@ 2014-07-02 22:22 ` Karsten Blees
  2014-07-07 17:43   ` Junio C Hamano
  2014-07-02 22:22 ` [PATCH v1 4/4] hashmap: add string interning API Karsten Blees
  2014-07-03  7:23 ` [PATCH v1 0/4] hashmap improvements Matthieu Moy
  4 siblings, 1 reply; 12+ messages in thread
From: Karsten Blees @ 2014-07-02 22:22 UTC (permalink / raw)
  To: Git List

Hashmap entries are typically looked up by just a key. The hashmap_get()
API expects an initialized entry structure instead, to support compound
keys. This flexibility is currently only needed by find_dir_entry() in
name-hash.c (and compat/win32/fscache.c in the msysgit fork). All other
(currently five) call sites of hashmap_get() have to set up a near emtpy
entry structure, resulting in duplicate code like this:

  struct hashmap_entry keyentry;
  hashmap_entry_init(&keyentry, hash(key));
  return hashmap_get(map, &keyentry, key);

Add a hashmap_get_from_hash() API that allows hashmap lookups by just
specifying the key and its hash code, i.e.:

  return hashmap_get_from_hash(map, hash(key), key);

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 Documentation/technical/api-hashmap.txt | 14 ++++++++++++++
 builtin/describe.c                      |  4 +---
 diffcore-rename.c                       |  7 +++----
 hashmap.h                               |  8 ++++++++
 name-hash.c                             |  5 ++---
 test-hashmap.c                          | 11 +++--------
 6 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
index dc21a7c..f9215d6 100644
--- a/Documentation/technical/api-hashmap.txt
+++ b/Documentation/technical/api-hashmap.txt
@@ -118,6 +118,20 @@ hashmap_entry) that has at least been initialized with the proper hash code
 If an entry with matching hash code is found, `key` and `keydata` are passed
 to `hashmap_cmp_fn` to decide whether the entry matches the key.
 
+`void *hashmap_get_from_hash(const struct hashmap *map, unsigned int hash, const void *keydata)`::
+
+	Returns the hashmap entry for the specified hash code and key data,
+	or NULL if not found.
++
+`map` is the hashmap structure.
++
+`hash` is the hash code of the entry to look up.
++
+If an entry with matching hash code is found, `keydata` is passed to
+`hashmap_cmp_fn` to decide whether the entry matches the key. The
+`entry_or_key` parameter points to a bogus hashmap_entry structure that
+should not be used in the comparison.
+
 `void *hashmap_get_next(const struct hashmap *map, const void *entry)`::
 
 	Returns the next equal hashmap entry, or NULL if not found. This can be
diff --git a/builtin/describe.c b/builtin/describe.c
index 57e84c8..ee6a3b9 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -58,9 +58,7 @@ static int commit_name_cmp(const struct commit_name *cn1,
 
 static inline struct commit_name *find_commit_name(const unsigned char *peeled)
 {
-	struct commit_name key;
-	hashmap_entry_init(&key, sha1hash(peeled));
-	return hashmap_get(&names, &key, peeled);
+	return hashmap_get_from_hash(&names, sha1hash(peeled), peeled);
 }
 
 static int replace_name(struct commit_name *e,
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6fa97d4..2e44a37 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -257,15 +257,14 @@ static int find_identical_files(struct hashmap *srcs,
 	int renames = 0;
 
 	struct diff_filespec *target = rename_dst[dst_index].two;
-	struct file_similarity *p, *best, dst;
+	struct file_similarity *p, *best = NULL;
 	int i = 100, best_score = -1;
 
 	/*
 	 * Find the best source match for specified destination.
 	 */
-	best = NULL;
-	hashmap_entry_init(&dst, hash_filespec(target));
-	for (p = hashmap_get(srcs, &dst, NULL); p; p = hashmap_get_next(srcs, p)) {
+	p = hashmap_get_from_hash(srcs, hash_filespec(target), NULL);
+	for (; p; p = hashmap_get_next(srcs, p)) {
 		int score;
 		struct diff_filespec *source = p->filespec;
 
diff --git a/hashmap.h b/hashmap.h
index ed5425a..12f0668 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -68,6 +68,14 @@ extern void *hashmap_put(struct hashmap *map, void *entry);
 extern void *hashmap_remove(struct hashmap *map, const void *key,
 		const void *keydata);
 
+static inline void *hashmap_get_from_hash(const struct hashmap *map,
+		unsigned int hash, const void *keydata)
+{
+	struct hashmap_entry key;
+	hashmap_entry_init(&key, hash);
+	return hashmap_get(map, &key, keydata);
+}
+
 /* hashmap_iter functions */
 
 extern void hashmap_iter_init(struct hashmap *map, struct hashmap_iter *iter);
diff --git a/name-hash.c b/name-hash.c
index 49fd508..702cd05 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -213,12 +213,11 @@ struct cache_entry *index_dir_exists(struct index_state *istate, const char *nam
 struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int icase)
 {
 	struct cache_entry *ce;
-	struct hashmap_entry key;
 
 	lazy_init_name_hash(istate);
 
-	hashmap_entry_init(&key, memihash(name, namelen));
-	ce = hashmap_get(&istate->name_hash, &key, NULL);
+	ce = hashmap_get_from_hash(&istate->name_hash,
+				   memihash(name, namelen), NULL);
 	while (ce) {
 		if (same_name(ce, name, namelen, icase))
 			return ce;
diff --git a/test-hashmap.c b/test-hashmap.c
index f5183fb..3c9f67b 100644
--- a/test-hashmap.c
+++ b/test-hashmap.c
@@ -115,9 +115,8 @@ static void perf_hashmap(unsigned int method, unsigned int rounds)
 
 		for (j = 0; j < rounds; j++) {
 			for (i = 0; i < TEST_SIZE; i++) {
-				struct hashmap_entry key;
-				hashmap_entry_init(&key, hashes[i]);
-				hashmap_get(&map, &key, entries[i]->key);
+				hashmap_get_from_hash(&map, hashes[i],
+						      entries[i]->key);
 			}
 		}
 
@@ -199,12 +198,8 @@ int main(int argc, char *argv[])
 
 		} else if (!strcmp("get", cmd) && l1) {
 
-			/* setup static key */
-			struct hashmap_entry key;
-			hashmap_entry_init(&key, hash);
-
 			/* lookup entry in hashmap */
-			entry = hashmap_get(&map, &key, p1);
+			entry = hashmap_get_from_hash(&map, hash, p1);
 
 			/* print result */
 			if (!entry)
-- 
1.9.4.msysgit.0.dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 4/4] hashmap: add string interning API
  2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
                   ` (2 preceding siblings ...)
  2014-07-02 22:22 ` [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API Karsten Blees
@ 2014-07-02 22:22 ` Karsten Blees
  2014-07-03  7:22   ` Matthieu Moy
  2014-07-07 17:44   ` Junio C Hamano
  2014-07-03  7:23 ` [PATCH v1 0/4] hashmap improvements Matthieu Moy
  4 siblings, 2 replies; 12+ messages in thread
From: Karsten Blees @ 2014-07-02 22:22 UTC (permalink / raw)
  To: Git List

Interning short strings with high probability of duplicates can reduce the
memory footprint and speed up comparisons.

Add strintern() and memintern() APIs that use a hashmap to manage the pool
of unique, interned strings.

Note: strintern(getenv()) could be used to sanitize git's use of getenv(),
in case we ever encounter a platform where a call to getenv() invalidates
previous getenv() results (which is allowed by POSIX).

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 Documentation/technical/api-hashmap.txt | 15 +++++++++++++
 hashmap.c                               | 38 +++++++++++++++++++++++++++++++++
 hashmap.h                               |  8 +++++++
 t/t0011-hashmap.sh                      | 13 +++++++++++
 test-hashmap.c                          | 14 ++++++++++++
 5 files changed, 88 insertions(+)

diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
index f9215d6..00c4c29 100644
--- a/Documentation/technical/api-hashmap.txt
+++ b/Documentation/technical/api-hashmap.txt
@@ -193,6 +193,21 @@ more entries.
 `hashmap_iter_first` is a combination of both (i.e. initializes the iterator
 and returns the first entry, if any).
 
+`const char *strintern(const char *string)`::
+`const void *memintern(const void *data, size_t len)`::
+
+	Returns the unique, interned version of the specified string or data,
+	similar to the `String.intern` API in Java and .NET, respectively.
+	Interned strings remain valid for the entire lifetime of the process.
++
+Can be used as `[x]strdup()` or `xmemdupz` replacement, except that interned
+strings / data must not be modified or freed.
++
+Interned strings are best used for short strings with high probability of
+duplicates.
++
+Uses a hashmap to store the pool of interned strings.
+
 Usage example
 -------------
 
diff --git a/hashmap.c b/hashmap.c
index d1b8056..f693839 100644
--- a/hashmap.c
+++ b/hashmap.c
@@ -226,3 +226,41 @@ void *hashmap_iter_next(struct hashmap_iter *iter)
 		current = iter->map->table[iter->tablepos++];
 	}
 }
+
+struct pool_entry {
+	struct hashmap_entry ent;
+	size_t len;
+	unsigned char data[FLEX_ARRAY];
+};
+
+static int pool_entry_cmp(const struct pool_entry *e1,
+			  const struct pool_entry *e2,
+			  const unsigned char *keydata)
+{
+	return e1->data != keydata &&
+	       (e1->len != e2->len || memcmp(e1->data, keydata, e1->len));
+}
+
+const void *memintern(const void *data, size_t len)
+{
+	static struct hashmap map;
+	struct pool_entry key, *e;
+
+	/* initialize string pool hashmap */
+	if (!map.tablesize)
+		hashmap_init(&map, (hashmap_cmp_fn) pool_entry_cmp, 0);
+
+	/* lookup interned string in pool */
+	hashmap_entry_init(&key, memhash(data, len));
+	key.len = len;
+	e = hashmap_get(&map, &key, data);
+	if (!e) {
+		/* not found: create it */
+		e = xmallocz(sizeof(struct pool_entry) + len);
+		hashmap_entry_init(e, key.ent.hash);
+		e->len = len;
+		memcpy(e->data, data, len);
+		hashmap_add(&map, e);
+	}
+	return e->data;
+}
diff --git a/hashmap.h b/hashmap.h
index 12f0668..507884b 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -87,4 +87,12 @@ static inline void *hashmap_iter_first(struct hashmap *map,
 	return hashmap_iter_next(iter);
 }
 
+/* string interning */
+
+extern const void *memintern(const void *data, size_t len);
+static inline const char *strintern(const char *string)
+{
+	return memintern(string, strlen(string));
+}
+
 #endif
diff --git a/t/t0011-hashmap.sh b/t/t0011-hashmap.sh
index 391e2b6..f97c805 100755
--- a/t/t0011-hashmap.sh
+++ b/t/t0011-hashmap.sh
@@ -237,4 +237,17 @@ test_expect_success 'grow / shrink' '
 
 '
 
+test_expect_success 'string interning' '
+
+test_hashmap "intern value1
+intern Value1
+intern value2
+intern value2
+" "value1
+Value1
+value2
+value2"
+
+'
+
 test_done
diff --git a/test-hashmap.c b/test-hashmap.c
index 3c9f67b..07aa7ec 100644
--- a/test-hashmap.c
+++ b/test-hashmap.c
@@ -234,6 +234,20 @@ int main(int argc, char *argv[])
 			/* print table sizes */
 			printf("%u %u\n", map.tablesize, map.size);
 
+		} else if (!strcmp("intern", cmd) && l1) {
+
+			/* test that strintern works */
+			const char *i1 = strintern(p1);
+			const char *i2 = strintern(p1);
+			if (strcmp(i1, p1))
+				printf("strintern(%s) returns %s\n", p1, i1);
+			else if (i1 == p1)
+				printf("strintern(%s) returns input pointer\n", p1);
+			else if (i1 != i2)
+				printf("strintern(%s) != strintern(%s)", i1, i2);
+			else
+				printf("%s\n", i1);
+
 		} else if (!strcmp("perfhashmap", cmd) && l1 && l2) {
 
 			perf_hashmap(atoi(p1), atoi(p2));
-- 
1.9.4.msysgit.0.dirty

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 4/4] hashmap: add string interning API
  2014-07-02 22:22 ` [PATCH v1 4/4] hashmap: add string interning API Karsten Blees
@ 2014-07-03  7:22   ` Matthieu Moy
  2014-07-07 17:44   ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Matthieu Moy @ 2014-07-03  7:22 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

> --- a/t/t0011-hashmap.sh
> +++ b/t/t0011-hashmap.sh
> @@ -237,4 +237,17 @@ test_expect_success 'grow / shrink' '
>  
>  '
>  
> +test_expect_success 'string interning' '
> +
> +test_hashmap "intern value1
> +intern Value1
> +intern value2
> +intern value2
> +" "value1
> +Value1
> +value2
> +value2"
> +
> +'

Indentation is broken here, but it's consistant with the current style
of the file. You may want a "modernize style" patch before yours and
then indent properly, but that shouldn't block inclusion.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/4] hashmap improvements
  2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
                   ` (3 preceding siblings ...)
  2014-07-02 22:22 ` [PATCH v1 4/4] hashmap: add string interning API Karsten Blees
@ 2014-07-03  7:23 ` Matthieu Moy
  4 siblings, 0 replies; 12+ messages in thread
From: Matthieu Moy @ 2014-07-03  7:23 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List, Tanay Abhra

Karsten Blees <karsten.blees@gmail.com> writes:

> Here are a few small hashmap improvements, partly resulting from recent
> discussion of the config-cache topic.
>
> Karsten Blees (4):
>   hashmap: factor out getting an int hash code from a SHA1
>   hashmap: improve struct hashmap member documentation
>   hashmap: add simplified hashmap_get_from_hash() API
>   hashmap: add string interning API

The patch series look good to me (but that's not a detailed review,
just a moderately quick look).

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1
  2014-07-02 22:20 ` [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1 Karsten Blees
@ 2014-07-07 17:22   ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2014-07-07 17:22 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

> Copying the first bytes of a SHA1 is duplicated in six places, however,
> the implications (wrong byte order on little-endian systems) is documented
> only once.

s/wrong /different /; but other than that I think this is a good
change.

> +`unsigned int sha1hash(const unsigned char *sha1)`::
> +
> +	Converts a cryptographic hash (e.g. SHA-1) into an int-sized hash code
> +	for use in hash tables. Cryptographic hashes are supposed to have
> +	uniform distribution, so in contrast to `memhash()`, this just copies
> +	the first `sizeof(int)` bytes without shuffling any bits. Note that
> +	the results will be different on big-endian and little-endian
> +	platforms, so they should not be stored or transferred over the net!

Tone down with s/!/./, perhaps?

Another thing we may want to caution against is to use it as a
tie-breaker that affects the final outcome the user can observe, but
that may be something that goes without saying.  I dunno..

> diff --git a/hashmap.h b/hashmap.h
> index a816ad4..ed5425a 100644
> --- a/hashmap.h
> +++ b/hashmap.h
> @@ -13,6 +13,17 @@ extern unsigned int strihash(const char *buf);
>  extern unsigned int memhash(const void *buf, size_t len);
>  extern unsigned int memihash(const void *buf, size_t len);
>  
> +static inline unsigned int sha1hash(const unsigned char *sha1)
> +{
> +	/*
> +	 * Equivalent to 'return *(int *)sha1;', but safe on platforms that
> +	 * don't support unaligned reads.
> +	 */

s/int/unsigned &/; other than that, the explanation is good.

> +	unsigned int hash;
> +	memcpy(&hash, sha1, sizeof(hash));
> +	return hash;
> +}
> +

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API
  2014-07-02 22:22 ` [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API Karsten Blees
@ 2014-07-07 17:43   ` Junio C Hamano
  2014-07-11 19:11     ` Karsten Blees
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2014-07-07 17:43 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

> Hashmap entries are typically looked up by just a key. The hashmap_get()
> API expects an initialized entry structure instead, to support compound
> keys. This flexibility is currently only needed by find_dir_entry() in
> name-hash.c (and compat/win32/fscache.c in the msysgit fork). All other
> (currently five) call sites of hashmap_get() have to set up a near emtpy

s/emtpy/empty/;

> entry structure, resulting in duplicate code like this:
>
>   struct hashmap_entry keyentry;
>   hashmap_entry_init(&keyentry, hash(key));
>   return hashmap_get(map, &keyentry, key);
>
> Add a hashmap_get_from_hash() API that allows hashmap lookups by just
> specifying the key and its hash code, i.e.:
>
>   return hashmap_get_from_hash(map, hash(key), key);

While I think the *_get_from_hash() is an improvement over the
three-line sequence, I find it somewhat strange that callers of it
still must compute the hash code themselves, instead of letting the
hashmap itself to apply the user supplied hash function to the key
to derive it.  After all, the map must know how to compare two
entries via a user-supplied cmpfn given at the map initialization
time, and the algorithm to derive the hash code falls into the same
category, in the sense that both must stay the same during the
lifetime of a hashmap, no?  Unless there is a situation where you
need to be able to initialize a hashmap_entry without knowing which
hashmap the entry will eventually be stored, it feels dubious API
that exposes to the outside callers hashmap_entry_init() that takes
the hash code without taking the hashmap itself.

In other words, why isn't hashmap_get() more like this:

        void *hashmap_get(const struct hashmap *map, const void *key)
        {
                struct hashmap_entry keyentry;
                hashmap_entry_init(&keyentry, map->hash(key));
                return *find_entry_ptr(map, &keyentry, key);
        }

with hashmap_entry_init() purely a static helper in hashmap.c?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 4/4] hashmap: add string interning API
  2014-07-02 22:22 ` [PATCH v1 4/4] hashmap: add string interning API Karsten Blees
  2014-07-03  7:22   ` Matthieu Moy
@ 2014-07-07 17:44   ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2014-07-07 17:44 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

> Interning short strings with high probability of duplicates can reduce the
> memory footprint and speed up comparisons.
>
> Add strintern() and memintern() APIs that use a hashmap to manage the pool
> of unique, interned strings.
>
> Note: strintern(getenv()) could be used to sanitize git's use of getenv(),
> in case we ever encounter a platform where a call to getenv() invalidates
> previous getenv() results (which is allowed by POSIX).

I think the attribute name/value strings are already interned, so
they may want to be converted to use this (or vice-versa).

>
> Signed-off-by: Karsten Blees <blees@dcon.de>
> ---
>  Documentation/technical/api-hashmap.txt | 15 +++++++++++++
>  hashmap.c                               | 38 +++++++++++++++++++++++++++++++++
>  hashmap.h                               |  8 +++++++
>  t/t0011-hashmap.sh                      | 13 +++++++++++
>  test-hashmap.c                          | 14 ++++++++++++
>  5 files changed, 88 insertions(+)
>
> diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
> index f9215d6..00c4c29 100644
> --- a/Documentation/technical/api-hashmap.txt
> +++ b/Documentation/technical/api-hashmap.txt
> @@ -193,6 +193,21 @@ more entries.
>  `hashmap_iter_first` is a combination of both (i.e. initializes the iterator
>  and returns the first entry, if any).
>  
> +`const char *strintern(const char *string)`::
> +`const void *memintern(const void *data, size_t len)`::
> +
> +	Returns the unique, interned version of the specified string or data,
> +	similar to the `String.intern` API in Java and .NET, respectively.
> +	Interned strings remain valid for the entire lifetime of the process.
> ++
> +Can be used as `[x]strdup()` or `xmemdupz` replacement, except that interned
> +strings / data must not be modified or freed.
> ++
> +Interned strings are best used for short strings with high probability of
> +duplicates.
> ++
> +Uses a hashmap to store the pool of interned strings.
> +
>  Usage example
>  -------------
>  
> diff --git a/hashmap.c b/hashmap.c
> index d1b8056..f693839 100644
> --- a/hashmap.c
> +++ b/hashmap.c
> @@ -226,3 +226,41 @@ void *hashmap_iter_next(struct hashmap_iter *iter)
>  		current = iter->map->table[iter->tablepos++];
>  	}
>  }
> +
> +struct pool_entry {
> +	struct hashmap_entry ent;
> +	size_t len;
> +	unsigned char data[FLEX_ARRAY];
> +};
> +
> +static int pool_entry_cmp(const struct pool_entry *e1,
> +			  const struct pool_entry *e2,
> +			  const unsigned char *keydata)
> +{
> +	return e1->data != keydata &&
> +	       (e1->len != e2->len || memcmp(e1->data, keydata, e1->len));
> +}
> +
> +const void *memintern(const void *data, size_t len)
> +{
> +	static struct hashmap map;
> +	struct pool_entry key, *e;
> +
> +	/* initialize string pool hashmap */
> +	if (!map.tablesize)
> +		hashmap_init(&map, (hashmap_cmp_fn) pool_entry_cmp, 0);
> +
> +	/* lookup interned string in pool */
> +	hashmap_entry_init(&key, memhash(data, len));
> +	key.len = len;
> +	e = hashmap_get(&map, &key, data);
> +	if (!e) {
> +		/* not found: create it */
> +		e = xmallocz(sizeof(struct pool_entry) + len);
> +		hashmap_entry_init(e, key.ent.hash);
> +		e->len = len;
> +		memcpy(e->data, data, len);
> +		hashmap_add(&map, e);
> +	}
> +	return e->data;
> +}
> diff --git a/hashmap.h b/hashmap.h
> index 12f0668..507884b 100644
> --- a/hashmap.h
> +++ b/hashmap.h
> @@ -87,4 +87,12 @@ static inline void *hashmap_iter_first(struct hashmap *map,
>  	return hashmap_iter_next(iter);
>  }
>  
> +/* string interning */
> +
> +extern const void *memintern(const void *data, size_t len);
> +static inline const char *strintern(const char *string)
> +{
> +	return memintern(string, strlen(string));
> +}
> +
>  #endif
> diff --git a/t/t0011-hashmap.sh b/t/t0011-hashmap.sh
> index 391e2b6..f97c805 100755
> --- a/t/t0011-hashmap.sh
> +++ b/t/t0011-hashmap.sh
> @@ -237,4 +237,17 @@ test_expect_success 'grow / shrink' '
>  
>  '
>  
> +test_expect_success 'string interning' '
> +
> +test_hashmap "intern value1
> +intern Value1
> +intern value2
> +intern value2
> +" "value1
> +Value1
> +value2
> +value2"
> +
> +'
> +
>  test_done
> diff --git a/test-hashmap.c b/test-hashmap.c
> index 3c9f67b..07aa7ec 100644
> --- a/test-hashmap.c
> +++ b/test-hashmap.c
> @@ -234,6 +234,20 @@ int main(int argc, char *argv[])
>  			/* print table sizes */
>  			printf("%u %u\n", map.tablesize, map.size);
>  
> +		} else if (!strcmp("intern", cmd) && l1) {
> +
> +			/* test that strintern works */
> +			const char *i1 = strintern(p1);
> +			const char *i2 = strintern(p1);
> +			if (strcmp(i1, p1))
> +				printf("strintern(%s) returns %s\n", p1, i1);
> +			else if (i1 == p1)
> +				printf("strintern(%s) returns input pointer\n", p1);
> +			else if (i1 != i2)
> +				printf("strintern(%s) != strintern(%s)", i1, i2);
> +			else
> +				printf("%s\n", i1);
> +
>  		} else if (!strcmp("perfhashmap", cmd) && l1 && l2) {
>  
>  			perf_hashmap(atoi(p1), atoi(p2));

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API
  2014-07-07 17:43   ` Junio C Hamano
@ 2014-07-11 19:11     ` Karsten Blees
  2014-07-11 22:21       ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Karsten Blees @ 2014-07-11 19:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List

Am 07.07.2014 19:43, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> Hashmap entries are typically looked up by just a key. The hashmap_get()
>> API expects an initialized entry structure instead, to support compound
>> keys. This flexibility is currently only needed by find_dir_entry() in
>> name-hash.c (and compat/win32/fscache.c in the msysgit fork). All other
>> (currently five) call sites of hashmap_get() have to set up a near emtpy
> 
> s/emtpy/empty/;
> 
>> entry structure, resulting in duplicate code like this:
>>
>>   struct hashmap_entry keyentry;
>>   hashmap_entry_init(&keyentry, hash(key));
>>   return hashmap_get(map, &keyentry, key);
>>
>> Add a hashmap_get_from_hash() API that allows hashmap lookups by just
>> specifying the key and its hash code, i.e.:
>>
>>   return hashmap_get_from_hash(map, hash(key), key);
> 
> While I think the *_get_from_hash() is an improvement over the
> three-line sequence, I find it somewhat strange that callers of it
> still must compute the hash code themselves, instead of letting the
> hashmap itself to apply the user supplied hash function to the key
> to derive it.  After all, the map must know how to compare two
> entries via a user-supplied cmpfn given at the map initialization
> time, and the algorithm to derive the hash code falls into the same
> category, in the sense that both must stay the same during the
> lifetime of a hashmap, no?  Unless there is a situation where you
> need to be able to initialize a hashmap_entry without knowing which
> hashmap the entry will eventually be stored, it feels dubious API
> that exposes to the outside callers hashmap_entry_init() that takes
> the hash code without taking the hashmap itself.
> 
> In other words, why isn't hashmap_get() more like this:
> 
>         void *hashmap_get(const struct hashmap *map, const void *key)
>         {
>                 struct hashmap_entry keyentry;
>                 hashmap_entry_init(&keyentry, map->hash(key));
>                 return *find_entry_ptr(map, &keyentry, key);
>         }
> 
> with hashmap_entry_init() purely a static helper in hashmap.c?
> 

1. Performance

Calculating hash codes is the most expensive operation when working with
hash tables (unless you already have a good hash such as sha1). And using
hash tables as a cache of some sort is probably the most common use case.
So you'll often have code like this:

1  unsigned int h = hash(key);
2  struct my_entry *e = hashmap_get_from_hash(&map, hash, key);
3  if (!e) {
4    e = malloc(sizeof(*e));
5    hashmap_entry_init(e, h);
6    e->key = key;
6    hashmap_add(&map, e);
7  }

Note that the hash code from line 1 can be reused in line 5. You cannot do
that if calculating the hash code is buried in the implementation.

Callbacks cannot be inlined either...


2. Simplicity

Most APIs take a user defined hashmap_entry structure, so you'd either need
two hash functions, or a hash function that can distinguish between the
'entry' and 'key-only' case, e.g.

  unsigned int my_hash(const struct my_entry *entry, const void *keydata)
  {
    if (keydata)
      return strhash(keydata);
    else
      return strhash(entry->key);
  }

Hashmap clients will typically provide small, type safe wrappers around the
hashmap API. That's perfect for calculating the hash code without breaking
encapsulation. IMO there's no need to make things more complex by requiring
another callback.


3. Compound keys

The key doesn't always consist of just a single word. E.g. for struct
dir_entry, the key is [char *name, int len]. So an API like this:

  void *hashmap_get(const struct hashmap *map, const void *key)

won't do in the general case (unless you require clients to define their
own key structure in addition to the entry structure...).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API
  2014-07-11 19:11     ` Karsten Blees
@ 2014-07-11 22:21       ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2014-07-11 22:21 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

>> In other words, why isn't hashmap_get() more like this:
>>  ...
>> with hashmap_entry_init() purely a static helper in hashmap.c?
>> 
> 1. Performance

OK.

> 2. Simplicity
>
> Hashmap clients will typically provide small, type safe wrappers around the
> hashmap API.

OK.

> 3. Compound keys
>
> The key doesn't always consist of just a single word. E.g. for struct
> dir_entry, the key is [char *name, int len]. So an API like this:
>
>   void *hashmap_get(const struct hashmap *map, const void *key)
>
> won't do in the general case (unless you require clients to define their
> own key structure in addition to the entry structure...).

Yeah, I was being silly.  Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-07-11 22:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
2014-07-02 22:20 ` [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1 Karsten Blees
2014-07-07 17:22   ` Junio C Hamano
2014-07-02 22:21 ` [PATCH v1 2/4] hashmap: improve struct hashmap member documentation Karsten Blees
2014-07-02 22:22 ` [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API Karsten Blees
2014-07-07 17:43   ` Junio C Hamano
2014-07-11 19:11     ` Karsten Blees
2014-07-11 22:21       ` Junio C Hamano
2014-07-02 22:22 ` [PATCH v1 4/4] hashmap: add string interning API Karsten Blees
2014-07-03  7:22   ` Matthieu Moy
2014-07-07 17:44   ` Junio C Hamano
2014-07-03  7:23 ` [PATCH v1 0/4] hashmap improvements Matthieu Moy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.