All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>
Subject: [PATCH v2 07/11] add oidset API
Date: Wed, 8 Feb 2017 15:53:07 -0500	[thread overview]
Message-ID: <20170208205307.uvgj3saf3cyrvtan@sigill.intra.peff.net> (raw)
In-Reply-To: <20170208205219.twgm5rggovqbepte@sigill.intra.peff.net>

This is similar to many of our uses of sha1-array, but it
overcomes one limitation of a sha1-array: when you are
de-duplicating a large input with relatively few unique
entries, sha1-array uses 20 bytes per non-unique entry.
Whereas this set will use memory linear in the number of
unique entries (albeit a few more than 20 bytes due to
hashmap overhead).

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile |  1 +
 oidset.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 oidset.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 95 insertions(+)
 create mode 100644 oidset.c
 create mode 100644 oidset.h

diff --git a/Makefile b/Makefile
index 8e4081e06..a5433978e 100644
--- a/Makefile
+++ b/Makefile
@@ -781,6 +781,7 @@ LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
 LIB_OBJS += object.o
+LIB_OBJS += oidset.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-check.o
diff --git a/oidset.c b/oidset.c
new file mode 100644
index 000000000..ac169f05d
--- /dev/null
+++ b/oidset.c
@@ -0,0 +1,49 @@
+#include "cache.h"
+#include "oidset.h"
+
+struct oidset_entry {
+	struct hashmap_entry hash;
+	struct object_id oid;
+};
+
+static int oidset_hashcmp(const void *va, const void *vb,
+			  const void *vkey)
+{
+	const struct oidset_entry *a = va, *b = vb;
+	const struct object_id *key = vkey;
+	return oidcmp(&a->oid, key ? key : &b->oid);
+}
+
+int oidset_contains(const struct oidset *set, const struct object_id *oid)
+{
+	struct hashmap_entry key;
+
+	if (!set->map.cmpfn)
+		return 0;
+
+	hashmap_entry_init(&key, sha1hash(oid->hash));
+	return !!hashmap_get(&set->map, &key, oid);
+}
+
+int oidset_insert(struct oidset *set, const struct object_id *oid)
+{
+	struct oidset_entry *entry;
+
+	if (!set->map.cmpfn)
+		hashmap_init(&set->map, oidset_hashcmp, 0);
+
+	if (oidset_contains(set, oid))
+		return 1;
+
+	entry = xmalloc(sizeof(*entry));
+	hashmap_entry_init(&entry->hash, sha1hash(oid->hash));
+	oidcpy(&entry->oid, oid);
+
+	hashmap_add(&set->map, entry);
+	return 0;
+}
+
+void oidset_clear(struct oidset *set)
+{
+	hashmap_free(&set->map, 1);
+}
diff --git a/oidset.h b/oidset.h
new file mode 100644
index 000000000..b7eaab5b8
--- /dev/null
+++ b/oidset.h
@@ -0,0 +1,45 @@
+#ifndef OIDSET_H
+#define OIDSET_H
+
+/**
+ * This API is similar to sha1-array, in that it maintains a set of object ids
+ * in a memory-efficient way. The major differences are:
+ *
+ *   1. It uses a hash, so we can do online duplicate removal, rather than
+ *      sort-and-uniq at the end. This can reduce memory footprint if you have
+ *      a large list of oids with many duplicates.
+ *
+ *   2. The per-unique-oid memory footprint is slightly higher due to hash
+ *      table overhead.
+ */
+
+/**
+ * A single oidset; should be zero-initialized (or use OIDSET_INIT).
+ */
+struct oidset {
+	struct hashmap map;
+};
+
+#define OIDSET_INIT { { NULL } }
+
+/**
+ * Returns true iff `set` contains `oid`.
+ */
+int oidset_contains(const struct oidset *set, const struct object_id *oid);
+
+/**
+ * Insert the oid into the set; a copy is made, so "oid" does not need
+ * to persist after this function is called.
+ *
+ * Returns 1 if the oid was already in the set, 0 otherwise. This can be used
+ * to perform an efficient check-and-add.
+ */
+int oidset_insert(struct oidset *set, const struct object_id *oid);
+
+/**
+ * Remove all entries from the oidset, freeing any resources associated with
+ * it.
+ */
+void oidset_clear(struct oidset *set);
+
+#endif /* OIDSET_H */
-- 
2.12.0.rc0.371.ga6cf8653b


  parent reply	other threads:[~2017-02-08 21:01 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-24  0:37 [PATCH 0/12] reducing resource usage of for_each_alternate_ref Jeff King
2017-01-24  0:38 ` [PATCH 01/12] for_each_alternate_ref: handle failure from real_pathdup() Jeff King
2017-01-25 18:26   ` Junio C Hamano
2017-01-24  0:39 ` [PATCH 02/12] for_each_alternate_ref: stop trimming trailing slashes Jeff King
2017-01-24  0:40 ` [PATCH 03/12] for_each_alternate_ref: use strbuf for path allocation Jeff King
2017-01-25 18:29   ` Junio C Hamano
2017-01-25 18:40     ` Jeff King
2017-01-24  0:40 ` [PATCH 04/12] for_each_alternate_ref: pass name/oid instead of ref struct Jeff King
2017-01-24  0:44 ` [PATCH 05/12] for_each_alternate_ref: replace transport code with for-each-ref Jeff King
2017-01-25 19:00   ` Junio C Hamano
2017-01-24  0:45 ` [PATCH 06/12] clone: disable save_commit_buffer Jeff King
2017-01-25 19:11   ` Junio C Hamano
2017-01-25 19:27     ` Jeff King
2017-01-25 19:35       ` Jeff King
2017-01-25 21:07         ` Jeff King
2017-01-24  0:45 ` [PATCH 07/12] fetch-pack: cache results of for_each_alternate_ref Jeff King
2017-01-25 19:21   ` Junio C Hamano
2017-01-25 19:47     ` Jeff King
2017-01-24  0:46 ` [PATCH 08/12] add oidset API Jeff King
2017-01-24 20:26   ` Ramsay Jones
2017-01-24 20:35     ` Jeff King
2017-01-24  0:47 ` [PATCH 09/12] receive-pack: use oidset to de-duplicate .have lines Jeff King
2017-01-25 19:32   ` Junio C Hamano
2017-01-25 19:54     ` Jeff King
2017-01-24  0:47 ` [PATCH 10/12] receive-pack: fix misleading namespace/.have comment Jeff King
2017-01-24  0:48 ` [PATCH 11/12] receive-pack: treat namespace .have lines like alternates Jeff King
2017-01-25 19:51   ` Junio C Hamano
2017-01-25 19:58     ` Jeff King
2017-01-27 17:45     ` Lukas Fleischer
2017-01-27 17:58       ` Jeff King
2017-01-27 20:42         ` Junio C Hamano
2017-01-24  0:48 ` [PATCH 12/12] receive-pack: avoid duplicates between our refs and alternates Jeff King
2017-01-25 20:02   ` Junio C Hamano
2017-01-25 20:05     ` Jeff King
2017-01-24  1:33 ` [PATCH 0/12] reducing resource usage of for_each_alternate_ref Brandon Williams
2017-01-24  2:12   ` Jeff King
2017-02-08 20:52 ` [PATCH v2 0/11] " Jeff King
2017-02-08 20:52   ` [PATCH v2 01/11] for_each_alternate_ref: handle failure from real_pathdup() Jeff King
2017-02-08 20:52   ` [PATCH v2 02/11] for_each_alternate_ref: stop trimming trailing slashes Jeff King
2017-02-08 20:52   ` [PATCH v2 03/11] for_each_alternate_ref: use strbuf for path allocation Jeff King
2017-02-08 20:52   ` [PATCH v2 04/11] for_each_alternate_ref: pass name/oid instead of ref struct Jeff King
2017-02-08 20:53   ` [PATCH v2 05/11] for_each_alternate_ref: replace transport code with for-each-ref Jeff King
2017-02-08 20:53   ` [PATCH v2 06/11] fetch-pack: cache results of for_each_alternate_ref Jeff King
2017-02-08 20:53   ` Jeff King [this message]
2017-02-08 20:53   ` [PATCH v2 08/11] receive-pack: use oidset to de-duplicate .have lines Jeff King
2017-02-08 20:53   ` [PATCH v2 09/11] receive-pack: fix misleading namespace/.have comment Jeff King
2017-02-08 20:53   ` [PATCH v2 10/11] receive-pack: treat namespace .have lines like alternates Jeff King
2017-02-08 20:53   ` [PATCH v2 11/11] receive-pack: avoid duplicates between our refs and alternates Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170208205307.uvgj3saf3cyrvtan@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.