All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Add delta islands support
@ 2018-07-22  5:48 Christian Couder
  2018-07-22  5:48 ` [RFC PATCH 1/5] packfile: make get_delta_base() non static Christian Couder
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Christian Couder @ 2018-07-22  5:48 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder

This patch series is upstreaming work made by GitHub and available in:

https://github.com/peff/git/commits/jk/delta-islands

The patch in the above branch has been split into 5 patches with their
own new commit message, but no other change has been made.

I kept Peff as the author and took the liberty to add his
Signed-off-by to all the patches.

The patches look good to me except perhaps that "pack.islandCore" is
not documented. Maybe something could be added in
Documentation/technical/ too, or maybe improving commit messages could
be enough.

Anyway I wanted to send this nearly "as is" first to get early
feedback about what should be done.

This patch series is also available on GitHub in:

https://github.com/chriscool/git/commits/delta-islands

Jeff King (5):
  packfile: make get_delta_base() non static
  Add delta-islands.{c,h}
  pack-objects: add delta-islands support
  repack: add delta-islands support
  t: add t9930-delta-islands.sh

 Documentation/config.txt           |   8 +
 Documentation/git-pack-objects.txt |  88 ++++++
 Documentation/git-repack.txt       |   5 +
 Makefile                           |   1 +
 builtin/pack-objects.c             | 130 +++++---
 builtin/repack.c                   |   9 +
 delta-islands.c                    | 490 +++++++++++++++++++++++++++++
 delta-islands.h                    |  11 +
 pack-objects.h                     |   4 +
 packfile.c                         |  10 +-
 packfile.h                         |   3 +
 t/t9930-delta-islands.sh           | 143 +++++++++
 12 files changed, 856 insertions(+), 46 deletions(-)
 create mode 100644 delta-islands.c
 create mode 100644 delta-islands.h
 create mode 100755 t/t9930-delta-islands.sh

-- 
2.18.0.237.gffdb1dbdaa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC PATCH 1/5] packfile: make get_delta_base() non static
  2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
@ 2018-07-22  5:48 ` Christian Couder
  2018-07-24 16:19   ` Junio C Hamano
  2018-07-22  5:48 ` [RFC PATCH 2/5] Add delta-islands.{c,h} Christian Couder
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Christian Couder @ 2018-07-22  5:48 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder

From: Jeff King <peff@peff.net>

As get_delta_base() will be used outside 'packfile.c' in
a following commit, let's make it non static and let's
declare it in 'packfile.h'.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 packfile.c | 10 +++++-----
 packfile.h |  3 +++
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7cd45aa4b2..4646bff5ff 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1037,11 +1037,11 @@ const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
 	return NULL;
 }
 
-static off_t get_delta_base(struct packed_git *p,
-				    struct pack_window **w_curs,
-				    off_t *curpos,
-				    enum object_type type,
-				    off_t delta_obj_offset)
+off_t get_delta_base(struct packed_git *p,
+		     struct pack_window **w_curs,
+		     off_t *curpos,
+		     enum object_type type,
+		     off_t delta_obj_offset)
 {
 	unsigned char *base_info = use_pack(p, w_curs, *curpos, NULL);
 	off_t base_offset;
diff --git a/packfile.h b/packfile.h
index cc7eaffe1b..30f0811382 100644
--- a/packfile.h
+++ b/packfile.h
@@ -125,6 +125,9 @@ extern void *unpack_entry(struct repository *r, struct packed_git *, off_t, enum
 extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
+extern off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
+			    off_t *curpos, enum object_type type,
+			    off_t delta_obj_offset);
 
 extern void release_pack_memory(size_t);
 
-- 
2.18.0.237.gffdb1dbdaa


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
  2018-07-22  5:48 ` [RFC PATCH 1/5] packfile: make get_delta_base() non static Christian Couder
@ 2018-07-22  5:48 ` Christian Couder
  2018-07-22  8:50   ` Duy Nguyen
                     ` (2 more replies)
  2018-07-22  5:48 ` [RFC PATCH 3/5] pack-objects: add delta-islands support Christian Couder
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 37+ messages in thread
From: Christian Couder @ 2018-07-22  5:48 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder

From: Jeff King <peff@peff.net>

Hosting providers that allow users to "fork" existing
repos want as much as possible those forks to use a
small amount of disk space.

Using alternates to keep all the objects from all
the forks into a unique central repo is way to do
that, but it can have some drawbacks. Especially when
packing the central repo, deltas will be created
between objects from different forks.

This can make cloning or fetching a fork much slower
and much more CPU intensive as Git might have to
compute new deltas for many objects to avoid sending
objects from a different fork.

Delta islands is a way to store objects from
different forks in the same repo and packfile without
having deltas between objects from different forks.

This patch implements the delta islands mechanism in
"delta-islands.{c,h}", but does not yet make use of it.

A few new fields are added in 'struct object_entry'
in "pack-objects.h" though.

The new "pack.island" config variable which can be
used to configure the different islands is described
a bit in "Documentation/config.txt", but the real
documentation will follow in a patch that actually
uses delta islands in "builtin/pack-objects.c".

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config.txt |   4 +
 Makefile                 |   1 +
 delta-islands.c          | 490 +++++++++++++++++++++++++++++++++++++++
 delta-islands.h          |  11 +
 pack-objects.h           |   4 +
 5 files changed, 510 insertions(+)
 create mode 100644 delta-islands.c
 create mode 100644 delta-islands.h

diff --git a/Documentation/config.txt b/Documentation/config.txt
index a32172a43c..f682e92a1a 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2542,6 +2542,10 @@ Note that changing the compression level will not automatically recompress
 all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
+pack.island::
+	A regular expression configuring a set of delta islands. See
+	"DELTA ISLANDS" in linkgit:git-pack-objects[1] for details.
+
 pack.deltaCacheSize::
 	The maximum memory in bytes used for caching deltas in
 	linkgit:git-pack-objects[1] before writing them out to a pack.
diff --git a/Makefile b/Makefile
index 08e5c54549..2804298977 100644
--- a/Makefile
+++ b/Makefile
@@ -840,6 +840,7 @@ LIB_OBJS += csum-file.o
 LIB_OBJS += ctype.o
 LIB_OBJS += date.o
 LIB_OBJS += decorate.o
+LIB_OBJS += delta-islands.o
 LIB_OBJS += diffcore-break.o
 LIB_OBJS += diffcore-delta.o
 LIB_OBJS += diffcore-order.o
diff --git a/delta-islands.c b/delta-islands.c
new file mode 100644
index 0000000000..645fe966c5
--- /dev/null
+++ b/delta-islands.c
@@ -0,0 +1,490 @@
+#include "builtin.h"
+#include "cache.h"
+#include "attr.h"
+#include "object.h"
+#include "blob.h"
+#include "commit.h"
+#include "tag.h"
+#include "tree.h"
+#include "delta.h"
+#include "pack.h"
+#include "tree-walk.h"
+#include "diff.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "progress.h"
+#include "refs.h"
+#include "khash.h"
+#include "pack-bitmap.h"
+#include "pack-objects.h"
+#include "delta-islands.h"
+#include "sha1-array.h"
+#include "config.h"
+
+KHASH_INIT(str, const char *, void *, 1, kh_str_hash_func, kh_str_hash_equal)
+
+static khash_sha1 *island_marks;
+static unsigned island_counter;
+static unsigned island_counter_core;
+
+static kh_str_t *remote_islands;
+
+struct remote_island {
+	uint64_t hash;
+	struct oid_array oids;
+};
+
+struct island_bitmap {
+	uint32_t refcount;
+	uint32_t bits[];
+};
+
+static uint32_t island_bitmap_size;
+
+/*
+ * Allocate a new bitmap; if "old" is not NULL, the new bitmap will be a copy
+ * of "old". Otherwise, the new bitmap is empty.
+ */
+static struct island_bitmap *island_bitmap_new(const struct island_bitmap *old)
+{
+	size_t size = sizeof(struct island_bitmap) + (island_bitmap_size * 4);
+	struct island_bitmap *b = xcalloc(1, size);
+
+	if (old)
+		memcpy(b, old, size);
+
+	b->refcount = 1;
+	return b;
+}
+
+static void island_bitmap_or(struct island_bitmap *a, const struct island_bitmap *b)
+{
+	uint32_t i;
+
+	for (i = 0; i < island_bitmap_size; ++i)
+		a->bits[i] |= b->bits[i];
+}
+
+static int island_bitmap_is_subset(struct island_bitmap *self,
+		struct island_bitmap *super)
+{
+	uint32_t i;
+
+	if (self == super)
+		return 1;
+
+	for (i = 0; i < island_bitmap_size; ++i) {
+		if ((self->bits[i] & super->bits[i]) != self->bits[i])
+			return 0;
+	}
+
+	return 1;
+}
+
+#define ISLAND_BITMAP_BLOCK(x) (x / 32)
+#define ISLAND_BITMAP_MASK(x) (1 << (x % 32))
+
+static void island_bitmap_set(struct island_bitmap *self, uint32_t i)
+{
+	self->bits[ISLAND_BITMAP_BLOCK(i)] |= ISLAND_BITMAP_MASK(i);
+}
+
+static int island_bitmap_get(struct island_bitmap *self, uint32_t i)
+{
+	return (self->bits[ISLAND_BITMAP_BLOCK(i)] & ISLAND_BITMAP_MASK(i)) != 0;
+}
+
+int in_same_island(const struct object_id *trg_oid, const struct object_id *src_oid)
+{
+	khiter_t trg_pos, src_pos;
+
+	/* If we aren't using islands, assume everything goes together. */
+	if (!island_marks)
+		return 1;
+
+	/*
+	 * If we don't have a bitmap for the target, we can delta it
+	 * against anything -- it's not an important object
+	 */
+	trg_pos = kh_get_sha1(island_marks, trg_oid->hash);
+	if (trg_pos >= kh_end(island_marks))
+		return 1;
+
+	/*
+	 * if the source (our delta base) doesn't have a bitmap,
+	 * we don't want to base any deltas on it!
+	 */
+	src_pos = kh_get_sha1(island_marks, src_oid->hash);
+	if (src_pos >= kh_end(island_marks))
+		return 0;
+
+	return island_bitmap_is_subset(kh_value(island_marks, trg_pos),
+				kh_value(island_marks, src_pos));
+}
+
+int island_delta_cmp(const struct object_id *a, const struct object_id *b)
+{
+	khiter_t a_pos, b_pos;
+	struct island_bitmap *a_bitmap = NULL, *b_bitmap = NULL;
+
+	if (!island_marks)
+		return 0;
+
+	a_pos = kh_get_sha1(island_marks, a->hash);
+	if (a_pos < kh_end(island_marks))
+		a_bitmap = kh_value(island_marks, a_pos);
+
+	b_pos = kh_get_sha1(island_marks, b->hash);
+	if (b_pos < kh_end(island_marks))
+		b_bitmap = kh_value(island_marks, b_pos);
+
+	if (a_bitmap) {
+		if (!b_bitmap || !island_bitmap_is_subset(a_bitmap, b_bitmap))
+			return -1;
+	}
+	if (b_bitmap) {
+		if (!a_bitmap || !island_bitmap_is_subset(b_bitmap, a_bitmap))
+			return 1;
+	}
+
+	return 0;
+}
+
+static struct island_bitmap *create_or_get_island_marks(struct object *obj)
+{
+	khiter_t pos;
+	int hash_ret;
+
+	pos = kh_put_sha1(island_marks, obj->oid.hash, &hash_ret);
+	if (hash_ret)
+		kh_value(island_marks, pos) = island_bitmap_new(NULL);
+
+	return kh_value(island_marks, pos);
+}
+
+static void set_island_marks(struct object *obj, struct island_bitmap *marks)
+{
+	struct island_bitmap *b;
+	khiter_t pos;
+	int hash_ret;
+
+	pos = kh_put_sha1(island_marks, obj->oid.hash, &hash_ret);
+	if (hash_ret) {
+		/*
+		 * We don't have one yet; make a copy-on-write of the
+		 * parent.
+		 */
+		marks->refcount++;
+		kh_value(island_marks, pos) = marks;
+		return;
+	}
+
+	/*
+	 * We do have it. Make sure we split any copy-on-write before
+	 * updating.
+	 */
+	b = kh_value(island_marks, pos);
+	if (b->refcount > 1) {
+		b->refcount--;
+		b = kh_value(island_marks, pos) = island_bitmap_new(b);
+	}
+	island_bitmap_or(b, marks);
+}
+
+static void mark_remote_island_1(struct remote_island *rl, int is_core_island)
+{
+	uint32_t i;
+
+	for (i = 0; i < rl->oids.nr; ++i) {
+		struct island_bitmap *marks;
+		struct object *obj = parse_object(&rl->oids.oid[i]);
+
+		if (!obj)
+			continue;
+
+		marks = create_or_get_island_marks(obj);
+		island_bitmap_set(marks, island_counter);
+
+		if (is_core_island && obj->type == OBJ_COMMIT)
+			obj->flags |= NEEDS_BITMAP;
+
+		/* If it was a tag, also make sure we hit the underlying object. */
+		while (obj && obj->type == OBJ_TAG) {
+			obj = ((struct tag *)obj)->tagged;
+			if (obj) {
+				parse_object(&obj->oid);
+				marks = create_or_get_island_marks(obj);
+				island_bitmap_set(marks, island_counter);
+			}
+		}
+	}
+
+	if (is_core_island)
+		island_counter_core = island_counter;
+
+	island_counter++;
+}
+
+static int cmp_tree_depth(const void *va, const void *vb)
+{
+	struct object_entry *a = *(struct object_entry **)va;
+	struct object_entry *b = *(struct object_entry **)vb;
+	return a->tree_depth - b->tree_depth;
+}
+
+void resolve_tree_islands(int progress, struct packing_data *to_pack)
+{
+	struct progress *progress_state = NULL;
+	struct object_entry **todo;
+	int nr = 0;
+	int i;
+
+	if (!island_marks)
+		return;
+
+	/*
+	 * We process only trees, as commits and tags have already been handled
+	 * (and passed their marks on to root trees, as well. We must make sure
+	 * to process them in descending tree-depth order so that marks
+	 * propagate down the tree properly, even if a sub-tree is found in
+	 * multiple parent trees.
+	 */
+	todo = xmalloc(to_pack->nr_objects * sizeof(*todo));
+	for (i = 0; i < to_pack->nr_objects; i++) {
+		if (oe_type(&to_pack->objects[i]) == OBJ_TREE)
+			todo[nr++] = &to_pack->objects[i];
+	}
+	qsort(todo, nr, sizeof(*todo), cmp_tree_depth);
+
+	if (progress)
+		progress_state = start_progress("Propagating island marks", nr);
+
+	for (i = 0; i < nr; i++) {
+		struct object_entry *ent = todo[i];
+		struct island_bitmap *root_marks;
+		struct tree *tree;
+		struct tree_desc desc;
+		struct name_entry entry;
+		khiter_t pos;
+
+		pos = kh_get_sha1(island_marks, ent->idx.oid.hash);
+		if (pos >= kh_end(island_marks))
+			continue;
+
+		root_marks = kh_value(island_marks, pos);
+
+		tree = lookup_tree(&ent->idx.oid);
+		if (!tree || parse_tree(tree) < 0)
+			die("bad tree object %s", oid_to_hex(&ent->idx.oid));
+
+		init_tree_desc(&desc, tree->buffer, tree->size);
+		while (tree_entry(&desc, &entry)) {
+			struct object *obj;
+
+			if (S_ISGITLINK(entry.mode))
+				continue;
+
+			obj = lookup_object(entry.oid->hash);
+			if (!obj)
+				continue;
+
+			set_island_marks(obj, root_marks);
+		}
+
+		free(tree->buffer);
+		tree->buffer = NULL;
+		tree->object.parsed = 0;
+
+		display_progress(progress_state, i+1);
+	}
+
+	stop_progress(&progress_state);
+	free(todo);
+}
+
+static regex_t *island_regexes;
+static unsigned int island_regexes_alloc, island_regexes_nr;
+static const char *core_island_name;
+
+static int island_config_callback(const char *k, const char *v, void *cb)
+{
+	if (!strcmp(k, "pack.island")) {
+		struct strbuf re = STRBUF_INIT;
+
+		if (!v)
+			return config_error_nonbool(k);
+
+		if (island_regexes_nr >= island_regexes_alloc) {
+			island_regexes_alloc = (island_regexes_alloc + 8) * 2;
+			island_regexes = xrealloc(island_regexes,
+					island_regexes_alloc * sizeof(regex_t));
+		}
+
+		if (*v != '^')
+			strbuf_addch(&re, '^');
+		strbuf_addstr(&re, v);
+
+		if (regcomp(&island_regexes[island_regexes_nr], re.buf, REG_EXTENDED))
+			die("failed to load island regex for '%s': %s", k, re.buf);
+
+		strbuf_release(&re);
+		island_regexes_nr++;
+		return 0;
+	}
+
+	if (!strcmp(k, "pack.islandcore"))
+		return git_config_string(&core_island_name, k, v);
+
+	return 0;
+}
+
+static void add_ref_to_island(const char *island_name, const struct object_id *oid)
+{
+	uint64_t sha_core;
+	struct remote_island *rl = NULL;
+
+	int hash_ret;
+	khiter_t pos = kh_put_str(remote_islands, island_name, &hash_ret);
+
+	if (hash_ret) {
+		kh_key(remote_islands, pos) = xstrdup(island_name);
+		kh_value(remote_islands, pos) = xcalloc(1, sizeof(struct remote_island));
+	}
+
+	rl = kh_value(remote_islands, pos);
+	oid_array_append(&rl->oids, oid);
+
+	memcpy(&sha_core, oid->hash, sizeof(uint64_t));
+	rl->hash += sha_core;
+}
+
+static int find_island_for_ref(const char *refname, const struct object_id *oid,
+			       int flags, void *data)
+{
+	int i, m;
+	regmatch_t matches[8];
+	struct strbuf island_name = STRBUF_INIT;
+
+	/* walk backwards to get last-one-wins ordering */
+	for (i = island_regexes_nr - 1; i >= 0; i--) {
+		if (!regexec(&island_regexes[i], refname,
+			     ARRAY_SIZE(matches), matches, 0))
+			break;
+	}
+
+	if (i < 0)
+		return 0;
+
+	for (m = 1; m < ARRAY_SIZE(matches); m++) {
+		regmatch_t *match = &matches[m];
+
+		if (match->rm_so == -1)
+			continue;
+
+		if (island_name.len)
+			strbuf_addch(&island_name, '-');
+
+		strbuf_add(&island_name, refname + match->rm_so, match->rm_eo - match->rm_so);
+	}
+
+	add_ref_to_island(island_name.buf, oid);
+	strbuf_release(&island_name);
+	return 0;
+}
+
+static struct remote_island *get_core_island(void)
+{
+	if (core_island_name) {
+		khiter_t pos = kh_get_str(remote_islands, core_island_name);
+		if (pos < kh_end(remote_islands))
+			return kh_value(remote_islands, pos);
+	}
+
+	return NULL;
+}
+
+static void deduplicate_islands(void)
+{
+	struct remote_island *island, *core = NULL, **list;
+	unsigned int island_count, dst, src, ref, i = 0;
+
+	island_count = kh_size(remote_islands);
+	list = xmalloc(island_count * sizeof(struct remote_island *));
+
+	kh_foreach_value(remote_islands, island, {
+		list[i++] = island;
+	});
+
+	for (ref = 0; ref + 1 < island_count; ref++) {
+		for (src = ref + 1, dst = src; src < island_count; src++) {
+			if (list[ref]->hash == list[src]->hash)
+				continue;
+
+			if (src != dst)
+				list[dst] = list[src];
+
+			dst++;
+		}
+		island_count = dst;
+	}
+
+	island_bitmap_size = (island_count / 32) + 1;
+	core = get_core_island();
+
+	for (i = 0; i < island_count; ++i) {
+		mark_remote_island_1(list[i], core && list[i]->hash == core->hash);
+	}
+
+	free(list);
+}
+
+void load_delta_islands(void)
+{
+	island_marks = kh_init_sha1();
+	remote_islands = kh_init_str();
+
+	git_config(island_config_callback, NULL);
+	for_each_ref(find_island_for_ref, NULL);
+	deduplicate_islands();
+
+	fprintf(stderr, "Marked %d islands, done.\n", island_counter);
+}
+
+void propagate_island_marks(struct commit *commit)
+{
+	khiter_t pos = kh_get_sha1(island_marks, commit->object.oid.hash);
+
+	if (pos < kh_end(island_marks)) {
+		struct commit_list *p;
+		struct island_bitmap *root_marks = kh_value(island_marks, pos);
+
+		parse_commit(commit);
+		set_island_marks(&get_commit_tree(commit)->object, root_marks);
+		for (p = commit->parents; p; p = p->next)
+			set_island_marks(&p->item->object, root_marks);
+	}
+}
+
+int compute_pack_layers(struct packing_data *to_pack)
+{
+	uint32_t i;
+
+	if (!core_island_name || !island_marks)
+		return 1;
+
+	for (i = 0; i < to_pack->nr_objects; ++i) {
+		struct object_entry *entry = &to_pack->objects[i];
+		khiter_t pos = kh_get_sha1(island_marks, entry->idx.oid.hash);
+
+		entry->layer = 1;
+
+		if (pos < kh_end(island_marks)) {
+			struct island_bitmap *bitmap = kh_value(island_marks, pos);
+
+			if (island_bitmap_get(bitmap, island_counter_core))
+				entry->layer = 0;
+		}
+	}
+
+	return 2;
+}
diff --git a/delta-islands.h b/delta-islands.h
new file mode 100644
index 0000000000..f9725730f4
--- /dev/null
+++ b/delta-islands.h
@@ -0,0 +1,11 @@
+#ifndef DELTA_ISLANDS_H
+#define DELTA_ISLANDS_H
+
+int island_delta_cmp(const struct object_id *a, const struct object_id *b);
+int in_same_island(const struct object_id *, const struct object_id *);
+void resolve_tree_islands(int progress, struct packing_data *to_pack);
+void load_delta_islands(void);
+void propagate_island_marks(struct commit *commit);
+int compute_pack_layers(struct packing_data *to_pack);
+
+#endif /* DELTA_ISLANDS_H */
diff --git a/pack-objects.h b/pack-objects.h
index edf74dabdd..8eecd67991 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -100,6 +100,10 @@ struct object_entry {
 	unsigned type_:TYPE_BITS;
 	unsigned no_try_delta:1;
 	unsigned in_pack_type:TYPE_BITS; /* could be delta */
+
+	unsigned int tree_depth; /* should be repositioned for packing? */
+	unsigned char layer;
+
 	unsigned preferred_base:1; /*
 				    * we do not pack this, but is available
 				    * to be used as the base object to delta
-- 
2.18.0.237.gffdb1dbdaa


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
  2018-07-22  5:48 ` [RFC PATCH 1/5] packfile: make get_delta_base() non static Christian Couder
  2018-07-22  5:48 ` [RFC PATCH 2/5] Add delta-islands.{c,h} Christian Couder
@ 2018-07-22  5:48 ` Christian Couder
  2018-07-22  8:55   ` Duy Nguyen
                     ` (2 more replies)
  2018-07-22  5:48 ` [RFC PATCH 4/5] repack: " Christian Couder
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 37+ messages in thread
From: Christian Couder @ 2018-07-22  5:48 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder

From: Jeff King <peff@peff.net>

Implement support for delta islands in git pack-objects
and document how delta islands work in
"Documentation/git-pack-objects.txt".

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-pack-objects.txt |  88 +++++++++++++++++++
 builtin/pack-objects.c             | 130 ++++++++++++++++++++---------
 2 files changed, 177 insertions(+), 41 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index d95b472d16..7b7a36056f 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -289,6 +289,94 @@ Unexpected missing object will raise an error.
 --unpack-unreachable::
 	Keep unreachable objects in loose form. This implies `--revs`.
 
+--delta-islands::
+	Restrict delta matches based on "islands". See DELTA ISLANDS
+	below.
+
+
+DELTA ISLANDS
+-------------
+
+When possible, `pack-objects` tries to reuse existing on-disk deltas to
+avoid having to search for new ones on the fly. This is an important
+optimization for serving fetches, because it means the server can avoid
+inflating most objects at all and just send the bytes directly from
+disk. This optimization can't work when an object is stored as a delta
+against a base which the receiver does not have (and which we are not
+already sending). In that case the server "breaks" the delta and has to
+find a new one, which has a high CPU cost. Therefore it's important for
+performance that the set of objects in on-disk delta relationships match
+what a client would fetch.
+
+In a normal repository, this tends to work automatically. The objects
+are mostly reachable from the branches and tags, and that's what clients
+fetch. Any deltas we find on the server are likely to be between objects
+the client has or will have.
+
+But in some repository setups, you may have several related but separate
+groups of ref tips, with clients tending to fetch those groups
+independently. For example, imagine that you are hosting several "forks"
+of a repository in a single shared object store, and letting clients
+view them as separate repositories through `GIT_NAMESPACE` or separate
+repos using the alternates mechanism. A naive repack may find that the
+optimal delta for an object is against a base that is only found in
+another fork. But when a client fetches, they will not have the base
+object, and we'll have to find a new delta on the fly.
+
+A similar situation may exist if you have many refs outside of
+`refs/heads/` and `refs/tags/` that point to related objects (e.g.,
+`refs/pull` or `refs/changes` used by some hosting providers). By
+default, clients fetch only heads and tags, and deltas against objects
+found only in those other groups cannot be sent as-is.
+
+Delta islands solve this problem by allowing you to group your refs into
+distinct "islands". Pack-objects computes which objects are reachable
+from which islands, and refuses to make a delta from an object `A`
+against a base which is not present in all of `A`'s islands. This
+results in slightly larger packs (because we miss some delta
+opportunities), but guarantees that a fetch of one island will not have
+to recompute deltas on the fly due to crossing island boundaries.
+
+Islands are configured via the `pack.island` option, which can be
+specified multiple times. Each value is a left-anchored regular
+expressions matching refnames. For example:
+
+-------------------------------------------
+[pack]
+island = refs/heads/
+island = refs/tags/
+-------------------------------------------
+
+puts heads and tags into an island (whose name is the empty string; see
+below for more on naming). Any refs which do not match those regular
+expressions (e.g., `refs/pull/123`) is not in any island. Any object
+which is reachable only from `refs/pull/` (but not heads or tags) is
+therefore not a candidate to be used as a base for `refs/heads/`.
+
+Refs are grouped into islands based on their "names", and two regexes
+that produce the same name are considered to be in the same island. The
+names are computed from the regexes by concatenating any capture groups
+from the regex (and if there are none, then the name is the empty
+string, as in the above example). This allows you to create arbitrary
+numbers of islands. For example, imagine you store the refs for each
+fork in `refs/virtual/ID`, where `ID` is a numeric identifier. You might
+then configure:
+
+-------------------------------------------
+[pack]
+island = refs/virtual/([0-9]+)/heads/
+island = refs/virtual/([0-9]+)/tags/
+island = refs/virtual/([0-9]+)/(pull)/
+-------------------------------------------
+
+That puts the heads and tags for each fork in their own island (named
+"1234" or similar), and the pull refs for each go into their own
+"1234-pull".
+
+Note that we pick a single island for each regex to go into, using "last
+one wins" ordering (which allows repo-specific config to take precedence
+over user-wide config, and so forth).
+
 SEE ALSO
 --------
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ebc8cefb53..3bd41d8b05 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -24,6 +24,7 @@
 #include "streaming.h"
 #include "thread-utils.h"
 #include "pack-bitmap.h"
+#include "delta-islands.h"
 #include "reachable.h"
 #include "sha1-array.h"
 #include "argv-array.h"
@@ -59,6 +60,7 @@ static struct packing_data to_pack;
 
 static struct pack_idx_entry **written_list;
 static uint32_t nr_result, nr_written, nr_seen;
+static uint32_t write_layer;
 
 static int non_empty;
 static int reuse_delta = 1, reuse_object = 1;
@@ -93,6 +95,8 @@ static uint16_t write_bitmap_options;
 
 static int exclude_promisor_objects;
 
+static int use_delta_islands;
+
 static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 static unsigned long cache_max_small_delta_size = 1000;
@@ -607,7 +611,7 @@ static inline void add_to_write_order(struct object_entry **wo,
 			       unsigned int *endp,
 			       struct object_entry *e)
 {
-	if (e->filled)
+	if (e->filled || e->layer != write_layer)
 		return;
 	wo[(*endp)++] = e;
 	e->filled = 1;
@@ -669,6 +673,7 @@ static void add_family_to_write_order(struct object_entry **wo,
 
 static struct object_entry **compute_write_order(void)
 {
+	uint32_t max_layers = 1;
 	unsigned int i, wo_end, last_untagged;
 
 	struct object_entry **wo;
@@ -700,51 +705,58 @@ static struct object_entry **compute_write_order(void)
 	 */
 	for_each_tag_ref(mark_tagged, NULL);
 
-	/*
-	 * Give the objects in the original recency order until
-	 * we see a tagged tip.
-	 */
+	if (use_delta_islands)
+		max_layers = compute_pack_layers(&to_pack);
+
 	ALLOC_ARRAY(wo, to_pack.nr_objects);
-	for (i = wo_end = 0; i < to_pack.nr_objects; i++) {
-		if (objects[i].tagged)
-			break;
-		add_to_write_order(wo, &wo_end, &objects[i]);
-	}
-	last_untagged = i;
+	wo_end = 0;
 
-	/*
-	 * Then fill all the tagged tips.
-	 */
-	for (; i < to_pack.nr_objects; i++) {
-		if (objects[i].tagged)
+	for (; write_layer < max_layers; ++write_layer) {
+		/*
+		 * Give the objects in the original recency order until
+		 * we see a tagged tip.
+		 */
+		for (i = 0; i < to_pack.nr_objects; i++) {
+			if (objects[i].tagged)
+				break;
 			add_to_write_order(wo, &wo_end, &objects[i]);
-	}
+		}
+		last_untagged = i;
 
-	/*
-	 * And then all remaining commits and tags.
-	 */
-	for (i = last_untagged; i < to_pack.nr_objects; i++) {
-		if (oe_type(&objects[i]) != OBJ_COMMIT &&
-		    oe_type(&objects[i]) != OBJ_TAG)
-			continue;
-		add_to_write_order(wo, &wo_end, &objects[i]);
-	}
+		/*
+		 * Then fill all the tagged tips.
+		 */
+		for (; i < to_pack.nr_objects; i++) {
+			if (objects[i].tagged)
+				add_to_write_order(wo, &wo_end, &objects[i]);
+		}
 
-	/*
-	 * And then all the trees.
-	 */
-	for (i = last_untagged; i < to_pack.nr_objects; i++) {
-		if (oe_type(&objects[i]) != OBJ_TREE)
-			continue;
-		add_to_write_order(wo, &wo_end, &objects[i]);
-	}
+		/*
+		 * And then all remaining commits and tags.
+		 */
+		for (i = last_untagged; i < to_pack.nr_objects; i++) {
+			if (oe_type(&objects[i]) != OBJ_COMMIT &&
+			    oe_type(&objects[i]) != OBJ_TAG)
+				continue;
+			add_to_write_order(wo, &wo_end, &objects[i]);
+		}
 
-	/*
-	 * Finally all the rest in really tight order
-	 */
-	for (i = last_untagged; i < to_pack.nr_objects; i++) {
-		if (!objects[i].filled)
-			add_family_to_write_order(wo, &wo_end, &objects[i]);
+		/*
+		 * And then all the trees.
+		 */
+		for (i = last_untagged; i < to_pack.nr_objects; i++) {
+			if (oe_type(&objects[i]) != OBJ_TREE)
+				continue;
+			add_to_write_order(wo, &wo_end, &objects[i]);
+		}
+
+		/*
+		 * Finally all the rest in really tight order
+		 */
+		for (i = last_untagged; i < to_pack.nr_objects; i++) {
+			if (!objects[i].filled && objects[i].layer == write_layer)
+				add_family_to_write_order(wo, &wo_end, &objects[i]);
+		}
 	}
 
 	if (wo_end != to_pack.nr_objects)
@@ -1504,7 +1516,8 @@ static void check_object(struct object_entry *entry)
 			break;
 		}
 
-		if (base_ref && (base_entry = packlist_find(&to_pack, base_ref, NULL))) {
+		if (base_ref && (base_entry = packlist_find(&to_pack, base_ref, NULL)) &&
+		    in_same_island(&entry->idx.oid, &base_entry->idx.oid)) {
 			/*
 			 * If base_ref was set above that means we wish to
 			 * reuse delta data, and we even found that base
@@ -1820,6 +1833,11 @@ static int type_size_sort(const void *_a, const void *_b)
 		return -1;
 	if (a->preferred_base < b->preferred_base)
 		return 1;
+	if (use_delta_islands) {
+		int island_cmp = island_delta_cmp(&a->idx.oid, &b->idx.oid);
+		if (island_cmp)
+			return island_cmp;
+	}
 	if (a_size > b_size)
 		return -1;
 	if (a_size < b_size)
@@ -1968,6 +1986,9 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (trg_size < src_size / 32)
 		return 0;
 
+	if (!in_same_island(&trg->entry->idx.oid, &src->entry->idx.oid))
+		return 0;
+
 	/* Load data if not already done */
 	if (!trg->data) {
 		read_lock();
@@ -2506,6 +2527,9 @@ static void prepare_pack(int window, int depth)
 	uint32_t i, nr_deltas;
 	unsigned n;
 
+	if (use_delta_islands)
+		resolve_tree_islands(progress, &to_pack);
+
 	get_object_details();
 
 	/*
@@ -2669,6 +2693,9 @@ static void show_commit(struct commit *commit, void *data)
 
 	if (write_bitmap_index)
 		index_commit_for_bitmap(commit);
+
+	if (use_delta_islands)
+		propagate_island_marks(commit);
 }
 
 static void show_object(struct object *obj, const char *name, void *data)
@@ -2676,6 +2703,19 @@ static void show_object(struct object *obj, const char *name, void *data)
 	add_preferred_base_object(name);
 	add_object_entry(&obj->oid, obj->type, name, 0);
 	obj->flags |= OBJECT_ADDED;
+
+	if (use_delta_islands) {
+		const char *p;
+		unsigned depth = 0;
+		struct object_entry *ent;
+
+		for (p = strchr(name, '/'); p; p = strchr(p + 1, '/'))
+			depth++;
+
+		ent = packlist_find(&to_pack, obj->oid.hash, NULL);
+		if (ent && depth > ent->tree_depth)
+			ent->tree_depth = depth;
+	}
 }
 
 static void show_object__ma_allow_any(struct object *obj, const char *name, void *data)
@@ -3003,6 +3043,9 @@ static void get_object_list(int ac, const char **av)
 	if (use_bitmap_index && !get_object_list_from_bitmap(&revs))
 		return;
 
+	if (use_delta_islands)
+		load_delta_islands();
+
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	mark_edges_uninteresting(&revs, show_edge);
@@ -3182,6 +3225,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		  option_parse_missing_action },
 		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
 			 N_("do not pack objects in promisor packfiles")),
+		OPT_BOOL(0, "delta-islands", &use_delta_islands,
+			 N_("enable islands for delta compression")),
 		OPT_END(),
 	};
 
@@ -3308,6 +3353,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (pack_to_stdout || !rev_list_all)
 		write_bitmap_index = 0;
 
+	if (use_delta_islands)
+		argv_array_push(&rp, "--topo-order");
+
 	if (progress && all_progress_implied)
 		progress = 2;
 
-- 
2.18.0.237.gffdb1dbdaa


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 4/5] repack: add delta-islands support
  2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
                   ` (2 preceding siblings ...)
  2018-07-22  5:48 ` [RFC PATCH 3/5] pack-objects: add delta-islands support Christian Couder
@ 2018-07-22  5:48 ` Christian Couder
  2018-07-22  5:48 ` [RFC PATCH 5/5] t: add t9930-delta-islands.sh Christian Couder
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Christian Couder @ 2018-07-22  5:48 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder

From: Jeff King <peff@peff.net>

Implement simple support for --delta-islands option and
repack.useDeltaIslands config variable in git repack.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config.txt     | 4 ++++
 Documentation/git-repack.txt | 5 +++++
 builtin/repack.c             | 9 +++++++++
 3 files changed, 18 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index f682e92a1a..1da39da2a6 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -3091,6 +3091,10 @@ repack.packKeptObjects::
 	index is being written (either via `--write-bitmap-index` or
 	`repack.writeBitmaps`).
 
+repack.useDeltaIslands::
+	If set to true, makes `git repack` act as if `--delta-islands`
+	was passed. Defaults to `false`.
+
 repack.writeBitmaps::
 	When true, git will write a bitmap index when packing all
 	objects to disk (e.g., when `git repack -a` is run).  This
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index d90e7907f4..a8b2d4722f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -155,6 +155,11 @@ depth is 4095.
 	being removed. In addition, any unreachable loose objects will
 	be packed (and their loose counterparts removed).
 
+-i::
+--delta-islands::
+	Pass the `--delta-islands` option to `git-pack-objects`, see
+	linkgit:git-pack-objects[1].
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 6c636e159e..5ab9ee69e4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -12,6 +12,7 @@
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps;
+static int use_delta_islands;
 static char *packdir, *packtmp;
 
 static const char *const git_repack_usage[] = {
@@ -40,6 +41,10 @@ static int repack_config(const char *var, const char *value, void *cb)
 		write_bitmaps = git_config_bool(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "repack.usedeltaislands")) {
+		use_delta_islands = git_config_bool(var, value);
+		return 0;
+	}
 	return git_default_config(var, value, cb);
 }
 
@@ -194,6 +199,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("pass --local to git-pack-objects")),
 		OPT_BOOL('b', "write-bitmap-index", &write_bitmaps,
 				N_("write bitmap index")),
+		OPT_BOOL('i', "delta-islands", &use_delta_islands,
+				N_("pass --delta-islands to git-pack-objects")),
 		OPT_STRING(0, "unpack-unreachable", &unpack_unreachable, N_("approxidate"),
 				N_("with -A, do not loosen objects older than this")),
 		OPT_BOOL('k', "keep-unreachable", &keep_unreachable,
@@ -267,6 +274,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		argv_array_pushf(&cmd.args, "--no-reuse-object");
 	if (write_bitmaps)
 		argv_array_push(&cmd.args, "--write-bitmap-index");
+	if (use_delta_islands)
+		argv_array_push(&cmd.args, "--delta-islands");
 
 	if (pack_everything & ALL_INTO_ONE) {
 		get_non_kept_pack_filenames(&existing_packs, &keep_pack_list);
-- 
2.18.0.237.gffdb1dbdaa


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC PATCH 5/5] t: add t9930-delta-islands.sh
  2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
                   ` (3 preceding siblings ...)
  2018-07-22  5:48 ` [RFC PATCH 4/5] repack: " Christian Couder
@ 2018-07-22  5:48 ` Christian Couder
  2018-07-24 10:24   ` Jeff King
  2018-07-24 10:16 ` [RFC PATCH 0/5] Add delta islands support Jeff King
  2018-07-26 13:34 ` Johannes Schindelin
  6 siblings, 1 reply; 37+ messages in thread
From: Christian Couder @ 2018-07-22  5:48 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Christian Couder

From: Jeff King <peff@peff.net>

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 t/t9930-delta-islands.sh | 143 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)
 create mode 100755 t/t9930-delta-islands.sh

diff --git a/t/t9930-delta-islands.sh b/t/t9930-delta-islands.sh
new file mode 100755
index 0000000000..fea92a5777
--- /dev/null
+++ b/t/t9930-delta-islands.sh
@@ -0,0 +1,143 @@
+#!/bin/sh
+
+test_description='exercise delta islands'
+. ./test-lib.sh
+
+# returns true iff $1 is a delta based on $2
+is_delta_base () {
+	delta_base=$(echo "$1" | git cat-file --batch-check='%(deltabase)') &&
+	echo >&2 "$1 has base $delta_base" &&
+	test "$delta_base" = "$2"
+}
+
+# generate a commit on branch $1 with a single file, "file", whose
+# content is mostly based on the seed $2, but with a unique bit
+# of content $3 appended. This should allow us to see whether
+# blobs of different refs delta against each other.
+commit() {
+	blob=$({ test-tool genrandom "$2" 10240 && echo "$3"; } |
+	       git hash-object -w --stdin) &&
+	tree=$(printf '100644 blob %s\tfile\n' "$blob" | git mktree) &&
+	commit=$(echo "$2-$3" | git commit-tree "$tree" ${4:+-p "$4"}) &&
+	git update-ref "refs/heads/$1" "$commit" &&
+	eval "$1"'=$(git rev-parse $1:file)' &&
+	eval "echo >&2 $1=\$$1"
+}
+
+test_expect_success 'setup commits' '
+	commit one seed 1 &&
+	commit two seed 12
+'
+
+# Note: This is heavily dependent on the "prefer larger objects as base"
+# heuristic.
+test_expect_success 'vanilla repack deltas one against two' '
+	git repack -adf &&
+	is_delta_base $one $two
+'
+
+test_expect_success 'island repack with no island definition is vanilla' '
+	git repack -adfi &&
+	is_delta_base $one $two
+'
+
+test_expect_success 'island repack with no matches is vanilla' '
+	git -c "pack.island=refs/foo" repack -adfi &&
+	is_delta_base $one $two
+'
+
+test_expect_success 'separate islands disallows delta' '
+	git -c "pack.island=refs/heads/(.*)" repack -adfi &&
+	! is_delta_base $one $two &&
+	! is_delta_base $two $one
+'
+
+test_expect_success 'same island allows delta' '
+	git -c "pack.island=refs/heads" repack -adfi &&
+	is_delta_base $one $two
+'
+
+test_expect_success 'coalesce same-named islands' '
+	git \
+		-c "pack.island=refs/(.*)/one" \
+		-c "pack.island=refs/(.*)/two" \
+		repack -adfi &&
+	is_delta_base $one $two
+'
+
+test_expect_success 'island restrictions drop reused deltas' '
+	git repack -adfi &&
+	is_delta_base $one $two &&
+	git -c "pack.island=refs/heads/(.*)" repack -adi &&
+	! is_delta_base $one $two &&
+	! is_delta_base $two $one
+'
+
+test_expect_success 'island regexes are left-anchored' '
+	git -c "pack.island=heads/(.*)" repack -adfi &&
+	is_delta_base $one $two
+'
+
+test_expect_success 'island regexes follow last-one-wins scheme' '
+	git \
+		-c "pack.island=refs/heads/(.*)" \
+		-c "pack.island=refs/heads/" \
+		repack -adfi &&
+	is_delta_base $one $two
+'
+
+test_expect_success 'setup shared history' '
+	commit root shared root &&
+	commit one shared 1 root &&
+	commit two shared 12-long root
+'
+
+# We know that $two will be preferred as a base from $one,
+# because we can transform it with a pure deletion.
+#
+# We also expect $root as a delta against $two by the "longest is base" rule.
+test_expect_success 'vanilla delta goes between branches' '
+	git repack -adf &&
+	is_delta_base $one $two &&
+	is_delta_base $root $two
+'
+
+# Here we should allow $one to base itself on $root; even though
+# they are in different islands, the objects in $root are in a superset
+# of islands compared to those in $one.
+#
+# Similarly, $two can delta against $root by our rules. And unlike $one,
+# in which we are just allowing it, the island rules actually put $root
+# as a possible base for $two, which it would not otherwise be (due to the size
+# sorting).
+test_expect_success 'deltas allowed against superset islands' '
+	git -c "pack.island=refs/heads/(.*)" repack -adfi &&
+	is_delta_base $one $root &&
+	is_delta_base $two $root
+'
+
+# We are going to test the packfile order here, so we again have to make some
+# assumptions. We assume that "$root", as part of our core "one", must come
+# before "$two". This should be guaranteed by the island code. However, for
+# this test to fail without islands, we are also assuming that it would not
+# otherwise do so. This is true by the current write order, which will put
+# commits (and their contents) before their parents.
+test_expect_success 'island core places core objects first' '
+	cat >expect <<-EOF &&
+	$root
+	$two
+	EOF
+	git -c "pack.island=refs/heads/(.*)" \
+	    -c "pack.islandcore=one" \
+	    repack -adfi &&
+	git verify-pack -v .git/objects/pack/*.pack |
+	cut -d" " -f1 |
+	egrep "$root|$two" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'unmatched island core is not fatal' '
+	git -c "pack.islandcore=one" repack -adfi
+'
+
+test_done
-- 
2.18.0.237.gffdb1dbdaa


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-07-22  5:48 ` [RFC PATCH 2/5] Add delta-islands.{c,h} Christian Couder
@ 2018-07-22  8:50   ` Duy Nguyen
  2018-07-22 13:57     ` Christian Couder
  2018-08-05 18:53     ` Christian Couder
  2018-07-24 16:47   ` Junio C Hamano
  2018-07-27  9:40   ` Jeff King
  2 siblings, 2 replies; 37+ messages in thread
From: Duy Nguyen @ 2018-07-22  8:50 UTC (permalink / raw)
  To: Christian Couder
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Sun, Jul 22, 2018 at 7:51 AM Christian Couder
<christian.couder@gmail.com> wrote:
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index a32172a43c..f682e92a1a 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -2542,6 +2542,10 @@ Note that changing the compression level will not automatically recompress
>  all existing objects. You can force recompression by passing the -F option
>  to linkgit:git-repack[1].
>
> +pack.island::
> +       A regular expression configuring a set of delta islands. See
> +       "DELTA ISLANDS" in linkgit:git-pack-objects[1] for details.
> +

That section is not added until 3/5 though.

> diff --git a/delta-islands.c b/delta-islands.c
> new file mode 100644
> index 0000000000..645fe966c5
> --- /dev/null
> +++ b/delta-islands.c
> @@ -0,0 +1,490 @@
> +#include "builtin.h"

A bit weird that builtin.h would be needed...

> +void resolve_tree_islands(int progress, struct packing_data *to_pack)
> +{
> +       struct progress *progress_state = NULL;
> +       struct object_entry **todo;
> +       int nr = 0;
> +       int i;
> +
> +       if (!island_marks)
> +               return;
> +
> +       /*
> +        * We process only trees, as commits and tags have already been handled
> +        * (and passed their marks on to root trees, as well. We must make sure
> +        * to process them in descending tree-depth order so that marks
> +        * propagate down the tree properly, even if a sub-tree is found in
> +        * multiple parent trees.
> +        */
> +       todo = xmalloc(to_pack->nr_objects * sizeof(*todo));
> +       for (i = 0; i < to_pack->nr_objects; i++) {
> +               if (oe_type(&to_pack->objects[i]) == OBJ_TREE)
> +                       todo[nr++] = &to_pack->objects[i];
> +       }
> +       qsort(todo, nr, sizeof(*todo), cmp_tree_depth);
> +
> +       if (progress)
> +               progress_state = start_progress("Propagating island marks", nr);

_() (same comment for other strings too)

> diff --git a/pack-objects.h b/pack-objects.h
> index edf74dabdd..8eecd67991 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -100,6 +100,10 @@ struct object_entry {
>         unsigned type_:TYPE_BITS;
>         unsigned no_try_delta:1;
>         unsigned in_pack_type:TYPE_BITS; /* could be delta */
> +
> +       unsigned int tree_depth; /* should be repositioned for packing? */
> +       unsigned char layer;
> +

This looks very much like an optional feature. To avoid increasing
pack-objects memory usage for common case, please move this to struct
packing_data.

>         unsigned preferred_base:1; /*
>                                     * we do not pack this, but is available
>                                     * to be used as the base object to delta
> --
> 2.18.0.237.gffdb1dbdaa
-- 
Duy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-22  5:48 ` [RFC PATCH 3/5] pack-objects: add delta-islands support Christian Couder
@ 2018-07-22  8:55   ` Duy Nguyen
  2018-08-05 17:28     ` Christian Couder
  2018-07-23 18:52   ` Stefan Beller
  2018-07-24 17:03   ` Junio C Hamano
  2 siblings, 1 reply; 37+ messages in thread
From: Duy Nguyen @ 2018-07-22  8:55 UTC (permalink / raw)
  To: Christian Couder
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Sun, Jul 22, 2018 at 7:52 AM Christian Couder
<christian.couder@gmail.com> wrote:
> @@ -700,51 +705,58 @@ static struct object_entry **compute_write_order(void)
>          */
>         for_each_tag_ref(mark_tagged, NULL);
>
> -       /*
> -        * Give the objects in the original recency order until
> -        * we see a tagged tip.
> -        */
> +       if (use_delta_islands)
> +               max_layers = compute_pack_layers(&to_pack);
> +
>         ALLOC_ARRAY(wo, to_pack.nr_objects);
> -       for (i = wo_end = 0; i < to_pack.nr_objects; i++) {
> -               if (objects[i].tagged)
> -                       break;
> -               add_to_write_order(wo, &wo_end, &objects[i]);
> -       }
> -       last_untagged = i;
> +       wo_end = 0;
>
> -       /*
> -        * Then fill all the tagged tips.
> -        */
> -       for (; i < to_pack.nr_objects; i++) {
> -               if (objects[i].tagged)
> +       for (; write_layer < max_layers; ++write_layer) {
> +               /*
> +                * Give the objects in the original recency order until
> +                * we see a tagged tip.
> +                */
> +               for (i = 0; i < to_pack.nr_objects; i++) {
> +                       if (objects[i].tagged)
> +                               break;
>                         add_to_write_order(wo, &wo_end, &objects[i]);
> -       }
> +               }
> +               last_untagged = i;
>
> -       /*
> -        * And then all remaining commits and tags.
> -        */
> -       for (i = last_untagged; i < to_pack.nr_objects; i++) {
> -               if (oe_type(&objects[i]) != OBJ_COMMIT &&
> -                   oe_type(&objects[i]) != OBJ_TAG)
> -                       continue;
> -               add_to_write_order(wo, &wo_end, &objects[i]);
> -       }
> +               /*
> +                * Then fill all the tagged tips.
> +                */

If we move the code in this loop to a separate function, in a separate
patch, first, would it produce a better diff? I think all the
indentation change here makes it a bit hard to read.

> +               for (; i < to_pack.nr_objects; i++) {
> +                       if (objects[i].tagged)
> +                               add_to_write_order(wo, &wo_end, &objects[i]);
> +               }
-- 
Duy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-07-22  8:50   ` Duy Nguyen
@ 2018-07-22 13:57     ` Christian Couder
  2018-08-05 18:53     ` Christian Couder
  1 sibling, 0 replies; 37+ messages in thread
From: Christian Couder @ 2018-07-22 13:57 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Sun, Jul 22, 2018 at 10:50 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sun, Jul 22, 2018 at 7:51 AM Christian Couder
> <christian.couder@gmail.com> wrote:

>> +pack.island::
>> +       A regular expression configuring a set of delta islands. See
>> +       "DELTA ISLANDS" in linkgit:git-pack-objects[1] for details.
>> +
>
> That section is not added until 3/5 though.

Yeah, so I guess it is better to move this hunk to 3/5 and keep
pack.island undocumented until the delta islands code is actually used
by pack-objects.

>> diff --git a/delta-islands.c b/delta-islands.c
>> new file mode 100644
>> index 0000000000..645fe966c5
>> --- /dev/null
>> +++ b/delta-islands.c
>> @@ -0,0 +1,490 @@
>> +#include "builtin.h"
>
> A bit weird that builtin.h would be needed...

Yeah, I will get rid of that include in the next iteration.

>> +       if (progress)
>> +               progress_state = start_progress("Propagating island marks", nr);
>
> _() (same comment for other strings too)

Ok, the strings will be marked for translation in the next iteration.

>> diff --git a/pack-objects.h b/pack-objects.h
>> index edf74dabdd..8eecd67991 100644
>> --- a/pack-objects.h
>> +++ b/pack-objects.h
>> @@ -100,6 +100,10 @@ struct object_entry {
>>         unsigned type_:TYPE_BITS;
>>         unsigned no_try_delta:1;
>>         unsigned in_pack_type:TYPE_BITS; /* could be delta */
>> +
>> +       unsigned int tree_depth; /* should be repositioned for packing? */
>> +       unsigned char layer;
>> +
>
> This looks very much like an optional feature. To avoid increasing
> pack-objects memory usage for common case, please move this to struct
> packing_data.

Ok, I will take a look at that.

Thanks for the review,
Christian.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-22  5:48 ` [RFC PATCH 3/5] pack-objects: add delta-islands support Christian Couder
  2018-07-22  8:55   ` Duy Nguyen
@ 2018-07-23 18:52   ` Stefan Beller
  2018-07-24  9:58     ` Jeff King
  2018-07-24 17:03   ` Junio C Hamano
  2 siblings, 1 reply; 37+ messages in thread
From: Stefan Beller @ 2018-07-23 18:52 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano, Jeff King, Christian Couder

On Sat, Jul 21, 2018 at 10:49 PM Christian Couder
<christian.couder@gmail.com> wrote:
>
> From: Jeff King <peff@peff.net>
>
> Implement support for delta islands in git pack-objects
> and document how delta islands work in
> "Documentation/git-pack-objects.txt".
>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/git-pack-objects.txt |  88 +++++++++++++++++++
>  builtin/pack-objects.c             | 130 ++++++++++++++++++++---------
>  2 files changed, 177 insertions(+), 41 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index d95b472d16..7b7a36056f 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -289,6 +289,94 @@ Unexpected missing object will raise an error.
>  --unpack-unreachable::
>         Keep unreachable objects in loose form. This implies `--revs`.
>
> +--delta-islands::
> +       Restrict delta matches based on "islands". See DELTA ISLANDS
> +       below.
> +
> +
> +DELTA ISLANDS
> +-------------
> +
> +When possible, `pack-objects` tries to reuse existing on-disk deltas to
> +avoid having to search for new ones on the fly. This is an important
> +optimization for serving fetches, because it means the server can avoid
> +inflating most objects at all and just send the bytes directly from
> +disk. This optimization can't work when an object is stored as a delta
> +against a base which the receiver does not have (and which we are not
> +already sending). In that case the server "breaks" the delta and has to
> +find a new one, which has a high CPU cost. Therefore it's important for
> +performance that the set of objects in on-disk delta relationships match
> +what a client would fetch.
> +
> +In a normal repository, this tends to work automatically. The objects
> +are mostly reachable from the branches and tags, and that's what clients
> +fetch. Any deltas we find on the server are likely to be between objects
> +the client has or will have.
> +
> +But in some repository setups, you may have several related but separate
> +groups of ref tips, with clients tending to fetch those groups
> +independently. For example, imagine that you are hosting several "forks"
> +of a repository in a single shared object store, and letting clients
> +view them as separate repositories through `GIT_NAMESPACE` or separate
> +repos using the alternates mechanism. A naive repack may find that the
> +optimal delta for an object is against a base that is only found in
> +another fork. But when a client fetches, they will not have the base
> +object, and we'll have to find a new delta on the fly.
> +
> +A similar situation may exist if you have many refs outside of
> +`refs/heads/` and `refs/tags/` that point to related objects (e.g.,
> +`refs/pull` or `refs/changes` used by some hosting providers). By
> +default, clients fetch only heads and tags, and deltas against objects
> +found only in those other groups cannot be sent as-is.
> +
> +Delta islands solve this problem by allowing you to group your refs into
> +distinct "islands". Pack-objects computes which objects are reachable
> +from which islands, and refuses to make a delta from an object `A`
> +against a base which is not present in all of `A`'s islands. This
> +results in slightly larger packs (because we miss some delta
> +opportunities), but guarantees that a fetch of one island will not have
> +to recompute deltas on the fly due to crossing island boundaries.
> +
> +Islands are configured via the `pack.island` option, which can be
> +specified multiple times. Each value is a left-anchored regular
> +expressions matching refnames. For example:
> +
> +-------------------------------------------
> +[pack]
> +island = refs/heads/
> +island = refs/tags/
> +-------------------------------------------
> +
> +puts heads and tags into an island (whose name is the empty string; see
> +below for more on naming). Any refs which do not match those regular
> +expressions (e.g., `refs/pull/123`) is not in any island. Any object
> +which is reachable only from `refs/pull/` (but not heads or tags) is
> +therefore not a candidate to be used as a base for `refs/heads/`.
> +
> +Refs are grouped into islands based on their "names", and two regexes
> +that produce the same name are considered to be in the same island. The
> +names are computed from the regexes by concatenating any capture groups
> +from the regex (and if there are none, then the name is the empty
> +string, as in the above example). This allows you to create arbitrary
> +numbers of islands. For example, imagine you store the refs for each
> +fork in `refs/virtual/ID`, where `ID` is a numeric identifier. You might
> +then configure:
> +
> +-------------------------------------------
> +[pack]
> +island = refs/virtual/([0-9]+)/heads/
> +island = refs/virtual/([0-9]+)/tags/
> +island = refs/virtual/([0-9]+)/(pull)/
> +-------------------------------------------
> +
> +That puts the heads and tags for each fork in their own island (named
> +"1234" or similar), and the pull refs for each go into their own
> +"1234-pull".
> +
> +Note that we pick a single island for each regex to go into, using "last
> +one wins" ordering (which allows repo-specific config to take precedence
> +over user-wide config, and so forth).

I had to read all of this [background information] to understand the
concept and I think it is misnamed, as my gut instinct first told me
to have deltas only "within an island and no island hopping is allowed".
(This message reads a bit like a commit message, not as documentation
as it is long winded, too).

This feature makes sure that the "common foundation" base is packed
in a way that it is not borrowing construction pieces from any of
the different things atop the common foundation.
It really is about packing the base, but naming it related to the
islands, that are on top of the common sea bed led me to think
that the islands are important of this feature, but really it is about
making the sea bed easy to use and not tied to one of the islands?

What about renaming this feature to

[pack]
    excludePartialReach = refs/virtual/[0-9]]+/tags/

  "By setting `pack.excludePartialReach`, object deltafication is
  prohibited for objects that are not reachable from all
  manifestations of the given regex"

Cryptic, but it explains it in my mind in a shorter, more concise way. ;-)


> @@ -3182,6 +3225,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>                   option_parse_missing_action },
>                 OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
>                          N_("do not pack objects in promisor packfiles")),
> +               OPT_BOOL(0, "delta-islands", &use_delta_islands,
> +                        N_("enable islands for delta compression")),

We enable this feature, but we disallow certain patterns to be used in packing,
so it sounds weird to me as we tell Git to *not* explore the full design space,
so we're not enabling it, but rather restricting it?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-23 18:52   ` Stefan Beller
@ 2018-07-24  9:58     ` Jeff King
  2018-07-24 17:20       ` Stefan Beller
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2018-07-24  9:58 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Christian Couder, git, Junio C Hamano, Christian Couder

On Mon, Jul 23, 2018 at 11:52:49AM -0700, Stefan Beller wrote:

> > +DELTA ISLANDS
> > +-------------
> [...]
> 
> I had to read all of this [background information] to understand the
> concept and I think it is misnamed, as my gut instinct first told me
> to have deltas only "within an island and no island hopping is allowed".
> (This message reads a bit like a commit message, not as documentation
> as it is long winded, too).

I'm not sure if I'm parsing your sentence correctly, but the reason I
called them "islands" is exactly that you'd have deltas within an island
and want to forbid island hopping. So I wasn't sure if you were saying
"that's what I think, but not how the feature works".

There _is_ a tricky thing, which is that a given object is going to
appear in many islands. So the rule is really "you cannot hop to a base
that is not in all of your islands".

> What about renaming this feature to
> 
> [pack]
>     excludePartialReach = refs/virtual/[0-9]]+/tags/
> 
>   "By setting `pack.excludePartialReach`, object deltafication is
>   prohibited for objects that are not reachable from all
>   manifestations of the given regex"
> 
> Cryptic, but it explains it in my mind in a shorter, more concise way. ;-)

So I'm hopelessly biased at this point, having developed and worked with
the feature under the "island" name for several years. But I find your
name and explanation pretty confusing. :)

Worse, though, it does not have any noun to refer to the reachable set.
The regex capture and the island names are an important part of the
feature, because it lets you make a single island out of:

  refs/virtual/([0-9]+)/heads
  refs/virtual/([0-9]+)/tags

but exclude:

  refs/virtual/([0-9]+)/(foo)

which goes into its own island ("123-foo" instead of "123"). So what's
the equivalent nomenclature to "island name"?

> > @@ -3182,6 +3225,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
> >                   option_parse_missing_action },
> >                 OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
> >                          N_("do not pack objects in promisor packfiles")),
> > +               OPT_BOOL(0, "delta-islands", &use_delta_islands,
> > +                        N_("enable islands for delta compression")),
> 
> We enable this feature, but we disallow certain patterns to be used in packing,
> so it sounds weird to me as we tell Git to *not* explore the full design space,
> so we're not enabling it, but rather restricting it?

Yeah, perhaps "respect islands during delta compression" makes more sense.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 0/5] Add delta islands support
  2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
                   ` (4 preceding siblings ...)
  2018-07-22  5:48 ` [RFC PATCH 5/5] t: add t9930-delta-islands.sh Christian Couder
@ 2018-07-24 10:16 ` Jeff King
  2018-07-24 17:18   ` Junio C Hamano
  2018-07-26 13:34 ` Johannes Schindelin
  6 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2018-07-24 10:16 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano, Christian Couder

On Sun, Jul 22, 2018 at 07:48:31AM +0200, Christian Couder wrote:

> I kept Peff as the author and took the liberty to add his
> Signed-off-by to all the patches.

Yes, that's fine. This is a topic I've been meaning to upstream for a
long time, so I'm happy to see it progressing in that direction.

A few reasons I've been hesitant:

 - this produces worse packs for people cloning. The packs we serve at
   GitHub for heavily-forked repositories can be significantly larger
   than what you could get by repacking in isolation.

   I think this is inherent in the scheme (we're losing some delta
   opportunities). But I think it's also made worse because the delta
   window gets clogged with candidates that are forbidden by the island
   config. Repacking with a big --window helps (and doesn't take as long
   as it otherwise might because we can reject some object pairs based
   on islands before doing any computation on the content).

   I also haven't done much modeling on how this scheme progresses over
   time. Imagine users tend to create forks at random points in time and
   then let them get stale. You're going to accumulate a bunch of
   islands that increasingly don't overlap with each other. And as time
   goes on, your delta opportunities get worse and worse.

   Which isn't to say the scheme doesn't work to some degree. It has
   been running in production at GitHub for several years. But it's very
   cobbled together, and I had always hoped to give it some more careful
   thought before sending it upstream. ;) That may be "the perfect is
   the enemy of the good" though (or at least the enemy of the
   mediocre).

 - the whole regex island-naming configuration feels complicated and
   hacky to me. I haven't been able to come up with a better solution,
   though.

 - the whole islandCore thing is it's own weirdness (more below)

> The patches look good to me except perhaps that "pack.islandCore" is
> not documented. Maybe something could be added in
> Documentation/technical/ too, or maybe improving commit messages could
> be enough.

So basically what islandCore does is make one island "special", and we
write all of its objects out first in the packfile. The idea here was
that it would interact well with the pack-reuse code which tries to send
the early part of the pack out verbatim. And if you could pick the fork
which is most-often cloned, then that would be a win (so you'd pick
torvalds/linux.git here).

But:

 - the pack-reuse code we have upstream is crap; it doesn't kick in for
   this instance anyway, because it's too concerned with whether you're
   sending all of the pack. So if you have a 10GB pack and the "core"
   island is the entire first 1GB, you should be able to ship out that
   first GB easily. But the code says "well, that's only 10%!", which is
   not enough to trigger.

   I have replacement code (which we have been running in production)
   that is more clever about the threshold, and also can handle gaps in
   the continuity (so we might realize we need to send objects 1-5000,
   then skip a few, then 5037-8000, and so on). And after that, it's not
   at all clear to me if the islandCore thing is actually still helpful.

 - it's hard to configure, because it may actually change over time
   (whereas all the other island config is basically static and just impacts
   how we consider the refs that are there)

 - there may actually be _multiple_ forks that are "special". In
   torvalds/linux, for example, there's a fair bit of hierarchy, and
   there are people who forked Linus and have their own active
   sub-community. The actual island concept handles this kind of
   layering OK, but the islandCore thing is pretty static.

So I'm not sure if islandCore is even needed or not. But somebody would
have to do some experiments to figure it out (and obviously I should
share the replacement pack-reuse patches).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 5/5] t: add t9930-delta-islands.sh
  2018-07-22  5:48 ` [RFC PATCH 5/5] t: add t9930-delta-islands.sh Christian Couder
@ 2018-07-24 10:24   ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2018-07-24 10:24 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano, Christian Couder

On Sun, Jul 22, 2018 at 07:48:36AM +0200, Christian Couder wrote:

> From: Jeff King <peff@peff.net>
> 
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  t/t9930-delta-islands.sh | 143 +++++++++++++++++++++++++++++++++++++++

For topics that I'm not immediately sending upstream, I usually stick
them in the t99xx range, so they don't conflict with upstream tests. But
for upstream, this should probably be in the t53xx range.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 1/5] packfile: make get_delta_base() non static
  2018-07-22  5:48 ` [RFC PATCH 1/5] packfile: make get_delta_base() non static Christian Couder
@ 2018-07-24 16:19   ` Junio C Hamano
  2018-07-27 11:29     ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2018-07-24 16:19 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Jeff King, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> From: Jeff King <peff@peff.net>
>
> As get_delta_base() will be used outside 'packfile.c' in
> a following commit, let's make it non static and let's
> declare it in 'packfile.h'.

As a public function in *.h, don't we want a bit of comment there to
help those who want to use it from outside packfile.c?


> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  packfile.c | 10 +++++-----
>  packfile.h |  3 +++
>  2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/packfile.c b/packfile.c
> index 7cd45aa4b2..4646bff5ff 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1037,11 +1037,11 @@ const struct packed_git *has_packed_and_bad(const unsigned char *sha1)
>  	return NULL;
>  }
>  
> -static off_t get_delta_base(struct packed_git *p,
> -				    struct pack_window **w_curs,
> -				    off_t *curpos,
> -				    enum object_type type,
> -				    off_t delta_obj_offset)
> +off_t get_delta_base(struct packed_git *p,
> +		     struct pack_window **w_curs,
> +		     off_t *curpos,
> +		     enum object_type type,
> +		     off_t delta_obj_offset)
>  {
>  	unsigned char *base_info = use_pack(p, w_curs, *curpos, NULL);
>  	off_t base_offset;
> diff --git a/packfile.h b/packfile.h
> index cc7eaffe1b..30f0811382 100644
> --- a/packfile.h
> +++ b/packfile.h
> @@ -125,6 +125,9 @@ extern void *unpack_entry(struct repository *r, struct packed_git *, off_t, enum
>  extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
>  extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
>  extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
> +extern off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
> +			    off_t *curpos, enum object_type type,
> +			    off_t delta_obj_offset);
>  
>  extern void release_pack_memory(size_t);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-07-22  5:48 ` [RFC PATCH 2/5] Add delta-islands.{c,h} Christian Couder
  2018-07-22  8:50   ` Duy Nguyen
@ 2018-07-24 16:47   ` Junio C Hamano
  2018-07-27 13:02     ` Jeff King
  2018-07-27  9:40   ` Jeff King
  2 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2018-07-24 16:47 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Jeff King, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> From: Jeff King <peff@peff.net>
>
> Hosting providers that allow users to "fork" existing
> repos want as much as possible those forks to use a
> small amount of disk space.

"... want those forks to consume as little amount of disk space as
possible?"  Or perhaps "... want those forks to share as much disk
space as possible"?

> Using alternates to keep all the objects from all the forks into a
> unique central repo is way to do that, but ...

s/is way to/is a way to/, I guess, but that makes it sound as if the
series invented a way to share objects without using alternates.

> it can have some
> drawbacks. Especially when packing the central repo, deltas will
> be created between objects from different forks.
>
> This can make cloning or fetching a fork much slower and much more
> CPU intensive as Git might have to compute new deltas for many
> objects to avoid sending objects from a different fork.
>
> Delta islands is a way to store objects from different forks in
> the same repo and packfile without having deltas between objects
> from different forks.

I think another paragraph between the above two is needed to briefly
outline the central idea (which would make it clear why that is
called "delta islands") is called for.  Perhaps

	Because the inefficiency primarily arises when an object is
	delitified against another object that does not exist in the
	same fork, we partion objects into sets that appear in the
	same fork, and define "delta islands".  When finding delta
	base, we do not allow an object outside the same island to
	be considered as its base.

or something along that line.  Perhaps that would even make the last
paragraph unnecessary.

> +struct island_bitmap {
> +	uint32_t refcount;
> +	uint32_t bits[];
> +};
> +
> +static uint32_t island_bitmap_size;
> +
> +/*
> + * Allocate a new bitmap; if "old" is not NULL, the new bitmap will be a copy
> + * of "old". Otherwise, the new bitmap is empty.
> + */
> +static struct island_bitmap *island_bitmap_new(const struct island_bitmap *old)
> +{
> +	size_t size = sizeof(struct island_bitmap) + (island_bitmap_size * 4);
> +	struct island_bitmap *b = xcalloc(1, size);

Is one of the variants of flex array macros applicable here?

> +	if (old)
> +		memcpy(b, old, size);
> +
> +	b->refcount = 1;
> +	return b;
> +}
> +
> +static void island_bitmap_or(struct island_bitmap *a, const struct island_bitmap *b)
> +{
> +	uint32_t i;
> +
> +	for (i = 0; i < island_bitmap_size; ++i)
> +		a->bits[i] |= b->bits[i];
> +}
> +
> +static int island_bitmap_is_subset(struct island_bitmap *self,
> +		struct island_bitmap *super)
> +{
> +	uint32_t i;
> +
> +	if (self == super)
> +		return 1;
> +
> +	for (i = 0; i < island_bitmap_size; ++i) {
> +		if ((self->bits[i] & super->bits[i]) != self->bits[i])
> +			return 0;
> +	}
> +
> +	return 1;
> +}
> +#define ISLAND_BITMAP_BLOCK(x) (x / 32)
> +#define ISLAND_BITMAP_MASK(x) (1 << (x % 32))
> +
> +static void island_bitmap_set(struct island_bitmap *self, uint32_t i)
> +{
> +	self->bits[ISLAND_BITMAP_BLOCK(i)] |= ISLAND_BITMAP_MASK(i);
> +}
> +
> +static int island_bitmap_get(struct island_bitmap *self, uint32_t i)
> +{
> +	return (self->bits[ISLAND_BITMAP_BLOCK(i)] & ISLAND_BITMAP_MASK(i)) != 0;
> +}
> +

Not necessarily a complaint, but do we need another implementation
of bitmap here, or the compressed bitmap used for the pack bitmap is
unsuited for the purpose of this thing (e.g. perhaps it is overkill,
as we won't be shooting for saving disk footprint of a bitmap that
we are not going to save on disk anyway)?

> +static int cmp_tree_depth(const void *va, const void *vb)
> +{
> +	struct object_entry *a = *(struct object_entry **)va;
> +	struct object_entry *b = *(struct object_entry **)vb;
> +	return a->tree_depth - b->tree_depth;
> +}
> +
> +void resolve_tree_islands(int progress, struct packing_data *to_pack)
> +{
> +	struct progress *progress_state = NULL;
> +	struct object_entry **todo;
> +	int nr = 0;
> +	int i;
> +
> +	if (!island_marks)
> +		return;
> +
> +	/*
> +	 * We process only trees, as commits and tags have already been handled
> +	 * (and passed their marks on to root trees, as well. We must make sure
> +	 * to process them in descending tree-depth order so that marks
> +	 * propagate down the tree properly, even if a sub-tree is found in
> +	 * multiple parent trees.
> +	 */
> +	todo = xmalloc(to_pack->nr_objects * sizeof(*todo));
> +	for (i = 0; i < to_pack->nr_objects; i++) {
> +		if (oe_type(&to_pack->objects[i]) == OBJ_TREE)
> +			todo[nr++] = &to_pack->objects[i];
> +	}
> +	qsort(todo, nr, sizeof(*todo), cmp_tree_depth);

Hmph, at this stage nobody actually seems to set tree_depth; I am
wondering how a tree that is at the root in one commit but appears
as a subdirectory in another commit is handled by this code.

> +static regex_t *island_regexes;
> +static unsigned int island_regexes_alloc, island_regexes_nr;
> +static const char *core_island_name;

Are these (and the bitmap & hashtable) something that "everything in
the_repository" folks would come in and nuke as global variables?  I
haven't thought deeply about it, but these smell like that there
would be one set of such in-core variables per one in-core object
store plus refs (i.e. repository instance).


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-22  5:48 ` [RFC PATCH 3/5] pack-objects: add delta-islands support Christian Couder
  2018-07-22  8:55   ` Duy Nguyen
  2018-07-23 18:52   ` Stefan Beller
@ 2018-07-24 17:03   ` Junio C Hamano
  2018-07-24 17:11     ` Junio C Hamano
  2 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2018-07-24 17:03 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Jeff King, Christian Couder

Christian Couder <christian.couder@gmail.com> writes:

> +Islands are configured via the `pack.island` option, which can be
> +specified multiple times. Each value is a left-anchored regular
> +expressions matching refnames. For example:
> +
> +-------------------------------------------
> +[pack]
> +island = refs/heads/
> +island = refs/tags/
> +-------------------------------------------
> +
> +puts heads and tags into an island (whose name is the empty string; see
> +below for more on naming). Any refs which do not match those regular
> +expressions (e.g., `refs/pull/123`) is not in any island. Any object
> +which is reachable only from `refs/pull/` (but not heads or tags) is
> +therefore not a candidate to be used as a base for `refs/heads/`.
> +
> +Refs are grouped into islands based on their "names", and two regexes
> +that produce the same name are considered to be in the same island. The
> +names are computed from the regexes by concatenating any capture groups
> +from the regex (and if there are none, then the name is the empty
> +string, as in the above example). This allows you to create arbitrary
> +numbers of islands. For example, imagine you store the refs for each
> +fork in `refs/virtual/ID`, where `ID` is a numeric identifier. You might
> +then configure:
> +
> +-------------------------------------------
> +[pack]
> +island = refs/virtual/([0-9]+)/heads/
> +island = refs/virtual/([0-9]+)/tags/
> +island = refs/virtual/([0-9]+)/(pull)/
> +-------------------------------------------

It becomes clear only from this example that what the feature calls
(and documented in patch 2/5) "regexp" is not BRE but ERE.  Update
2/5 so that it is clear to readers of "git config --help" who looks
for "pack.island" in the output.

> +That puts the heads and tags for each fork in their own island (named
> +"1234" or similar), and the pull refs for each go into their own
> +"1234-pull".

"by concatenating any capture groups" made me imagine that the last
one would be "1234pull" without dash.  The actual rule should be
mentioned in that paragraph (i.e.  "concatenating any capture groups
from the regex, with a '-' dash in between" or something like that).

> +Note that we pick a single island for each regex to go into, using "last
> +one wins" ordering (which allows repo-specific config to take precedence
> +over user-wide config, and so forth).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-24 17:03   ` Junio C Hamano
@ 2018-07-24 17:11     ` Junio C Hamano
  2018-08-05 17:40       ` Christian Couder
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2018-07-24 17:11 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Jeff King, Christian Couder

Junio C Hamano <gitster@pobox.com> writes:

>> +-------------------------------------------
>> +[pack]
>> +island = refs/virtual/([0-9]+)/heads/
>> +island = refs/virtual/([0-9]+)/tags/
>> +island = refs/virtual/([0-9]+)/(pull)/
>> +-------------------------------------------
>
> It becomes clear only from this example that what the feature calls
> (and documented in patch 2/5) "regexp" is not BRE but ERE.  Update
> 2/5 so that it is clear to readers of "git config --help" who looks
> for "pack.island" in the output.
>
>> +That puts the heads and tags for each fork in their own island (named
>> +"1234" or similar), and the pull refs for each go into their own
>> +"1234-pull".
>
> "by concatenating any capture groups" made me imagine that the last
> one would be "1234pull" without dash.  The actual rule should be
> mentioned in that paragraph (i.e.  "concatenating any capture groups
> from the regex, with a '-' dash in between" or something like that).

Another thing I noticed from 2/5 is that you can have up to 7 such
capture groups.  I do not have any opinion if 7 is too few or too
many, but we would want the number to be documented, and end-user
input diagnosed when it requires more captures than we support (if
we are not already checking, that is).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 0/5] Add delta islands support
  2018-07-24 10:16 ` [RFC PATCH 0/5] Add delta islands support Jeff King
@ 2018-07-24 17:18   ` Junio C Hamano
  2018-07-24 21:14     ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2018-07-24 17:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Couder, git, Christian Couder

Jeff King <peff@peff.net> writes:

>    I think this is inherent in the scheme (we're losing some delta
>    opportunities). But I think it's also made worse because the delta
>    window gets clogged with candidates that are forbidden by the island
>    config.

Hmph, and the reason why objects that do not even belong to the same
island to be usable as a base are in the object list in the first
place is...?

>    Repacking with a big --window helps (and doesn't take as long
>    as it otherwise might because we can reject some object pairs based
>    on islands before doing any computation on the content).

Ah, then yes, a large window with early culling based on the delta
island criteria definitely sounds like the right solution to that
problem.

>    I have replacement code (which we have been running in production)
>    that is more clever about the threshold, and also can handle gaps in
>    the continuity (so we might realize we need to send objects 1-5000,
>    then skip a few, then 5037-8000, and so on).

That vaguely sounds similar to what folks at $DAYJOB runs in their
Gerrit/jgit thing.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-24  9:58     ` Jeff King
@ 2018-07-24 17:20       ` Stefan Beller
  2018-07-27 13:13         ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Beller @ 2018-07-24 17:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Couder, git, Junio C Hamano, Christian Couder

On Tue, Jul 24, 2018 at 2:58 AM Jeff King <peff@peff.net> wrote:
>
> On Mon, Jul 23, 2018 at 11:52:49AM -0700, Stefan Beller wrote:
>
> > > +DELTA ISLANDS
> > > +-------------
> > [...]
> >
> > I had to read all of this [background information] to understand the
> > concept and I think it is misnamed, as my gut instinct first told me
> > to have deltas only "within an island and no island hopping is allowed".
> > (This message reads a bit like a commit message, not as documentation
> > as it is long winded, too).
>
> I'm not sure if I'm parsing your sentence correctly, but the reason I
> called them "islands" is exactly that you'd have deltas within an island
> and want to forbid island hopping. So I wasn't sure if you were saying
> "that's what I think, but not how the feature works".

Yeah, you want to avoid island hopping, but still want the sea bed to be
useful for all islands (i.e. to have the largest possible base pack, that all
islands can share without being overexposed with objects they should
not see). And that is the main feature IMHO. It is not so much about
island separation (as you could use its own physically separated repo
for these separation tasks), but the main selling point of this feature
is that it enables a base pack that can be shared across all islands
without violating ACLs for example or making one island "too big"
(having unneeded objects on the island).

So metaphorically speaking I assumed the sea bed to support
all islands, which themselves are independent from each other.

> There _is_ a tricky thing, which is that a given object is going to
> appear in many islands. So the rule is really "you cannot hop to a base
> that is not in all of your islands".

Yes, if islands were numbers, we'd be looking for the Greatest common
common divisor for all of them, as all islands have to have access to
all objects in the base pack.

>
> > What about renaming this feature to
> >
> > [pack]
> >     excludePartialReach = refs/virtual/[0-9]]+/tags/
> >
> >   "By setting `pack.excludePartialReach`, object deltafication is
> >   prohibited for objects that are not reachable from all
> >   manifestations of the given regex"
> >
> > Cryptic, but it explains it in my mind in a shorter, more concise way. ;-)
>
> So I'm hopelessly biased at this point, having developed and worked with
> the feature under the "island" name for several years. But I find your
> name and explanation pretty confusing. :)

Ok.

>
> Worse, though, it does not have any noun to refer to the reachable set.
> The regex capture and the island names are an important part of the
> feature, because it lets you make a single island out of:
>
>   refs/virtual/([0-9]+)/heads
>   refs/virtual/([0-9]+)/tags
>
> but exclude:
>
>   refs/virtual/([0-9]+)/(foo)
>
> which goes into its own island ("123-foo" instead of "123"). So what's
> the equivalent nomenclature to "island name"?

So in my understanding we have a "common base pack" and specific
packs on top for each "island".

Regarding naming I find islands interesting and well fitting here, I don't
know of any better name.

I was just confused in the beginning as the name indicates that we care
about the islands, but we rather care about *not* hopping them.

Do you envision to have "groups of islands" (an atoll) for say all
open source clones of linux.git, such that you can layer the packs?
You would not just have the base pack + island pack, but have one
pack that is common to most islands?

Sorry if the initial email came off as a rant; I think the islands
metaphor is very good.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 0/5] Add delta islands support
  2018-07-24 17:18   ` Junio C Hamano
@ 2018-07-24 21:14     ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2018-07-24 21:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git, Christian Couder

On Tue, Jul 24, 2018 at 10:18:03AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >    I think this is inherent in the scheme (we're losing some delta
> >    opportunities). But I think it's also made worse because the delta
> >    window gets clogged with candidates that are forbidden by the island
> >    config.
> 
> Hmph, and the reason why objects that do not even belong to the same
> island to be usable as a base are in the object list in the first
> place is...?

Because we are doing a "git repack -ad" here. So we are considering
_all_ of the objects, but avoiding making deltas between some of them.

During an actual fetch, the islands are not used at all (but the delta
relationships left on disk are important for letting us reuse those
deltas as-is).

> >    Repacking with a big --window helps (and doesn't take as long
> >    as it otherwise might because we can reject some object pairs based
> >    on islands before doing any computation on the content).
> 
> Ah, then yes, a large window with early culling based on the delta
> island criteria definitely sounds like the right solution to that
> problem.

I try to account for this somewhat by looking at islands in the sort we
do of the delta candidates.  But to be honest, I am not sure how much
sense that makes (but I did verify experimentally that it helps). Again,
this is one of the reasons I had not sent this previously: I am not at
all confident that it is doing the right thing in all places.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 0/5] Add delta islands support
  2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
                   ` (5 preceding siblings ...)
  2018-07-24 10:16 ` [RFC PATCH 0/5] Add delta islands support Jeff King
@ 2018-07-26 13:34 ` Johannes Schindelin
  6 siblings, 0 replies; 37+ messages in thread
From: Johannes Schindelin @ 2018-07-26 13:34 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano, Jeff King, Christian Couder

Hi Chris,


On Sun, 22 Jul 2018, Christian Couder wrote:

> This patch series is upstreaming work made by GitHub and available in:
> 
> https://github.com/peff/git/commits/jk/delta-islands
> 
> The patch in the above branch has been split into 5 patches with their
> own new commit message, but no other change has been made.
> 
> I kept Peff as the author and took the liberty to add his
> Signed-off-by to all the patches.
> 
> The patches look good to me except perhaps that "pack.islandCore" is
> not documented. Maybe something could be added in
> Documentation/technical/ too, or maybe improving commit messages could
> be enough.
> 
> Anyway I wanted to send this nearly "as is" first to get early
> feedback about what should be done.
> 
> This patch series is also available on GitHub in:
> 
> https://github.com/chriscool/git/commits/delta-islands

It might make sense to explain *somewhere* in the cover letter what the
patches are all about, i.e. what problem they are supposed to solve. I did
not see any indication here.

Ciao,
Dscho

> 
> Jeff King (5):
>   packfile: make get_delta_base() non static
>   Add delta-islands.{c,h}
>   pack-objects: add delta-islands support
>   repack: add delta-islands support
>   t: add t9930-delta-islands.sh
> 
>  Documentation/config.txt           |   8 +
>  Documentation/git-pack-objects.txt |  88 ++++++
>  Documentation/git-repack.txt       |   5 +
>  Makefile                           |   1 +
>  builtin/pack-objects.c             | 130 +++++---
>  builtin/repack.c                   |   9 +
>  delta-islands.c                    | 490 +++++++++++++++++++++++++++++
>  delta-islands.h                    |  11 +
>  pack-objects.h                     |   4 +
>  packfile.c                         |  10 +-
>  packfile.h                         |   3 +
>  t/t9930-delta-islands.sh           | 143 +++++++++
>  12 files changed, 856 insertions(+), 46 deletions(-)
>  create mode 100644 delta-islands.c
>  create mode 100644 delta-islands.h
>  create mode 100755 t/t9930-delta-islands.sh
> 
> -- 
> 2.18.0.237.gffdb1dbdaa
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-07-22  5:48 ` [RFC PATCH 2/5] Add delta-islands.{c,h} Christian Couder
  2018-07-22  8:50   ` Duy Nguyen
  2018-07-24 16:47   ` Junio C Hamano
@ 2018-07-27  9:40   ` Jeff King
  2 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2018-07-27  9:40 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Junio C Hamano, Christian Couder

On Sun, Jul 22, 2018 at 07:48:33AM +0200, Christian Couder wrote:

> +	/*
> +	 * We process only trees, as commits and tags have already been handled
> +	 * (and passed their marks on to root trees, as well. We must make sure
> +	 * to process them in descending tree-depth order so that marks
> +	 * propagate down the tree properly, even if a sub-tree is found in
> +	 * multiple parent trees.
> +	 */
> +	todo = xmalloc(to_pack->nr_objects * sizeof(*todo));

I was fiddling with "make coccicheck", and it looks like this code could
stand some modernization. This could use ALLOC_ARRAY().

> +	for (i = 0; i < to_pack->nr_objects; i++) {
> +		if (oe_type(&to_pack->objects[i]) == OBJ_TREE)
> +			todo[nr++] = &to_pack->objects[i];
> +	}
> +	qsort(todo, nr, sizeof(*todo), cmp_tree_depth);

And this QSORT().

There are a few others, I won't list them all. The only tricky one I see
is:

> +		free(tree->buffer);
> +		tree->buffer = NULL;
> +		tree->object.parsed = 0;

This suggests FREE_AND_NULL(), but I think it actually the whole block
should become a call to free_tree_buffer().

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 1/5] packfile: make get_delta_base() non static
  2018-07-24 16:19   ` Junio C Hamano
@ 2018-07-27 11:29     ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2018-07-27 11:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git, Christian Couder

On Tue, Jul 24, 2018 at 09:19:27AM -0700, Junio C Hamano wrote:

> Christian Couder <christian.couder@gmail.com> writes:
> 
> > From: Jeff King <peff@peff.net>
> >
> > As get_delta_base() will be used outside 'packfile.c' in
> > a following commit, let's make it non static and let's
> > declare it in 'packfile.h'.
> 
> As a public function in *.h, don't we want a bit of comment there to
> help those who want to use it from outside packfile.c?

Arguably it may want a better name, too.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-07-24 16:47   ` Junio C Hamano
@ 2018-07-27 13:02     ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2018-07-27 13:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git, Christian Couder

On Tue, Jul 24, 2018 at 09:47:29AM -0700, Junio C Hamano wrote:

> > +/*
> > + * Allocate a new bitmap; if "old" is not NULL, the new bitmap will be a copy
> > + * of "old". Otherwise, the new bitmap is empty.
> > + */
> > +static struct island_bitmap *island_bitmap_new(const struct island_bitmap *old)
> > +{
> > +	size_t size = sizeof(struct island_bitmap) + (island_bitmap_size * 4);
> > +	struct island_bitmap *b = xcalloc(1, size);
> 
> Is one of the variants of flex array macros applicable here?

Hmm, maybe. The tricky thing about this bitmap (and this touches on your
"why another bitmap" question below) is that we have a _ton_ of them
(one per object). So we want to do what we can to keep memory usage
down. So there are two weird things:

 - these bitmaps do not know their own size. They are all exactly the
   same size (one bit per island), and we store the size only once in
   the whole program.

 - they are copy-on-write. This helps because there's a lot of
   locality to the maps within the object graph. But it also means
   sometimes we initialize empty, and sometimes from another array.

So I think it might work to do:

  if (old) {
	/* copy bits from existing bitmap */
	FLEX_ALLOC_MEM(b, bits, old->bits, st_mult(island_bitmap_size, 4));
  } else {
	/* start with all-zero bits courtesy of xcalloc */
	FLEX_ALLOC_MEM(b, bits, NULL, 0);
  }

That'd still waste an extra byte per struct (or 4, I guess, due to
padding), since we unconditionally NUL-terminate the flex-struct (since
they were really designed around string storage).

> > +static int island_bitmap_get(struct island_bitmap *self, uint32_t i)
> > +{
> > +	return (self->bits[ISLAND_BITMAP_BLOCK(i)] & ISLAND_BITMAP_MASK(i)) != 0;
> > +}
> > +
> 
> Not necessarily a complaint, but do we need another implementation
> of bitmap here, or the compressed bitmap used for the pack bitmap is
> unsuited for the purpose of this thing (e.g. perhaps it is overkill,
> as we won't be shooting for saving disk footprint of a bitmap that
> we are not going to save on disk anyway)?

Mostly it's about memory savings, as I discussed above. We _could_ use
ewah bitmaps instead of naive ones here. Those are bad for random access
to bits, but I think most of the operations are "is this bitmap a subset
of this other one" which can be done by walking them linearly.

I doubt the compression would be all that significant, though. They do
best when there's a lot of locality in the bitmap. Here our bits are the
islands, grouped and ordered by some subset of the refname. I don't
think there's any reason to think that refs/heads/a and refs/heads/b
would be correlated. OTOH, if most objects are in most islands (e.g.,
because the islands represent a bunch of forks of some "base" history),
then you might get a big run of 1's for those objects.

I don't think we really experimented with it. To be honest, I don't
recall how much measuring we did for the refcounted bitmaps, either. So
I'm not sure how much the copy-on-write saves (though my vague
recollection is that it was worth doing if you have a lot of islands).

> > +	/*
> > +	 * We process only trees, as commits and tags have already been handled
> > +	 * (and passed their marks on to root trees, as well. We must make sure
> > +	 * to process them in descending tree-depth order so that marks
> > +	 * propagate down the tree properly, even if a sub-tree is found in
> > +	 * multiple parent trees.
> > +	 */
> > +	todo = xmalloc(to_pack->nr_objects * sizeof(*todo));
> > +	for (i = 0; i < to_pack->nr_objects; i++) {
> > +		if (oe_type(&to_pack->objects[i]) == OBJ_TREE)
> > +			todo[nr++] = &to_pack->objects[i];
> > +	}
> > +	qsort(todo, nr, sizeof(*todo), cmp_tree_depth);
> 
> Hmph, at this stage nobody actually seems to set tree_depth; I am
> wondering how a tree that is at the root in one commit but appears
> as a subdirectory in another commit is handled by this code.

The tree depth is set when we actually walk the graph collecting
objects, and we use the max depth at which we've seen it. That said,
this happens in show_object(), which I think we'd avoid entering twice
for the same object (due to list-objects setting the SEEN flag).

So I suspect that yes, you could have a case where a tree has multiple
depths, and we'd visit them topologically out of order. And that could
lead to objects reachable from that tree missing some of their islands.
Something like:

  - refs/remotes/123/master points to commit C which points to tree X,
    which points to blob B

  - refs/remotes/456/master points to commit D which points to tree Y,
    which points to tree X, which points to blob B

We mark C with island "123" and D with island "456", and ditto for their
root trees X and Y. Then we sort the trees by depth, with the intent
that Y would come before its child X. But due to the bogus depth, X may
come first.  We propagate the mark for 123 down to children of X,
including B. When we get to Y, we propagate the mark for 456 down to
tree X. But we don't ever visit X again, because we assume we're going
in topological order through the list.

And we might decide not to allow a delta that we otherwise would
(because B really is in both 123 and 456, but is only marked for one).

> > +static regex_t *island_regexes;
> > +static unsigned int island_regexes_alloc, island_regexes_nr;
> > +static const char *core_island_name;
> 
> Are these (and the bitmap & hashtable) something that "everything in
> the_repository" folks would come in and nuke as global variables?  I
> haven't thought deeply about it, but these smell like that there
> would be one set of such in-core variables per one in-core object
> store plus refs (i.e. repository instance).

I don't think so. They're really local to the single pack-objects operation, so
they're no worse than all the other file-local static globals.

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-24 17:20       ` Stefan Beller
@ 2018-07-27 13:13         ` Jeff King
  2018-07-27 17:22           ` Stefan Beller
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2018-07-27 13:13 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Christian Couder, git, Junio C Hamano, Christian Couder

On Tue, Jul 24, 2018 at 10:20:05AM -0700, Stefan Beller wrote:

> So in my understanding we have a "common base pack" and specific
> packs on top for each "island".

Sort of. This is another hacky part. The islands themselves are
generally just about forbidding deltas, and not any particular kind of
layering.

But there's some magic layering only for the "core" island, which gets
to go first (and makes a sort of pseudo-pack at the front of the one
pack). And then everything else is written willy nilly. This is a hack
to try to make the "blit the pack bytes out" code path for cloning fast.
And that has to pick _one_ winner, so ideally you'd point it at the
thing that gets cloned the most, and everybody else gets to be a loser.

Again, this was designed for the current pack-reuse code we have
upstream, which we (GitHub) found to be pretty crappy (which I feel
justified in saying as one of the authors). I need to clean up and share
the alternative strategy we ended up with.

> Do you envision to have "groups of islands" (an atoll) for say all
> open source clones of linux.git, such that you can layer the packs?
> You would not just have the base pack + island pack, but have one
> pack that is common to most islands?

So no, we don't really layer in any sane way. If pack-objects were fed
the topological relationships between the forks, in theory we could
create a layered packfile that respects that.

But even that is not quite enough. At the time of forking, you might
imagine that torvalds/linux has the base pack, and then somebody forks
from them and contains all of those objects plus more, and somebody
forks from them, and so on. But that's just a snapshot. Later
torvalds/linux will get a bunch of new objects pushed to it. And some of
its forks will merge those objects, too. But some of them will just rot,
abandoned, as nobody ever touches them again.

So I don't think there's much to be gained by paying attention to the
external forking relationships. We have to discover afresh the
relationships between objects, and which refs (and thus which islands)
point to them.

One thing I don't think we ever tried was doubling down on the
islandCore concept and making the "root" fork as tightly packed as it
could be (with the assumption that _most_ people grab that). And then
just respect the islands for all the other objects (remember this is an
optimization, so the worst case is somebody asks for an object during a
fetch and we have to throw away its on-disk delta).

That would solve the problem that fetching torvalds/linux from GitHub
yields a bigger pack than fetching it from kernel.org. But again, it's
making that root fork weirdly magical. People who fetch solely from
other forks won't get any benefit (and may even see worse packs).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-27 13:13         ` Jeff King
@ 2018-07-27 17:22           ` Stefan Beller
  2018-07-28  9:00             ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Beller @ 2018-07-27 17:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Couder, git, Junio C Hamano, Christian Couder

On Fri, Jul 27, 2018 at 6:13 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Jul 24, 2018 at 10:20:05AM -0700, Stefan Beller wrote:
>
> > So in my understanding we have a "common base pack" and specific
> > packs on top for each "island".
>
> Sort of. This is another hacky part. The islands themselves are
> generally just about forbidding deltas, and not any particular kind of
> layering.
>
> But there's some magic layering only for the "core" island, which gets
> to go first (and makes a sort of pseudo-pack at the front of the one
> pack). And then everything else is written willy nilly. This is a hack
> to try to make the "blit the pack bytes out" code path for cloning fast.

yup, I do understand its purpose; we had the same discussions here
for the JGit based hosting.

So you are saying island hopping is disallowed, but the core island
has an airport using the spokes system to reach all other islands?
(I described the core island as sea bed before). Sounds reasonable.

> So no, we don't really layer in any sane way. If pack-objects were fed
> the topological relationships between the forks, in theory we could
> create a layered packfile that respects that.
>
> But even that is not quite enough. At the time of forking, you might
> imagine that torvalds/linux has the base pack, and then somebody forks
> from them and contains all of those objects plus more, and somebody
> forks from them, and so on. But that's just a snapshot. Later
> torvalds/linux will get a bunch of new objects pushed to it. And some of
> its forks will merge those objects, too. But some of them will just rot,
> abandoned, as nobody ever touches them again.
>
> So I don't think there's much to be gained by paying attention to the
> external forking relationships. We have to discover afresh the
> relationships between objects, and which refs (and thus which islands)
> point to them.
>
> One thing I don't think we ever tried was doubling down on the
> islandCore concept and making the "root" fork as tightly packed as it
> could be (with the assumption that _most_ people grab that). And then
> just respect the islands for all the other objects (remember this is an
> optimization, so the worst case is somebody asks for an object during a
> fetch and we have to throw away its on-disk delta).
>
> That would solve the problem that fetching torvalds/linux from GitHub
> yields a bigger pack than fetching it from kernel.org. But again, it's
> making that root fork weirdly magical. People who fetch solely from
> other forks won't get any benefit (and may even see worse packs).

Thanks for the explanation.
I think this discussion just hints at me being dense reading the
documentation. After all I like the concept.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-27 17:22           ` Stefan Beller
@ 2018-07-28  9:00             ` Jeff King
  2018-07-28 12:12               ` Christian Couder
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff King @ 2018-07-28  9:00 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Christian Couder, git, Junio C Hamano, Christian Couder

On Fri, Jul 27, 2018 at 10:22:09AM -0700, Stefan Beller wrote:

> > That would solve the problem that fetching torvalds/linux from GitHub
> > yields a bigger pack than fetching it from kernel.org. But again, it's
> > making that root fork weirdly magical. People who fetch solely from
> > other forks won't get any benefit (and may even see worse packs).
> 
> Thanks for the explanation.
> I think this discussion just hints at me being dense reading the
> documentation. After all I like the concept.

I actually think it hints that the documentation in the commits
themselves is not adequate. :)

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-28  9:00             ` Jeff King
@ 2018-07-28 12:12               ` Christian Couder
  0 siblings, 0 replies; 37+ messages in thread
From: Christian Couder @ 2018-07-28 12:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, git, Junio C Hamano, Christian Couder

On Sat, Jul 28, 2018 at 11:00 AM, Jeff King <peff@peff.net> wrote:
>
> On Fri, Jul 27, 2018 at 10:22:09AM -0700, Stefan Beller wrote:
>
> > > That would solve the problem that fetching torvalds/linux from GitHub
> > > yields a bigger pack than fetching it from kernel.org. But again, it's
> > > making that root fork weirdly magical. People who fetch solely from
> > > other forks won't get any benefit (and may even see worse packs).
> >
> > Thanks for the explanation.
> > I think this discussion just hints at me being dense reading the
> > documentation. After all I like the concept.
>
> I actually think it hints that the documentation in the commits
> themselves is not adequate. :)

Ok, I will try to improve it using information from the discussion threads.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-22  8:55   ` Duy Nguyen
@ 2018-08-05 17:28     ` Christian Couder
  0 siblings, 0 replies; 37+ messages in thread
From: Christian Couder @ 2018-08-05 17:28 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Sun, Jul 22, 2018 at 10:55 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sun, Jul 22, 2018 at 7:52 AM Christian Couder
> <christian.couder@gmail.com> wrote:
>>
>> -       /*
>> -        * And then all remaining commits and tags.
>> -        */
>> -       for (i = last_untagged; i < to_pack.nr_objects; i++) {
>> -               if (oe_type(&objects[i]) != OBJ_COMMIT &&
>> -                   oe_type(&objects[i]) != OBJ_TAG)
>> -                       continue;
>> -               add_to_write_order(wo, &wo_end, &objects[i]);
>> -       }
>> +               /*
>> +                * Then fill all the tagged tips.
>> +                */
>
> If we move the code in this loop to a separate function, in a separate
> patch, first, would it produce a better diff? I think all the
> indentation change here makes it a bit hard to read.

Sorry I just realized that I forgot to try to do that. It will be on
my todo list for the next iteration.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-07-24 17:11     ` Junio C Hamano
@ 2018-08-05 17:40       ` Christian Couder
  2018-08-06  8:44         ` Christian Couder
  0 siblings, 1 reply; 37+ messages in thread
From: Christian Couder @ 2018-08-05 17:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Christian Couder

On Tue, Jul 24, 2018 at 7:11 PM, Junio C Hamano <gitster@pobox.com> wrote:
>
> Another thing I noticed from 2/5 is that you can have up to 7 such
> capture groups.  I do not have any opinion if 7 is too few or too
> many, but we would want the number to be documented, and end-user
> input diagnosed when it requires more captures than we support (if
> we are not already checking, that is).

This is the new documentation:

-> Refs are grouped into islands based on their "names", and two regexes
-> that produce the same name are considered to be in the same
-> island. The names are computed from the regexes by concatenating any
-> capture groups from the regex, with a '-' dash in between. (And if
-> there are no capture groups, then the name is the empty string, as in
-> the above example.) This allows you to create arbitrary numbers of
-> islands. Only up to 7 such capture groups are supported though.

I added the last sentence above, but I wonder if it is 7 or 8. The
code is the following:

-> static int find_island_for_ref(const char *refname, const struct
object_id *oid,
->                    int flags, void *data)
-> {
->     int i, m;
->     regmatch_t matches[8];
->     struct strbuf island_name = STRBUF_INIT;
->
->     /* walk backwards to get last-one-wins ordering */
->     for (i = island_regexes_nr - 1; i >= 0; i--) {
->         if (!regexec(&island_regexes[i], refname,
->                  ARRAY_SIZE(matches), matches, 0))
->             break;
->     }

I also wonder if the above is enough to diagnose end-user input when
it requires more captures than we support. A quick look at the man
page of the regex functions wasn't enough to enlighten me. Any input
on these issues is very welcome!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-07-22  8:50   ` Duy Nguyen
  2018-07-22 13:57     ` Christian Couder
@ 2018-08-05 18:53     ` Christian Couder
  2018-08-06 14:17       ` Jeff King
  2018-08-06 15:53       ` Duy Nguyen
  1 sibling, 2 replies; 37+ messages in thread
From: Christian Couder @ 2018-08-05 18:53 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Sun, Jul 22, 2018 at 10:50 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sun, Jul 22, 2018 at 7:51 AM Christian Couder
> <christian.couder@gmail.com> wrote:

>> diff --git a/pack-objects.h b/pack-objects.h
>> index edf74dabdd..8eecd67991 100644
>> --- a/pack-objects.h
>> +++ b/pack-objects.h
>> @@ -100,6 +100,10 @@ struct object_entry {
>>         unsigned type_:TYPE_BITS;
>>         unsigned no_try_delta:1;
>>         unsigned in_pack_type:TYPE_BITS; /* could be delta */
>> +
>> +       unsigned int tree_depth; /* should be repositioned for packing? */
>> +       unsigned char layer;
>> +
>
> This looks very much like an optional feature. To avoid increasing
> pack-objects memory usage for common case, please move this to struct
> packing_data.

As you can see the patch 6/6 (in the v2 of this patch series that I
just sent) moves `unsigned int tree_depth` from 'struct object_entry'
to 'struct packing_data'. I am not sure that I did it right and that
it is worth it though as it is a bit more complex.

About `unsigned char layer` I am even less sure that it's worth it for
the following reasons:

- 'struct object_entry' contains bit fields as its last members and
then the following comment:

    /*
     * pahole results on 64-bit linux (gcc and clang)
     *
     *   size: 80, bit_padding: 20 bits, holes: 8 bits
     *
     * and on 32-bit (gcc)
     *
     *   size: 76, bit_padding: 20 bits, holes: 8 bits
     */

I am not sure if this comment is still up to date, but if it true
maybe there is enough space in the padding to add 'layer' as a 8 bit
bit field somewhere without increasing the size of 'struct
object_entry'.

- 'layer' is used in add_to_write_order() which is called from many
places in add_descendants_to_write_order() and compute_write_order()
for example like this:

            for (s = DELTA_SIBLING(e); s; s = DELTA_SIBLING(s)) {
                add_to_write_order(wo, endp, s);
            }

So it seems to me that moving 'layer' to 'struct packing_data' would
require changing the DELTA_SIBLING macros or adding a hack to easily
find the index in an array corresponding to a 'struct object_entry'.

- I don't see why it is different from other fields in 'struct
object_entry' that are also used in compute_write_order(), for
example:

    uint32_t delta_idx;    /* delta base object */
    uint32_t delta_child_idx; /* deltified objects who bases me */
    uint32_t delta_sibling_idx; /* other deltified objects who ... */

Would we also want to move these fields to 'struct packing_data'? Why
or why not? If we want to move them, then it might be a good idea to
do all the moving at the same time to make sure we get something
coherent.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-08-05 17:40       ` Christian Couder
@ 2018-08-06  8:44         ` Christian Couder
  2018-08-06 13:58           ` Jeff King
  0 siblings, 1 reply; 37+ messages in thread
From: Christian Couder @ 2018-08-06  8:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Christian Couder

On Sun, Aug 5, 2018 at 7:40 PM, Christian Couder
<christian.couder@gmail.com> wrote:
> On Tue, Jul 24, 2018 at 7:11 PM, Junio C Hamano <gitster@pobox.com> wrote:

> This is the new documentation:
>
> -> Refs are grouped into islands based on their "names", and two regexes
> -> that produce the same name are considered to be in the same
> -> island. The names are computed from the regexes by concatenating any
> -> capture groups from the regex, with a '-' dash in between. (And if
> -> there are no capture groups, then the name is the empty string, as in
> -> the above example.) This allows you to create arbitrary numbers of
> -> islands. Only up to 7 such capture groups are supported though.
>
> I added the last sentence above, but I wonder if it is 7 or 8.

Actually after reading the man page again, it looks like the first
element in the array we pass is used for "the entire regular
expression's match addresses", so we only get 7 capture groups (not
8).

> The code is the following:
>
> -> static int find_island_for_ref(const char *refname, const struct
> object_id *oid,
> ->                    int flags, void *data)
> -> {
> ->     int i, m;
> ->     regmatch_t matches[8];
> ->     struct strbuf island_name = STRBUF_INIT;
> ->
> ->     /* walk backwards to get last-one-wins ordering */
> ->     for (i = island_regexes_nr - 1; i >= 0; i--) {
> ->         if (!regexec(&island_regexes[i], refname,
> ->                  ARRAY_SIZE(matches), matches, 0))
> ->             break;
> ->     }
>
> I also wonder if the above is enough to diagnose end-user input when
> it requires more captures than we support. A quick look at the man
> page of the regex functions wasn't enough to enlighten me. Any input
> on these issues is very welcome!

Taking a look at how we use regexec() in our code base, it looks like
it might be better to use regexec_buf() defined in
"git-compat-util.h".

I am not completely sure about that because apply.c has:

    status = regexec(stamp, timestamp, ARRAY_SIZE(m), m, 0);
    if (status) {
        if (status != REG_NOMATCH)
            warning(_("regexec returned %d for input: %s"),
                status, timestamp);
        return 0;
    }

Though the above uses a regex that is defined in apply.c. The regex
doesn't come from the config file.

Actually except the above there is a mix of regexec() and
regexec_buf() in our code base, which are used with only 0, 1 or 2
capture groups, so it is not very clear what should be used.

And anyway I still don't see how we could diagnose when the end user
input requires more captures than we support.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 3/5] pack-objects: add delta-islands support
  2018-08-06  8:44         ` Christian Couder
@ 2018-08-06 13:58           ` Jeff King
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2018-08-06 13:58 UTC (permalink / raw)
  To: Christian Couder; +Cc: Junio C Hamano, git, Christian Couder

On Mon, Aug 06, 2018 at 10:44:14AM +0200, Christian Couder wrote:

> Taking a look at how we use regexec() in our code base, it looks like
> it might be better to use regexec_buf() defined in
> "git-compat-util.h".
> 
> I am not completely sure about that because apply.c has:
> 
>     status = regexec(stamp, timestamp, ARRAY_SIZE(m), m, 0);
>     if (status) {
>         if (status != REG_NOMATCH)
>             warning(_("regexec returned %d for input: %s"),
>                 status, timestamp);
>         return 0;
>     }
> 
> Though the above uses a regex that is defined in apply.c. The regex
> doesn't come from the config file.
> 
> Actually except the above there is a mix of regexec() and
> regexec_buf() in our code base, which are used with only 0, 1 or 2
> capture groups, so it is not very clear what should be used.

I don't think we need regexec_buf(). The advantage it has is that it can
operate on strings that aren't NUL-terminated, but that isn't the case
here.

> And anyway I still don't see how we could diagnose when the end user
> input requires more captures than we support.

We can use the final element as a sentinel, and complain if it gets
filled in:

diff --git a/delta-islands.c b/delta-islands.c
index dcc6590cc1..18426ffb18 100644
--- a/delta-islands.c
+++ b/delta-islands.c
@@ -375,6 +375,10 @@ static int find_island_for_ref(const char *refname, const struct object_id *oid,
 	if (i < 0)
 		return 0;
 
+	if (matches[ARRAY_SIZE(matches)-1].rm_so != -1)
+		die("island regex had too many matches (max=%d)",
+		    (int)ARRAY_SIZE(matches) - 2);
+
 	for (m = 1; m < ARRAY_SIZE(matches); m++) {
 		regmatch_t *match = &matches[m];
 

The big downside is that it only kicks in when you actually successfully
make a match. So you could have:

  [pack]
  island = refs/(one)/(two)/(three)/(four)/(five)/(six)/(seven)

in your config for years, and then one day it blows up when somebody
actually has a ref that matches it.

I think it would be fine to just say "we only respect the first N
capture groups". And maybe even issue a warning (based on the detection
above). I'd also be fine with bumping the "matches" array to something
more ridiculous, like 32. The current value of 8 was supposed to be
ridiculous already (we've never used more than 2).

-Peff

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-08-05 18:53     ` Christian Couder
@ 2018-08-06 14:17       ` Jeff King
  2018-08-06 15:53       ` Duy Nguyen
  1 sibling, 0 replies; 37+ messages in thread
From: Jeff King @ 2018-08-06 14:17 UTC (permalink / raw)
  To: Christian Couder
  Cc: Duy Nguyen, Git Mailing List, Junio C Hamano, Christian Couder

On Sun, Aug 05, 2018 at 08:53:19PM +0200, Christian Couder wrote:

> - 'layer' is used in add_to_write_order() which is called from many
> places in add_descendants_to_write_order() and compute_write_order()
> for example like this:
> 
>             for (s = DELTA_SIBLING(e); s; s = DELTA_SIBLING(s)) {
>                 add_to_write_order(wo, endp, s);
>             }
> 
> So it seems to me that moving 'layer' to 'struct packing_data' would
> require changing the DELTA_SIBLING macros or adding a hack to easily
> find the index in an array corresponding to a 'struct object_entry'.

Right, that hack is how the various the existing memory-shrinking macros
work. Every "struct object_entry *" that we deal with _must_ be in the
list in "packing_data", so that we can compute an index as
"entry - to_pack.objects".

So I think Duy is just suggesting:

  void oe_set_layer(struct packing_data *pack,
                    struct object_entry *entry,
		    int layer)
  {
          if (!pack->layers)
	          ALLOC_ARRAY(pack->layers, pack->nr_objects);
	  pack->layers[entry - pack->objects] = layer;
  }

  int oe_get_layer(struct packing_data *pack,
                   struct object_entry *entry)
  {
          if (!pack->layers)
	          return 0;
          return pack->layers[entry - pack->nr_objects];
  }

That said, I wonder if need to cache the layer at all. We compute the
layers all at once in the new compute_pack_layers(). But it is just a
linear walk over the objects, asking for each "is this in the core
island?".

Could we just ask "is this in the core island?" when we consider each
object? That's a hash lookup each time we ask instead of looking at our
int, but I'd think we would only ask in linear proportion to the number
of objects. So there's no algorithmic speedup, though maybe a
constant-time one (for two layers, I'd expect we'd probably ask about
each object twice).

> - I don't see why it is different from other fields in 'struct
> object_entry' that are also used in compute_write_order(), for
> example:
> 
>     uint32_t delta_idx;    /* delta base object */
>     uint32_t delta_child_idx; /* deltified objects who bases me */
>     uint32_t delta_sibling_idx; /* other deltified objects who ... */
> 
> Would we also want to move these fields to 'struct packing_data'? Why
> or why not? If we want to move them, then it might be a good idea to
> do all the moving at the same time to make sure we get something
> coherent.

We actually use those multiple times. They're reset and used in
compute_write_order(), but I think we use them earlier as part of the
delta search (at the very least we use delta_idx; maybe no the
child/sibling stuff).

-Peff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-08-05 18:53     ` Christian Couder
  2018-08-06 14:17       ` Jeff King
@ 2018-08-06 15:53       ` Duy Nguyen
  2018-08-06 18:54         ` Christian Couder
  1 sibling, 1 reply; 37+ messages in thread
From: Duy Nguyen @ 2018-08-06 15:53 UTC (permalink / raw)
  To: Christian Couder
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Sun, Aug 5, 2018 at 8:53 PM Christian Couder
<christian.couder@gmail.com> wrote:
>
> On Sun, Jul 22, 2018 at 10:50 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> > On Sun, Jul 22, 2018 at 7:51 AM Christian Couder
> > <christian.couder@gmail.com> wrote:
>
> >> diff --git a/pack-objects.h b/pack-objects.h
> >> index edf74dabdd..8eecd67991 100644
> >> --- a/pack-objects.h
> >> +++ b/pack-objects.h
> >> @@ -100,6 +100,10 @@ struct object_entry {
> >>         unsigned type_:TYPE_BITS;
> >>         unsigned no_try_delta:1;
> >>         unsigned in_pack_type:TYPE_BITS; /* could be delta */
> >> +
> >> +       unsigned int tree_depth; /* should be repositioned for packing? */
> >> +       unsigned char layer;
> >> +
> >
> > This looks very much like an optional feature. To avoid increasing
> > pack-objects memory usage for common case, please move this to struct
> > packing_data.
>
> As you can see the patch 6/6 (in the v2 of this patch series that I
> just sent) moves `unsigned int tree_depth` from 'struct object_entry'
> to 'struct packing_data'. I am not sure that I did it right and that
> it is worth it though as it is a bit more complex.

It is a bit more complex than I expected. But I think if you go with
Jeff's suggestion (i.e. think of the new tree_depth array an extension
of objects array) it's a bit simpler: you access both arrays using the
same index, both arrays should have the same size and be realloc'd at
the same time in packlist_alloc().

Is it worth it? The "pahole" comment in this file is up to date. We
use 80 bytes per object. This series makes the struct 88 bytes (I've
just rerun pahole). On linux repo with 12M objects, "git pack-objects
--all" needs extra 96MB memory even if this feature is not used. So
yes I still think it's worth moving these fields out of struct
object_entry.
-- 
Duy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-08-06 15:53       ` Duy Nguyen
@ 2018-08-06 18:54         ` Christian Couder
  2018-08-06 19:21           ` Duy Nguyen
  0 siblings, 1 reply; 37+ messages in thread
From: Christian Couder @ 2018-08-06 18:54 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Mon, Aug 6, 2018 at 5:53 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sun, Aug 5, 2018 at 8:53 PM Christian Couder
> <christian.couder@gmail.com> wrote:
>>
>> As you can see the patch 6/6 (in the v2 of this patch series that I
>> just sent) moves `unsigned int tree_depth` from 'struct object_entry'
>> to 'struct packing_data'. I am not sure that I did it right and that
>> it is worth it though as it is a bit more complex.
>
> It is a bit more complex than I expected. But I think if you go with
> Jeff's suggestion (i.e. think of the new tree_depth array an extension
> of objects array) it's a bit simpler: you access both arrays using the
> same index, both arrays should have the same size and be realloc'd at
> the same time in packlist_alloc().

Ok, I will take a look at doing that to simplify things. Thanks to
Peff and you for that suggestion!

> Is it worth it? The "pahole" comment in this file is up to date. We
> use 80 bytes per object. This series makes the struct 88 bytes (I've
> just rerun pahole).

Did you run it on V1 or on V2? I guess on V2, but then what do you
think about converting the 'layer' field into a bit field, which might
be simpler and save space?

> On linux repo with 12M objects, "git pack-objects
> --all" needs extra 96MB memory even if this feature is not used. So
> yes I still think it's worth moving these fields out of struct
> object_entry.

And what about the fields already in struct object_entry? While I am
at it, I think I could move some of them too if it is really so worth
it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCH 2/5] Add delta-islands.{c,h}
  2018-08-06 18:54         ` Christian Couder
@ 2018-08-06 19:21           ` Duy Nguyen
  0 siblings, 0 replies; 37+ messages in thread
From: Duy Nguyen @ 2018-08-06 19:21 UTC (permalink / raw)
  To: Christian Couder
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Christian Couder

On Mon, Aug 6, 2018 at 8:54 PM Christian Couder
<christian.couder@gmail.com> wrote:
> > Is it worth it? The "pahole" comment in this file is up to date. We
> > use 80 bytes per object. This series makes the struct 88 bytes (I've
> > just rerun pahole).
>
> Did you run it on V1 or on V2? I guess on V2, but then what do you
> think about converting the 'layer' field into a bit field, which might
> be simpler and save space?

V2. I kinda ignored the layer field because Jeff was suggesting that
it may not be needed. It may be better to just move it out though
because it's not that complicated and even if we have enough bits
left, they may be spread out that it's hard to reorder so that they're
grouped together and use can use them.

> > On linux repo with 12M objects, "git pack-objects
> > --all" needs extra 96MB memory even if this feature is not used. So
> > yes I still think it's worth moving these fields out of struct
> > object_entry.
>
> And what about the fields already in struct object_entry? While I am
> at it, I think I could move some of them too if it is really so worth

Way ahead of you. In v2.17.0 this struct is 136 bytes so going down to
80 bytes now is a big memory saving. I think I  squeezed everything
possible and was a bit too aggressive that I created a new performance
regression in v2.18.0 ;-)
-- 
Duy

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2018-08-06 19:21 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-22  5:48 [RFC PATCH 0/5] Add delta islands support Christian Couder
2018-07-22  5:48 ` [RFC PATCH 1/5] packfile: make get_delta_base() non static Christian Couder
2018-07-24 16:19   ` Junio C Hamano
2018-07-27 11:29     ` Jeff King
2018-07-22  5:48 ` [RFC PATCH 2/5] Add delta-islands.{c,h} Christian Couder
2018-07-22  8:50   ` Duy Nguyen
2018-07-22 13:57     ` Christian Couder
2018-08-05 18:53     ` Christian Couder
2018-08-06 14:17       ` Jeff King
2018-08-06 15:53       ` Duy Nguyen
2018-08-06 18:54         ` Christian Couder
2018-08-06 19:21           ` Duy Nguyen
2018-07-24 16:47   ` Junio C Hamano
2018-07-27 13:02     ` Jeff King
2018-07-27  9:40   ` Jeff King
2018-07-22  5:48 ` [RFC PATCH 3/5] pack-objects: add delta-islands support Christian Couder
2018-07-22  8:55   ` Duy Nguyen
2018-08-05 17:28     ` Christian Couder
2018-07-23 18:52   ` Stefan Beller
2018-07-24  9:58     ` Jeff King
2018-07-24 17:20       ` Stefan Beller
2018-07-27 13:13         ` Jeff King
2018-07-27 17:22           ` Stefan Beller
2018-07-28  9:00             ` Jeff King
2018-07-28 12:12               ` Christian Couder
2018-07-24 17:03   ` Junio C Hamano
2018-07-24 17:11     ` Junio C Hamano
2018-08-05 17:40       ` Christian Couder
2018-08-06  8:44         ` Christian Couder
2018-08-06 13:58           ` Jeff King
2018-07-22  5:48 ` [RFC PATCH 4/5] repack: " Christian Couder
2018-07-22  5:48 ` [RFC PATCH 5/5] t: add t9930-delta-islands.sh Christian Couder
2018-07-24 10:24   ` Jeff King
2018-07-24 10:16 ` [RFC PATCH 0/5] Add delta islands support Jeff King
2018-07-24 17:18   ` Junio C Hamano
2018-07-24 21:14     ` Jeff King
2018-07-26 13:34 ` Johannes Schindelin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.