git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/14] Serialized Git Commit Graph
@ 2018-01-30 21:39 Derrick Stolee
  2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
                   ` (15 more replies)
  0 siblings, 16 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Thanks to everyone who gave comments on v1. I tried my best to respond to
all of the feedback, but may have missed some while I was doing several
renames, including:

* builtin/graph.c -> builtin/commit-graph.c
* packed-graph.[c|h] -> commit-graph.[c|h]
* t/t5319-graph.sh -> t/t5318-commit-graph.sh

Because of these renames (and several type/function renames) the diff
is too large to conveniently share here.

Some issues that came up and are addressed:

* Use <hash> instead of <oid> when referring to the graph-<hash>.graph
  filenames and the contents of graph-head.
* 32-bit timestamps will not cause undefined behavior.
* timestamp_t is unsigned, so they are never negative.
* The config setting "core.commitgraph" now only controls consuming the
  graph during normal operations and will not block the commit-graph
  plumbing command.
* The --stdin-commits is better about sanitizing the input for strings
  that do not parse to OIDs or are OIDs for non-commit objects.

One unresolved comment that I would like consensus on is the use of
globals to store the config setting and the graph state. I'm currently
using the pattern from packed_git instead of putting these values in
the_repository. However, we want to eventually remove globals like
packed_git. Should I deviate from the pattern _now_ in order to keep
the problem from growing, or should I keep to the known pattern?

Finally, I tried to clean up my incorrect style as I was recreating
these commits. Feel free to be merciless in style feedback now that the
architecture is more stable.

Thanks,
-Stolee

-- >8 --

As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base <A> <B>' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv                       |  0.42s |  0.27s | -35%  |
| rev-list --all                   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

  git config core.commitgraph true
  git show-ref -s | git graph --write --update-head --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.

[1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a2ce@jeffhostetler.com/
    Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
    A GitHub pull request containing the latest version of this patch.

Derrick Stolee (14):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement construct_commit_graph()
  commit-graph: implement git-commit-graph --write
  commit-graph: implement git-commit-graph --read
  commit-graph: implement git-commit-graph --update-head
  commit-graph: implement git-commit-graph --clear
  commit-graph: teach git-commit-graph --delete-expired
  commit-graph: add core.commitgraph setting
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: close under reachability
  commit-graph: build graph from starting commits

 .gitignore                                      |   1 +
 Documentation/config.txt                        |   3 +
 Documentation/git-commit-graph.txt              | 100 +++
 Documentation/technical/commit-graph-format.txt |  89 +++
 Documentation/technical/commit-graph.txt        | 189 ++++++
 Makefile                                        |   2 +
 alloc.c                                         |   1 +
 builtin.h                                       |   1 +
 builtin/commit-graph.c                          | 229 +++++++
 cache.h                                         |   1 +
 command-list.txt                                |   1 +
 commit-graph.c                                  | 841 ++++++++++++++++++++++++
 commit-graph.h                                  |  69 ++
 commit.c                                        |  10 +-
 commit.h                                        |   4 +
 config.c                                        |   5 +
 environment.c                                   |   1 +
 git.c                                           |   1 +
 log-tree.c                                      |   3 +-
 packfile.c                                      |   4 +-
 packfile.h                                      |   2 +
 t/t5318-commit-graph.sh                         | 272 ++++++++
 22 files changed, 1824 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 Documentation/technical/commit-graph-format.txt
 create mode 100644 Documentation/technical/commit-graph.txt
 create mode 100644 builtin/commit-graph.c
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h
 create mode 100755 t/t5318-commit-graph.sh

-- 
2.16.0


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v2 01/14] commit-graph: add format document
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-01 21:44   ` Jonathan Tan
  2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 89 +++++++++++++++++++++++++
 1 file changed, 89 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 0000000000..8a987c7aa9
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,89 @@
+Git commit graph format
+=======================
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+      The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+      Currently, the only valid version is 1.
+
+  1-byte Object Id Version (1 = SHA-1)
+
+  1-byte Object Id Length (H)
+
+  1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+      First 4 bytes describe chunk id. Value 0 is a terminating label.
+      Other 8 bytes provide offset in current file for chunk to start.
+      (Chunks are ordered contiguously in the file, so you can infer
+      the length using the next chunk position if necessary.)
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+      The ith entry, F[i], stores the number of OIDs with first
+      byte at most i. Thus F[255] stores the total
+      number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+      The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+    * The first H bytes are for the OID of the root tree.
+    * The next 8 bytes are for the int-ids of the first two parents
+      of the ith commit. Stores value 0xffffffff if no parent in that
+      position. If there are more than two parents, the second value
+      has its most-significant bit on and the other bits store an array
+      position into the Large Edge List chunk.
+    * The next 8 bytes store the generation number of the commit and
+      the commit time in seconds since EPOCH. The generation number
+      uses the higher 30 bits of the first 4 bytes, while the commit
+      time uses the 32 bits of the second 4 bytes, along with the lowest
+      2 bits of the lowest byte, storing the 33rd and 34th bit of the
+      commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'})
+      This list of 4-byte values store the second through nth parents for
+      all octopus merges. The second parent value in the commit data is a
+      negative number pointing into this list. Then iterate through this
+      list starting at that position until reaching a value with the most-
+      significant bit on. The other bits correspond to the int-id of the
+      last parent. This chunk should always be present, but may be empty.
+
+TRAILER:
+
+	H-byte HASH-checksum of all of the above.
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 02/14] graph: add commit graph design document
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
  2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-01-31  2:19   ` Stefan Beller
  2018-01-30 21:39 ` [PATCH v2 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 189 +++++++++++++++++++++++++++++++
 1 file changed, 189 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000000..cbf88f7264
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,189 @@
+Git Commit Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'git show --remerge-diff' and can take minutes to compute depending on
+history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitgraph'
+config setting, then the existing ODB is sufficient. The file is stored
+next to packfiles either in the .git/objects/pack directory or in the pack
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+    If A and B are commits with generation numbers N and M, respectively,
+    and N <= M, then A cannot reach B. That is, we know without searching
+    that B is not an ancestor of A because it is further from a root commit
+    than A.
+
+    Conversely, when checking if A is an ancestor of B, then we only need
+    to walk commits until all commits on the walk boundary have generation
+    number at most N. If we walk commits using a priority queue seeded by
+    generation numbers, then we always expand the boundary commit with highest
+    generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+    If A and B are commits with commit time X and Y, respectively, and
+    X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<hash>.graph' in the pack
+  directory. This could be stored in an alternate.
+
+- The most-recent graph file hash is stored in a 'graph-head' file for
+  immediate access and storing backup graphs. This could be stored in an
+  alternate, and refers to a 'graph-<hash>.graph' file in the same pack
+  directory.
+
+- The core.commitgraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object id length and hash
+  algorithm, so a future change of hash algorithm does not require a change
+  in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer position
+  to seek into the single graph file. It is possible to extend the model
+  for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'commit-graph' builtin. These are not
+  updated automatically during clone, fetch, repack, or creating new commits.
+
+- There is no '--verify' option for the 'commit-graph' builtin to verify the
+  contents of the graph file agree with the contents in the ODB.
+
+- When rewriting the graph, we do not check for a commit still existing
+  in the ODB, so garbage collection may remove commits.
+
+- Generation numbers are not computed in the current version. The file
+  format supports storing them, along with a mechanism to upgrade from
+  a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+  are not currently computed, so all generation numbers will be marked as
+  0 (or "uncomputed"). A later patch will include this calculation.
+
+- The commit graph is currently incompatible with commit grafts. This can be
+  remedied by duplicating or refactoring the current graft logic.
+
+- After computing and storing generation numbers, we must make graph
+  walks aware of generation numbers to gain the performance benefits they
+  enable. This will mostly be accomplished by swapping a commit-date-ordered
+  priority queue with one ordered by generation number. The following
+  operations are important candidates:
+
+    - paint_down_to_common()
+    - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+  When writing a new graph, we could check that the ODB still contains
+  the commits and choose to remove the commits that are deleted from the
+  ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+  object for a commit. This passes through lookup_tree() and consequently
+  lookup_object(). Also, it calls lookup_commit() when loading the parents.
+  These method calls check the ODB for object existence, even if the
+  consumer does not need the content. For example, we do not need the
+  tree contents when computing merge bases. Now that commit parsing is
+  removed from the computation time, these lookup operations are the
+  slowest operations keeping graph walks from being fast. Consider
+  loading these objects without verifying their existence in the ODB and
+  only loading them fully when consumers need them. Consider a method
+  such as "ensure_tree_loaded(commit)" that fully loads a tree before
+  using commit->tree.
+
+- The current design uses the 'commit-graph' builtin to generate the graph.
+  When this feature stabilizes enough to recommend to most users, we should
+  add automatic graph writes to common operations that create many commits.
+  For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
+  commands.
+
+- A server could provide a commit graph file as part of the network protocol
+  to avoid extra calculations by clients.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+    Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+    An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+    Discussion about generation numbers on commits and how they interact
+    with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+    More discussion about generation numbers and not storing them inside
+    commit objects. A valuable quote:
+
+    "I think we should be moving more in the direction of keeping
+     repo-local caches for optimizations. Reachability bitmaps have been
+     a big performance win. I think we should be doing the same with our
+     properties of commits. Not just generation numbers, but making it
+     cheap to access the graph structure without zlib-inflating whole
+     commit objects (i.e., packv4 or something like the "metapacks" I
+     proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+    A patch to remove the ahead-behind calculation from 'status'.
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 03/14] commit-graph: create git-commit-graph builtin
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
  2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
  2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-02  0:53   ` SZEDER Gábor
  2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a '--pack-dir' option.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                         |  1 +
 Documentation/git-commit-graph.txt |  7 +++++++
 Makefile                           |  1 +
 builtin.h                          |  1 +
 builtin/commit-graph.c             | 33 +++++++++++++++++++++++++++++++++
 command-list.txt                   |  1 +
 git.c                              |  1 +
 7 files changed, 45 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..e82f90184d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
new file mode 100644
index 0000000000..c8ea548dfb
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,7 @@
+git-commit-graph(1)
+============
+
+NAME
+----
+git-commit-graph - Write and verify Git commit graphs (.graph files)
+
diff --git a/Makefile b/Makefile
index 1a9b23b679..aee5d3f7b9 100644
--- a/Makefile
+++ b/Makefile
@@ -965,6 +965,7 @@ BUILTIN_OBJS += builtin/for-each-ref.o
 BUILTIN_OBJS += builtin/fsck.o
 BUILTIN_OBJS += builtin/gc.o
 BUILTIN_OBJS += builtin/get-tar-commit-id.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/grep.o
 BUILTIN_OBJS += builtin/hash-object.o
 BUILTIN_OBJS += builtin/help.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..079855b6d4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 0000000000..2104550d25
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,33 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "dir.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "packfile.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+	N_("git commit-graph [--pack-dir <packdir>]"),
+	NULL
+};
+
+static struct opts_commit_graph {
+	const char *pack_dir;
+} opts;
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_commit_graph_options[] = {
+		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
+			N_("dir"),
+			N_("The pack directory to store the graph") },
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_commit_graph_usage,
+				   builtin_commit_graph_options);
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..835c5890be 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean                               mainporcelain
 git-clone                               mainporcelain           init
 git-column                              purehelpers
 git-commit                              mainporcelain           history
+git-commit-graph                        plumbingmanipulators
 git-commit-tree                         plumbingmanipulators
 git-config                              ancillarymanipulators
 git-count-objects                       ancillaryinterrogators
diff --git a/git.c b/git.c
index c870b9719c..c7b5adae7b 100644
--- a/git.c
+++ b/git.c
@@ -388,6 +388,7 @@ static struct cmd_struct commands[] = {
 	{ "clone", cmd_clone },
 	{ "column", cmd_column, RUN_SETUP_GENTLY },
 	{ "commit", cmd_commit, RUN_SETUP | NEED_WORK_TREE },
+	{ "commit-graph", cmd_commit_graph, RUN_SETUP },
 	{ "commit-tree", cmd_commit_tree, RUN_SETUP },
 	{ "config", cmd_config, RUN_SETUP_GENTLY },
 	{ "count-objects", cmd_count_objects, RUN_SETUP },
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-01 22:23   ` Jonathan Tan
                     ` (2 more replies)
  2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
                   ` (11 subsequent siblings)
  15 siblings, 3 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given pack
directory.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |   1 +
 commit-graph.c | 376 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commit-graph.h |  20 +++
 3 files changed, 397 insertions(+)
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h

diff --git a/Makefile b/Makefile
index aee5d3f7b9..894432b35b 100644
--- a/Makefile
+++ b/Makefile
@@ -773,6 +773,7 @@ LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
 LIB_OBJS += commit.o
+LIB_OBJS += commit-graph.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
diff --git a/commit-graph.c b/commit-graph.c
new file mode 100644
index 0000000000..db2b7390c7
--- /dev/null
+++ b/commit-graph.c
@@ -0,0 +1,376 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "commit-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 20
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_LARGE_EDGES_NEEDED 0x80000000
+#define GRAPH_PARENT_MISSING 0x7fffffff
+#define GRAPH_EDGE_LAST_MASK 0x7fffffff
+#define GRAPH_PARENT_NONE 0x70000000
+
+#define GRAPH_LAST_EDGE 0x80000000
+
+#define GRAPH_FANOUT_SIZE (4*256)
+#define GRAPH_CHUNKLOOKUP_SIZE (5 * 12)
+#define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
+			GRAPH_OID_LEN + sizeof(struct commit_graph_header))
+
+char* get_commit_graph_filename_hash(const char *pack_dir,
+				     struct object_id *hash)
+{
+	size_t len;
+	struct strbuf head_path = STRBUF_INIT;
+	strbuf_addstr(&head_path, pack_dir);
+	strbuf_addstr(&head_path, "/graph-");
+	strbuf_addstr(&head_path, oid_to_hex(hash));
+	strbuf_addstr(&head_path, ".graph");
+
+	return strbuf_detach(&head_path, &len);
+}
+
+static void write_graph_chunk_fanout(struct sha1file *f,
+				     struct commit **commits,
+				     int nr_commits)
+{
+	uint32_t i, count = 0;
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+
+	/*
+	 * Write the first-level table (the list is sorted,
+	 * but we use a 256-entry lookup to be able to avoid
+	 * having to do eight extra binary search iterations).
+	 */
+	for (i = 0; i < 256; i++) {
+		uint32_t swap_count;
+
+		while (list < last) {
+			if ((*list)->object.oid.hash[0] != i)
+				break;
+			count++;
+			list++;
+		}
+
+		swap_count = htonl(count);
+		sha1write(f, &swap_count, 4);
+	}
+}
+
+static void write_graph_chunk_oids(struct sha1file *f, int hash_len,
+				   struct commit **commits, int nr_commits)
+{
+	struct commit **list, **last = commits + nr_commits;
+	for (list = commits; list < last; list++)
+		sha1write(f, (*list)->object.oid.hash, (int)hash_len);
+}
+
+static int commit_pos(struct commit **commits, int nr_commits,
+		      const struct object_id *oid, uint32_t *pos)
+{
+	uint32_t first = 0, last = nr_commits;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		struct object_id *current;
+		int cmp;
+
+		current = &(commits[mid]->object.oid);
+		cmp = oidcmp(oid, current);
+		if (!cmp) {
+			*pos = mid;
+			return 1;
+		}
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	*pos = first;
+	return 0;
+}
+
+static void write_graph_chunk_data(struct sha1file *f, int hash_len,
+				   struct commit **commits, int nr_commits)
+{
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+	uint32_t num_large_edges = 0;
+
+	while (list < last) {
+		struct commit_list *parent;
+		uint32_t int_id, swap_int_id;
+		uint32_t packedDate[2];
+
+		parse_commit(*list);
+		sha1write(f, (*list)->tree->object.oid.hash, hash_len);
+
+		parent = (*list)->parents;
+
+		if (!parent)
+			swap_int_id = htonl(GRAPH_PARENT_NONE);
+		else if (commit_pos(commits, nr_commits,
+				    &(parent->item->object.oid), &int_id))
+			swap_int_id = htonl(int_id);
+		else
+			swap_int_id = htonl(GRAPH_PARENT_MISSING);
+
+		sha1write(f, &swap_int_id, 4);
+
+		if (parent)
+			parent = parent->next;
+
+		if (!parent)
+			swap_int_id = htonl(GRAPH_PARENT_NONE);
+		else if (parent->next)
+			swap_int_id = htonl(GRAPH_LARGE_EDGES_NEEDED | num_large_edges);
+		else if (commit_pos(commits, nr_commits,
+				    &(parent->item->object.oid), &int_id))
+			swap_int_id = htonl(int_id);
+		else
+			swap_int_id = htonl(GRAPH_PARENT_MISSING);
+
+		sha1write(f, &swap_int_id, 4);
+
+		if (parent && parent->next) {
+			do {
+				num_large_edges++;
+				parent = parent->next;
+			} while (parent);
+		}
+
+		if (sizeof((*list)->date) > 4)
+			packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
+		else
+			packedDate[0] = 0;
+
+		packedDate[1] = htonl((*list)->date);
+		sha1write(f, packedDate, 8);
+
+		list++;
+	}
+}
+
+static void write_graph_chunk_large_edges(struct sha1file *f,
+					  struct commit **commits,
+					  int nr_commits)
+{
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+	struct commit_list *parent;
+
+	while (list < last) {
+		int num_parents = 0;
+		for (parent = (*list)->parents; num_parents < 3 && parent;
+		     parent = parent->next)
+			num_parents++;
+
+		if (num_parents <= 2) {
+			list++;
+			continue;
+		}
+
+		for (parent = (*list)->parents; parent; parent = parent->next) {
+			uint32_t int_id, swap_int_id;
+			uint32_t last_edge = 0;
+
+			if (parent == (*list)->parents)
+				continue;
+
+			if (!parent->next)
+				last_edge |= GRAPH_LAST_EDGE;
+
+			if (commit_pos(commits, nr_commits,
+				       &(parent->item->object.oid),
+				       &int_id))
+				swap_int_id = htonl(int_id | last_edge);
+			else
+				swap_int_id = htonl(GRAPH_PARENT_MISSING | last_edge);
+
+			sha1write(f, &swap_int_id, 4);
+		}
+
+		list++;
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	struct object_id *a = *(struct object_id **)_a;
+	struct object_id *b = *(struct object_id **)_b;
+	return oidcmp(a, b);
+}
+
+struct packed_commit_list {
+	struct commit **list;
+	int num;
+	int size;
+};
+
+struct packed_oid_list {
+	struct object_id **list;
+	int num;
+	int size;
+};
+
+static int if_packed_commit_add_to_list(const struct object_id *oid,
+					struct packed_git *pack,
+					uint32_t pos,
+					void *data)
+{
+	struct packed_oid_list *list = (struct packed_oid_list*)data;
+	enum object_type type;
+	unsigned long size;
+	void *inner_data;
+	off_t offset = nth_packed_object_offset(pack, pos);
+	inner_data = unpack_entry(pack, offset, &type, &size);
+
+	if (inner_data)
+		free(inner_data);
+
+	if (type != OBJ_COMMIT)
+		return 0;
+
+	ALLOC_GROW(list->list, list->num + 1, list->size);
+	list->list[list->num] = (struct object_id *)malloc(sizeof(struct object_id));
+	oidcpy(list->list[list->num], oid);
+	(list->num)++;
+
+	return 0;
+}
+
+struct object_id *construct_commit_graph(const char *pack_dir)
+{
+	struct packed_oid_list oids;
+	struct packed_commit_list commits;
+	struct commit_graph_header hdr;
+	struct sha1file *f;
+	int i, count_distinct = 0;
+	struct strbuf tmp_file = STRBUF_INIT;
+	unsigned char final_hash[GIT_MAX_RAWSZ];
+	char *graph_name;
+	int fd;
+	uint32_t chunk_ids[5];
+	uint64_t chunk_offsets[5];
+	int num_long_edges;
+	struct object_id *f_hash;
+	char *fname;
+	struct commit_list *parent;
+
+	oids.num = 0;
+	oids.size = 1024;
+	ALLOC_ARRAY(oids.list, oids.size);
+	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+	QSORT(oids.list, oids.num, commit_compare);
+
+	count_distinct = 1;
+	for (i = 1; i < oids.num; i++) {
+		if (oidcmp(oids.list[i-1], oids.list[i]))
+			count_distinct++;
+	}
+
+	commits.num = 0;
+	commits.size = count_distinct;
+	ALLOC_ARRAY(commits.list, commits.size);
+
+	num_long_edges = 0;
+	for (i = 0; i < oids.num; i++) {
+		int num_parents = 0;
+		if (i > 0 && !oidcmp(oids.list[i-1], oids.list[i]))
+			continue;
+
+		commits.list[commits.num] = lookup_commit(oids.list[i]);
+		parse_commit(commits.list[commits.num]);
+
+		for (parent = commits.list[commits.num]->parents;
+		     parent; parent = parent->next)
+			num_parents++;
+
+		if (num_parents > 2)
+			num_long_edges += num_parents - 1;
+
+		commits.num++;
+	}
+
+	strbuf_addstr(&tmp_file, pack_dir);
+	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
+
+	fd = git_mkstemp_mode(tmp_file.buf, 0444);
+	if (fd < 0)
+		die_errno("unable to create '%s'", tmp_file.buf);
+
+	graph_name = strbuf_detach(&tmp_file, NULL);
+	f = sha1fd(fd, graph_name);
+
+	hdr.graph_signature = htonl(GRAPH_SIGNATURE);
+	hdr.graph_version = GRAPH_VERSION;
+	hdr.hash_version = GRAPH_OID_VERSION;
+	hdr.hash_len = GRAPH_OID_LEN;
+	hdr.num_chunks = 4;
+
+	assert(sizeof(hdr) == 8);
+	sha1write(f, &hdr, sizeof(hdr));
+
+	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
+	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
+	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
+	chunk_ids[4] = 0;
+
+	chunk_offsets[0] = sizeof(hdr) + GRAPH_CHUNKLOOKUP_SIZE;
+	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
+	chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.num;
+	chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.num;
+	chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;
+
+	for (i = 0; i <= hdr.num_chunks; i++) {
+		uint32_t chunk_write[3];
+
+		chunk_write[0] = htonl(chunk_ids[i]);
+		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
+		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
+		sha1write(f, chunk_write, 12);
+	}
+
+	write_graph_chunk_fanout(f, commits.list, commits.num);
+	write_graph_chunk_oids(f, GRAPH_OID_LEN, commits.list, commits.num);
+	write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.num);
+	write_graph_chunk_large_edges(f, commits.list, commits.num);
+
+	sha1close(f, final_hash, CSUM_CLOSE | CSUM_FSYNC);
+
+	f_hash = (struct object_id *)malloc(sizeof(struct object_id));
+	memcpy(f_hash->hash, final_hash, GIT_MAX_RAWSZ);
+	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
+
+	if (rename(graph_name, fname))
+		die("failed to rename %s to %s", graph_name, fname);
+
+	free(oids.list);
+	oids.size = 0;
+	oids.num = 0;
+
+	return f_hash;
+}
+
diff --git a/commit-graph.h b/commit-graph.h
new file mode 100644
index 0000000000..7b3469a7df
--- /dev/null
+++ b/commit-graph.h
@@ -0,0 +1,20 @@
+#ifndef COMMIT_GRAPH_H
+#define COMMIT_GRAPH_H
+
+#include "git-compat-util.h"
+#include "commit.h"
+
+extern char* get_commit_graph_filename_hash(const char *pack_dir,
+					    struct object_id *hash);
+
+struct commit_graph_header {
+	uint32_t graph_signature;
+	unsigned char graph_version;
+	unsigned char hash_version;
+	unsigned char hash_len;
+	unsigned char num_chunks;
+};
+
+extern struct object_id *construct_commit_graph(const char *pack_dir);
+
+#endif
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (3 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-01 23:33   ` Jonathan Tan
                     ` (2 more replies)
  2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
                   ` (10 subsequent siblings)
  15 siblings, 3 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 18 +++++++
 builtin/commit-graph.c             | 30 ++++++++++++
 t/t5318-commit-graph.sh            | 96 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 144 insertions(+)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index c8ea548dfb..3f3790d9a8 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,3 +5,21 @@ NAME
 ----
 git-commit-graph - Write and verify Git commit graphs (.graph files)
 
+
+SYNOPSIS
+--------
+[verse]
+'git commit-graph' --write <options> [--pack-dir <pack_dir>]
+
+EXAMPLES
+--------
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+------------------------------------------------
+$ git commit-graph --write
+------------------------------------------------
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 2104550d25..7affd512f1 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -6,22 +6,38 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
+	N_("git commit-graph --write [--pack-dir <packdir>]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *pack_dir;
+	int write;
 } opts;
 
+static int graph_write(void)
+{
+	struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
+
+	if (graph_hash)
+		printf("%s\n", oid_to_hex(graph_hash));
+
+	free(graph_hash);
+	return 0;
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
 			N_("dir"),
 			N_("The pack directory to store the graph") },
+		OPT_BOOL('w', "write", &opts.write,
+			N_("write commit graph file")),
 		OPT_END(),
 	};
 
@@ -29,5 +45,19 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);
 
+	argc = parse_options(argc, argv, prefix,
+			     builtin_commit_graph_options,
+			     builtin_commit_graph_usage, 0);
+
+	if (!opts.pack_dir) {
+		struct strbuf path = STRBUF_INIT;
+		strbuf_addstr(&path, get_object_directory());
+		strbuf_addstr(&path, "/pack");
+		opts.pack_dir = strbuf_detach(&path, NULL);
+	}
+
+	if (opts.write)
+		return graph_write();
+
 	return 0;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 0000000000..6bcd1cc264
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,96 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' \
+    'rm -rf .git &&
+     mkdir full &&
+     cd full &&
+     git init &&
+     git config core.commitgraph true &&
+     git config pack.threads 1 &&
+     packdir=".git/objects/pack"'
+
+test_expect_success 'write graph with no packs' \
+    'git commit-graph --write --pack-dir .'
+
+test_expect_success 'create commits and repack' \
+    'for i in $(test_seq 5)
+     do
+        echo $i >$i.txt &&
+        git add $i.txt &&
+        git commit -m "commit $i" &&
+        git branch commits/$i
+     done &&
+     git repack'
+
+test_expect_success 'write graph' \
+    'graph1=$(git commit-graph --write) &&
+     test_path_is_file ${packdir}/graph-${graph1}.graph'
+
+t_expect_success 'Add more commits' \
+    'git reset --hard commits/3 &&
+     for i in $(test_seq 6 10)
+     do
+        echo $i >$i.txt &&
+        git add $i.txt &&
+        git commit -m "commit $i" &&
+        git branch commits/$i
+     done &&
+     git reset --hard commits/3 &&
+     for i in $(test_seq 11 15)
+     do
+        echo $i >$i.txt &&
+        git add $i.txt &&
+        git commit -m "commit $i" &&
+        git branch commits/$i
+     done &&
+     git reset --hard commits/7 &&
+     git merge commits/11 &&
+     git branch merge/1 &&
+     git reset --hard commits/8 &&
+     git merge commits/12 &&
+     git branch merge/2 &&
+     git reset --hard commits/5 &&
+     git merge commits/10 commits/15 &&
+     git branch merge/3 &&
+     git repack'
+
+# Current graph structure:
+#
+#      M3
+#     / |\_____
+#    / 10      15
+#   /   |      |
+#  /    9 M2   14
+# |     |/  \  |
+# |     8 M1 | 13
+# |     |/ | \_|
+# 5     7  |   12
+# |     |   \__|
+# 4     6      11
+# |____/______/
+# 3
+# |
+# 2
+# |
+# 1
+
+test_expect_success 'write graph with merges' \
+    'graph2=$(git commit-graph --write) &&
+     test_path_is_file ${packdir}/graph-${graph2}.graph'
+
+test_expect_success 'setup bare repo' \
+    'cd .. &&
+     git clone --bare full bare &&
+     cd bare &&
+     git config core.graph true &&
+     git config pack.threads 1 &&
+     baredir="objects/pack"'
+
+test_expect_success 'write graph in bare repo' \
+    'graphbare=$(git commit-graph --write) &&
+     test_path_is_file ${baredir}/graph-${graphbare}.graph'
+
+test_done
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 06/14] commit-graph: implement git-commit-graph --read
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (4 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-01-31  2:22   ` Stefan Beller
                     ` (2 more replies)
  2018-01-30 21:39 ` [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head Derrick Stolee
                   ` (9 subsequent siblings)
  15 siblings, 3 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach git-commit-graph to read commit graph files and summarize their contents.

Use the --read option to verify the contents of a commit graph file in the
tests.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |   7 ++
 builtin/commit-graph.c             |  55 +++++++++++++++
 commit-graph.c                     | 138 ++++++++++++++++++++++++++++++++++++-
 commit-graph.h                     |  25 +++++++
 t/t5318-commit-graph.sh            |  28 ++++++--
 5 files changed, 247 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 3f3790d9a8..09aeaf6c82 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git commit-graph' --write <options> [--pack-dir <pack_dir>]
+'git commit-graph' --read <options> [--pack-dir <pack_dir>]
 
 EXAMPLES
 --------
@@ -20,6 +21,12 @@ EXAMPLES
 $ git commit-graph --write
 ------------------------------------------------
 
+* Read basic information from a graph file.
++
+------------------------------------------------
+$ git commit-graph --read --graph-hash=<hash>
+------------------------------------------------
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 7affd512f1..218740b1f8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,15 +10,58 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
+	N_("git commit-graph --read [--graph-hash=<hash>]"),
 	N_("git commit-graph --write [--pack-dir <packdir>]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *pack_dir;
+	int read;
+	const char *graph_hash;
 	int write;
 } opts;
 
+static int graph_read(void)
+{
+	struct object_id graph_hash;
+	struct commit_graph *graph = 0;
+	const char *graph_file;
+
+	if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ)
+		get_oid_hex(opts.graph_hash, &graph_hash);
+	else
+		die("no graph hash specified");
+
+	graph_file = get_commit_graph_filename_hash(opts.pack_dir, &graph_hash);
+	graph = load_commit_graph_one(graph_file, opts.pack_dir);
+
+	if (!graph)
+		die("graph file %s does not exist", graph_file);
+
+	printf("header: %08x %02x %02x %02x %02x\n",
+		ntohl(graph->hdr->graph_signature),
+		graph->hdr->graph_version,
+		graph->hdr->hash_version,
+		graph->hdr->hash_len,
+		graph->hdr->num_chunks);
+	printf("num_commits: %u\n", graph->num_commits);
+	printf("chunks:");
+
+	if (graph->chunk_oid_fanout)
+		printf(" oid_fanout");
+	if (graph->chunk_oid_lookup)
+		printf(" oid_lookup");
+	if (graph->chunk_commit_data)
+		printf(" commit_metadata");
+	if (graph->chunk_large_edges)
+		printf(" large_edges");
+	printf("\n");
+
+	printf("pack_dir: %s\n", graph->pack_dir);
+	return 0;
+}
+
 static int graph_write(void)
 {
 	struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
@@ -36,8 +79,14 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
 			N_("dir"),
 			N_("The pack directory to store the graph") },
+		OPT_BOOL('r', "read", &opts.read,
+			N_("read graph file")),
 		OPT_BOOL('w', "write", &opts.write,
 			N_("write commit graph file")),
+		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
+			N_("hash"),
+			N_("A hash for a specific graph file in the pack-dir."),
+			PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
 		OPT_END(),
 	};
 
@@ -49,6 +98,10 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage, 0);
 
+	if (opts.write + opts.read > 1)
+		usage_with_options(builtin_commit_graph_usage,
+				   builtin_commit_graph_options);
+
 	if (!opts.pack_dir) {
 		struct strbuf path = STRBUF_INIT;
 		strbuf_addstr(&path, get_object_directory());
@@ -56,6 +109,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		opts.pack_dir = strbuf_detach(&path, NULL);
 	}
 
+	if (opts.read)
+		return graph_read();
 	if (opts.write)
 		return graph_write();
 
diff --git a/commit-graph.c b/commit-graph.c
index db2b7390c7..622a650259 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -48,6 +48,142 @@ char* get_commit_graph_filename_hash(const char *pack_dir,
 	return strbuf_detach(&head_path, &len);
 }
 
+static struct commit_graph *alloc_commit_graph(int extra)
+{
+	struct commit_graph *g = xmalloc(st_add(sizeof(*g), extra));
+	memset(g, 0, sizeof(*g));
+	g->graph_fd = -1;
+
+	return g;
+}
+
+int close_commit_graph(struct commit_graph *g)
+{
+	if (g->graph_fd < 0)
+		return 0;
+
+	munmap((void *)g->data, g->data_len);
+	g->data = 0;
+
+	close(g->graph_fd);
+	g->graph_fd = -1;
+
+	return 1;
+}
+
+static void free_commit_graph(struct commit_graph **g)
+{
+	if (!g || !*g)
+		return;
+
+	close_commit_graph(*g);
+
+	free(*g);
+	*g = NULL;
+}
+
+struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir)
+{
+	void *graph_map;
+	const unsigned char *data;
+	struct commit_graph_header *hdr;
+	size_t graph_size;
+	struct stat st;
+	uint32_t i;
+	struct commit_graph *graph;
+	int fd = git_open(graph_file);
+	uint64_t last_chunk_offset;
+	uint32_t last_chunk_id;
+
+	if (fd < 0)
+		return 0;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return 0;
+	}
+	graph_size = xsize_t(st.st_size);
+
+	if (graph_size < GRAPH_MIN_SIZE) {
+		close(fd);
+		die("graph file %s is too small", graph_file);
+	}
+	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	data = (const unsigned char *)graph_map;
+
+	hdr = graph_map;
+	if (ntohl(hdr->graph_signature) != GRAPH_SIGNATURE) {
+		uint32_t signature = ntohl(hdr->graph_signature);
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("graph signature %X does not match signature %X",
+			signature, GRAPH_SIGNATURE);
+	}
+	if (hdr->graph_version != GRAPH_VERSION) {
+		unsigned char version = hdr->graph_version;
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("graph version %X does not match version %X",
+			version, GRAPH_VERSION);
+	}
+
+	graph = alloc_commit_graph(strlen(pack_dir) + 1);
+
+	graph->hdr = hdr;
+	graph->graph_fd = fd;
+	graph->data = graph_map;
+	graph->data_len = graph_size;
+
+	last_chunk_id = 0;
+	last_chunk_offset = (uint64_t)sizeof(*hdr);
+	for (i = 0; i < hdr->num_chunks; i++) {
+		uint32_t chunk_id = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i));
+		uint64_t chunk_offset1 = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i + 4));
+		uint32_t chunk_offset2 = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i + 8));
+		uint64_t chunk_offset = (chunk_offset1 << 32) | chunk_offset2;
+
+		if (chunk_offset > graph_size - GIT_MAX_RAWSZ)
+			die("improper chunk offset %08x%08x", (uint32_t)(chunk_offset >> 32),
+			    (uint32_t)chunk_offset);
+
+		switch (chunk_id) {
+			case GRAPH_CHUNKID_OIDFANOUT:
+				graph->chunk_oid_fanout = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_OIDLOOKUP:
+				graph->chunk_oid_lookup = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_DATA:
+				graph->chunk_commit_data = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_LARGEEDGES:
+				graph->chunk_large_edges = data + chunk_offset;
+				break;
+
+			case 0:
+				break;
+
+			default:
+				free_commit_graph(&graph);
+				die("unrecognized graph chunk id: %08x", chunk_id);
+		}
+
+		if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
+		{
+			graph->num_commits = (chunk_offset - last_chunk_offset)
+					     / hdr->hash_len;
+		}
+
+		last_chunk_id = chunk_id;
+		last_chunk_offset = chunk_offset;
+	}
+
+	strcpy(graph->pack_dir, pack_dir);
+	return graph;
+}
+
 static void write_graph_chunk_fanout(struct sha1file *f,
 				     struct commit **commits,
 				     int nr_commits)
@@ -361,7 +497,7 @@ struct object_id *construct_commit_graph(const char *pack_dir)
 	sha1close(f, final_hash, CSUM_CLOSE | CSUM_FSYNC);
 
 	f_hash = (struct object_id *)malloc(sizeof(struct object_id));
-	memcpy(f_hash->hash, final_hash, GIT_MAX_RAWSZ);
+	hashcpy(f_hash->hash, final_hash);
 	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
 
 	if (rename(graph_name, fname))
diff --git a/commit-graph.h b/commit-graph.h
index 7b3469a7df..e046ae575c 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -15,6 +15,31 @@ struct commit_graph_header {
 	unsigned char num_chunks;
 };
 
+extern struct commit_graph {
+	int graph_fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	const struct commit_graph_header *hdr;
+
+	struct object_id oid;
+
+	uint32_t num_commits;
+
+	const unsigned char *chunk_oid_fanout;
+	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_large_edges;
+
+	/* something like ".git/objects/pack" */
+	char pack_dir[FLEX_ARRAY]; /* more */
+} *commit_graph;
+
+extern int close_commit_graph(struct commit_graph *g);
+
+extern struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir);
+
 extern struct object_id *construct_commit_graph(const char *pack_dir);
 
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6bcd1cc264..da565624e3 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -25,11 +25,23 @@ test_expect_success 'create commits and repack' \
      done &&
      git repack'
 
+_graph_read_expect() {
+    cat >expect <<- EOF
+header: 43475048 01 01 14 04
+num_commits: $1
+chunks: oid_fanout oid_lookup commit_metadata large_edges
+pack_dir: $2
+EOF
+}
+
 test_expect_success 'write graph' \
     'graph1=$(git commit-graph --write) &&
-     test_path_is_file ${packdir}/graph-${graph1}.graph'
+     test_path_is_file ${packdir}/graph-${graph1}.graph &&
+     git commit-graph --read --graph-hash=${graph1} >output &&
+     _graph_read_expect "5" "${packdir}" &&
+     cmp expect output'
 
-t_expect_success 'Add more commits' \
+test_expect_success 'Add more commits' \
     'git reset --hard commits/3 &&
      for i in $(test_seq 6 10)
      do
@@ -79,7 +91,10 @@ t_expect_success 'Add more commits' \
 
 test_expect_success 'write graph with merges' \
     'graph2=$(git commit-graph --write) &&
-     test_path_is_file ${packdir}/graph-${graph2}.graph'
+     test_path_is_file ${packdir}/graph-${graph2}.graph &&
+     git commit-graph --read --graph-hash=${graph2} >output &&
+     _graph_read_expect "18" "${packdir}" &&
+     cmp expect output'
 
 test_expect_success 'setup bare repo' \
     'cd .. &&
@@ -87,10 +102,13 @@ test_expect_success 'setup bare repo' \
      cd bare &&
      git config core.graph true &&
      git config pack.threads 1 &&
-     baredir="objects/pack"'
+     baredir="./objects/pack"'
 
 test_expect_success 'write graph in bare repo' \
     'graphbare=$(git commit-graph --write) &&
-     test_path_is_file ${baredir}/graph-${graphbare}.graph'
+     test_path_is_file ${baredir}/graph-${graphbare}.graph &&
+     git commit-graph --read --graph-hash=${graphbare} >output &&
+     _graph_read_expect "18" "${baredir}" &&
+     cmp expect output'
 
 test_done
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (5 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-02  1:35   ` SZEDER Gábor
  2018-02-02  2:45   ` SZEDER Gábor
  2018-01-30 21:39 ` [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear Derrick Stolee
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

It is possible to have multiple commit graph files in a pack directory,
but only one is important at a time. Use a 'graph_head' file to point
to the important file. Teach git-commit-graph to write 'graph_head' upon
writing a new commit graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 34 ++++++++++++++++++++++++++++++++++
 builtin/commit-graph.c             | 38 +++++++++++++++++++++++++++++++++++---
 commit-graph.c                     | 25 +++++++++++++++++++++++++
 commit-graph.h                     |  2 ++
 t/t5318-commit-graph.sh            | 12 ++++++++++--
 5 files changed, 106 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 09aeaf6c82..99ced16ddc 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -12,15 +12,49 @@ SYNOPSIS
 'git commit-graph' --write <options> [--pack-dir <pack_dir>]
 'git commit-graph' --read <options> [--pack-dir <pack_dir>]
 
+OPTIONS
+-------
+--pack-dir::
+	Use given directory for the location of packfiles, graph-head,
+	and graph files.
+
+--read::
+	Read a graph file given by the graph-head file and output basic
+	details about the graph file. (Cannot be combined with --write.)
+
+--graph-id::
+	When used with --read, consider the graph file graph-<oid>.graph.
+
+--write::
+	Write a new graph file to the pack directory. (Cannot be combined
+	with --read.)
+
+--update-head::
+	When used with --write, update the graph-head file to point to
+	the written graph file.
+
 EXAMPLES
 --------
 
+* Output the hash of the graph file pointed to by <dir>/graph-head.
++
+------------------------------------------------
+$ git commit-graph --pack-dir=<dir>
+------------------------------------------------
+
 * Write a commit graph file for the packed commits in your local .git folder.
 +
 ------------------------------------------------
 $ git commit-graph --write
 ------------------------------------------------
 
+* Write a graph file for the packed commits in your local .git folder,
+* and update graph-head.
++
+------------------------------------------------
+$ git commit-graph --write --update-head
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 218740b1f8..d73cbc907d 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -11,7 +11,7 @@
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph --read [--graph-hash=<hash>]"),
-	N_("git commit-graph --write [--pack-dir <packdir>]"),
+	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head]"),
 	NULL
 };
 
@@ -20,6 +20,9 @@ static struct opts_commit_graph {
 	int read;
 	const char *graph_hash;
 	int write;
+	int update_head;
+	int has_existing;
+	struct object_id old_graph_hash;
 } opts;
 
 static int graph_read(void)
@@ -30,8 +33,8 @@ static int graph_read(void)
 
 	if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ)
 		get_oid_hex(opts.graph_hash, &graph_hash);
-	else
-		die("no graph hash specified");
+	else if (!get_graph_head_hash(opts.pack_dir, &graph_hash))
+		die("no graph-head exists");
 
 	graph_file = get_commit_graph_filename_hash(opts.pack_dir, &graph_hash);
 	graph = load_commit_graph_one(graph_file, opts.pack_dir);
@@ -62,10 +65,33 @@ static int graph_read(void)
 	return 0;
 }
 
+static void update_head_file(const char *pack_dir, const struct object_id *graph_hash)
+{
+	struct strbuf head_path = STRBUF_INIT;
+	int fd;
+	struct lock_file lk = LOCK_INIT;
+
+	strbuf_addstr(&head_path, pack_dir);
+	strbuf_addstr(&head_path, "/");
+	strbuf_addstr(&head_path, "graph-head");
+
+	fd = hold_lock_file_for_update(&lk, head_path.buf, LOCK_DIE_ON_ERROR);
+	strbuf_release(&head_path);
+
+	if (fd < 0)
+		die_errno("unable to open graph-head");
+
+	write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ);
+	commit_lock_file(&lk);
+}
+
 static int graph_write(void)
 {
 	struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
 
+	if (opts.update_head)
+		update_head_file(opts.pack_dir, graph_hash);
+
 	if (graph_hash)
 		printf("%s\n", oid_to_hex(graph_hash));
 
@@ -83,6 +109,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			N_("read graph file")),
 		OPT_BOOL('w', "write", &opts.write,
 			N_("write commit graph file")),
+		OPT_BOOL('u', "update-head", &opts.update_head,
+			N_("update graph-head to written graph file")),
 		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
 			N_("hash"),
 			N_("A hash for a specific graph file in the pack-dir."),
@@ -109,10 +137,14 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		opts.pack_dir = strbuf_detach(&path, NULL);
 	}
 
+	opts.has_existing = !!get_graph_head_hash(opts.pack_dir, &opts.old_graph_hash);
+
 	if (opts.read)
 		return graph_read();
 	if (opts.write)
 		return graph_write();
 
+	if (opts.has_existing)
+		printf("%s\n", oid_to_hex(&opts.old_graph_hash));
 	return 0;
 }
diff --git a/commit-graph.c b/commit-graph.c
index 622a650259..764e016ddb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -35,6 +35,31 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
 			GRAPH_OID_LEN + sizeof(struct commit_graph_header))
 
+struct object_id *get_graph_head_hash(const char *pack_dir, struct object_id *hash)
+{
+	struct strbuf head_filename = STRBUF_INIT;
+	char hex[GIT_MAX_HEXSZ + 1];
+	FILE *f;
+
+	strbuf_addstr(&head_filename, pack_dir);
+	strbuf_addstr(&head_filename, "/graph-head");
+
+	f = fopen(head_filename.buf, "r");
+	strbuf_release(&head_filename);
+
+	if (!f)
+		return 0;
+
+	if (!fgets(hex, sizeof(hex), f))
+		die("failed to read graph-head");
+
+	fclose(f);
+
+	if (get_oid_hex(hex, hash))
+		return 0;
+	return hash;
+}
+
 char* get_commit_graph_filename_hash(const char *pack_dir,
 				     struct object_id *hash)
 {
diff --git a/commit-graph.h b/commit-graph.h
index e046ae575c..43eb0aec84 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,8 @@
 #include "git-compat-util.h"
 #include "commit.h"
 
+extern struct object_id *get_graph_head_hash(const char *pack_dir,
+					     struct object_id *hash);
 extern char* get_commit_graph_filename_hash(const char *pack_dir,
 					    struct object_id *hash);
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index da565624e3..d1a23bcdaf 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -13,7 +13,8 @@ test_expect_success 'setup full repo' \
      packdir=".git/objects/pack"'
 
 test_expect_success 'write graph with no packs' \
-    'git commit-graph --write --pack-dir .'
+    'git commit-graph --write --pack-dir . &&
+     test_path_is_missing graph-head'
 
 test_expect_success 'create commits and repack' \
     'for i in $(test_seq 5)
@@ -37,6 +38,7 @@ EOF
 test_expect_success 'write graph' \
     'graph1=$(git commit-graph --write) &&
      test_path_is_file ${packdir}/graph-${graph1}.graph &&
+     test_path_is_missing ${packdir}/graph-head &&
      git commit-graph --read --graph-hash=${graph1} >output &&
      _graph_read_expect "5" "${packdir}" &&
      cmp expect output'
@@ -90,8 +92,11 @@ test_expect_success 'Add more commits' \
 # 1
 
 test_expect_success 'write graph with merges' \
-    'graph2=$(git commit-graph --write) &&
+    'graph2=$(git commit-graph --write --update-head) &&
      test_path_is_file ${packdir}/graph-${graph2}.graph &&
+     test_path_is_file ${packdir}/graph-head &&
+     echo ${graph2} >expect &&
+     cmp -n 40 expect ${packdir}/graph-head &&
      git commit-graph --read --graph-hash=${graph2} >output &&
      _graph_read_expect "18" "${packdir}" &&
      cmp expect output'
@@ -107,6 +112,9 @@ test_expect_success 'setup bare repo' \
 test_expect_success 'write graph in bare repo' \
     'graphbare=$(git commit-graph --write) &&
      test_path_is_file ${baredir}/graph-${graphbare}.graph &&
+     test_path_is_file ${baredir}/graph-head &&
+     echo ${graphbare} >expect &&
+     cmp -n 40 expect ${baredir}/graph-head &&
      git commit-graph --read --graph-hash=${graphbare} >output &&
      _graph_read_expect "18" "${baredir}" &&
      cmp expect output'
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (6 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-02  4:01   ` SZEDER Gábor
  2018-01-30 21:39 ` [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired Derrick Stolee
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach Git to delete the current 'graph_head' file and the commit graph
it references. This is a good safety valve if somehow the file is
corrupted and needs to be recalculated. Since the commit graph is a
summary of contents already in the ODB, it can be regenerated.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 16 ++++++++++++++--
 builtin/commit-graph.c             | 32 +++++++++++++++++++++++++++++++-
 t/t5318-commit-graph.sh            |  7 ++++++-
 3 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 99ced16ddc..33d6567f11 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -11,6 +11,7 @@ SYNOPSIS
 [verse]
 'git commit-graph' --write <options> [--pack-dir <pack_dir>]
 'git commit-graph' --read <options> [--pack-dir <pack_dir>]
+'git commit-graph' --clear [--pack-dir <pack_dir>]
 
 OPTIONS
 -------
@@ -18,16 +19,21 @@ OPTIONS
 	Use given directory for the location of packfiles, graph-head,
 	and graph files.
 
+--clear::
+	Delete the graph-head file and the graph file it references.
+	(Cannot be combined with --read or --write.)
+
 --read::
 	Read a graph file given by the graph-head file and output basic
-	details about the graph file. (Cannot be combined with --write.)
+	details about the graph file. (Cannot be combined with --clear
+	or --write.)
 
 --graph-id::
 	When used with --read, consider the graph file graph-<oid>.graph.
 
 --write::
 	Write a new graph file to the pack directory. (Cannot be combined
-	with --read.)
+	with --clear or --read.)
 
 --update-head::
 	When used with --write, update the graph-head file to point to
@@ -61,6 +67,12 @@ $ git commit-graph --write --update-head
 $ git commit-graph --read --graph-hash=<hash>
 ------------------------------------------------
 
+* Delete <dir>/graph-head and the file it references.
++
+------------------------------------------------
+$ git commit-graph --clear --pack-dir=<dir>
+------------------------------------------------
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index d73cbc907d..4970dec133 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,6 +10,7 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
+	N_("git commit-graph --clear [--pack-dir <packdir>]"),
 	N_("git commit-graph --read [--graph-hash=<hash>]"),
 	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head]"),
 	NULL
@@ -17,6 +18,7 @@ static char const * const builtin_commit_graph_usage[] = {
 
 static struct opts_commit_graph {
 	const char *pack_dir;
+	int clear;
 	int read;
 	const char *graph_hash;
 	int write;
@@ -25,6 +27,30 @@ static struct opts_commit_graph {
 	struct object_id old_graph_hash;
 } opts;
 
+static int graph_clear(void)
+{
+	struct strbuf head_path = STRBUF_INIT;
+	char *old_path;
+
+	if (!opts.has_existing)
+		return 0;
+
+	strbuf_addstr(&head_path, opts.pack_dir);
+	strbuf_addstr(&head_path, "/");
+	strbuf_addstr(&head_path, "graph-head");
+	if (remove_path(head_path.buf))
+		die("failed to remove path %s", head_path.buf);
+	strbuf_release(&head_path);
+
+	old_path = get_commit_graph_filename_hash(opts.pack_dir,
+						  &opts.old_graph_hash);
+	if (remove_path(old_path))
+		die("failed to remove path %s", old_path);
+	free(old_path);
+
+	return 0;
+}
+
 static int graph_read(void)
 {
 	struct object_id graph_hash;
@@ -105,6 +131,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
 			N_("dir"),
 			N_("The pack directory to store the graph") },
+		OPT_BOOL('c', "clear", &opts.clear,
+			N_("clear graph file and graph-head")),
 		OPT_BOOL('r', "read", &opts.read,
 			N_("read graph file")),
 		OPT_BOOL('w', "write", &opts.write,
@@ -126,7 +154,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage, 0);
 
-	if (opts.write + opts.read > 1)
+	if (opts.write + opts.read + opts.clear > 1)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);
 
@@ -139,6 +167,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 
 	opts.has_existing = !!get_graph_head_hash(opts.pack_dir, &opts.old_graph_hash);
 
+	if (opts.clear)
+		return graph_clear();
 	if (opts.read)
 		return graph_read();
 	if (opts.write)
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index d1a23bcdaf..6e3b62b754 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -101,6 +101,11 @@ test_expect_success 'write graph with merges' \
      _graph_read_expect "18" "${packdir}" &&
      cmp expect output'
 
+test_expect_success 'clear graph' \
+    'git commit-graph --clear &&
+     test_path_is_missing ${packdir}/graph-${graph2}.graph &&
+     test_path_is_missing ${packdir}/graph-head'
+
 test_expect_success 'setup bare repo' \
     'cd .. &&
      git clone --bare full bare &&
@@ -110,7 +115,7 @@ test_expect_success 'setup bare repo' \
      baredir="./objects/pack"'
 
 test_expect_success 'write graph in bare repo' \
-    'graphbare=$(git commit-graph --write) &&
+    'graphbare=$(git commit-graph --write --update-head) &&
      test_path_is_file ${baredir}/graph-${graphbare}.graph &&
      test_path_is_file ${baredir}/graph-head &&
      echo ${graphbare} >expect &&
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (7 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-02 15:04   ` SZEDER Gábor
  2018-01-30 21:39 ` [PATCH v2 10/14] commit-graph: add core.commitgraph setting Derrick Stolee
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach git-commit-graph to delete the graph previously referenced by 'graph_head'
when writing a new graph file and updating 'graph_head'. This prevents
data creep by storing a list of useless graphs. Be careful to not delete
the graph if the file did not change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 +++--
 builtin/commit-graph.c             | 16 ++++++++-
 t/t5318-commit-graph.sh            | 66 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 33d6567f11..7b376e9212 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -39,6 +39,10 @@ OPTIONS
 	When used with --write, update the graph-head file to point to
 	the written graph file.
 
+--delete-expired::
+	When used with --write and --update-head, delete the graph file
+	previously referenced by graph-head.
+
 EXAMPLES
 --------
 
@@ -55,10 +59,10 @@ $ git commit-graph --write
 ------------------------------------------------
 
 * Write a graph file for the packed commits in your local .git folder,
-* and update graph-head.
+* update graph-head, and delete the old graph-<oid>.graph file.
 +
 ------------------------------------------------
-$ git commit-graph --write --update-head
+$ git commit-graph --write --update-head --delete-expired
 ------------------------------------------------
 
 * Read basic information from a graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 4970dec133..766f09e6fc 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph --clear [--pack-dir <packdir>]"),
 	N_("git commit-graph --read [--graph-hash=<hash>]"),
-	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head]"),
+	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head] [--delete-expired]"),
 	NULL
 };
 
@@ -23,6 +23,7 @@ static struct opts_commit_graph {
 	const char *graph_hash;
 	int write;
 	int update_head;
+	int delete_expired;
 	int has_existing;
 	struct object_id old_graph_hash;
 } opts;
@@ -121,6 +122,17 @@ static int graph_write(void)
 	if (graph_hash)
 		printf("%s\n", oid_to_hex(graph_hash));
 
+
+	if (opts.delete_expired && opts.update_head && opts.has_existing &&
+	    oidcmp(graph_hash, &opts.old_graph_hash)) {
+		char *old_path = get_commit_graph_filename_hash(opts.pack_dir,
+								&opts.old_graph_hash);
+		if (remove_path(old_path))
+			die("failed to remove path %s", old_path);
+
+		free(old_path);
+	}
+
 	free(graph_hash);
 	return 0;
 }
@@ -139,6 +151,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			N_("write commit graph file")),
 		OPT_BOOL('u', "update-head", &opts.update_head,
 			N_("update graph-head to written graph file")),
+		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
+			N_("delete expired head graph file")),
 		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
 			N_("hash"),
 			N_("A hash for a specific graph file in the pack-dir."),
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6e3b62b754..b56a6d4217 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -101,9 +101,73 @@ test_expect_success 'write graph with merges' \
      _graph_read_expect "18" "${packdir}" &&
      cmp expect output'
 
+test_expect_success 'Add more commits' \
+    'for i in $(test_seq 16 20)
+     do
+        echo $i >$i.txt &&
+        git add $i.txt &&
+        git commit -m "commit $i" &&
+        git branch commits/$i
+     done &&
+     git repack'
+
+# Current graph structure:
+#
+#      20
+#       |
+#      19
+#       |
+#      18
+#       |
+#      17
+#       |
+#      16
+#       |
+#      M3
+#     / |\_____
+#    / 10      15
+#   /   |      |
+#  /    9 M2   14
+# |     |/  \  |
+# |     8 M1 | 13
+# |     |/ | \_|
+# 5     7  |   12
+# |     |   \__|
+# 4     6      11
+# |____/______/
+# 3
+# |
+# 2
+# |
+# 1
+
+test_expect_success 'write graph with merges' \
+    'graph3=$(git commit-graph --write --update-head --delete-expired) &&
+     test_path_is_file ${packdir}/graph-${graph3}.graph &&
+     test_path_is_missing ${packdir}/graph-${graph2}.graph &&
+     test_path_is_file ${packdir}/graph-${graph1}.graph &&
+     test_path_is_file ${packdir}/graph-head &&
+     echo ${graph3} >expect &&
+     cmp -n 40 expect ${packdir}/graph-head &&
+     git commit-graph --read --graph-hash=${graph3} >output &&
+     _graph_read_expect "23" "${packdir}" &&
+     cmp expect output'
+
+test_expect_success 'write graph with nothing new' \
+    'graph4=$(git commit-graph --write --update-head --delete-expired) &&
+     test_path_is_file ${packdir}/graph-${graph4}.graph &&
+     test_path_is_file ${packdir}/graph-${graph1}.graph &&
+     test_path_is_file ${packdir}/graph-head &&
+     echo ${graph4} >expect &&
+     cmp -n 40 expect ${packdir}/graph-head &&
+     git commit-graph --read --graph-hash=${graph4} >output &&
+     _graph_read_expect "23" "${packdir}" &&
+     cmp expect output'
+
 test_expect_success 'clear graph' \
     'git commit-graph --clear &&
      test_path_is_missing ${packdir}/graph-${graph2}.graph &&
+     test_path_is_file ${packdir}/graph-${graph1}.graph &&
      test_path_is_missing ${packdir}/graph-head'
 
 test_expect_success 'setup bare repo' \
@@ -121,7 +185,7 @@ test_expect_success 'write graph in bare repo' \
      echo ${graphbare} >expect &&
      cmp -n 40 expect ${baredir}/graph-head &&
      git commit-graph --read --graph-hash=${graphbare} >output &&
-     _graph_read_expect "18" "${baredir}" &&
+     _graph_read_expect "23" "${baredir}" &&
      cmp expect output'
 
 test_done
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 10/14] commit-graph: add core.commitgraph setting
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (8 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-01-31 22:44   ` Igor Djordjevic
  2018-02-02 16:01   ` SZEDER Gábor
  2018-01-30 21:39 ` [PATCH v2 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

The commit graph feature is controlled by the new core.commitgraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitgraph is that a user can always stop checking
for or parsing commit graph files if core.commitgraph=0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h                  | 1 +
 config.c                 | 5 +++++
 environment.c            | 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..5b63559a2b 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitgraph::
+	Enable git commit graph feature. Allows reading from .graph files.
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index d8b975a571..e50e447a4f 100644
--- a/cache.h
+++ b/cache.h
@@ -825,6 +825,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_apply_sparse_checkout;
+extern int core_commitgraph;
 extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
diff --git a/config.c b/config.c
index e617c2018d..99153fcfdb 100644
--- a/config.c
+++ b/config.c
@@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.commitgraph")) {
+		core_commitgraph = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..faa4323cc5 100644
--- a/environment.c
+++ b/environment.c
@@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
+int core_commitgraph;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 11/14] commit: integrate commit graph with commit parsing
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (9 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 10/14] commit-graph: add core.commitgraph setting Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-02-02  1:51   ` Jonathan Tan
  2018-01-30 21:39 ` [PATCH v2 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date. The only loosely-expected
condition is that the commit buffer is loaded into the cache. This
was checked in log-tree.c:show_log(), but the "return;" on failure
produced unexpected results (i.e. the message line was never terminated).
The new behavior of loading the buffer when needed prevents the
unexpected behavior.

If core.commitgraph is false, then do not check graph files.

In test script t5319-commit-graph.sh, add output-matching conditions on
read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 704,766
reachable commits and is behind 'origin/master' by 19,610 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv                       |  0.42s |  0.27s | -35%  |
| rev-list --all                   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c                 |   1 +
 commit-graph.c          | 237 ++++++++++++++++++++++++++++++++++++++++++++++++
 commit-graph.h          |  20 +++-
 commit.c                |  10 +-
 commit.h                |   4 +
 log-tree.c              |   3 +-
 t/t5318-commit-graph.sh |  47 ++++++++++
 7 files changed, 318 insertions(+), 4 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadfacd..cf4f8b61e1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
 	struct commit *c = alloc_node(&commit_state, sizeof(struct commit));
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
+	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 764e016ddb..fc816533c6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -35,6 +35,9 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
 			GRAPH_OID_LEN + sizeof(struct commit_graph_header))
 
+/* global storage */
+struct commit_graph *commit_graph = 0;
+
 struct object_id *get_graph_head_hash(const char *pack_dir, struct object_id *hash)
 {
 	struct strbuf head_filename = STRBUF_INIT;
@@ -209,6 +212,220 @@ struct commit_graph *load_commit_graph_one(const char *graph_file, const char *p
 	return graph;
 }
 
+static void prepare_commit_graph_one(const char *obj_dir)
+{
+	char *graph_file;
+	struct object_id oid;
+	struct strbuf pack_dir = STRBUF_INIT;
+	strbuf_addstr(&pack_dir, obj_dir);
+	strbuf_add(&pack_dir, "/pack", 5);
+
+	if (!get_graph_head_hash(pack_dir.buf, &oid))
+		return;
+
+	graph_file = get_commit_graph_filename_hash(pack_dir.buf, &oid);
+
+	commit_graph = load_commit_graph_one(graph_file, pack_dir.buf);
+	strbuf_release(&pack_dir);
+}
+
+static int prepare_commit_graph_run_once = 0;
+void prepare_commit_graph(void)
+{
+	struct alternate_object_database *alt;
+	char *obj_dir;
+
+	if (prepare_commit_graph_run_once)
+		return;
+	prepare_commit_graph_run_once = 1;
+
+	obj_dir = get_object_directory();
+	prepare_commit_graph_one(obj_dir);
+	prepare_alt_odb();
+	for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next)
+		prepare_commit_graph_one(alt->path);
+}
+
+static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
+{
+	uint32_t last, first = 0;
+
+	if (oid->hash[0])
+		first = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * (oid->hash[0] - 1)));
+	last = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * oid->hash[0]));
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		const unsigned char *current;
+		int cmp;
+
+		current = g->chunk_oid_lookup + g->hdr->hash_len * mid;
+		cmp = hashcmp(oid->hash, current);
+		if (!cmp) {
+			*pos = mid;
+			return 1;
+		}
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	*pos = first;
+	return 0;
+}
+
+struct object_id *get_nth_commit_oid(struct commit_graph *g,
+				     uint32_t n,
+				     struct object_id *oid)
+{
+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hdr->hash_len * n);
+	return oid;
+}
+
+static int full_parse_commit(struct commit *item, struct commit_graph *g,
+			     uint32_t pos, const unsigned char *commit_data)
+{
+	struct object_id oid;
+	struct commit *new_parent;
+	uint32_t new_parent_pos;
+	uint32_t *parent_data_ptr;
+	uint64_t date_low, date_high;
+	struct commit_list **pptr;
+
+	item->object.parsed = 1;
+	item->graph_pos = pos;
+
+	hashcpy(oid.hash, commit_data);
+	item->tree = lookup_tree(&oid);
+
+	date_high = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 8)) & 0x3;
+	date_low = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 12));
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
+	pptr = &item->parents;
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len));
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return 1;
+	get_nth_commit_oid(g, new_parent_pos, &oid);
+	new_parent = lookup_commit(&oid);
+	if (new_parent) {
+		new_parent->graph_pos = new_parent_pos;
+		pptr = &commit_list_insert(new_parent, pptr)->next;
+	} else {
+		die("could not find commit %s", oid_to_hex(&oid));
+	}
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 4));
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return 1;
+	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED)) {
+		get_nth_commit_oid(g, new_parent_pos, &oid);
+		new_parent = lookup_commit(&oid);
+		if (new_parent) {
+			new_parent->graph_pos = new_parent_pos;
+			pptr = &commit_list_insert(new_parent, pptr)->next;
+		} else
+			die("could not find commit %s", oid_to_hex(&oid));
+		return 1;
+	}
+
+	parent_data_ptr = (uint32_t*)(g->chunk_large_edges + 4 * (new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED));
+	do {
+		new_parent_pos = ntohl(*parent_data_ptr);
+
+		get_nth_commit_oid(g, new_parent_pos & GRAPH_EDGE_LAST_MASK, &oid);
+		new_parent = lookup_commit(&oid);
+		if (new_parent) {
+			new_parent->graph_pos = new_parent_pos & GRAPH_EDGE_LAST_MASK;
+			pptr = &commit_list_insert(new_parent, pptr)->next;
+		} else
+			die("could not find commit %s", oid_to_hex(&oid));
+		parent_data_ptr++;
+	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
+
+	return 1;
+}
+
+/**
+ * Fill 'item' to contain all information that would be parsed by parse_commit_buffer.
+ */
+static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	uint32_t new_parent_pos;
+	uint32_t *parent_data_ptr;
+	const unsigned char *commit_data = g->chunk_commit_data + (g->hdr->hash_len + 16) * pos;
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len));
+
+	if (new_parent_pos == GRAPH_PARENT_MISSING)
+		return 0;
+
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return full_parse_commit(item, g, pos, commit_data);
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 4));
+
+	if (new_parent_pos == GRAPH_PARENT_MISSING)
+		return 0;
+	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED))
+		return full_parse_commit(item, g, pos, commit_data);
+
+	new_parent_pos = new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED;
+
+	if (new_parent_pos == GRAPH_PARENT_MISSING)
+		return 0;
+
+	parent_data_ptr = (uint32_t*)(g->chunk_large_edges + 4 * new_parent_pos);
+	do {
+		new_parent_pos = ntohl(*parent_data_ptr);
+
+		if ((new_parent_pos & GRAPH_EDGE_LAST_MASK) == GRAPH_PARENT_MISSING)
+			return 0;
+
+		parent_data_ptr++;
+	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
+
+	return full_parse_commit(item, g, pos, commit_data);
+}
+
+/**
+ * Given a commit struct, try to fill the commit struct info, including:
+ *  1. tree object
+ *  2. date
+ *  3. parents.
+ *
+ * Returns 1 if and only if the commit was found in the commit graph.
+ *
+ * See parse_commit_buffer() for the fallback after this call.
+ */
+int parse_commit_in_graph(struct commit *item)
+{
+	if (!core_commitgraph)
+		return 0;
+	if (item->object.parsed)
+		return 1;
+
+	prepare_commit_graph();
+	if (commit_graph) {
+		uint32_t pos;
+		int found;
+		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+			pos = item->graph_pos;
+			found = 1;
+		} else {
+			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
+		}
+
+		if (found)
+			return fill_commit_in_graph(item, commit_graph, pos);
+	}
+
+	return 0;
+}
+
 static void write_graph_chunk_fanout(struct sha1file *f,
 				     struct commit **commits,
 				     int nr_commits)
@@ -439,9 +656,24 @@ struct object_id *construct_commit_graph(const char *pack_dir)
 	char *fname;
 	struct commit_list *parent;
 
+	prepare_commit_graph();
+
 	oids.num = 0;
 	oids.size = 1024;
+
+	if (commit_graph && oids.size < commit_graph->num_commits)
+		oids.size = commit_graph->num_commits;
+
 	ALLOC_ARRAY(oids.list, oids.size);
+
+	if (commit_graph) {
+		for (i = 0; i < commit_graph->num_commits; i++) {
+			oids.list[i] = malloc(sizeof(struct object_id));
+			get_nth_commit_oid(commit_graph, i, oids.list[i]);
+		}
+		oids.num = commit_graph->num_commits;
+	}
+
 	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
 	QSORT(oids.list, oids.num, commit_compare);
 
@@ -525,6 +757,11 @@ struct object_id *construct_commit_graph(const char *pack_dir)
 	hashcpy(f_hash->hash, final_hash);
 	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
 
+	if (commit_graph) {
+		close_commit_graph(commit_graph);
+		FREE_AND_NULL(commit_graph);
+	}
+
 	if (rename(graph_name, fname))
 		die("failed to rename %s to %s", graph_name, fname);
 
diff --git a/commit-graph.h b/commit-graph.h
index 43eb0aec84..05ddbbe165 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,18 @@
 #include "git-compat-util.h"
 #include "commit.h"
 
+/**
+ * Given a commit struct, try to fill the commit struct info, including:
+ *  1. tree object
+ *  2. date
+ *  3. parents.
+ *
+ * Returns 1 if and only if the commit was found in the packed graph.
+ *
+ * See parse_commit_buffer() for the fallback after this call.
+ */
+extern int parse_commit_in_graph(struct commit *item);
+
 extern struct object_id *get_graph_head_hash(const char *pack_dir,
 					     struct object_id *hash);
 extern char* get_commit_graph_filename_hash(const char *pack_dir,
@@ -40,7 +52,13 @@ extern struct commit_graph {
 
 extern int close_commit_graph(struct commit_graph *g);
 
-extern struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir);
+extern struct commit_graph *load_commit_graph_one(const char *graph_file,
+						  const char *pack_dir);
+extern void prepare_commit_graph(void);
+
+extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
+					    uint32_t n,
+					    struct object_id *oid);
 
 extern struct object_id *construct_commit_graph(const char *pack_dir);
 
diff --git a/commit.c b/commit.c
index cab8d4455b..4437798e84 100644
--- a/commit.c
+++ b/commit.c
@@ -12,6 +12,7 @@
 #include "prio-queue.h"
 #include "sha1-lookup.h"
 #include "wt-status.h"
+#include "commit-graph.h"
 
 static struct commit_extra_header *read_commit_extra_header_lines(const char *buf, size_t len, const char **);
 
@@ -374,7 +375,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int check_packed)
 {
 	enum object_type type;
 	void *buffer;
@@ -385,6 +386,8 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return -1;
 	if (item->object.parsed)
 		return 0;
+	if (check_packed && parse_commit_in_graph(item))
+		return 0;
 	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
 	if (!buffer)
 		return quiet_on_missing ? -1 :
@@ -404,6 +407,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 	return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+	return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
 	if (parse_commit(item))
diff --git a/commit.h b/commit.h
index 8c68ca1a5a..fc8880d187 100644
--- a/commit.h
+++ b/commit.h
@@ -9,6 +9,8 @@
 #include "string-list.h"
 #include "pretty.h"
 
+#define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+
 struct commit_list {
 	struct commit *item;
 	struct commit_list *next;
@@ -21,6 +23,7 @@ struct commit {
 	timestamp_t date;
 	struct commit_list *parents;
 	struct tree *tree;
+	uint32_t graph_pos;
 };
 
 extern int save_commit_buffer;
@@ -60,6 +63,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
+extern int parse_commit_internal(struct commit *item, int quiet_on_missing, int check_packed);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
diff --git a/log-tree.c b/log-tree.c
index fca29d4799..156aed4541 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -659,8 +659,7 @@ void show_log(struct rev_info *opt)
 		show_mergetag(opt, commit);
 	}
 
-	if (!get_cached_commit_buffer(commit, NULL))
-		return;
+	get_commit_buffer(commit, NULL);
 
 	if (opt->show_notes) {
 		int raw;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index b56a6d4217..93b0d4f51b 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -26,6 +26,24 @@ test_expect_success 'create commits and repack' \
      done &&
      git repack'
 
+_graph_git_two_modes() {
+    git -c core.commitgraph=true $1 >output
+    git -c core.commitgraph=false $1 >expect
+    cmp output expect
+}
+
+_graph_git_behavior() {
+    BRANCH=$1
+    COMPARE=$2
+    test_expect_success 'check normal git operations' \
+        '_graph_git_two_modes "log --oneline ${BRANCH}" &&
+         _graph_git_two_modes "log --topo-order ${BRANCH}" &&
+         _graph_git_two_modes "branch -vv" &&
+         _graph_git_two_modes "merge-base -a ${BRANCH} ${COMPARE}"'
+}
+
+_graph_git_behavior "commits/3" "commits/4"
+
 _graph_read_expect() {
     cat >expect <<- EOF
 header: 43475048 01 01 14 04
@@ -43,6 +61,8 @@ test_expect_success 'write graph' \
      _graph_read_expect "5" "${packdir}" &&
      cmp expect output'
 
+_graph_git_behavior "commits/3" "commits/4"
+
 test_expect_success 'Add more commits' \
     'git reset --hard commits/3 &&
      for i in $(test_seq 6 10)
@@ -91,6 +111,10 @@ test_expect_success 'Add more commits' \
 # |
 # 1
 
+_graph_git_behavior "merge/1" "merge/2"
+_graph_git_behavior "merge/1" "merge/3"
+_graph_git_behavior "merge/2" "merge/3"
+
 test_expect_success 'write graph with merges' \
     'graph2=$(git commit-graph --write --update-head) &&
      test_path_is_file ${packdir}/graph-${graph2}.graph &&
@@ -101,6 +125,10 @@ test_expect_success 'write graph with merges' \
      _graph_read_expect "18" "${packdir}" &&
      cmp expect output'
 
+_graph_git_behavior merge/1 merge/2
+_graph_git_behavior merge/1 merge/3
+_graph_git_behavior merge/2 merge/3
+
 test_expect_success 'Add more commits' \
     'for i in $(test_seq 16 20)
      do
@@ -141,6 +169,10 @@ test_expect_success 'Add more commits' \
 # |
 # 1
 
+# Test behavior while in mixed mode
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'write graph with merges' \
     'graph3=$(git commit-graph --write --update-head --delete-expired) &&
      test_path_is_file ${packdir}/graph-${graph3}.graph &&
@@ -153,6 +185,9 @@ test_expect_success 'write graph with merges' \
      _graph_read_expect "23" "${packdir}" &&
      cmp expect output'
 
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'write graph with nothing new' \
     'graph4=$(git commit-graph --write --update-head --delete-expired) &&
      test_path_is_file ${packdir}/graph-${graph4}.graph &&
@@ -164,12 +199,18 @@ test_expect_success 'write graph with nothing new' \
      _graph_read_expect "23" "${packdir}" &&
      cmp expect output'
 
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'clear graph' \
     'git commit-graph --clear &&
      test_path_is_missing ${packdir}/graph-${graph2}.graph &&
      test_path_is_file ${packdir}/graph-${graph1}.graph &&
      test_path_is_missing ${packdir}/graph-head'
 
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'setup bare repo' \
     'cd .. &&
      git clone --bare full bare &&
@@ -178,6 +219,9 @@ test_expect_success 'setup bare repo' \
      git config pack.threads 1 &&
      baredir="./objects/pack"'
 
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'write graph in bare repo' \
     'graphbare=$(git commit-graph --write --update-head) &&
      test_path_is_file ${baredir}/graph-${graphbare}.graph &&
@@ -188,4 +232,7 @@ test_expect_success 'write graph in bare repo' \
      _graph_read_expect "23" "${baredir}" &&
      cmp expect output'
 
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_done
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 12/14] commit-graph: read only from specific pack-indexes
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (10 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-01-30 21:39 ` [PATCH v2 13/14] commit-graph: close under reachability Derrick Stolee
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively, since we add all commits stored in a
previous commit graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 13 +++++++++++++
 builtin/commit-graph.c             | 25 ++++++++++++++++++++++---
 commit-graph.c                     | 25 +++++++++++++++++++++++--
 commit-graph.h                     |  4 +++-
 packfile.c                         |  4 ++--
 packfile.h                         |  2 ++
 t/t5318-commit-graph.sh            |  6 ++++--
 7 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 7b376e9212..d0571cd896 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -43,6 +43,11 @@ OPTIONS
 	When used with --write and --update-head, delete the graph file
 	previously referenced by graph-head.
 
+--stdin-packs::
+	When used with --write, generate the new graph by walking objects
+	only in the specified packfiles and any commits in the
+	existing graph-head.
+
 EXAMPLES
 --------
 
@@ -65,6 +70,14 @@ $ git commit-graph --write
 $ git commit-graph --write --update-head --delete-expired
 ------------------------------------------------
 
+* Write a graph file, extending the current graph file using commits
+* in <pack-index>, update graph-head, and delete the old graph-<hash>.graph
+* file.
++
+------------------------------------------------
+$ echo <pack-index> | git commit-graph --write --update-head --delete-expired --stdin-packs
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 766f09e6fc..80a409e784 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph --clear [--pack-dir <packdir>]"),
 	N_("git commit-graph --read [--graph-hash=<hash>]"),
-	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head] [--delete-expired]"),
+	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs]"),
 	NULL
 };
 
@@ -24,6 +24,7 @@ static struct opts_commit_graph {
 	int write;
 	int update_head;
 	int delete_expired;
+	int stdin_packs;
 	int has_existing;
 	struct object_id old_graph_hash;
 } opts;
@@ -114,7 +115,24 @@ static void update_head_file(const char *pack_dir, const struct object_id *graph
 
 static int graph_write(void)
 {
-	struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
+	struct object_id *graph_hash;
+	char **pack_indexes = NULL;
+	int num_packs = 0;
+	int size_packs = 0;
+
+	if (opts.stdin_packs) {
+		struct strbuf buf = STRBUF_INIT;
+		size_packs = 128;
+		ALLOC_ARRAY(pack_indexes, size_packs);
+
+		while (strbuf_getline(&buf, stdin) != EOF) {
+			ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
+			pack_indexes[num_packs++] = buf.buf;
+			strbuf_detach(&buf, NULL);
+		}
+	}
+
+	graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, num_packs);
 
 	if (opts.update_head)
 		update_head_file(opts.pack_dir, graph_hash);
@@ -122,7 +140,6 @@ static int graph_write(void)
 	if (graph_hash)
 		printf("%s\n", oid_to_hex(graph_hash));
 
-
 	if (opts.delete_expired && opts.update_head && opts.has_existing &&
 	    oidcmp(graph_hash, &opts.old_graph_hash)) {
 		char *old_path = get_commit_graph_filename_hash(opts.pack_dir,
@@ -153,6 +170,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			N_("update graph-head to written graph file")),
 		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
 			N_("delete expired head graph file")),
+		OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
+			N_("only scan packfiles listed by stdin")),
 		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
 			N_("hash"),
 			N_("A hash for a specific graph file in the pack-dir."),
diff --git a/commit-graph.c b/commit-graph.c
index fc816533c6..e5a1d9ee8b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -638,7 +638,9 @@ static int if_packed_commit_add_to_list(const struct object_id *oid,
 	return 0;
 }
 
-struct object_id *construct_commit_graph(const char *pack_dir)
+struct object_id *construct_commit_graph(const char *pack_dir,
+					 char **pack_indexes,
+					 int nr_packs)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -674,7 +676,26 @@ struct object_id *construct_commit_graph(const char *pack_dir)
 		oids.num = commit_graph->num_commits;
 	}
 
-	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+	if (pack_indexes) {
+		int pack_dir_len = strlen(pack_dir) + 1;
+		struct strbuf packname = STRBUF_INIT;
+		strbuf_add(&packname, pack_dir, pack_dir_len - 1);
+		strbuf_addch(&packname, '/');
+		for (i = 0; i < nr_packs; i++) {
+			struct packed_git *p;
+			strbuf_setlen(&packname, pack_dir_len);
+			strbuf_addstr(&packname, pack_indexes[i]);
+			p = add_packed_git(packname.buf, packname.len, 1);
+			if (!p)
+				die("error adding pack %s", packname.buf);
+			if (open_pack_index(p))
+				die("error opening index for %s", packname.buf);
+			for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);
+			close_pack(p);
+		}
+	} else {
+		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+	}
 	QSORT(oids.list, oids.num, commit_compare);
 
 	count_distinct = 1;
diff --git a/commit-graph.h b/commit-graph.h
index 05ddbbe165..3ae1eadce0 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -60,6 +60,8 @@ extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
 					    uint32_t n,
 					    struct object_id *oid);
 
-extern struct object_id *construct_commit_graph(const char *pack_dir);
+extern struct object_id *construct_commit_graph(const char *pack_dir,
+						char **pack_indexes,
+						int nr_packs);
 
 #endif
diff --git a/packfile.c b/packfile.c
index 4a5fe7ab18..48133bd669 100644
--- a/packfile.c
+++ b/packfile.c
@@ -299,7 +299,7 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
-static void close_pack(struct packed_git *p)
+void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
@@ -1860,7 +1860,7 @@ int has_pack_index(const unsigned char *sha1)
 	return 1;
 }
 
-static int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
+int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
 {
 	uint32_t i;
 	int r = 0;
diff --git a/packfile.h b/packfile.h
index 0cdeb54dcd..cde868feb6 100644
--- a/packfile.h
+++ b/packfile.h
@@ -61,6 +61,7 @@ extern void close_pack_index(struct packed_git *);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 extern void close_pack_windows(struct packed_git *);
+extern void close_pack(struct packed_git *);
 extern void close_all_packs(void);
 extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
@@ -133,6 +134,7 @@ typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
 				  uint32_t pos,
 				  void *data);
+extern int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data);
 extern int for_each_packed_object(each_packed_object_fn, void *, unsigned flags);
 
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 93b0d4f51b..b9a73f398c 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -137,7 +137,9 @@ test_expect_success 'Add more commits' \
         git commit -m "commit $i" &&
         git branch commits/$i
      done &&
-     git repack'
+     ls ${packdir} | grep idx >existing-idx &&
+     git repack &&
+     ls ${packdir} | grep idx | grep -v --file=existing-idx >new-idx'
 
 # Current graph structure:
 #
@@ -174,7 +176,7 @@ _graph_git_behavior commits/20 merge/1
 _graph_git_behavior commits/20 merge/2
 
 test_expect_success 'write graph with merges' \
-    'graph3=$(git commit-graph --write --update-head --delete-expired) &&
+    'graph3=$(cat new-idx | git commit-graph --write --update-head --delete-expired --stdin-packs) &&
      test_path_is_file ${packdir}/graph-${graph3}.graph &&
      test_path_is_missing ${packdir}/graph-${graph2}.graph &&
      test_path_is_file ${packdir}/graph-${graph1}.graph &&
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 13/14] commit-graph: close under reachability
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (11 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-01-30 21:39 ` [PATCH v2 14/14] commit-graph: build graph from starting commits Derrick Stolee
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach construct_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 26 ++++++++++++++++++++++++++
 t/t5318-commit-graph.sh | 14 ++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index e5a1d9ee8b..cfa0415a21 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -5,6 +5,7 @@
 #include "packfile.h"
 #include "commit.h"
 #include "object.h"
+#include "revision.h"
 #include "commit-graph.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
@@ -638,6 +639,29 @@ static int if_packed_commit_add_to_list(const struct object_id *oid,
 	return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+	int i;
+	struct rev_info revs;
+	struct commit *commit;
+	init_revisions(&revs, NULL);
+
+	for (i = 0; i < oids->num; i++) {
+		commit = lookup_commit(oids->list[i]);
+		if (commit && !parse_commit(commit))
+			revs.commits = commit_list_insert(commit, &revs.commits);
+	}
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+
+	while ((commit = get_revision(&revs)) != NULL) {
+		ALLOC_GROW(oids->list, oids->num + 1, oids->size);
+		oids->list[oids->num] = &(commit->object.oid);
+		(oids->num)++;
+	}
+}
+
 struct object_id *construct_commit_graph(const char *pack_dir,
 					 char **pack_indexes,
 					 int nr_packs)
@@ -696,6 +720,8 @@ struct object_id *construct_commit_graph(const char *pack_dir,
 	} else {
 		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
 	}
+
+	close_reachable(&oids);
 	QSORT(oids.list, oids.num, commit_compare);
 
 	count_distinct = 1;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index b9a73f398c..2001b0b5b5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -213,6 +213,20 @@ test_expect_success 'clear graph' \
 _graph_git_behavior commits/20 merge/1
 _graph_git_behavior commits/20 merge/2
 
+test_expect_success 'build graph from latest pack with closure' \
+    'graph5=$(cat new-idx | git commit-graph --write --update-head --stdin-packs) &&
+     test_path_is_file ${packdir}/graph-${graph5}.graph &&
+     test_path_is_file ${packdir}/graph-${graph1}.graph &&
+     test_path_is_file ${packdir}/graph-head &&
+     echo ${graph5} >expect &&
+     cmp -n 40 expect ${packdir}/graph-head &&
+     git commit-graph --read --graph-hash=${graph5} >output &&
+     _graph_read_expect "21" "${packdir}" &&
+     cmp expect output'
+
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'setup bare repo' \
     'cd .. &&
      git clone --bare full bare &&
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 14/14] commit-graph: build graph from starting commits
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (12 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 13/14] commit-graph: close under reachability Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
  2018-01-30 21:47 ` [PATCH v2 00/14] Serialized Git Commit Graph Stefan Beller
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
  15 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, git, sbeller, dstolee

Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  7 ++++++-
 builtin/commit-graph.c             | 34 +++++++++++++++++++++++++---------
 commit-graph.c                     | 26 +++++++++++++++++++++++---
 commit-graph.h                     |  4 +++-
 t/t5318-commit-graph.sh            | 18 ++++++++++++++++++
 5 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index d0571cd896..3357c0cf8f 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -46,7 +46,12 @@ OPTIONS
 --stdin-packs::
 	When used with --write, generate the new graph by walking objects
 	only in the specified packfiles and any commits in the
-	existing graph-head.
+	existing graph-head. (Cannot be combined with --stdin-commits.)
+
+--stdin-commits::
+	When used with --write, generate the new graph by walking commits
+	starting at the commits specified in stdin as a list of OIDs in
+	hex, one OID per line. (Cannot be combined with --stdin-packs.)
 
 EXAMPLES
 --------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 80a409e784..adc05f0582 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph --clear [--pack-dir <packdir>]"),
 	N_("git commit-graph --read [--graph-hash=<hash>]"),
-	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs]"),
+	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -25,6 +25,7 @@ static struct opts_commit_graph {
 	int update_head;
 	int delete_expired;
 	int stdin_packs;
+	int stdin_commits;
 	int has_existing;
 	struct object_id old_graph_hash;
 } opts;
@@ -117,23 +118,36 @@ static int graph_write(void)
 {
 	struct object_id *graph_hash;
 	char **pack_indexes = NULL;
+	char **commits = NULL;
 	int num_packs = 0;
-	int size_packs = 0;
+	int num_commits = 0;
+	char **lines = NULL;
+	int num_lines = 0;
+	int size_lines = 0;
 
-	if (opts.stdin_packs) {
+	if (opts.stdin_packs || opts.stdin_commits) {
 		struct strbuf buf = STRBUF_INIT;
-		size_packs = 128;
-		ALLOC_ARRAY(pack_indexes, size_packs);
+		size_lines = 128;
+		ALLOC_ARRAY(lines, size_lines);
 
 		while (strbuf_getline(&buf, stdin) != EOF) {
-			ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
-			pack_indexes[num_packs++] = buf.buf;
+			ALLOC_GROW(lines, num_lines + 1, size_lines);
+			lines[num_lines++] = buf.buf;
 			strbuf_detach(&buf, NULL);
 		}
-	}
 
-	graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, num_packs);
+		if (opts.stdin_packs) {
+			pack_indexes = lines;
+			num_packs = num_lines;
+		}
+		if (opts.stdin_commits) {
+			commits = lines;
+			num_commits = num_lines;
+		}
+	}
 
+	graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, num_packs,
+					    commits, num_commits);
 	if (opts.update_head)
 		update_head_file(opts.pack_dir, graph_hash);
 
@@ -172,6 +186,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			N_("delete expired head graph file")),
 		OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
 			N_("only scan packfiles listed by stdin")),
+		OPT_BOOL('C', "stdin-commits", &opts.stdin_commits,
+			N_("start walk at commits listed by stdin")),
 		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
 			N_("hash"),
 			N_("A hash for a specific graph file in the pack-dir."),
diff --git a/commit-graph.c b/commit-graph.c
index cfa0415a21..7f31a6c795 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -664,7 +664,9 @@ static void close_reachable(struct packed_oid_list *oids)
 
 struct object_id *construct_commit_graph(const char *pack_dir,
 					 char **pack_indexes,
-					 int nr_packs)
+					 int nr_packs,
+					 char **commit_hex,
+					 int nr_commits)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -717,10 +719,28 @@ struct object_id *construct_commit_graph(const char *pack_dir,
 			for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);
 			close_pack(p);
 		}
-	} else {
-		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
 	}
 
+	if (commit_hex) {
+		for (i = 0; i < nr_commits; i++) {
+			const char *end;
+			ALLOC_GROW(oids.list, oids.num + 1, oids.size);
+
+			oids.list[oids.num] = malloc(sizeof(struct object_id));
+
+			if (parse_oid_hex(commit_hex[i], oids.list[oids.num], &end)) {
+				free(oids.list[oids.num]);
+				continue;
+			}
+
+			if (lookup_commit(oids.list[oids.num]))
+				oids.num++;
+		}
+	}
+
+	if (!pack_indexes && !commit_hex)
+		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+
 	close_reachable(&oids);
 	QSORT(oids.list, oids.num, commit_compare);
 
diff --git a/commit-graph.h b/commit-graph.h
index 3ae1eadce0..619b1f6def 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -62,6 +62,8 @@ extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
 
 extern struct object_id *construct_commit_graph(const char *pack_dir,
 						char **pack_indexes,
-						int nr_packs);
+						int nr_packs,
+						char **commits,
+						int nr_commits);
 
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2001b0b5b5..0bf27a2e7c 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -227,6 +227,24 @@ test_expect_success 'build graph from latest pack with closure' \
 _graph_git_behavior commits/20 merge/1
 _graph_git_behavior commits/20 merge/2
 
+test_expect_success 'build graph from commits with closure' \
+    'git rev-parse commits/20 >commits-in &&
+     git rev-parse merge/1 >>commits-in &&
+     git rev-parse merge/2 >>commits-in &&
+     graph6=$(cat commits-in | git commit-graph --write --update-head --delete-expired --stdin-commits) &&
+     test_path_is_file ${packdir}/graph-${graph6}.graph &&
+     test_path_is_missing ${packdir}/graph-${graph5}.graph &&
+     test_path_is_file ${packdir}/graph-${graph1}.graph &&
+     test_path_is_file ${packdir}/graph-head &&
+     echo ${graph6} >expect &&
+     cmp -n 40 expect ${packdir}/graph-head &&
+     git commit-graph --read --graph-hash=${graph6} >output &&
+     _graph_read_expect "23" "${packdir}" &&
+     cmp expect output'
+
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'setup bare repo' \
     'cd .. &&
      git clone --bare full bare &&
-- 
2.16.0.15.g9c3cf44.dirty


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 00/14] Serialized Git Commit Graph
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (13 preceding siblings ...)
  2018-01-30 21:39 ` [PATCH v2 14/14] commit-graph: build graph from starting commits Derrick Stolee
@ 2018-01-30 21:47 ` Stefan Beller
  2018-02-01  2:34   ` Stefan Beller
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
  15 siblings, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-01-30 21:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee

On Tue, Jan 30, 2018 at 1:39 PM, Derrick Stolee <stolee@gmail.com> wrote:
> Thanks to everyone who gave comments on v1. I tried my best to respond to
> all of the feedback, but may have missed some while I was doing several
> renames, including:
>
> * builtin/graph.c -> builtin/commit-graph.c
> * packed-graph.[c|h] -> commit-graph.[c|h]
> * t/t5319-graph.sh -> t/t5318-commit-graph.sh
>
> Because of these renames (and several type/function renames) the diff
> is too large to conveniently share here.
>
> Some issues that came up and are addressed:
>
> * Use <hash> instead of <oid> when referring to the graph-<hash>.graph
>   filenames and the contents of graph-head.
> * 32-bit timestamps will not cause undefined behavior.
> * timestamp_t is unsigned, so they are never negative.
> * The config setting "core.commitgraph" now only controls consuming the
>   graph during normal operations and will not block the commit-graph
>   plumbing command.
> * The --stdin-commits is better about sanitizing the input for strings
>   that do not parse to OIDs or are OIDs for non-commit objects.
>
> One unresolved comment that I would like consensus on is the use of
> globals to store the config setting and the graph state. I'm currently
> using the pattern from packed_git instead of putting these values in
> the_repository. However, we want to eventually remove globals like
> packed_git. Should I deviate from the pattern _now_ in order to keep
> the problem from growing, or should I keep to the known pattern?

I have a series doing the conversion in
https://github.com/stefanbeller/git/tree/object-store
that is based on 2.16.

While the commits are structured for easy review (to not miss any of
the globals that that series is based upon), I did not come up with a
good strategy how to take care of series in flight that add more globals.

So I think for now you'd want to keep it as global vars, such that
it is consistent with the code base and then we'll figure out how to
do the conversion one step at a time.

Please do not feel stopped or hindered by my slow pace of working
through that series, maybe I'll have to come up with another approach
that is better for upstream (rebasing that series is a pain, as upstream
moves rather quickly. Maybe I'll have to send that series in smaller chunks).

> Finally, I tried to clean up my incorrect style as I was recreating
> these commits. Feel free to be merciless in style feedback now that the
> architecture is more stable.

ok, will do.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 02/14] graph: add commit graph design document
  2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
@ 2018-01-31  2:19   ` Stefan Beller
  0 siblings, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-01-31  2:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee

On Tue, Jan 30, 2018 at 1:39 PM, Derrick Stolee <stolee@gmail.com> wrote:
> Add Documentation/technical/commit-graph.txt with details of the planned
> commit graph feature, including future plans.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 189 +++++++++++++++++++++++++++++++
>  1 file changed, 189 insertions(+)
>  create mode 100644 Documentation/technical/commit-graph.txt
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> new file mode 100644
> index 0000000000..cbf88f7264
> --- /dev/null
> +++ b/Documentation/technical/commit-graph.txt
> @@ -0,0 +1,189 @@
> +Git Commit Graph Design Notes
> +=============================
> +
> +Git walks the commit graph for many reasons, including:
> +
> +1. Listing and filtering commit history.
> +2. Computing merge bases.
> +
> +These operations can become slow as the commit count grows. The merge
> +base calculation shows up in many user-facing commands, such as 'merge-base'
> +or 'git show --remerge-diff' and can take minutes to compute depending on
> +history shape.

Sorry for appearing more authoritative than I am here. The --remerge-diff flag
is just floating around the mailing list, and was never merged. (It is
such a cool
feature though, but it would actually confuse users looking for it,
not finding it)


> +There are two main costs here:
> +
> +1. Decompressing and parsing commits.
> +2. Walking the entire graph to avoid topological order mistakes.
> +
> +The commit graph file is a supplemental data structure that accelerates
> +commit graph walks. If a user downgrades or disables the 'core.commitgraph'
> +config setting, then the existing ODB is sufficient. The file is stored
> +next to packfiles either in the .git/objects/pack directory or in the pack
> +directory of an alternate.
> +
> +The commit graph file stores the commit graph structure along with some
> +extra metadata to speed up graph walks. By listing commit OIDs in lexi-
> +cographic order, we can identify an integer position for each commit and
> +refer to the parents of a commit using those integer positions. We use
> +binary search to find initial commits and then use the integer positions
> +for fast lookups during the walk.
> +
> +A consumer may load the following info for a commit from the graph:
> +
> +1. The commit OID.
> +2. The list of parents, along with their integer position.
> +3. The commit date.
> +4. The root tree OID.
> +5. The generation number (see definition below).
> +
> +Values 1-4 satisfy the requirements of parse_commit_gently().
> +
> +Define the "generation number" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has generation number one.
> +
> + * A commit with at least one parent has generation number one more than
> +   the largest generation number among its parents.
> +
> +Equivalently, the generation number of a commit A is one more than the
> +length of a longest path from A to a root commit. The recursive definition
> +is easier to use for computation and observing the following property:
> +
> +    If A and B are commits with generation numbers N and M, respectively,
> +    and N <= M, then A cannot reach B. That is, we know without searching
> +    that B is not an ancestor of A because it is further from a root commit
> +    than A.
> +
> +    Conversely, when checking if A is an ancestor of B, then we only need
> +    to walk commits until all commits on the walk boundary have generation
> +    number at most N. If we walk commits using a priority queue seeded by
> +    generation numbers, then we always expand the boundary commit with highest
> +    generation number and can easily detect the stopping condition.
> +
> +This property can be used to significantly reduce the time it takes to
> +walk commits and determine topological relationships. Without generation
> +numbers, the general heuristic is the following:
> +
> +    If A and B are commits with commit time X and Y, respectively, and
> +    X < Y, then A _probably_ cannot reach B.
> +
> +This heuristic is currently used whenever the computation can make
> +mistakes with topological orders (such as "git log" with default order),
> +but is not used when the topological order is required (such as merge
> +base calculations, "git log --graph").
> +
> +In practice, we expect some commits to be created recently and not stored
> +in the commit graph. We can treat these commits as having "infinite"
> +generation number and walk until reaching commits with known generation
> +number.
> +
> +Design Details
> +--------------
> +
> +- A graph file is stored in a file named 'graph-<hash>.graph' in the pack
> +  directory. This could be stored in an alternate.
> +
> +- The most-recent graph file hash is stored in a 'graph-head' file for
> +  immediate access and storing backup graphs. This could be stored in an
> +  alternate, and refers to a 'graph-<hash>.graph' file in the same pack
> +  directory.
> +
> +- The core.commitgraph config setting must be on to consume graph files.
> +
> +- The file format includes parameters for the object id length and hash
> +  algorithm, so a future change of hash algorithm does not require a change
> +  in format.
> +
> +Current Limitations
> +-------------------
> +
> +- Only one graph file is used at one time. This allows the integer position
> +  to seek into the single graph file. It is possible to extend the model
> +  for multiple graph files, but that is currently not part of the design.
> +
> +- .graph files are managed only by the 'commit-graph' builtin. These are not
> +  updated automatically during clone, fetch, repack, or creating new commits.
> +
> +- There is no '--verify' option for the 'commit-graph' builtin to verify the
> +  contents of the graph file agree with the contents in the ODB.
> +
> +- When rewriting the graph, we do not check for a commit still existing
> +  in the ODB, so garbage collection may remove commits.
> +
> +- Generation numbers are not computed in the current version. The file
> +  format supports storing them, along with a mechanism to upgrade from
> +  a file without generation numbers to one that uses them.
> +
> +Future Work
> +-----------
> +
> +- The file format includes room for precomputed generation numbers. These
> +  are not currently computed, so all generation numbers will be marked as
> +  0 (or "uncomputed"). A later patch will include this calculation.
> +
> +- The commit graph is currently incompatible with commit grafts. This can be
> +  remedied by duplicating or refactoring the current graft logic.
> +
> +- After computing and storing generation numbers, we must make graph
> +  walks aware of generation numbers to gain the performance benefits they
> +  enable. This will mostly be accomplished by swapping a commit-date-ordered
> +  priority queue with one ordered by generation number. The following
> +  operations are important candidates:
> +
> +    - paint_down_to_common()
> +    - 'log --topo-order'
> +
> +- The graph currently only adds commits to a previously existing graph.
> +  When writing a new graph, we could check that the ODB still contains
> +  the commits and choose to remove the commits that are deleted from the
> +  ODB. For performance reasons, this check should remain optional.
> +
> +- Currently, parse_commit_gently() requires filling in the root tree
> +  object for a commit. This passes through lookup_tree() and consequently
> +  lookup_object(). Also, it calls lookup_commit() when loading the parents.
> +  These method calls check the ODB for object existence, even if the
> +  consumer does not need the content. For example, we do not need the
> +  tree contents when computing merge bases. Now that commit parsing is
> +  removed from the computation time, these lookup operations are the
> +  slowest operations keeping graph walks from being fast. Consider
> +  loading these objects without verifying their existence in the ODB and
> +  only loading them fully when consumers need them. Consider a method
> +  such as "ensure_tree_loaded(commit)" that fully loads a tree before
> +  using commit->tree.
> +
> +- The current design uses the 'commit-graph' builtin to generate the graph.
> +  When this feature stabilizes enough to recommend to most users, we should
> +  add automatic graph writes to common operations that create many commits.
> +  For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
> +  commands.
> +
> +- A server could provide a commit graph file as part of the network protocol
> +  to avoid extra calculations by clients.
> +
> +Related Links
> +-------------
> +[0] https://bugs.chromium.org/p/git/issues/detail?id=8
> +    Chromium work item for: Serialized Commit Graph
> +
> +[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
> +    An abandoned patch that introduced generation numbers.
> +
> +[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
> +    Discussion about generation numbers on commits and how they interact
> +    with fsck.
> +
> +[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
> +    More discussion about generation numbers and not storing them inside
> +    commit objects. A valuable quote:
> +
> +    "I think we should be moving more in the direction of keeping
> +     repo-local caches for optimizations. Reachability bitmaps have been
> +     a big performance win. I think we should be doing the same with our
> +     properties of commits. Not just generation numbers, but making it
> +     cheap to access the graph structure without zlib-inflating whole
> +     commit objects (i.e., packv4 or something like the "metapacks" I
> +     proposed a few years ago)."
> +
> +[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
> +    A patch to remove the ahead-behind calculation from 'status'.
> --
> 2.16.0.15.g9c3cf44.dirty
>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/14] commit-graph: implement git-commit-graph --read
  2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
@ 2018-01-31  2:22   ` Stefan Beller
  2018-02-02  0:02   ` SZEDER Gábor
  2018-02-02  0:23   ` Jonathan Tan
  2 siblings, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-01-31  2:22 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee

> +static void free_commit_graph(struct commit_graph **g)
> +{
> +       if (!g || !*g)
> +               return;
> +
> +       close_commit_graph(*g);
> +
> +       free(*g);
> +       *g = NULL;

nit: You may want to use FREE_AND_NULL(*g) instead.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 10/14] commit-graph: add core.commitgraph setting
  2018-01-30 21:39 ` [PATCH v2 10/14] commit-graph: add core.commitgraph setting Derrick Stolee
@ 2018-01-31 22:44   ` Igor Djordjevic
  2018-02-02 16:01   ` SZEDER Gábor
  1 sibling, 0 replies; 146+ messages in thread
From: Igor Djordjevic @ 2018-01-31 22:44 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: gitster, peff, git, sbeller, dstolee

Hi Derrick,

On 30/01/2018 22:39, Derrick Stolee wrote:
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 0e25b2c92b..5b63559a2b 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -898,6 +898,9 @@ core.notesRef::
>  This setting defaults to "refs/notes/commits", and it can be overridden by
>  the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
>  
> +core.commitgraph::
             ^^^
A small style nitpick - you may want to use "core.commitGraph" 
throughout the series (note capital "G"), making it more readable and 
aligning with the rest of `git config` variable names (using "bumpyCaps" 
as per coding guidelines[1], and as seen a few lines below, at the 
end of this very patch, too, "core.sparseCheckout").

> +	Enable git commit graph feature. Allows reading from .graph files.
> +
>  core.sparseCheckout::
>  	Enable "sparse checkout" feature. See section "Sparse checkout" in
>  	linkgit:git-read-tree[1] for more information.
> diff --git a/cache.h b/cache.h
> index d8b975a571..e50e447a4f 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -825,6 +825,7 @@ extern char *git_replace_ref_base;
>  extern int fsync_object_files;
>  extern int core_preload_index;
>  extern int core_apply_sparse_checkout;
> +extern int core_commitgraph;
                        ^^^
Similar nit here, might be "core_commit_graph" (throughout the 
series) would align better with existing variable names around it 
(note additional underscore between "commit" and "graph"), but also 
with your own naming "scheme" used for cmd_commit_graph(), 
builtin_commit_graph_usage[], construct_commit_graph(), etc.

>  extern int precomposed_unicode;
>  extern int protect_hfs;
>  extern int protect_ntfs;
> diff --git a/config.c b/config.c
> index e617c2018d..99153fcfdb 100644
> --- a/config.c
> +++ b/config.c
> @@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, const char *value)
>  		return 0;
>  	}
>  
> +	if (!strcmp(var, "core.commitgraph")) {
> +		core_commitgraph = git_config_bool(var, value);
> +		return 0;
> +	}
> +
>  	if (!strcmp(var, "core.sparsecheckout")) {
>  		core_apply_sparse_checkout = git_config_bool(var, value);
>  		return 0;
> diff --git a/environment.c b/environment.c
> index 63ac38a46f..faa4323cc5 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
>  char *notes_ref_name;
>  int grafts_replace_parents = 1;
>  int core_apply_sparse_checkout;
> +int core_commitgraph;
>  int merge_log_config = -1;
>  int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
>  unsigned long pack_size_limit_cfg;
> 

Thanks, Buga

[1] https://github.com/git/git/blob/master/Documentation/CodingGuidelines

  Externally Visible Names
  
  ...
  
  The section and variable names that consist of multiple words are
  formed by concatenating the words without punctuations (e.g. `-`),
  and are broken using bumpyCaps in documentation as a hint to the
  reader.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 00/14] Serialized Git Commit Graph
  2018-01-30 21:47 ` [PATCH v2 00/14] Serialized Git Commit Graph Stefan Beller
@ 2018-02-01  2:34   ` Stefan Beller
  0 siblings, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-02-01  2:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee

>> Finally, I tried to clean up my incorrect style as I was recreating
>> these commits. Feel free to be merciless in style feedback now that the
>> architecture is more stable.
>
> ok, will do.

I reviewed all patches and found no nits worth mentioning.
The whole series has been reviewed by me.

>
> Thanks,
> Stefan

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 01/14] commit-graph: add format document
  2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
@ 2018-02-01 21:44   ` Jonathan Tan
  0 siblings, 0 replies; 146+ messages in thread
From: Jonathan Tan @ 2018-02-01 21:44 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, peff, git, sbeller, dstolee

On Tue, 30 Jan 2018 16:39:30 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> +    * The first H bytes are for the OID of the root tree.
> +    * The next 8 bytes are for the int-ids of the first two parents
> +      of the ith commit. Stores value 0xffffffff if no parent in that
> +      position. If there are more than two parents, the second value
> +      has its most-significant bit on and the other bits store an array
> +      position into the Large Edge List chunk.

[snip]

> +  Large Edge List (ID: {'E', 'D', 'G', 'E'})
> +      This list of 4-byte values store the second through nth parents for
> +      all octopus merges. The second parent value in the commit data is a
> +      negative number pointing into this list. 

Looking at the paragraph which I quoted before the [snip], this sentence
is confusing (in particular, the second parent value is not interpreted
as the normal two's-complement negative value, and the semantics of
negative values indexing into the list is unclear). Maybe it's better to
omit it entirely.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
  2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
@ 2018-02-01 22:23   ` Jonathan Tan
  2018-02-01 23:46   ` SZEDER Gábor
  2018-02-02 15:32   ` SZEDER Gábor
  2 siblings, 0 replies; 146+ messages in thread
From: Jonathan Tan @ 2018-02-01 22:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, peff, git, sbeller, dstolee

On Tue, 30 Jan 2018 16:39:33 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
> +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */

Could all these just be string constants? sha1write can handle them well
enough.

> +static void write_graph_chunk_fanout(struct sha1file *f,
> +				     struct commit **commits,
> +				     int nr_commits)
> +{
> +	uint32_t i, count = 0;
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +
> +	/*
> +	 * Write the first-level table (the list is sorted,
> +	 * but we use a 256-entry lookup to be able to avoid
> +	 * having to do eight extra binary search iterations).
> +	 */
> +	for (i = 0; i < 256; i++) {
> +		uint32_t swap_count;
> +
> +		while (list < last) {
> +			if ((*list)->object.oid.hash[0] != i)
> +				break;
> +			count++;
> +			list++;
> +		}
> +
> +		swap_count = htonl(count);
> +		sha1write(f, &swap_count, 4);

You can use sha1write_be32() instead of swapping.

> +static void write_graph_chunk_large_edges(struct sha1file *f,
> +					  struct commit **commits,
> +					  int nr_commits)
> +{
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +	struct commit_list *parent;
> +
> +	while (list < last) {
> +		int num_parents = 0;
> +		for (parent = (*list)->parents; num_parents < 3 && parent;
> +		     parent = parent->next)
> +			num_parents++;
> +
> +		if (num_parents <= 2) {
> +			list++;
> +			continue;
> +		}
> +
> +		for (parent = (*list)->parents; parent; parent = parent->next) {
> +			uint32_t int_id, swap_int_id;
> +			uint32_t last_edge = 0;
> +
> +			if (parent == (*list)->parents)
> +				continue;

Probably better to just initialize "parent = (*list)->parents->next".
Also probably best to add a comment describing why you are doing this
(e.g. "The first parent is already in the main commit table; the large
edges table only contains the second parent onwards").

> +struct packed_commit_list {
> +	struct commit **list;
> +	int num;
> +	int size;
> +};
> +
> +struct packed_oid_list {
> +	struct object_id **list;
> +	int num;
> +	int size;
> +};

What are num and size? If they're nr and alloc, maybe use those names
instead.

> +static int if_packed_commit_add_to_list(const struct object_id *oid,
> +					struct packed_git *pack,
> +					uint32_t pos,
> +					void *data)
> +{
> +	struct packed_oid_list *list = (struct packed_oid_list*)data;
> +	enum object_type type;
> +	unsigned long size;
> +	void *inner_data;
> +	off_t offset = nth_packed_object_offset(pack, pos);
> +	inner_data = unpack_entry(pack, offset, &type, &size);
> +
> +	if (inner_data)
> +		free(inner_data);
> +
> +	if (type != OBJ_COMMIT)
> +		return 0;
> +
> +	ALLOC_GROW(list->list, list->num + 1, list->size);
> +	list->list[list->num] = (struct object_id *)malloc(sizeof(struct object_id));

No need to cast return value of malloc. Also, use xmalloc?

> +struct object_id *construct_commit_graph(const char *pack_dir)
> +{
> +	struct packed_oid_list oids;
> +	struct packed_commit_list commits;
> +	struct commit_graph_header hdr;
> +	struct sha1file *f;
> +	int i, count_distinct = 0;
> +	struct strbuf tmp_file = STRBUF_INIT;
> +	unsigned char final_hash[GIT_MAX_RAWSZ];
> +	char *graph_name;
> +	int fd;
> +	uint32_t chunk_ids[5];
> +	uint64_t chunk_offsets[5];
> +	int num_long_edges;
> +	struct object_id *f_hash;
> +	char *fname;
> +	struct commit_list *parent;
> +
> +	oids.num = 0;
> +	oids.size = 1024;
> +	ALLOC_ARRAY(oids.list, oids.size);
> +	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
> +	QSORT(oids.list, oids.num, commit_compare);
> +
> +	count_distinct = 1;
> +	for (i = 1; i < oids.num; i++) {
> +		if (oidcmp(oids.list[i-1], oids.list[i]))
> +			count_distinct++;
> +	}
> +
> +	commits.num = 0;
> +	commits.size = count_distinct;
> +	ALLOC_ARRAY(commits.list, commits.size);
> +
> +	num_long_edges = 0;
> +	for (i = 0; i < oids.num; i++) {
> +		int num_parents = 0;
> +		if (i > 0 && !oidcmp(oids.list[i-1], oids.list[i]))
> +			continue;
> +
> +		commits.list[commits.num] = lookup_commit(oids.list[i]);
> +		parse_commit(commits.list[commits.num]);
> +
> +		for (parent = commits.list[commits.num]->parents;
> +		     parent; parent = parent->next)
> +			num_parents++;
> +
> +		if (num_parents > 2)
> +			num_long_edges += num_parents - 1;
> +
> +		commits.num++;
> +	}
> +
> +	strbuf_addstr(&tmp_file, pack_dir);
> +	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
> +
> +	fd = git_mkstemp_mode(tmp_file.buf, 0444);
> +	if (fd < 0)
> +		die_errno("unable to create '%s'", tmp_file.buf);
> +
> +	graph_name = strbuf_detach(&tmp_file, NULL);
> +	f = sha1fd(fd, graph_name);
> +
> +	hdr.graph_signature = htonl(GRAPH_SIGNATURE);
> +	hdr.graph_version = GRAPH_VERSION;
> +	hdr.hash_version = GRAPH_OID_VERSION;
> +	hdr.hash_len = GRAPH_OID_LEN;
> +	hdr.num_chunks = 4;
> +
> +	assert(sizeof(hdr) == 8);
> +	sha1write(f, &hdr, sizeof(hdr));

Instead of assembling these into a data structure, could you just use
individual calls to sha1write_be32 and sha1write_u8? Same comment
throughout this function.

> diff --git a/commit-graph.h b/commit-graph.h
> new file mode 100644
> index 0000000000..7b3469a7df
> --- /dev/null
> +++ b/commit-graph.h
> @@ -0,0 +1,20 @@
> +#ifndef COMMIT_GRAPH_H
> +#define COMMIT_GRAPH_H
> +
> +#include "git-compat-util.h"
> +#include "commit.h"
> +
> +extern char* get_commit_graph_filename_hash(const char *pack_dir,
> +					    struct object_id *hash);
> +
> +struct commit_graph_header {
> +	uint32_t graph_signature;
> +	unsigned char graph_version;
> +	unsigned char hash_version;
> +	unsigned char hash_len;
> +	unsigned char num_chunks;
> +};

This seems like it should be an internal detail of commit-graph.c.
(Also, as commented above, this struct might not be necessary at all.)

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
@ 2018-02-01 23:33   ` Jonathan Tan
  2018-02-02 18:36     ` Stefan Beller
  2018-02-01 23:48   ` SZEDER Gábor
  2018-02-02  1:47   ` SZEDER Gábor
  2 siblings, 1 reply; 146+ messages in thread
From: Jonathan Tan @ 2018-02-01 23:33 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, peff, git, sbeller, dstolee

On Tue, 30 Jan 2018 16:39:34 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index c8ea548dfb..3f3790d9a8 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -5,3 +5,21 @@ NAME
>  ----
>  git-commit-graph - Write and verify Git commit graphs (.graph files)
>  
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git commit-graph' --write <options> [--pack-dir <pack_dir>]

Subcommands (like those in git submodule) generally don't take "--", as
far as I know.

> +static int graph_write(void)
> +{
> +	struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);

I should have noticed this when I was reviewing the previous patch, but
this is probably better named write_commit_graph.

> +test_expect_success 'create commits and repack' \
> +    'for i in $(test_seq 5)
> +     do
> +        echo $i >$i.txt &&
> +        git add $i.txt &&
> +        git commit -m "commit $i" &&

You can generate commits more easily with test_commit. Also, the final
character of the test_expect_success line should be the apostrophe that
starts the big text block, like in other files. (That also means that
the backslash is unnecessary.)

> +# Current graph structure:
> +#
> +#      M3
> +#     / |\_____
> +#    / 10      15
> +#   /   |      |
> +#  /    9 M2   14
> +# |     |/  \  |
> +# |     8 M1 | 13
> +# |     |/ | \_|
> +# 5     7  |   12
> +# |     |   \__|
> +# 4     6      11
> +# |____/______/
> +# 3
> +# |
> +# 2
> +# |
> +# 1

I don't think we need such a complicated graph structure - maybe it's
sufficient to have one 2-way merge and one 3-way merge. E.g.

6
|\-.
5 \ \
|\ \ \
1 2 3 4

Also, I wonder if it is possible to test the output to a greater extent.
We don't want anything that relies on the ordering of commits
(especially since we plan on transitioning away from using SHA-1 as the
hash function) but we could still test, for example, that a 3-way merge
results in an edge list of the form "EDGE_..._..." (where the first _
does not have its high bit set, but the second one does). This could be
done by using my hex_unpack() function as seen here [1] with grep (e.g.
"45444745[0-7].......[8-9a-f].......").

[1] https://public-inbox.org/git/b9ea93edabc42754dc3643d6307c22a947eabaf3.1506714999.git.jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
  2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
  2018-02-01 22:23   ` Jonathan Tan
@ 2018-02-01 23:46   ` SZEDER Gábor
  2018-02-02 15:32   ` SZEDER Gábor
  2 siblings, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-01 23:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee


> Teach Git to write a commit graph file by checking all packed objects
> to see if they are commits, then store the file in the given pack
> directory.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Makefile       |   1 +
>  commit-graph.c | 376 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  commit-graph.h |  20 +++
>  3 files changed, 397 insertions(+)
>  create mode 100644 commit-graph.c
>  create mode 100644 commit-graph.h


> diff --git a/commit-graph.c b/commit-graph.c
> new file mode 100644
> index 0000000000..db2b7390c7
> --- /dev/null
> +++ b/commit-graph.c

> +struct packed_commit_list {
> +	struct commit **list;
> +	int num;
> +	int size;
> +};
> +
> +struct packed_oid_list {
> +	struct object_id **list;
> +	int num;
> +	int size;
> +};

When we manage the memory allocation of an array with the ALLOC_GROW
macro, then we tend to call the helper fields as 'alloc' and 'nr'.

> +static int if_packed_commit_add_to_list(const struct object_id *oid,
> +					struct packed_git *pack,
> +					uint32_t pos,
> +					void *data)
> +{
> +	struct packed_oid_list *list = (struct packed_oid_list*)data;
> +	enum object_type type;
> +	unsigned long size;
> +	void *inner_data;
> +	off_t offset = nth_packed_object_offset(pack, pos);
> +	inner_data = unpack_entry(pack, offset, &type, &size);
> +
> +	if (inner_data)
> +		free(inner_data);
> +
> +	if (type != OBJ_COMMIT)
> +		return 0;
> +
> +	ALLOC_GROW(list->list, list->num + 1, list->size);
> +	list->list[list->num] = (struct object_id *)malloc(sizeof(struct object_id));
> +	oidcpy(list->list[list->num], oid);
> +	(list->num)++;
> +
> +	return 0;
> +}
> +
> +struct object_id *construct_commit_graph(const char *pack_dir)
> +{
> +	struct packed_oid_list oids;
> +	struct packed_commit_list commits;
> +	struct commit_graph_header hdr;
> +	struct sha1file *f;
> +	int i, count_distinct = 0;
> +	struct strbuf tmp_file = STRBUF_INIT;
> +	unsigned char final_hash[GIT_MAX_RAWSZ];
> +	char *graph_name;
> +	int fd;
> +	uint32_t chunk_ids[5];
> +	uint64_t chunk_offsets[5];
> +	int num_long_edges;
> +	struct object_id *f_hash;
> +	char *fname;
> +	struct commit_list *parent;
> +
> +	oids.num = 0;
> +	oids.size = 1024;
> +	ALLOC_ARRAY(oids.list, oids.size);
> +	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
> +	QSORT(oids.list, oids.num, commit_compare);
> +
> +	count_distinct = 1;
> +	for (i = 1; i < oids.num; i++) {
> +		if (oidcmp(oids.list[i-1], oids.list[i]))
> +			count_distinct++;
> +	}
> +
> +	commits.num = 0;
> +	commits.size = count_distinct;
> +	ALLOC_ARRAY(commits.list, commits.size);
> +
> +	num_long_edges = 0;
> +	for (i = 0; i < oids.num; i++) {
> +		int num_parents = 0;
> +		if (i > 0 && !oidcmp(oids.list[i-1], oids.list[i]))
> +			continue;
> +
> +		commits.list[commits.num] = lookup_commit(oids.list[i]);
> +		parse_commit(commits.list[commits.num]);
> +
> +		for (parent = commits.list[commits.num]->parents;
> +		     parent; parent = parent->next)
> +			num_parents++;
> +
> +		if (num_parents > 2)
> +			num_long_edges += num_parents - 1;
> +
> +		commits.num++;
> +	}
> +
> +	strbuf_addstr(&tmp_file, pack_dir);
> +	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
> +
> +	fd = git_mkstemp_mode(tmp_file.buf, 0444);
> +	if (fd < 0)
> +		die_errno("unable to create '%s'", tmp_file.buf);
> +
> +	graph_name = strbuf_detach(&tmp_file, NULL);
> +	f = sha1fd(fd, graph_name);
> +
> +	hdr.graph_signature = htonl(GRAPH_SIGNATURE);
> +	hdr.graph_version = GRAPH_VERSION;
> +	hdr.hash_version = GRAPH_OID_VERSION;
> +	hdr.hash_len = GRAPH_OID_LEN;
> +	hdr.num_chunks = 4;
> +
> +	assert(sizeof(hdr) == 8);
> +	sha1write(f, &hdr, sizeof(hdr));
> +
> +	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
> +	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
> +	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> +	chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
> +	chunk_ids[4] = 0;
> +
> +	chunk_offsets[0] = sizeof(hdr) + GRAPH_CHUNKLOOKUP_SIZE;
> +	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
> +	chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.num;
> +	chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.num;
> +	chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;
> +
> +	for (i = 0; i <= hdr.num_chunks; i++) {
> +		uint32_t chunk_write[3];
> +
> +		chunk_write[0] = htonl(chunk_ids[i]);
> +		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
> +		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
> +		sha1write(f, chunk_write, 12);
> +	}
> +
> +	write_graph_chunk_fanout(f, commits.list, commits.num);
> +	write_graph_chunk_oids(f, GRAPH_OID_LEN, commits.list, commits.num);
> +	write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.num);
> +	write_graph_chunk_large_edges(f, commits.list, commits.num);
> +
> +	sha1close(f, final_hash, CSUM_CLOSE | CSUM_FSYNC);
> +
> +	f_hash = (struct object_id *)malloc(sizeof(struct object_id));
> +	memcpy(f_hash->hash, final_hash, GIT_MAX_RAWSZ);

hashcpy(), perhaps?

> +	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
> +
> +	if (rename(graph_name, fname))
> +		die("failed to rename %s to %s", graph_name, fname);
> +
> +	free(oids.list);
> +	oids.size = 0;
> +	oids.num = 0;
> +
> +	return f_hash;
> +}
> +
> diff --git a/commit-graph.h b/commit-graph.h
> new file mode 100644
> index 0000000000..7b3469a7df
> --- /dev/null
> +++ b/commit-graph.h
> @@ -0,0 +1,20 @@
> +#ifndef COMMIT_GRAPH_H
> +#define COMMIT_GRAPH_H
> +
> +#include "git-compat-util.h"
> +#include "commit.h"
> +
> +extern char* get_commit_graph_filename_hash(const char *pack_dir,
> +					    struct object_id *hash);
> +
> +struct commit_graph_header {
> +	uint32_t graph_signature;
> +	unsigned char graph_version;
> +	unsigned char hash_version;
> +	unsigned char hash_len;
> +	unsigned char num_chunks;
> +};
> +
> +extern struct object_id *construct_commit_graph(const char *pack_dir);
> +
> +#endif
> -- 
> 2.16.0.15.g9c3cf44.dirty



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
  2018-02-01 23:33   ` Jonathan Tan
@ 2018-02-01 23:48   ` SZEDER Gábor
  2018-02-05 18:07     ` Derrick Stolee
  2018-02-02  1:47   ` SZEDER Gábor
  2 siblings, 1 reply; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-01 23:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> Teach git-commit-graph to write graph files. Create new test script to verify
> this command succeeds without failure.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 18 +++++++
>  builtin/commit-graph.c             | 30 ++++++++++++
>  t/t5318-commit-graph.sh            | 96 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 144 insertions(+)
>  create mode 100755 t/t5318-commit-graph.sh
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index c8ea548dfb..3f3790d9a8 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -5,3 +5,21 @@ NAME
>  ----
>  git-commit-graph - Write and verify Git commit graphs (.graph files)
>  
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git commit-graph' --write <options> [--pack-dir <pack_dir>]
> +

What do these options do and what is the command's output?  IOW, an
'OPTIONS' section would be nice.

> +EXAMPLES
> +--------
> +
> +* Write a commit graph file for the packed commits in your local .git folder.
> ++
> +------------------------------------------------
> +$ git commit-graph --write
> +------------------------------------------------
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> new file mode 100755
> index 0000000000..6bcd1cc264
> --- /dev/null
> +++ b/t/t5318-commit-graph.sh
> @@ -0,0 +1,96 @@
> +#!/bin/sh
> +
> +test_description='commit graph'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup full repo' \
> +    'rm -rf .git &&
> +     mkdir full &&
> +     cd full &&
> +     git init &&
> +     git config core.commitgraph true &&
> +     git config pack.threads 1 &&

Does this pack.threads=1 make a difference?

> +     packdir=".git/objects/pack"'

We tend to put single quotes around tests like this:

  test_expect_success 'setup full repo' '
        do-this &&
        check-that
  '

This is not a mere style nit: those newlines before and after the test
block make the test's output with '--verbose-log' slightly more
readable.

Furthermore, we prefer tabs for indentation.

Finally, 'cd'-ing around such that it affects subsequent tests is
usually frowned upon.  However, in this particular case (going into
one repo, doing a bunch of tests there, then going into another repo,
and doing another bunch of tests) I think it's better than changing
directory in a subshell in every single test.

> +
> +test_expect_success 'write graph with no packs' \
> +    'git commit-graph --write --pack-dir .'
> +
> +test_expect_success 'create commits and repack' \
> +    'for i in $(test_seq 5)
> +     do
> +        echo $i >$i.txt &&
> +        git add $i.txt &&
> +        git commit -m "commit $i" &&
> +        git branch commits/$i
> +     done &&
> +     git repack'
> +
> +test_expect_success 'write graph' \
> +    'graph1=$(git commit-graph --write) &&
> +     test_path_is_file ${packdir}/graph-${graph1}.graph'

Style nit:  those {} around the variable names are unnecessary, but I
see you use them a lot.

> +
> +t_expect_success 'Add more commits' \

This must be test_expect_success.

> +    'git reset --hard commits/3 &&
> +     for i in $(test_seq 6 10)
> +     do
> +        echo $i >$i.txt &&
> +        git add $i.txt &&
> +        git commit -m "commit $i" &&
> +        git branch commits/$i
> +     done &&
> +     git reset --hard commits/3 &&
> +     for i in $(test_seq 11 15)
> +     do
> +        echo $i >$i.txt &&
> +        git add $i.txt &&
> +        git commit -m "commit $i" &&
> +        git branch commits/$i
> +     done &&
> +     git reset --hard commits/7 &&
> +     git merge commits/11 &&
> +     git branch merge/1 &&
> +     git reset --hard commits/8 &&
> +     git merge commits/12 &&
> +     git branch merge/2 &&
> +     git reset --hard commits/5 &&
> +     git merge commits/10 commits/15 &&
> +     git branch merge/3 &&
> +     git repack'
> +
> +# Current graph structure:
> +#
> +#      M3
> +#     / |\_____
> +#    / 10      15
> +#   /   |      |
> +#  /    9 M2   14
> +# |     |/  \  |
> +# |     8 M1 | 13
> +# |     |/ | \_|
> +# 5     7  |   12
> +# |     |   \__|
> +# 4     6      11
> +# |____/______/
> +# 3
> +# |
> +# 2
> +# |
> +# 1
> +
> +test_expect_success 'write graph with merges' \
> +    'graph2=$(git commit-graph --write) &&
> +     test_path_is_file ${packdir}/graph-${graph2}.graph'
> +
> +test_expect_success 'setup bare repo' \
> +    'cd .. &&
> +     git clone --bare full bare &&
> +     cd bare &&
> +     git config core.graph true &&
> +     git config pack.threads 1 &&
> +     baredir="objects/pack"'
> +
> +test_expect_success 'write graph in bare repo' \
> +    'graphbare=$(git commit-graph --write) &&
> +     test_path_is_file ${baredir}/graph-${graphbare}.graph'
> +
> +test_done
> -- 
> 2.16.0.15.g9c3cf44.dirty



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/14] commit-graph: implement git-commit-graph --read
  2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
  2018-01-31  2:22   ` Stefan Beller
@ 2018-02-02  0:02   ` SZEDER Gábor
  2018-02-02  0:23   ` Jonathan Tan
  2 siblings, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02  0:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee


> Teach git-commit-graph to read commit graph files and summarize their contents.
> 
> Use the --read option to verify the contents of a commit graph file in the
> tests.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt |   7 ++
>  builtin/commit-graph.c             |  55 +++++++++++++++
>  commit-graph.c                     | 138 ++++++++++++++++++++++++++++++++++++-
>  commit-graph.h                     |  25 +++++++
>  t/t5318-commit-graph.sh            |  28 ++++++--
>  5 files changed, 247 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 3f3790d9a8..09aeaf6c82 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -10,6 +10,7 @@ SYNOPSIS
>  --------
>  [verse]
>  'git commit-graph' --write <options> [--pack-dir <pack_dir>]
> +'git commit-graph' --read <options> [--pack-dir <pack_dir>]

Again, what does this option do?

>  EXAMPLES
>  --------
> @@ -20,6 +21,12 @@ EXAMPLES
>  $ git commit-graph --write
>  ------------------------------------------------
>  
> +* Read basic information from a graph file.
> ++
> +------------------------------------------------
> +$ git commit-graph --read --graph-hash=<hash>
> +------------------------------------------------
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 7affd512f1..218740b1f8 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c

> +int close_commit_graph(struct commit_graph *g)

static, perhaps?  I see it's declared as extern in the headeer file
below, but I don't see it called outside of this source file by the
end of the patch series.

> +{
> +	if (g->graph_fd < 0)
> +		return 0;
> +
> +	munmap((void *)g->data, g->data_len);
> +	g->data = 0;
> +
> +	close(g->graph_fd);
> +	g->graph_fd = -1;
> +
> +	return 1;
> +}
> +
> +static void free_commit_graph(struct commit_graph **g)
> +{
> +	if (!g || !*g)
> +		return;
> +
> +	close_commit_graph(*g);
> +
> +	free(*g);
> +	*g = NULL;
> +}
> +
> +struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir)
> +{
> +	void *graph_map;
> +	const unsigned char *data;
> +	struct commit_graph_header *hdr;
> +	size_t graph_size;
> +	struct stat st;
> +	uint32_t i;
> +	struct commit_graph *graph;
> +	int fd = git_open(graph_file);
> +	uint64_t last_chunk_offset;
> +	uint32_t last_chunk_id;
> +
> +	if (fd < 0)
> +		return 0;
> +	if (fstat(fd, &st)) {
> +		close(fd);
> +		return 0;
> +	}
> +	graph_size = xsize_t(st.st_size);
> +
> +	if (graph_size < GRAPH_MIN_SIZE) {
> +		close(fd);
> +		die("graph file %s is too small", graph_file);
> +	}
> +	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	data = (const unsigned char *)graph_map;
> +
> +	hdr = graph_map;
> +	if (ntohl(hdr->graph_signature) != GRAPH_SIGNATURE) {
> +		uint32_t signature = ntohl(hdr->graph_signature);
> +		munmap(graph_map, graph_size);
> +		close(fd);
> +		die("graph signature %X does not match signature %X",
> +			signature, GRAPH_SIGNATURE);
> +	}
> +	if (hdr->graph_version != GRAPH_VERSION) {
> +		unsigned char version = hdr->graph_version;
> +		munmap(graph_map, graph_size);
> +		close(fd);
> +		die("graph version %X does not match version %X",
> +			version, GRAPH_VERSION);
> +	}
> +
> +	graph = alloc_commit_graph(strlen(pack_dir) + 1);
> +
> +	graph->hdr = hdr;
> +	graph->graph_fd = fd;
> +	graph->data = graph_map;
> +	graph->data_len = graph_size;
> +
> +	last_chunk_id = 0;
> +	last_chunk_offset = (uint64_t)sizeof(*hdr);
> +	for (i = 0; i < hdr->num_chunks; i++) {
> +		uint32_t chunk_id = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i));
> +		uint64_t chunk_offset1 = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i + 4));
> +		uint32_t chunk_offset2 = ntohl(*(uint32_t*)(data + sizeof(*hdr) + 12 * i + 8));

There are a lot of magic number in these three lines, but at least
they are all multiples of 4.

> +		uint64_t chunk_offset = (chunk_offset1 << 32) | chunk_offset2;
> +
> +		if (chunk_offset > graph_size - GIT_MAX_RAWSZ)
> +			die("improper chunk offset %08x%08x", (uint32_t)(chunk_offset >> 32),
> +			    (uint32_t)chunk_offset);
> +
> +		switch (chunk_id) {
> +			case GRAPH_CHUNKID_OIDFANOUT:
> +				graph->chunk_oid_fanout = data + chunk_offset;
> +				break;
> +
> +			case GRAPH_CHUNKID_OIDLOOKUP:
> +				graph->chunk_oid_lookup = data + chunk_offset;
> +				break;
> +
> +			case GRAPH_CHUNKID_DATA:
> +				graph->chunk_commit_data = data + chunk_offset;
> +				break;
> +
> +			case GRAPH_CHUNKID_LARGEEDGES:
> +				graph->chunk_large_edges = data + chunk_offset;
> +				break;
> +
> +			case 0:
> +				break;
> +
> +			default:
> +				free_commit_graph(&graph);
> +				die("unrecognized graph chunk id: %08x", chunk_id);
> +		}
> +
> +		if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
> +		{
> +			graph->num_commits = (chunk_offset - last_chunk_offset)
> +					     / hdr->hash_len;
> +		}
> +
> +		last_chunk_id = chunk_id;
> +		last_chunk_offset = chunk_offset;
> +	}
> +
> +	strcpy(graph->pack_dir, pack_dir);
> +	return graph;
> +}
> +
>  static void write_graph_chunk_fanout(struct sha1file *f,
>  				     struct commit **commits,
>  				     int nr_commits)
> @@ -361,7 +497,7 @@ struct object_id *construct_commit_graph(const char *pack_dir)
>  	sha1close(f, final_hash, CSUM_CLOSE | CSUM_FSYNC);
>  
>  	f_hash = (struct object_id *)malloc(sizeof(struct object_id));
> -	memcpy(f_hash->hash, final_hash, GIT_MAX_RAWSZ);
> +	hashcpy(f_hash->hash, final_hash);

Oh, look, I told you it's hashcpy()! ;)

>  	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
>  
>  	if (rename(graph_name, fname))
> diff --git a/commit-graph.h b/commit-graph.h
> index 7b3469a7df..e046ae575c 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -15,6 +15,31 @@ struct commit_graph_header {
>  	unsigned char num_chunks;
>  };
>  
> +extern struct commit_graph {
> +	int graph_fd;
> +
> +	const unsigned char *data;
> +	size_t data_len;
> +
> +	const struct commit_graph_header *hdr;
> +
> +	struct object_id oid;
> +
> +	uint32_t num_commits;
> +
> +	const unsigned char *chunk_oid_fanout;
> +	const unsigned char *chunk_oid_lookup;
> +	const unsigned char *chunk_commit_data;
> +	const unsigned char *chunk_large_edges;
> +
> +	/* something like ".git/objects/pack" */
> +	char pack_dir[FLEX_ARRAY]; /* more */
> +} *commit_graph;
> +
> +extern int close_commit_graph(struct commit_graph *g);
> +
> +extern struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir);
> +
>  extern struct object_id *construct_commit_graph(const char *pack_dir);
>  
>  #endif
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 6bcd1cc264..da565624e3 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -25,11 +25,23 @@ test_expect_success 'create commits and repack' \
>       done &&
>       git repack'
>  
> +_graph_read_expect() {
> +    cat >expect <<- EOF
> +header: 43475048 01 01 14 04
> +num_commits: $1
> +chunks: oid_fanout oid_lookup commit_metadata large_edges
> +pack_dir: $2
> +EOF
> +}

Style nit: since you are already using the '<<-' operator for the
here-doc, you could indent it with tabs.

> +
>  test_expect_success 'write graph' \
>      'graph1=$(git commit-graph --write) &&
> -     test_path_is_file ${packdir}/graph-${graph1}.graph'
> +     test_path_is_file ${packdir}/graph-${graph1}.graph &&
> +     git commit-graph --read --graph-hash=${graph1} >output &&
> +     _graph_read_expect "5" "${packdir}" &&
> +     cmp expect output'

Please use the 'test_cmp' helper throughout the tests instead.
If the two files don't match, 'cmp' will only tell you where they
start to differ, whereas 'test_cmp' will actually show the difference.

> -t_expect_success 'Add more commits' \
> +test_expect_success 'Add more commits' \

This should be squashed into the earlier commit.

>      'git reset --hard commits/3 &&
>       for i in $(test_seq 6 10)
>       do
> @@ -79,7 +91,10 @@ t_expect_success 'Add more commits' \
>  
>  test_expect_success 'write graph with merges' \
>      'graph2=$(git commit-graph --write) &&
> -     test_path_is_file ${packdir}/graph-${graph2}.graph'
> +     test_path_is_file ${packdir}/graph-${graph2}.graph &&
> +     git commit-graph --read --graph-hash=${graph2} >output &&
> +     _graph_read_expect "18" "${packdir}" &&
> +     cmp expect output'
>  
>  test_expect_success 'setup bare repo' \
>      'cd .. &&
> @@ -87,10 +102,13 @@ test_expect_success 'setup bare repo' \
>       cd bare &&
>       git config core.graph true &&
>       git config pack.threads 1 &&
> -     baredir="objects/pack"'
> +     baredir="./objects/pack"'

Is this change really necessary?  If it is, then perhaps it should
have been written this way upon its introduction.

>  test_expect_success 'write graph in bare repo' \
>      'graphbare=$(git commit-graph --write) &&
> -     test_path_is_file ${baredir}/graph-${graphbare}.graph'
> +     test_path_is_file ${baredir}/graph-${graphbare}.graph &&
> +     git commit-graph --read --graph-hash=${graphbare} >output &&
> +     _graph_read_expect "18" "${baredir}" &&
> +     cmp expect output'
>  
>  test_done
> -- 
> 2.16.0.15.g9c3cf44.dirty



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/14] commit-graph: implement git-commit-graph --read
  2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
  2018-01-31  2:22   ` Stefan Beller
  2018-02-02  0:02   ` SZEDER Gábor
@ 2018-02-02  0:23   ` Jonathan Tan
  2018-02-05 19:29     ` Derrick Stolee
  2 siblings, 1 reply; 146+ messages in thread
From: Jonathan Tan @ 2018-02-02  0:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, peff, git, sbeller, dstolee

On Tue, 30 Jan 2018 16:39:35 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> Teach git-commit-graph to read commit graph files and summarize their contents.

One overall question - is the "read" command meant to be a command used
by the end user, or is it here just to test that some aspects of reading
works? If the former, I'm not sure how useful it is. And if the latter,
I think that it is more useful to just implementing something that reads
it, then make the 11/14 change (modifying parse_commit_gently) and
include a perf test to show that your commit graph reading is both
correct and (performance-)effective.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/14] commit-graph: create git-commit-graph builtin
  2018-01-30 21:39 ` [PATCH v2 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
@ 2018-02-02  0:53   ` SZEDER Gábor
  0 siblings, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02  0:53 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> new file mode 100644
> index 0000000000..c8ea548dfb
> --- /dev/null
> +++ b/Documentation/git-commit-graph.txt
> @@ -0,0 +1,7 @@
> +git-commit-graph(1)
> +============

Here the length of the '====' must match the length of the title line
above, or AsciiDoc will complain about missing document title.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head
  2018-01-30 21:39 ` [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head Derrick Stolee
@ 2018-02-02  1:35   ` SZEDER Gábor
  2018-02-05 21:01     ` Derrick Stolee
  2018-02-02  2:45   ` SZEDER Gábor
  1 sibling, 1 reply; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02  1:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> It is possible to have multiple commit graph files in a pack directory,
> but only one is important at a time. Use a 'graph_head' file to point
> to the important file.

This implies that all those other files are ignored, right?

> Teach git-commit-graph to write 'graph_head' upon
> writing a new commit graph file.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 34 ++++++++++++++++++++++++++++++++++
>  builtin/commit-graph.c             | 38 +++++++++++++++++++++++++++++++++++---
>  commit-graph.c                     | 25 +++++++++++++++++++++++++
>  commit-graph.h                     |  2 ++
>  t/t5318-commit-graph.sh            | 12 ++++++++++--
>  5 files changed, 106 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 09aeaf6c82..99ced16ddc 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -12,15 +12,49 @@ SYNOPSIS
>  'git commit-graph' --write <options> [--pack-dir <pack_dir>]
>  'git commit-graph' --read <options> [--pack-dir <pack_dir>]
>  
> +OPTIONS
> +-------

Oh, look, the 'OPTIONS' section I missed in the previous patches! ;)

This should be split up and squashed into the previous patches where
the individual --options are first mentioned.

> +--pack-dir::
> +	Use given directory for the location of packfiles, graph-head,
> +	and graph files.
> +
> +--read::
> +	Read a graph file given by the graph-head file and output basic
> +	details about the graph file. (Cannot be combined with --write.)

From the output of 'git commit-graph --read' it seems that it's not a
generally useful option to the user.  Perhaps it should be mentioned
that it's only intended as a debugging aid?  Or maybe it doesn't
really matter, because eventually this command will become irrelevant,
as other commands (clone, fetch, gc) will invoke it automagically...

> +--graph-id::
> +	When used with --read, consider the graph file graph-<oid>.graph.
> +
> +--write::
> +	Write a new graph file to the pack directory. (Cannot be combined
> +	with --read.)

I think this should also mention that it prints the generated graph
file's checksum.

> +
> +--update-head::
> +	When used with --write, update the graph-head file to point to
> +	the written graph file.

So it should be used with '--write', noted.

>  EXAMPLES
>  --------
>  
> +* Output the hash of the graph file pointed to by <dir>/graph-head.
> ++
> +------------------------------------------------
> +$ git commit-graph --pack-dir=<dir>
> +------------------------------------------------
> +
>  * Write a commit graph file for the packed commits in your local .git folder.
>  +
>  ------------------------------------------------
>  $ git commit-graph --write
>  ------------------------------------------------
>  
> +* Write a graph file for the packed commits in your local .git folder,
> +* and update graph-head.
> ++
> +------------------------------------------------
> +$ git commit-graph --write --update-head
> +------------------------------------------------
> +
>  * Read basic information from a graph file.
>  +
>  ------------------------------------------------
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 218740b1f8..d73cbc907d 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -11,7 +11,7 @@
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph [--pack-dir <packdir>]"),
>  	N_("git commit-graph --read [--graph-hash=<hash>]"),
> -	N_("git commit-graph --write [--pack-dir <packdir>]"),
> +	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head]"),
>  	NULL
>  };
>  
> @@ -20,6 +20,9 @@ static struct opts_commit_graph {
>  	int read;
>  	const char *graph_hash;
>  	int write;
> +	int update_head;
> +	int has_existing;
> +	struct object_id old_graph_hash;
>  } opts;
>  
>  static int graph_read(void)
> @@ -30,8 +33,8 @@ static int graph_read(void)
>  
>  	if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ)
>  		get_oid_hex(opts.graph_hash, &graph_hash);
> -	else
> -		die("no graph hash specified");
> +	else if (!get_graph_head_hash(opts.pack_dir, &graph_hash))
> +		die("no graph-head exists");
>  
>  	graph_file = get_commit_graph_filename_hash(opts.pack_dir, &graph_hash);
>  	graph = load_commit_graph_one(graph_file, opts.pack_dir);
> @@ -62,10 +65,33 @@ static int graph_read(void)
>  	return 0;
>  }
>  
> +static void update_head_file(const char *pack_dir, const struct object_id *graph_hash)
> +{
> +	struct strbuf head_path = STRBUF_INIT;
> +	int fd;
> +	struct lock_file lk = LOCK_INIT;
> +
> +	strbuf_addstr(&head_path, pack_dir);
> +	strbuf_addstr(&head_path, "/");
> +	strbuf_addstr(&head_path, "graph-head");

strbuf_addstr(&head_path, "/graph-head"); ?

> +
> +	fd = hold_lock_file_for_update(&lk, head_path.buf, LOCK_DIE_ON_ERROR);
> +	strbuf_release(&head_path);
> +
> +	if (fd < 0)
> +		die_errno("unable to open graph-head");
> +
> +	write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ);
> +	commit_lock_file(&lk);

The new graph-head file will be writable.  All other files in
.git/objects/pack are created read-only, including graph files.  Just
pointing it out, but I don't think it's a bit deal; other than
consistency with the permissions of other files I don't have any
argument for making it read-only.

> +}
> +
>  static int graph_write(void)
>  {
>  	struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);

First the new graph file is written ...

>  
> +	if (opts.update_head)
> +		update_head_file(opts.pack_dir, graph_hash);

... and then the new graph head, good.  There could be a race if it
were the other way around.

> +
>  	if (graph_hash)
>  		printf("%s\n", oid_to_hex(graph_hash));
>  
> @@ -83,6 +109,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  			N_("read graph file")),
>  		OPT_BOOL('w', "write", &opts.write,
>  			N_("write commit graph file")),
> +		OPT_BOOL('u', "update-head", &opts.update_head,
> +			N_("update graph-head to written graph file")),
>  		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
>  			N_("hash"),
>  			N_("A hash for a specific graph file in the pack-dir."),
> @@ -109,10 +137,14 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  		opts.pack_dir = strbuf_detach(&path, NULL);
>  	}
>  
> +	opts.has_existing = !!get_graph_head_hash(opts.pack_dir, &opts.old_graph_hash);
> +
>  	if (opts.read)
>  		return graph_read();
>  	if (opts.write)
>  		return graph_write();
>  
> +	if (opts.has_existing)
> +		printf("%s\n", oid_to_hex(&opts.old_graph_hash));

It seems that a command like 'git commit-graph --read --update-head'
succeeds and '--update-head' has no effect.  I think it should error
out.  'git commit-graph --update-head' doesn't complain, either.

Would it be more appropriate to have 'read' and 'write' subcommands
instead of '--read' and '--write' options?  Then parse-options alone
would take care of a command line like 'git commit-graph read
--update-index' and error out because of unrecognized option.

>  	return 0;
>  }
> diff --git a/commit-graph.c b/commit-graph.c
> index 622a650259..764e016ddb 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -35,6 +35,31 @@
>  #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
>  			GRAPH_OID_LEN + sizeof(struct commit_graph_header))
>  
> +struct object_id *get_graph_head_hash(const char *pack_dir, struct object_id *hash)
> +{
> +	struct strbuf head_filename = STRBUF_INIT;
> +	char hex[GIT_MAX_HEXSZ + 1];
> +	FILE *f;
> +
> +	strbuf_addstr(&head_filename, pack_dir);
> +	strbuf_addstr(&head_filename, "/graph-head");
> +
> +	f = fopen(head_filename.buf, "r");
> +	strbuf_release(&head_filename);
> +
> +	if (!f)
> +		return 0;
> +
> +	if (!fgets(hex, sizeof(hex), f))
> +		die("failed to read graph-head");
> +
> +	fclose(f);
> +
> +	if (get_oid_hex(hex, hash))
> +		return 0;
> +	return hash;
> +}
> +
>  char* get_commit_graph_filename_hash(const char *pack_dir,
>  				     struct object_id *hash)
>  {
> diff --git a/commit-graph.h b/commit-graph.h
> index e046ae575c..43eb0aec84 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -4,6 +4,8 @@
>  #include "git-compat-util.h"
>  #include "commit.h"
>  
> +extern struct object_id *get_graph_head_hash(const char *pack_dir,
> +					     struct object_id *hash);
>  extern char* get_commit_graph_filename_hash(const char *pack_dir,
>  					    struct object_id *hash);
>  
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index da565624e3..d1a23bcdaf 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -13,7 +13,8 @@ test_expect_success 'setup full repo' \
>       packdir=".git/objects/pack"'
>  
>  test_expect_success 'write graph with no packs' \
> -    'git commit-graph --write --pack-dir .'
> +    'git commit-graph --write --pack-dir . &&
> +     test_path_is_missing graph-head'
>  
>  test_expect_success 'create commits and repack' \
>      'for i in $(test_seq 5)
> @@ -37,6 +38,7 @@ EOF
>  test_expect_success 'write graph' \
>      'graph1=$(git commit-graph --write) &&
>       test_path_is_file ${packdir}/graph-${graph1}.graph &&
> +     test_path_is_missing ${packdir}/graph-head &&
>       git commit-graph --read --graph-hash=${graph1} >output &&
>       _graph_read_expect "5" "${packdir}" &&
>       cmp expect output'
> @@ -90,8 +92,11 @@ test_expect_success 'Add more commits' \
>  # 1
>  
>  test_expect_success 'write graph with merges' \
> -    'graph2=$(git commit-graph --write) &&
> +    'graph2=$(git commit-graph --write --update-head) &&
>       test_path_is_file ${packdir}/graph-${graph2}.graph &&
> +     test_path_is_file ${packdir}/graph-head &&
> +     echo ${graph2} >expect &&
> +     cmp -n 40 expect ${packdir}/graph-head &&

This check is fishy, and that '-n 40' will need adjustment once we
migrate to a longer hash function.  I presume you used it, because
'graph-head' contains only 40 hexdigits without a trailing newline,
but 'expect' created with 'echo' does contain a newline as well,
right?  Then this would be better instead:

  printf $graph2 >expect &&
  test_cmp expect $packdir/graph-head &&

>       git commit-graph --read --graph-hash=${graph2} >output &&
>       _graph_read_expect "18" "${packdir}" &&
>       cmp expect output'
> @@ -107,6 +112,9 @@ test_expect_success 'setup bare repo' \
>  test_expect_success 'write graph in bare repo' \
>      'graphbare=$(git commit-graph --write) &&
>       test_path_is_file ${baredir}/graph-${graphbare}.graph &&
> +     test_path_is_file ${baredir}/graph-head &&
> +     echo ${graphbare} >expect &&
> +     cmp -n 40 expect ${baredir}/graph-head &&

Likewise.

>       git commit-graph --read --graph-hash=${graphbare} >output &&
>       _graph_read_expect "18" "${baredir}" &&
>       cmp expect output'
> -- 
> 2.16.0.15.g9c3cf44.dirty



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
  2018-02-01 23:33   ` Jonathan Tan
  2018-02-01 23:48   ` SZEDER Gábor
@ 2018-02-02  1:47   ` SZEDER Gábor
  2 siblings, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02  1:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> new file mode 100755
> index 0000000000..6bcd1cc264
> --- /dev/null
> +++ b/t/t5318-commit-graph.sh
> @@ -0,0 +1,96 @@
> +#!/bin/sh
> +
> +test_description='commit graph'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup full repo' \
> +    'rm -rf .git &&
> +     mkdir full &&
> +     cd full &&
> +     git init &&
> +     git config core.commitgraph true &&

This config variable is unknown at this point.
I think the test shouldn't set it before it's introduced in patch 10.

> +     git config pack.threads 1 &&
> +     packdir=".git/objects/pack"'


> +test_expect_success 'setup bare repo' \
> +    'cd .. &&
> +     git clone --bare full bare &&
> +     cd bare &&
> +     git config core.graph true &&

Likewise, and its name should be updated as well.

> +     git config pack.threads 1 &&
> +     baredir="objects/pack"'

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 11/14] commit: integrate commit graph with commit parsing
  2018-01-30 21:39 ` [PATCH v2 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
@ 2018-02-02  1:51   ` Jonathan Tan
  2018-02-06 14:53     ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Jonathan Tan @ 2018-02-02  1:51 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, peff, git, sbeller, dstolee

On Tue, 30 Jan 2018 16:39:40 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> +/* global storage */
> +struct commit_graph *commit_graph = 0;

NULL, not 0.

> +static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
> +{
> +	uint32_t last, first = 0;
> +
> +	if (oid->hash[0])
> +		first = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * (oid->hash[0] - 1)));
> +	last = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * oid->hash[0]));
> +
> +	while (first < last) {
> +		uint32_t mid = first + (last - first) / 2;
> +		const unsigned char *current;
> +		int cmp;
> +
> +		current = g->chunk_oid_lookup + g->hdr->hash_len * mid;
> +		cmp = hashcmp(oid->hash, current);
> +		if (!cmp) {
> +			*pos = mid;
> +			return 1;
> +		}
> +		if (cmp > 0) {
> +			first = mid + 1;
> +			continue;
> +		}
> +		last = mid;
> +	}
> +
> +	*pos = first;
> +	return 0;
> +}

This would be better in sha1-lookup.c, something like the reverse of commit
f1068efefe6d ("sha1_file: drop experimental GIT_USE_LOOKUP search",
2017-08-09), except that it can be done using a simple binary search.

> +static int full_parse_commit(struct commit *item, struct commit_graph *g,
> +			     uint32_t pos, const unsigned char *commit_data)
> +{
> +	struct object_id oid;
> +	struct commit *new_parent;
> +	uint32_t new_parent_pos;
> +	uint32_t *parent_data_ptr;
> +	uint64_t date_low, date_high;
> +	struct commit_list **pptr;
> +
> +	item->object.parsed = 1;
> +	item->graph_pos = pos;
> +
> +	hashcpy(oid.hash, commit_data);
> +	item->tree = lookup_tree(&oid);
> +
> +	date_high = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 8)) & 0x3;
> +	date_low = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 12));
> +	item->date = (timestamp_t)((date_high << 32) | date_low);
> +
> +	pptr = &item->parents;
> +
> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len));
> +	if (new_parent_pos == GRAPH_PARENT_NONE)
> +		return 1;
> +	get_nth_commit_oid(g, new_parent_pos, &oid);
> +	new_parent = lookup_commit(&oid);
> +	if (new_parent) {
> +		new_parent->graph_pos = new_parent_pos;
> +		pptr = &commit_list_insert(new_parent, pptr)->next;
> +	} else {
> +		die("could not find commit %s", oid_to_hex(&oid));
> +	}
> +
> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 4));
> +	if (new_parent_pos == GRAPH_PARENT_NONE)
> +		return 1;
> +	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED)) {
> +		get_nth_commit_oid(g, new_parent_pos, &oid);
> +		new_parent = lookup_commit(&oid);
> +		if (new_parent) {
> +			new_parent->graph_pos = new_parent_pos;
> +			pptr = &commit_list_insert(new_parent, pptr)->next;
> +		} else
> +			die("could not find commit %s", oid_to_hex(&oid));
> +		return 1;
> +	}
> +
> +	parent_data_ptr = (uint32_t*)(g->chunk_large_edges + 4 * (new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED));
> +	do {
> +		new_parent_pos = ntohl(*parent_data_ptr);
> +
> +		get_nth_commit_oid(g, new_parent_pos & GRAPH_EDGE_LAST_MASK, &oid);
> +		new_parent = lookup_commit(&oid);
> +		if (new_parent) {
> +			new_parent->graph_pos = new_parent_pos & GRAPH_EDGE_LAST_MASK;
> +			pptr = &commit_list_insert(new_parent, pptr)->next;
> +		} else
> +			die("could not find commit %s", oid_to_hex(&oid));
> +		parent_data_ptr++;
> +	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
> +
> +	return 1;
> +}

The part that converts <pointer to parent data> into <struct commit *>
seems to be duplicated 3 times. Refactor into a function?

> +/**
> + * Fill 'item' to contain all information that would be parsed by parse_commit_buffer.
> + */
> +static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
> +{
> +	uint32_t new_parent_pos;
> +	uint32_t *parent_data_ptr;
> +	const unsigned char *commit_data = g->chunk_commit_data + (g->hdr->hash_len + 16) * pos;
> +
> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len));
> +
> +	if (new_parent_pos == GRAPH_PARENT_MISSING)
> +		return 0;
> +
> +	if (new_parent_pos == GRAPH_PARENT_NONE)
> +		return full_parse_commit(item, g, pos, commit_data);
> +
> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 4));
> +
> +	if (new_parent_pos == GRAPH_PARENT_MISSING)
> +		return 0;
> +	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED))
> +		return full_parse_commit(item, g, pos, commit_data);
> +
> +	new_parent_pos = new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED;
> +
> +	if (new_parent_pos == GRAPH_PARENT_MISSING)
> +		return 0;
> +
> +	parent_data_ptr = (uint32_t*)(g->chunk_large_edges + 4 * new_parent_pos);
> +	do {
> +		new_parent_pos = ntohl(*parent_data_ptr);
> +
> +		if ((new_parent_pos & GRAPH_EDGE_LAST_MASK) == GRAPH_PARENT_MISSING)
> +			return 0;
> +
> +		parent_data_ptr++;
> +	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
> +
> +	return full_parse_commit(item, g, pos, commit_data);
> +}

This function seems to just check for GRAPH_PARENT_MISSING - could that
check be folded into full_parse_commit() instead? (Then
full_parse_commit can be renamed to fill_commit_in_graph.)

> @@ -439,9 +656,24 @@ struct object_id *construct_commit_graph(const char *pack_dir)
>  	char *fname;
>  	struct commit_list *parent;
>  
> +	prepare_commit_graph();
> +
>  	oids.num = 0;
>  	oids.size = 1024;
> +
> +	if (commit_graph && oids.size < commit_graph->num_commits)
> +		oids.size = commit_graph->num_commits;
> +
>  	ALLOC_ARRAY(oids.list, oids.size);
> +
> +	if (commit_graph) {
> +		for (i = 0; i < commit_graph->num_commits; i++) {
> +			oids.list[i] = malloc(sizeof(struct object_id));
> +			get_nth_commit_oid(commit_graph, i, oids.list[i]);
> +		}
> +		oids.num = commit_graph->num_commits;
> +	}
> +
>  	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
>  	QSORT(oids.list, oids.num, commit_compare);
>  
> @@ -525,6 +757,11 @@ struct object_id *construct_commit_graph(const char *pack_dir)
>  	hashcpy(f_hash->hash, final_hash);
>  	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
>  
> +	if (commit_graph) {
> +		close_commit_graph(commit_graph);
> +		FREE_AND_NULL(commit_graph);
> +	}
> +
>  	if (rename(graph_name, fname))
>  		die("failed to rename %s to %s", graph_name, fname);

What is the relation of these changes to construct_commit_graph() to the
rest of the patch?

> diff --git a/commit-graph.h b/commit-graph.h
> index 43eb0aec84..05ddbbe165 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -4,6 +4,18 @@
>  #include "git-compat-util.h"
>  #include "commit.h"
>  
> +/**
> + * Given a commit struct, try to fill the commit struct info, including:
> + *  1. tree object
> + *  2. date
> + *  3. parents.
> + *
> + * Returns 1 if and only if the commit was found in the packed graph.
> + *
> + * See parse_commit_buffer() for the fallback after this call.
> + */
> +extern int parse_commit_in_graph(struct commit *item);
> +
>  extern struct object_id *get_graph_head_hash(const char *pack_dir,
>  					     struct object_id *hash);
>  extern char* get_commit_graph_filename_hash(const char *pack_dir,
> @@ -40,7 +52,13 @@ extern struct commit_graph {
>  
>  extern int close_commit_graph(struct commit_graph *g);
>  
> -extern struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir);
> +extern struct commit_graph *load_commit_graph_one(const char *graph_file,
> +						  const char *pack_dir);
> +extern void prepare_commit_graph(void);
> +
> +extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
> +					    uint32_t n,
> +					    struct object_id *oid);
>  
>  extern struct object_id *construct_commit_graph(const char *pack_dir);

This header file now contains functions for reading the commit graph,
and functions for writing one. It seems to me that those are (and should
be) quite disjoint, so it might be better to separate them into two.

> -int parse_commit_gently(struct commit *item, int quiet_on_missing)
> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int check_packed)
>  {
>  	enum object_type type;
>  	void *buffer;
> @@ -385,6 +386,8 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> +	if (check_packed && parse_commit_in_graph(item))
> +		return 0;
>  	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
>  	if (!buffer)
>  		return quiet_on_missing ? -1 :
> @@ -404,6 +407,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  	return ret;
>  }
>  
> +int parse_commit_gently(struct commit *item, int quiet_on_missing)
> +{
> +	return parse_commit_internal(item, quiet_on_missing, 1);
> +}

Are you planning to use parse_commit_internal() from elsewhere? (It
doesn't seem to be the case, at least from this patch series.)

> diff --git a/log-tree.c b/log-tree.c
> index fca29d4799..156aed4541 100644
> --- a/log-tree.c
> +++ b/log-tree.c
> @@ -659,8 +659,7 @@ void show_log(struct rev_info *opt)
>  		show_mergetag(opt, commit);
>  	}
>  
> -	if (!get_cached_commit_buffer(commit, NULL))
> -		return;
> +	get_commit_buffer(commit, NULL);

This undoes an optimization that I discuss in my e-mail message here
[1]. If we decide to do this, it should at least be called out in the
commit message.

[1] https://public-inbox.org/git/b88725476d9f13ba4381d85e5fe049f6ef93f621.1506714999.git.jonathantanmy@google.com/

> +_graph_git_two_modes() {

No need for the name to start with an underscore, I think.

> +    git -c core.commitgraph=true $1 >output
> +    git -c core.commitgraph=false $1 >expect
> +    cmp output expect

Use test_cmp.

> +}
> +
> +_graph_git_behavior() {
> +    BRANCH=$1
> +    COMPARE=$2
> +    test_expect_success 'check normal git operations' \
> +        '_graph_git_two_modes "log --oneline ${BRANCH}" &&
> +         _graph_git_two_modes "log --topo-order ${BRANCH}" &&
> +         _graph_git_two_modes "branch -vv" &&
> +         _graph_git_two_modes "merge-base -a ${BRANCH} ${COMPARE}"'
> +}

This makes it difficult to debug failing tests, since they're all named
the same. Better to just run the commands inline, and wrap the
invocations of _graph_git_behavior in an appropriately named
test_expect_success.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head
  2018-01-30 21:39 ` [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head Derrick Stolee
  2018-02-02  1:35   ` SZEDER Gábor
@ 2018-02-02  2:45   ` SZEDER Gábor
  1 sibling, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02  2:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index da565624e3..d1a23bcdaf 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh


> @@ -107,6 +112,9 @@ test_expect_success 'setup bare repo' \
>  test_expect_success 'write graph in bare repo' \
>      'graphbare=$(git commit-graph --write) &&
>       test_path_is_file ${baredir}/graph-${graphbare}.graph &&
> +     test_path_is_file ${baredir}/graph-head &&

This test and the one preceeding it are wrong.

Note that 'git commit-graph --write' above is missing the
'--update-head' option, so there should be no graph-head file written,
yet this 'this test_path_is_file' doesn't fail the test.

The devil lies in the previous test 'setup bare repo', where this bare
repo is created by cloning from a local remote: a simple 'git clone
--bare full bare' hardlinks all files under .git/objects, including
all graph and graph-head files that exist in the remote repo.

The previous test should run 'git clone --bare --no-local full bare'
instead, and then this test would fail because of the missing
graph-head file, as it should.  Specifying '--update-head' will make
it work again.


> +     echo ${graphbare} >expect &&
> +     cmp -n 40 expect ${baredir}/graph-head &&
>       git commit-graph --read --graph-hash=${graphbare} >output &&
>       _graph_read_expect "18" "${baredir}" &&
>       cmp expect output'



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear
  2018-01-30 21:39 ` [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear Derrick Stolee
@ 2018-02-02  4:01   ` SZEDER Gábor
  0 siblings, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02  4:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> Teach Git to delete the current 'graph_head' file and the commit graph
> it references.

And will it leave other, non-important graph files behind?  Looking at
the code it indeed does.  What is the use case for keeping the
non-important graph files?

> This is a good safety valve if somehow the file is
> corrupted and needs to be recalculated. Since the commit graph is a
> summary of contents already in the ODB, it can be regenerated.

Wouldn't a simple 'git commit-graph --write --update-head' regenerate
it on it's own, without cleaning first?  It appears, after running a
few tests, that a corrupt graph file can be recreated without
cleaning, which is great.  However, if graph-head is corrupt, then the
command errors out with 'failed to read graph-head'.  I don't
understand the rationale behind this, it would be overwritten anyway,
and its content is not necessary for recreating the graph.  And
indeed, after commenting out that get_graph_head_hash() call in
cmd_commit_graph() it doesn't want to read my corrupted graph-head
file, and recreates both the graph and graph-head files just fine.

I think the requirement for explicitly cleaning a corrupt graph-head
before re-writing it is just unnecessary complication.

On second thought, what's the point of '--write' without
'--update-head', when consumers (thinking 'log --topo-order...) will
need the graph-head anyway?  I think '--write' should create a
graph-head without requiring an additional option.

Hmph, another second thought: the word 'head' has a rather specific
meaning in Git, although it's usually capitalized.  Using this word in
options and filenames may lead to confusion, especially the option
'--update-head'.


> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 16 ++++++++++++++--
>  builtin/commit-graph.c             | 32 +++++++++++++++++++++++++++++++-
>  t/t5318-commit-graph.sh            |  7 ++++++-
>  3 files changed, 51 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 99ced16ddc..33d6567f11 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -11,6 +11,7 @@ SYNOPSIS
>  [verse]
>  'git commit-graph' --write <options> [--pack-dir <pack_dir>]
>  'git commit-graph' --read <options> [--pack-dir <pack_dir>]
> +'git commit-graph' --clear [--pack-dir <pack_dir>]
>  
>  OPTIONS
>  -------
> @@ -18,16 +19,21 @@ OPTIONS
>  	Use given directory for the location of packfiles, graph-head,
>  	and graph files.
>  
> +--clear::
> +	Delete the graph-head file and the graph file it references.
> +	(Cannot be combined with --read or --write.)
> +
>  --read::
>  	Read a graph file given by the graph-head file and output basic
> -	details about the graph file. (Cannot be combined with --write.)
> +	details about the graph file. (Cannot be combined with --clear
> +	or --write.)
>  
>  --graph-id::
>  	When used with --read, consider the graph file graph-<oid>.graph.
>  
>  --write::
>  	Write a new graph file to the pack directory. (Cannot be combined
> -	with --read.)
> +	with --clear or --read.)

All these "cannot be combined with --this and --that" remarks make
subcommands more and more appealing.

>  
>  --update-head::
>  	When used with --write, update the graph-head file to point to
> @@ -61,6 +67,12 @@ $ git commit-graph --write --update-head
>  $ git commit-graph --read --graph-hash=<hash>
>  ------------------------------------------------
>  
> +* Delete <dir>/graph-head and the file it references.
> ++
> +------------------------------------------------
> +$ git commit-graph --clear --pack-dir=<dir>
> +------------------------------------------------
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index d73cbc907d..4970dec133 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -10,6 +10,7 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph [--pack-dir <packdir>]"),
> +	N_("git commit-graph --clear [--pack-dir <packdir>]"),
>  	N_("git commit-graph --read [--graph-hash=<hash>]"),
>  	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head]"),
>  	NULL
> @@ -17,6 +18,7 @@ static char const * const builtin_commit_graph_usage[] = {
>  
>  static struct opts_commit_graph {
>  	const char *pack_dir;
> +	int clear;
>  	int read;
>  	const char *graph_hash;
>  	int write;
> @@ -25,6 +27,30 @@ static struct opts_commit_graph {
>  	struct object_id old_graph_hash;
>  } opts;
>  
> +static int graph_clear(void)
> +{
> +	struct strbuf head_path = STRBUF_INIT;
> +	char *old_path;
> +
> +	if (!opts.has_existing)
> +		return 0;
> +
> +	strbuf_addstr(&head_path, opts.pack_dir);
> +	strbuf_addstr(&head_path, "/");
> +	strbuf_addstr(&head_path, "graph-head");

strbuf_addstr(&head_path, "/graph-head")

Although, considering that this is the third place assembling this
path, maybe a helper function would be worth it.

> +	if (remove_path(head_path.buf))
> +		die("failed to remove path %s", head_path.buf);
> +	strbuf_release(&head_path);
> +
> +	old_path = get_commit_graph_filename_hash(opts.pack_dir,
> +						  &opts.old_graph_hash);
> +	if (remove_path(old_path))
> +		die("failed to remove path %s", old_path);
> +	free(old_path);
> +
> +	return 0;
> +}
> +
>  static int graph_read(void)
>  {
>  	struct object_id graph_hash;
> @@ -105,6 +131,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
>  			N_("dir"),
>  			N_("The pack directory to store the graph") },
> +		OPT_BOOL('c', "clear", &opts.clear,
> +			N_("clear graph file and graph-head")),
>  		OPT_BOOL('r', "read", &opts.read,
>  			N_("read graph file")),
>  		OPT_BOOL('w', "write", &opts.write,
> @@ -126,7 +154,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  			     builtin_commit_graph_options,
>  			     builtin_commit_graph_usage, 0);
>  
> -	if (opts.write + opts.read > 1)
> +	if (opts.write + opts.read + opts.clear > 1)
>  		usage_with_options(builtin_commit_graph_usage,
>  				   builtin_commit_graph_options);
>  
> @@ -139,6 +167,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  
>  	opts.has_existing = !!get_graph_head_hash(opts.pack_dir, &opts.old_graph_hash);
>  
> +	if (opts.clear)
> +		return graph_clear();
>  	if (opts.read)
>  		return graph_read();
>  	if (opts.write)
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index d1a23bcdaf..6e3b62b754 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -101,6 +101,11 @@ test_expect_success 'write graph with merges' \
>       _graph_read_expect "18" "${packdir}" &&
>       cmp expect output'
>  
> +test_expect_success 'clear graph' \
> +    'git commit-graph --clear &&
> +     test_path_is_missing ${packdir}/graph-${graph2}.graph &&
> +     test_path_is_missing ${packdir}/graph-head'
> +
>  test_expect_success 'setup bare repo' \
>      'cd .. &&
>       git clone --bare full bare &&
> @@ -110,7 +115,7 @@ test_expect_success 'setup bare repo' \
>       baredir="./objects/pack"'
>  
>  test_expect_success 'write graph in bare repo' \
> -    'graphbare=$(git commit-graph --write) &&
> +    'graphbare=$(git commit-graph --write --update-head) &&

This should have been done in the previous patch.

>       test_path_is_file ${baredir}/graph-${graphbare}.graph &&
>       test_path_is_file ${baredir}/graph-head &&
>       echo ${graphbare} >expect &&
> -- 
> 2.16.0.15.g9c3cf44.dirty


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired
  2018-01-30 21:39 ` [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired Derrick Stolee
@ 2018-02-02 15:04   ` SZEDER Gábor
  0 siblings, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02 15:04 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> Teach git-commit-graph to delete the graph previously referenced by 'graph_head'
> when writing a new graph file and updating 'graph_head'. This prevents
> data creep by storing a list of useless graphs. Be careful to not delete
> the graph if the file did not change.

We have to be careful with deleting the previously referenced graph
file right away after generating the new one.  Consider two processes
running concurrently, one writing new graph files with
--delete-expire', and the other reading the commit graph, e.g. a
future graph-aware 'git gc' and 'git log --topo-order':

  1. 'log' reads the hash of the graph file from graph-head.
  2. 'gc' writes the new graph and graph head files and deletes the
     old graph file.
  3. 'log' tries to open the the graph file with the hash it just
     read, but that file is already gone.

At this point 'log' could simply error out, but that would be rather
unfriendly.  Or it could try harder and could just ignore the missing
graph file and walk revisions the old school way.  It would be slower,
depending on the history size maybe much slower, but it would work.
Good.

However, in addition to the reader trying harder, I think we should
also consider making the writer more careful, too, and only delete a
stale graph file after a certain grace period is elapsed; similar to
how 'git gc' only deletes old loose objects.  And then perhaps it
should delete all graph files that are older than that grace period;
as it is, neither '--clear' nor '--delete-expired' seem to care about
graph files that aren't or weren't referenced by the graph-head.


> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 4970dec133..766f09e6fc 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c

> @@ -121,6 +122,17 @@ static int graph_write(void)
>  	if (graph_hash)
>  		printf("%s\n", oid_to_hex(graph_hash));
>  
> +
> +	if (opts.delete_expired && opts.update_head && opts.has_existing &&
> +	    oidcmp(graph_hash, &opts.old_graph_hash)) {
> +		char *old_path = get_commit_graph_filename_hash(opts.pack_dir,
> +								&opts.old_graph_hash);
> +		if (remove_path(old_path))
> +			die("failed to remove path %s", old_path);
> +
> +		free(old_path);
> +	}
> +
>  	free(graph_hash);
>  	return 0;
>  }
> @@ -139,6 +151,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  			N_("write commit graph file")),
>  		OPT_BOOL('u', "update-head", &opts.update_head,
>  			N_("update graph-head to written graph file")),
> +		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
> +			N_("delete expired head graph file")),
>  		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
>  			N_("hash"),
>  			N_("A hash for a specific graph file in the pack-dir."),

Like '--update-head', '--delete-expired' is silently ignored when it's
not used with '--write'.


> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 6e3b62b754..b56a6d4217 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh

> +test_expect_success 'write graph with merges' \
> +    'graph3=$(git commit-graph --write --update-head --delete-expired) &&
> +     test_path_is_file ${packdir}/graph-${graph3}.graph &&
> +     test_path_is_missing ${packdir}/graph-${graph2}.graph &&
> +     test_path_is_file ${packdir}/graph-${graph1}.graph &&
> +     test_path_is_file ${packdir}/graph-head &&
> +     echo ${graph3} >expect &&
> +     cmp -n 40 expect ${packdir}/graph-head &&

printf and test_cmp.

> +     git commit-graph --read --graph-hash=${graph3} >output &&
> +     _graph_read_expect "23" "${packdir}" &&
> +     cmp expect output'
> +
> +test_expect_success 'write graph with nothing new' \
> +    'graph4=$(git commit-graph --write --update-head --delete-expired) &&
> +     test_path_is_file ${packdir}/graph-${graph4}.graph &&
> +     test_path_is_file ${packdir}/graph-${graph1}.graph &&
> +     test_path_is_file ${packdir}/graph-head &&
> +     echo ${graph4} >expect &&
> +     cmp -n 40 expect ${packdir}/graph-head &&

Likewise.

> +     git commit-graph --read --graph-hash=${graph4} >output &&
> +     _graph_read_expect "23" "${packdir}" &&
> +     cmp expect output'
> +
>  test_expect_success 'clear graph' \
>      'git commit-graph --clear &&
>       test_path_is_missing ${packdir}/graph-${graph2}.graph &&
> +     test_path_is_file ${packdir}/graph-${graph1}.graph &&
>       test_path_is_missing ${packdir}/graph-head'
>  
>  test_expect_success 'setup bare repo' \
> @@ -121,7 +185,7 @@ test_expect_success 'write graph in bare repo' \
>       echo ${graphbare} >expect &&
>       cmp -n 40 expect ${baredir}/graph-head &&
>       git commit-graph --read --graph-hash=${graphbare} >output &&
> -     _graph_read_expect "18" "${baredir}" &&
> +     _graph_read_expect "23" "${baredir}" &&
>       cmp expect output'
>  
>  test_done
> -- 
> 2.16.0.15.g9c3cf44.dirty



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
  2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
  2018-02-01 22:23   ` Jonathan Tan
  2018-02-01 23:46   ` SZEDER Gábor
@ 2018-02-02 15:32   ` SZEDER Gábor
  2018-02-05 16:06     ` Derrick Stolee
  2 siblings, 1 reply; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02 15:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee


> Teach Git to write a commit graph file by checking all packed objects
> to see if they are commits, then store the file in the given pack
> directory.

I'm afraid that scanning all packed objects is a bit of a roundabout
way to approach this.

In my git repo, with 9 pack files at the moment, i.e. not that big a
repo and not that many pack files:

  $ time ./git commit-graph --write --update-head
  4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803

  real    0m27.550s
  user    0m27.113s
  sys     0m0.376s

In comparison, performing a good old revision walk to gather all the
info that is written into the graph file:

  $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
  52954

  real    0m0.903s
  user    0m0.972s
  sys     0m0.058s



> +char* get_commit_graph_filename_hash(const char *pack_dir,
> +				     struct object_id *hash)
> +{
> +	size_t len;
> +	struct strbuf head_path = STRBUF_INIT;
> +	strbuf_addstr(&head_path, pack_dir);
> +	strbuf_addstr(&head_path, "/graph-");
> +	strbuf_addstr(&head_path, oid_to_hex(hash));
> +	strbuf_addstr(&head_path, ".graph");

Nit: this is assembling the path of a graph file, not that of a
graph-head, so the strbuf should be renamed accordingly.

> +
> +	return strbuf_detach(&head_path, &len);
> +}


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 10/14] commit-graph: add core.commitgraph setting
  2018-01-30 21:39 ` [PATCH v2 10/14] commit-graph: add core.commitgraph setting Derrick Stolee
  2018-01-31 22:44   ` Igor Djordjevic
@ 2018-02-02 16:01   ` SZEDER Gábor
  1 sibling, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-02 16:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, peff, git, sbeller, dstolee

> The commit graph feature is controlled by the new core.commitgraph config
> setting. This defaults to 0, so the feature is opt-in.
> 
> The intention of core.commitgraph is that a user can always stop checking
> for or parsing commit graph files if core.commitgraph=0.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/config.txt | 3 +++
>  cache.h                  | 1 +
>  config.c                 | 5 +++++
>  environment.c            | 1 +
>  4 files changed, 10 insertions(+)

Please squash this in to keep the completion script up to date.


  -- >8 --

diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index 3683c772c..53880f627 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -2419,6 +2419,7 @@ _git_config ()
 		core.bigFileThreshold
 		core.checkStat
 		core.commentChar
+		core.commitGraph
 		core.compression
 		core.createObject
 		core.deltaBaseCacheLimit


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-02-01 23:33   ` Jonathan Tan
@ 2018-02-02 18:36     ` Stefan Beller
  2018-02-02 22:48       ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-02-02 18:36 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Derrick Stolee, git, Junio C Hamano, Jeff King, Jeff Hostetler,
	Derrick Stolee

On Thu, Feb 1, 2018 at 3:33 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> On Tue, 30 Jan 2018 16:39:34 -0500
> Derrick Stolee <stolee@gmail.com> wrote:
>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index c8ea548dfb..3f3790d9a8 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -5,3 +5,21 @@ NAME
>>  ----
>>  git-commit-graph - Write and verify Git commit graphs (.graph files)
>>
>> +
>> +SYNOPSIS
>> +--------
>> +[verse]
>> +'git commit-graph' --write <options> [--pack-dir <pack_dir>]
>
> Subcommands (like those in git submodule) generally don't take "--", as
> far as I know.

Then you know only the ugly side of Git. ;)

It is true for git-submodule and a few others (the minority of commands IIRC)
git-tag for example takes subcommands such as --list or --verify.
https://public-inbox.org/git/xmqqiomodkt9.fsf@gitster.dls.corp.google.com/

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-02-02 18:36     ` Stefan Beller
@ 2018-02-02 22:48       ` Junio C Hamano
  2018-02-03  1:58         ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-02 22:48 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jonathan Tan, Derrick Stolee, git, Jeff King, Jeff Hostetler,
	Derrick Stolee

Stefan Beller <sbeller@google.com> writes:

> It is true for git-submodule and a few others (the minority of commands IIRC)
> git-tag for example takes subcommands such as --list or --verify.
> https://public-inbox.org/git/xmqqiomodkt9.fsf@gitster.dls.corp.google.com/

Thanks.  It refers to an article at gmane, which is not readily
accessible unless you use newsreader.  The original discussion it
refers to appears at:

    https://public-inbox.org/git/7vbo5itjfl.fsf@alter.siamese.dyndns.org/

for those who are interested.

I am still not sure if it is a good design to add a new command like
this series does, though.  I would naively have expected that this
would be a new pack index format that is produced by pack-objects
and index-pack, for example, in which case its maintenance would
almost be invisible to end users (i.e. just like how the pack bitmap
feature was added to the system).



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-02-02 22:48       ` Junio C Hamano
@ 2018-02-03  1:58         ` Derrick Stolee
  2018-02-03  9:28           ` Jeff King
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-03  1:58 UTC (permalink / raw)
  To: Junio C Hamano, Stefan Beller
  Cc: Jonathan Tan, git, Jeff King, Jeff Hostetler, Derrick Stolee, szeder.dev

On 2/2/2018 5:48 PM, Junio C Hamano wrote:
> Stefan Beller <sbeller@google.com> writes:
> 
>> It is true for git-submodule and a few others (the minority of commands IIRC)
>> git-tag for example takes subcommands such as --list or --verify.
>> https://public-inbox.org/git/xmqqiomodkt9.fsf@gitster.dls.corp.google.com/
> 
> Thanks.  It refers to an article at gmane, which is not readily
> accessible unless you use newsreader.  The original discussion it
> refers to appears at:
> 
>      https://public-inbox.org/git/7vbo5itjfl.fsf@alter.siamese.dyndns.org/
> 
> for those who are interested.

Thanks for the links.

> I am still not sure if it is a good design to add a new command like
> this series does, though.  I would naively have expected that this
> would be a new pack index format that is produced by pack-objects
> and index-pack, for example, in which case its maintenance would
> almost be invisible to end users (i.e. just like how the pack bitmap
> feature was added to the system).

I agree that the medium-term goal is to have this happen without user 
intervention. Something like a "core.autoCommitGraph" setting to trigger 
commit-graph writes during other cleanup activities, such as a repack or 
a gc.

I don't think pairing this with pack-objects or index-pack is a good 
direction, because the commit graph is not locked into a packfile the 
way the bitmap is. In fact, the entire ODB could be replaced 
independently and the graph is still valid (the commits in the graph may 
no longer have their paired commits in the ODB due to a GC; you should 
never navigate to those commits without having a ref pointing to them, 
so this is not immediately a problem).

This sort of interaction with GC is one reason why I did not include the 
automatic updates in this patch. The integration with existing 
maintenance tasks will be worth discussion in its own right. I'd rather 
demonstrate the value of having a graph (even if it is currently 
maintained manually) and then follow up with a focus to integrate with 
repack, gc, etc.

I plan to clean up this patch on Monday given the feedback I received 
the last two days (Thanks Jonathan and Szeder!). However, if the current 
builtin design will block merging, then I'll wait until we can find one 
that works.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-02-03  1:58         ` Derrick Stolee
@ 2018-02-03  9:28           ` Jeff King
  2018-02-05 18:48             ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Jeff King @ 2018-02-03  9:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Stefan Beller, Jonathan Tan, git, Jeff Hostetler,
	Derrick Stolee, szeder.dev

On Fri, Feb 02, 2018 at 08:58:52PM -0500, Derrick Stolee wrote:

> I don't think pairing this with pack-objects or index-pack is a good
> direction, because the commit graph is not locked into a packfile the way
> the bitmap is. In fact, the entire ODB could be replaced independently and
> the graph is still valid (the commits in the graph may no longer have their
> paired commits in the ODB due to a GC; you should never navigate to those
> commits without having a ref pointing to them, so this is not immediately a
> problem).

One advantage of tying this to packs is that you can piggy-back on the
.idx to avoid storing object ids a second time. If we imagine that you
use a 32-bit index into the .idx instead, that's a savings of 16 bytes
per object (or more when we switch to a longer hash). You only need to
refer to commits and their root trees, though. So on something like
linux.git, you're talking about 2 * 700k * 16 = 21 megabytes you could
save.

That may not be worth worrying about too much, compared to the size of
the rest of the data. Disk space is obviously cheap, but I'm more
concerned about working-set size. However, 21 megabytes probably isn't
breaking the bank there these days (and it may even be faster, since the
commit-graph lookups can use the more compact index that contains only
commits, not other objects).

The big advantage of your scheme is that you can update the graph index
without repacking. The traditional advice has been that you should
always do a full repack during a gc (since it gives the most delta
opportunities). So metadata like reachability bitmaps were happy to tied
to packs, since you're repacking anyway during a gc. But my
understanding is that this doesn't really fly with the Windows
repository, where it's simply so big that you never obtain a single
pack, and just pass around slices of history in pack format.

So I think I'm OK with the direction here of keeping metadata caches
separate from the pack storage.

> This sort of interaction with GC is one reason why I did not include the
> automatic updates in this patch. The integration with existing maintenance
> tasks will be worth discussion in its own right. I'd rather demonstrate the
> value of having a graph (even if it is currently maintained manually) and
> then follow up with a focus to integrate with repack, gc, etc.
> 
> I plan to clean up this patch on Monday given the feedback I received the
> last two days (Thanks Jonathan and Szeder!). However, if the current builtin
> design will block merging, then I'll wait until we can find one that works.

If they're not tied to packs, then I think having a separate builtin
like this is the best approach. It gives you a plumbing command to
experiment with, and it can easily be called from git-gc.

-Peff

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
  2018-02-02 15:32   ` SZEDER Gábor
@ 2018-02-05 16:06     ` Derrick Stolee
  2018-02-07 15:08       ` SZEDER Gábor
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-05 16:06 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, gitster, peff, git, sbeller, dstolee

On 2/2/2018 10:32 AM, SZEDER Gábor wrote:
>> Teach Git to write a commit graph file by checking all packed objects
>> to see if they are commits, then store the file in the given pack
>> directory.
> I'm afraid that scanning all packed objects is a bit of a roundabout
> way to approach this.
>
> In my git repo, with 9 pack files at the moment, i.e. not that big a
> repo and not that many pack files:
>
>    $ time ./git commit-graph --write --update-head
>    4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803
>
>    real    0m27.550s
>    user    0m27.113s
>    sys     0m0.376s
>
> In comparison, performing a good old revision walk to gather all the
> info that is written into the graph file:
>
>    $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
>    52954
>
>    real    0m0.903s
>    user    0m0.972s
>    sys     0m0.058s

Two reasons this is in here:

(1) It's easier to get the write implemented this way and add the 
reachable closure later (which I do).

(2) For GVFS, we want to add all commits that arrived in a "prefetch 
pack" to the graph even if we do not have a ref that points to the 
commit yet. We expect many commits to become reachable soon and having 
them in the graph saves a lot of time in merge-base calculations.

So, (1) is for patch simplicity, and (2) is why I want it to be an 
option in the final version. See the --stdin-packs argument later for a 
way to do this incrementally.

I expect almost all users to use the reachable closure method with 
--stdin-commits (and that's how I will integrate automatic updates with 
'fetch', 'repack', and 'gc' in a later patch).

>
>> +char* get_commit_graph_filename_hash(const char *pack_dir,
>> +				     struct object_id *hash)
>> +{
>> +	size_t len;
>> +	struct strbuf head_path = STRBUF_INIT;
>> +	strbuf_addstr(&head_path, pack_dir);
>> +	strbuf_addstr(&head_path, "/graph-");
>> +	strbuf_addstr(&head_path, oid_to_hex(hash));
>> +	strbuf_addstr(&head_path, ".graph");
> Nit: this is assembling the path of a graph file, not that of a
> graph-head, so the strbuf should be renamed accordingly.
>
>> +
>> +	return strbuf_detach(&head_path, &len);
>> +}


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-02-01 23:48   ` SZEDER Gábor
@ 2018-02-05 18:07     ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-05 18:07 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, gitster, peff, git, sbeller, dstolee

On 2/1/2018 6:48 PM, SZEDER Gábor wrote:
>> Teach git-commit-graph to write graph files. Create new test script to verify
>> this command succeeds without failure.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/git-commit-graph.txt | 18 +++++++
>>   builtin/commit-graph.c             | 30 ++++++++++++
>>   t/t5318-commit-graph.sh            | 96 ++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 144 insertions(+)
>>   create mode 100755 t/t5318-commit-graph.sh
>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index c8ea548dfb..3f3790d9a8 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -5,3 +5,21 @@ NAME
>>   ----
>>   git-commit-graph - Write and verify Git commit graphs (.graph files)
>>   
>> +
>> +SYNOPSIS
>> +--------
>> +[verse]
>> +'git commit-graph' --write <options> [--pack-dir <pack_dir>]
>> +
> What do these options do and what is the command's output?  IOW, an
> 'OPTIONS' section would be nice.
>
>> +EXAMPLES
>> +--------
>> +
>> +* Write a commit graph file for the packed commits in your local .git folder.
>> ++
>> +------------------------------------------------
>> +$ git commit-graph --write
>> +------------------------------------------------
>> +
>> +GIT
>> +---
>> +Part of the linkgit:git[1] suite
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> new file mode 100755
>> index 0000000000..6bcd1cc264
>> --- /dev/null
>> +++ b/t/t5318-commit-graph.sh
>> @@ -0,0 +1,96 @@
>> +#!/bin/sh
>> +
>> +test_description='commit graph'
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup full repo' \
>> +    'rm -rf .git &&
>> +     mkdir full &&
>> +     cd full &&
>> +     git init &&
>> +     git config core.commitgraph true &&
>> +     git config pack.threads 1 &&
> Does this pack.threads=1 make a difference?
>
>> +     packdir=".git/objects/pack"'
> We tend to put single quotes around tests like this:
>
>    test_expect_success 'setup full repo' '
>          do-this &&
>          check-that
>    '
>
> This is not a mere style nit: those newlines before and after the test
> block make the test's output with '--verbose-log' slightly more
> readable.
>
> Furthermore, we prefer tabs for indentation.

Oops! My bad for using t5302-pack-index.sh as my model for creating test 
scripts. It's pretty old, but I do see some of the newer tests using 
this newer style.

> Finally, 'cd'-ing around such that it affects subsequent tests is
> usually frowned upon.  However, in this particular case (going into
> one repo, doing a bunch of tests there, then going into another repo,
> and doing another bunch of tests) I think it's better than changing
> directory in a subshell in every single test.
>
>> +
>> +test_expect_success 'write graph with no packs' \
>> +    'git commit-graph --write --pack-dir .'
>> +
>> +test_expect_success 'create commits and repack' \
>> +    'for i in $(test_seq 5)
>> +     do
>> +        echo $i >$i.txt &&
>> +        git add $i.txt &&
>> +        git commit -m "commit $i" &&
>> +        git branch commits/$i
>> +     done &&
>> +     git repack'
>> +
>> +test_expect_success 'write graph' \
>> +    'graph1=$(git commit-graph --write) &&
>> +     test_path_is_file ${packdir}/graph-${graph1}.graph'
> Style nit:  those {} around the variable names are unnecessary, but I
> see you use them a lot.
>
>> +
>> +t_expect_success 'Add more commits' \
> This must be test_expect_success.
>
>> +    'git reset --hard commits/3 &&
>> +     for i in $(test_seq 6 10)
>> +     do
>> +        echo $i >$i.txt &&
>> +        git add $i.txt &&
>> +        git commit -m "commit $i" &&
>> +        git branch commits/$i
>> +     done &&
>> +     git reset --hard commits/3 &&
>> +     for i in $(test_seq 11 15)
>> +     do
>> +        echo $i >$i.txt &&
>> +        git add $i.txt &&
>> +        git commit -m "commit $i" &&
>> +        git branch commits/$i
>> +     done &&
>> +     git reset --hard commits/7 &&
>> +     git merge commits/11 &&
>> +     git branch merge/1 &&
>> +     git reset --hard commits/8 &&
>> +     git merge commits/12 &&
>> +     git branch merge/2 &&
>> +     git reset --hard commits/5 &&
>> +     git merge commits/10 commits/15 &&
>> +     git branch merge/3 &&
>> +     git repack'
>> +
>> +# Current graph structure:
>> +#
>> +#      M3
>> +#     / |\_____
>> +#    / 10      15
>> +#   /   |      |
>> +#  /    9 M2   14
>> +# |     |/  \  |
>> +# |     8 M1 | 13
>> +# |     |/ | \_|
>> +# 5     7  |   12
>> +# |     |   \__|
>> +# 4     6      11
>> +# |____/______/
>> +# 3
>> +# |
>> +# 2
>> +# |
>> +# 1
>> +
>> +test_expect_success 'write graph with merges' \
>> +    'graph2=$(git commit-graph --write) &&
>> +     test_path_is_file ${packdir}/graph-${graph2}.graph'
>> +
>> +test_expect_success 'setup bare repo' \
>> +    'cd .. &&
>> +     git clone --bare full bare &&
>> +     cd bare &&
>> +     git config core.graph true &&
>> +     git config pack.threads 1 &&
>> +     baredir="objects/pack"'
>> +
>> +test_expect_success 'write graph in bare repo' \
>> +    'graphbare=$(git commit-graph --write) &&
>> +     test_path_is_file ${baredir}/graph-${graphbare}.graph'
>> +
>> +test_done
>> -- 
>> 2.16.0.15.g9c3cf44.dirty
>


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-02-03  9:28           ` Jeff King
@ 2018-02-05 18:48             ` Junio C Hamano
  2018-02-06 18:55               ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-05 18:48 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Stefan Beller, Jonathan Tan, git, Jeff Hostetler,
	Derrick Stolee, szeder.dev

Jeff King <peff@peff.net> writes:

> The big advantage of your scheme is that you can update the graph index
> without repacking. The traditional advice has been that you should
> always do a full repack during a gc (since it gives the most delta
> opportunities). So metadata like reachability bitmaps were happy to tied
> to packs, since you're repacking anyway during a gc. But my
> understanding is that this doesn't really fly with the Windows
> repository, where it's simply so big that you never obtain a single
> pack, and just pass around slices of history in pack format.
>
> So I think I'm OK with the direction here of keeping metadata caches
> separate from the pack storage.

OK.  I guess that the topology information surviving repacking is a
reason good enough to keep this separate from pack files, and I
agree with your "If they're not tied to packs,..." below, too.

Thanks.

> If they're not tied to packs, then I think having a separate builtin
> like this is the best approach. It gives you a plumbing command to
> experiment with, and it can easily be called from git-gc.
>
> -Peff

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/14] commit-graph: implement git-commit-graph --read
  2018-02-02  0:23   ` Jonathan Tan
@ 2018-02-05 19:29     ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-05 19:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, git, sbeller, dstolee

On 2/1/2018 7:23 PM, Jonathan Tan wrote:
> On Tue, 30 Jan 2018 16:39:35 -0500
> Derrick Stolee <stolee@gmail.com> wrote:
>
>> Teach git-commit-graph to read commit graph files and summarize their contents.
> One overall question - is the "read" command meant to be a command used
> by the end user, or is it here just to test that some aspects of reading
> works? If the former, I'm not sure how useful it is. And if the latter,
> I think that it is more useful to just implementing something that reads
> it, then make the 11/14 change (modifying parse_commit_gently) and
> include a perf test to show that your commit graph reading is both
> correct and (performance-)effective.

The "read" command is intended for use with the tests to verify that the 
different --write commands write the correct number of commits at a 
time. For example, we can verify that the closure under reachability 
works when using --stdin-commits, that we get every commit in a pack 
when using --stdin-packs, and that we don't get more commits than we should.

It doesn't serve much purpose on the user-facing side, but this is 
intended to be a plumbing command that is called by other porcelain 
commands.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head
  2018-02-02  1:35   ` SZEDER Gábor
@ 2018-02-05 21:01     ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-05 21:01 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, gitster, peff, git, sbeller, dstolee

On 2/1/2018 8:35 PM, SZEDER Gábor wrote:
>> It is possible to have multiple commit graph files in a pack directory,
>> but only one is important at a time. Use a 'graph_head' file to point
>> to the important file.
> This implies that all those other files are ignored, right?

Yes. We do not use directory listings to find graph files.

>
>> Teach git-commit-graph to write 'graph_head' upon
>> writing a new commit graph file.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/git-commit-graph.txt | 34 ++++++++++++++++++++++++++++++++++
>>   builtin/commit-graph.c             | 38 +++++++++++++++++++++++++++++++++++---
>>   commit-graph.c                     | 25 +++++++++++++++++++++++++
>>   commit-graph.h                     |  2 ++
>>   t/t5318-commit-graph.sh            | 12 ++++++++++--
>>   5 files changed, 106 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index 09aeaf6c82..99ced16ddc 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -12,15 +12,49 @@ SYNOPSIS
>>   'git commit-graph' --write <options> [--pack-dir <pack_dir>]
>>   'git commit-graph' --read <options> [--pack-dir <pack_dir>]
>>   
>> +OPTIONS
>> +-------
> Oh, look, the 'OPTIONS' section I missed in the previous patches! ;)
>
> This should be split up and squashed into the previous patches where
> the individual --options are first mentioned.
>
>> +--pack-dir::
>> +	Use given directory for the location of packfiles, graph-head,
>> +	and graph files.
>> +
>> +--read::
>> +	Read a graph file given by the graph-head file and output basic
>> +	details about the graph file. (Cannot be combined with --write.)
>  From the output of 'git commit-graph --read' it seems that it's not a
> generally useful option to the user.  Perhaps it should be mentioned
> that it's only intended as a debugging aid?  Or maybe it doesn't
> really matter, because eventually this command will become irrelevant,
> as other commands (clone, fetch, gc) will invoke it automagically...

I'll add some wording to make this clear.

>
>> +--graph-id::
>> +	When used with --read, consider the graph file graph-<oid>.graph.
>> +
>> +--write::
>> +	Write a new graph file to the pack directory. (Cannot be combined
>> +	with --read.)
> I think this should also mention that it prints the generated graph
> file's checksum.
>
>> +
>> +--update-head::
>> +	When used with --write, update the graph-head file to point to
>> +	the written graph file.
> So it should be used with '--write', noted.
>
>>   EXAMPLES
>>   --------
>>   
>> +* Output the hash of the graph file pointed to by <dir>/graph-head.
>> ++
>> +------------------------------------------------
>> +$ git commit-graph --pack-dir=<dir>
>> +------------------------------------------------
>> +
>>   * Write a commit graph file for the packed commits in your local .git folder.
>>   +
>>   ------------------------------------------------
>>   $ git commit-graph --write
>>   ------------------------------------------------
>>   
>> +* Write a graph file for the packed commits in your local .git folder,
>> +* and update graph-head.
>> ++
>> +------------------------------------------------
>> +$ git commit-graph --write --update-head
>> +------------------------------------------------
>> +
>>   * Read basic information from a graph file.
>>   +
>>   ------------------------------------------------
>> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
>> index 218740b1f8..d73cbc907d 100644
>> --- a/builtin/commit-graph.c
>> +++ b/builtin/commit-graph.c
>> @@ -11,7 +11,7 @@
>>   static char const * const builtin_commit_graph_usage[] = {
>>   	N_("git commit-graph [--pack-dir <packdir>]"),
>>   	N_("git commit-graph --read [--graph-hash=<hash>]"),
>> -	N_("git commit-graph --write [--pack-dir <packdir>]"),
>> +	N_("git commit-graph --write [--pack-dir <packdir>] [--update-head]"),
>>   	NULL
>>   };
>>   
>> @@ -20,6 +20,9 @@ static struct opts_commit_graph {
>>   	int read;
>>   	const char *graph_hash;
>>   	int write;
>> +	int update_head;
>> +	int has_existing;
>> +	struct object_id old_graph_hash;
>>   } opts;
>>   
>>   static int graph_read(void)
>> @@ -30,8 +33,8 @@ static int graph_read(void)
>>   
>>   	if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ)
>>   		get_oid_hex(opts.graph_hash, &graph_hash);
>> -	else
>> -		die("no graph hash specified");
>> +	else if (!get_graph_head_hash(opts.pack_dir, &graph_hash))
>> +		die("no graph-head exists");
>>   
>>   	graph_file = get_commit_graph_filename_hash(opts.pack_dir, &graph_hash);
>>   	graph = load_commit_graph_one(graph_file, opts.pack_dir);
>> @@ -62,10 +65,33 @@ static int graph_read(void)
>>   	return 0;
>>   }
>>   
>> +static void update_head_file(const char *pack_dir, const struct object_id *graph_hash)
>> +{
>> +	struct strbuf head_path = STRBUF_INIT;
>> +	int fd;
>> +	struct lock_file lk = LOCK_INIT;
>> +
>> +	strbuf_addstr(&head_path, pack_dir);
>> +	strbuf_addstr(&head_path, "/");
>> +	strbuf_addstr(&head_path, "graph-head");
> strbuf_addstr(&head_path, "/graph-head"); ?
>
>> +
>> +	fd = hold_lock_file_for_update(&lk, head_path.buf, LOCK_DIE_ON_ERROR);
>> +	strbuf_release(&head_path);
>> +
>> +	if (fd < 0)
>> +		die_errno("unable to open graph-head");
>> +
>> +	write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ);
>> +	commit_lock_file(&lk);
> The new graph-head file will be writable.  All other files in
> .git/objects/pack are created read-only, including graph files.  Just
> pointing it out, but I don't think it's a bit deal; other than
> consistency with the permissions of other files I don't have any
> argument for making it read-only.

I don't have strong opinions on the permissions difference, except that 
a graph-<hash>.graph file should not change contents (or the hash will 
be wrong) but a user or external tool could change the graph-head 
contents to point to a different file. I can't think of a case where 
that is important.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 11/14] commit: integrate commit graph with commit parsing
  2018-02-02  1:51   ` Jonathan Tan
@ 2018-02-06 14:53     ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-06 14:53 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, git, sbeller, dstolee

On 2/1/2018 8:51 PM, Jonathan Tan wrote:
> On Tue, 30 Jan 2018 16:39:40 -0500
> Derrick Stolee <stolee@gmail.com> wrote:
>
>> +/* global storage */
>> +struct commit_graph *commit_graph = 0;
> NULL, not 0.
>
>> +static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
>> +{
>> +	uint32_t last, first = 0;
>> +
>> +	if (oid->hash[0])
>> +		first = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * (oid->hash[0] - 1)));
>> +	last = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * oid->hash[0]));
>> +
>> +	while (first < last) {
>> +		uint32_t mid = first + (last - first) / 2;
>> +		const unsigned char *current;
>> +		int cmp;
>> +
>> +		current = g->chunk_oid_lookup + g->hdr->hash_len * mid;
>> +		cmp = hashcmp(oid->hash, current);
>> +		if (!cmp) {
>> +			*pos = mid;
>> +			return 1;
>> +		}
>> +		if (cmp > 0) {
>> +			first = mid + 1;
>> +			continue;
>> +		}
>> +		last = mid;
>> +	}
>> +
>> +	*pos = first;
>> +	return 0;
>> +}
> This would be better in sha1-lookup.c, something like the reverse of commit
> f1068efefe6d ("sha1_file: drop experimental GIT_USE_LOOKUP search",
> 2017-08-09), except that it can be done using a simple binary search.

I rebased my patch onto your binary search patch, so I'll use that in 
the future.

>
>> +static int full_parse_commit(struct commit *item, struct commit_graph *g,
>> +			     uint32_t pos, const unsigned char *commit_data)
>> +{
>> +	struct object_id oid;
>> +	struct commit *new_parent;
>> +	uint32_t new_parent_pos;
>> +	uint32_t *parent_data_ptr;
>> +	uint64_t date_low, date_high;
>> +	struct commit_list **pptr;
>> +
>> +	item->object.parsed = 1;
>> +	item->graph_pos = pos;
>> +
>> +	hashcpy(oid.hash, commit_data);
>> +	item->tree = lookup_tree(&oid);
>> +
>> +	date_high = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 8)) & 0x3;
>> +	date_low = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 12));
>> +	item->date = (timestamp_t)((date_high << 32) | date_low);
>> +
>> +	pptr = &item->parents;
>> +
>> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len));
>> +	if (new_parent_pos == GRAPH_PARENT_NONE)
>> +		return 1;
>> +	get_nth_commit_oid(g, new_parent_pos, &oid);
>> +	new_parent = lookup_commit(&oid);
>> +	if (new_parent) {
>> +		new_parent->graph_pos = new_parent_pos;
>> +		pptr = &commit_list_insert(new_parent, pptr)->next;
>> +	} else {
>> +		die("could not find commit %s", oid_to_hex(&oid));
>> +	}
>> +
>> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 4));
>> +	if (new_parent_pos == GRAPH_PARENT_NONE)
>> +		return 1;
>> +	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED)) {
>> +		get_nth_commit_oid(g, new_parent_pos, &oid);
>> +		new_parent = lookup_commit(&oid);
>> +		if (new_parent) {
>> +			new_parent->graph_pos = new_parent_pos;
>> +			pptr = &commit_list_insert(new_parent, pptr)->next;
>> +		} else
>> +			die("could not find commit %s", oid_to_hex(&oid));
>> +		return 1;
>> +	}
>> +
>> +	parent_data_ptr = (uint32_t*)(g->chunk_large_edges + 4 * (new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED));
>> +	do {
>> +		new_parent_pos = ntohl(*parent_data_ptr);
>> +
>> +		get_nth_commit_oid(g, new_parent_pos & GRAPH_EDGE_LAST_MASK, &oid);
>> +		new_parent = lookup_commit(&oid);
>> +		if (new_parent) {
>> +			new_parent->graph_pos = new_parent_pos & GRAPH_EDGE_LAST_MASK;
>> +			pptr = &commit_list_insert(new_parent, pptr)->next;
>> +		} else
>> +			die("could not find commit %s", oid_to_hex(&oid));
>> +		parent_data_ptr++;
>> +	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
>> +
>> +	return 1;
>> +}
> The part that converts <pointer to parent data> into <struct commit *>
> seems to be duplicated 3 times. Refactor into a function?

Will do.

>
>> +/**
>> + * Fill 'item' to contain all information that would be parsed by parse_commit_buffer.
>> + */
>> +static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
>> +{
>> +	uint32_t new_parent_pos;
>> +	uint32_t *parent_data_ptr;
>> +	const unsigned char *commit_data = g->chunk_commit_data + (g->hdr->hash_len + 16) * pos;
>> +
>> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len));
>> +
>> +	if (new_parent_pos == GRAPH_PARENT_MISSING)
>> +		return 0;
>> +
>> +	if (new_parent_pos == GRAPH_PARENT_NONE)
>> +		return full_parse_commit(item, g, pos, commit_data);
>> +
>> +	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hdr->hash_len + 4));
>> +
>> +	if (new_parent_pos == GRAPH_PARENT_MISSING)
>> +		return 0;
>> +	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED))
>> +		return full_parse_commit(item, g, pos, commit_data);
>> +
>> +	new_parent_pos = new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED;
>> +
>> +	if (new_parent_pos == GRAPH_PARENT_MISSING)
>> +		return 0;
>> +
>> +	parent_data_ptr = (uint32_t*)(g->chunk_large_edges + 4 * new_parent_pos);
>> +	do {
>> +		new_parent_pos = ntohl(*parent_data_ptr);
>> +
>> +		if ((new_parent_pos & GRAPH_EDGE_LAST_MASK) == GRAPH_PARENT_MISSING)
>> +			return 0;
>> +
>> +		parent_data_ptr++;
>> +	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
>> +
>> +	return full_parse_commit(item, g, pos, commit_data);
>> +}
> This function seems to just check for GRAPH_PARENT_MISSING - could that
> check be folded into full_parse_commit() instead? (Then
> full_parse_commit can be renamed to fill_commit_in_graph.)

I'd rather not have a really long method, but I could make the two steps 
their own static methods (one for checking and one for full parsing) to 
make it more clear that there are two steps here.

>
>> @@ -439,9 +656,24 @@ struct object_id *construct_commit_graph(const char *pack_dir)
>>   	char *fname;
>>   	struct commit_list *parent;
>>   
>> +	prepare_commit_graph();
>> +
>>   	oids.num = 0;
>>   	oids.size = 1024;
>> +
>> +	if (commit_graph && oids.size < commit_graph->num_commits)
>> +		oids.size = commit_graph->num_commits;
>> +
>>   	ALLOC_ARRAY(oids.list, oids.size);
>> +
>> +	if (commit_graph) {
>> +		for (i = 0; i < commit_graph->num_commits; i++) {
>> +			oids.list[i] = malloc(sizeof(struct object_id));
>> +			get_nth_commit_oid(commit_graph, i, oids.list[i]);
>> +		}
>> +		oids.num = commit_graph->num_commits;
>> +	}
>> +
>>   	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
>>   	QSORT(oids.list, oids.num, commit_compare);

This change auto-includes the commits that were in the existing graph 
into the new graph.

>>   
>> @@ -525,6 +757,11 @@ struct object_id *construct_commit_graph(const char *pack_dir)
>>   	hashcpy(f_hash->hash, final_hash);
>>   	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
>>   
>> +	if (commit_graph) {
>> +		close_commit_graph(commit_graph);
>> +		FREE_AND_NULL(commit_graph);
>> +	}
>> +
>>   	if (rename(graph_name, fname))
>>   		die("failed to rename %s to %s", graph_name, fname);

This change is necessary if we are going to use --delete-expired, as we 
need to unmap the file before we can delete it. Perhaps it would be 
better to close the graph in the builtin instead so that relationship is 
clearer.

> What is the relation of these changes to construct_commit_graph() to the
> rest of the patch?

(answered above, since the two changes have different purposes)

>> diff --git a/commit-graph.h b/commit-graph.h
>> index 43eb0aec84..05ddbbe165 100644
>> --- a/commit-graph.h
>> +++ b/commit-graph.h
>> @@ -4,6 +4,18 @@
>>   #include "git-compat-util.h"
>>   #include "commit.h"
>>   
>> +/**
>> + * Given a commit struct, try to fill the commit struct info, including:
>> + *  1. tree object
>> + *  2. date
>> + *  3. parents.
>> + *
>> + * Returns 1 if and only if the commit was found in the packed graph.
>> + *
>> + * See parse_commit_buffer() for the fallback after this call.
>> + */
>> +extern int parse_commit_in_graph(struct commit *item);
>> +
>>   extern struct object_id *get_graph_head_hash(const char *pack_dir,
>>   					     struct object_id *hash);
>>   extern char* get_commit_graph_filename_hash(const char *pack_dir,
>> @@ -40,7 +52,13 @@ extern struct commit_graph {
>>   
>>   extern int close_commit_graph(struct commit_graph *g);
>>   
>> -extern struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir);
>> +extern struct commit_graph *load_commit_graph_one(const char *graph_file,
>> +						  const char *pack_dir);
>> +extern void prepare_commit_graph(void);
>> +
>> +extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
>> +					    uint32_t n,
>> +					    struct object_id *oid);
>>   
>>   extern struct object_id *construct_commit_graph(const char *pack_dir);
> This header file now contains functions for reading the commit graph,
> and functions for writing one. It seems to me that those are (and should
> be) quite disjoint, so it might be better to separate them into two.

This header file provides a unified API surface for interacting with 
commit graphs. I'm not a fan of how other write commands are hidden in 
the builtins (like 'builtin/pack-objects.c' for writing packs). If there 
is a better example of how this split has been done in the root 
directory, I'm happy to consider it.

>
>> -int parse_commit_gently(struct commit *item, int quiet_on_missing)
>> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int check_packed)
>>   {
>>   	enum object_type type;
>>   	void *buffer;
>> @@ -385,6 +386,8 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>>   		return -1;
>>   	if (item->object.parsed)
>>   		return 0;
>> +	if (check_packed && parse_commit_in_graph(item))
>> +		return 0;
>>   	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
>>   	if (!buffer)
>>   		return quiet_on_missing ? -1 :
>> @@ -404,6 +407,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>>   	return ret;
>>   }
>>   
>> +int parse_commit_gently(struct commit *item, int quiet_on_missing)
>> +{
>> +	return parse_commit_internal(item, quiet_on_missing, 1);
>> +}
> Are you planning to use parse_commit_internal() from elsewhere? (It
> doesn't seem to be the case, at least from this patch series.)

At one point I was using it, but I removed the one caller and forgot to 
clean up.

>
>> diff --git a/log-tree.c b/log-tree.c
>> index fca29d4799..156aed4541 100644
>> --- a/log-tree.c
>> +++ b/log-tree.c
>> @@ -659,8 +659,7 @@ void show_log(struct rev_info *opt)
>>   		show_mergetag(opt, commit);
>>   	}
>>   
>> -	if (!get_cached_commit_buffer(commit, NULL))
>> -		return;
>> +	get_commit_buffer(commit, NULL);
> This undoes an optimization that I discuss in my e-mail message here
> [1]. If we decide to do this, it should at least be called out in the
> commit message.
>
> [1] https://public-inbox.org/git/b88725476d9f13ba4381d85e5fe049f6ef93f621.1506714999.git.jonathantanmy@google.com/

I will call this out more clearly in my commit message next time. My 
problem with the existing code is that it doesn't just ignore the commit 
contents but will actually not write a newline. I noticed during testing 
'git log --oneline' with the graph enabled and the output listed several 
short-shas in one line.

>
>> +_graph_git_two_modes() {
> No need for the name to start with an underscore, I think.
>
>> +    git -c core.commitgraph=true $1 >output
>> +    git -c core.commitgraph=false $1 >expect
>> +    cmp output expect
> Use test_cmp.
>
>> +}
>> +
>> +_graph_git_behavior() {
>> +    BRANCH=$1
>> +    COMPARE=$2
>> +    test_expect_success 'check normal git operations' \
>> +        '_graph_git_two_modes "log --oneline ${BRANCH}" &&
>> +         _graph_git_two_modes "log --topo-order ${BRANCH}" &&
>> +         _graph_git_two_modes "branch -vv" &&
>> +         _graph_git_two_modes "merge-base -a ${BRANCH} ${COMPARE}"'
>> +}
> This makes it difficult to debug failing tests, since they're all named
> the same. Better to just run the commands inline, and wrap the
> invocations of _graph_git_behavior in an appropriately named
> test_expect_success.

I'll add a parameter that adds a message to each test about the state of 
the repo and graph.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
  2018-02-05 18:48             ` Junio C Hamano
@ 2018-02-06 18:55               ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-06 18:55 UTC (permalink / raw)
  To: Junio C Hamano, Jeff King
  Cc: Stefan Beller, Jonathan Tan, git, Jeff Hostetler, Derrick Stolee,
	szeder.dev

On 2/5/2018 1:48 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
>> The big advantage of your scheme is that you can update the graph index
>> without repacking. The traditional advice has been that you should
>> always do a full repack during a gc (since it gives the most delta
>> opportunities). So metadata like reachability bitmaps were happy to tied
>> to packs, since you're repacking anyway during a gc. But my
>> understanding is that this doesn't really fly with the Windows
>> repository, where it's simply so big that you never obtain a single
>> pack, and just pass around slices of history in pack format.
>>
>> So I think I'm OK with the direction here of keeping metadata caches
>> separate from the pack storage.
> OK.  I guess that the topology information surviving repacking is a
> reason good enough to keep this separate from pack files, and I
> agree with your "If they're not tied to packs,..." below, too.
>
> Thanks.
>
>> If they're not tied to packs, then I think having a separate builtin
>> like this is the best approach. It gives you a plumbing command to
>> experiment with, and it can easily be called from git-gc.
>>
>> -Peff

Thanks for all the advice here. In addition to the many cleanups that 
were suggested, I'm going to take a try at the "subcommand" approach. 
I'll use git-submodule--helper and git-remote as models for my 
implementation.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
  2018-02-05 16:06     ` Derrick Stolee
@ 2018-02-07 15:08       ` SZEDER Gábor
  2018-02-07 15:10         ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-07 15:08 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git mailing list, Junio C Hamano, Jeff King, git, Stefan Beller,
	Derrick Stolee

On Mon, Feb 5, 2018 at 5:06 PM, Derrick Stolee <stolee@gmail.com> wrote:
> On 2/2/2018 10:32 AM, SZEDER Gábor wrote:

>> In my git repo, with 9 pack files at the moment, i.e. not that big a
>> repo and not that many pack files:
>>
>>    $ time ./git commit-graph --write --update-head
>>    4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803
>>
>>    real    0m27.550s
>>    user    0m27.113s
>>    sys     0m0.376s
>>
>> In comparison, performing a good old revision walk to gather all the
>> info that is written into the graph file:
>>
>>    $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
>>    52954
>>
>>    real    0m0.903s
>>    user    0m0.972s
>>    sys     0m0.058s
>
>
> Two reasons this is in here:
>
> (1) It's easier to get the write implemented this way and add the reachable
> closure later (which I do).
>
> (2) For GVFS, we want to add all commits that arrived in a "prefetch pack"
> to the graph even if we do not have a ref that points to the commit yet. We
> expect many commits to become reachable soon and having them in the graph
> saves a lot of time in merge-base calculations.
>
> So, (1) is for patch simplicity, and (2) is why I want it to be an option in
> the final version. See the --stdin-packs argument later for a way to do this
> incrementally.
>
> I expect almost all users to use the reachable closure method with
> --stdin-commits (and that's how I will integrate automatic updates with
> 'fetch', 'repack', and 'gc' in a later patch).

I see.  I was about to ask about the expected use-cases of the
'--stdin-packs' option, considering how much slower it is to enumerate
all objects in pack files, but run out of time after patch 10.

The run-time using '--stdin-commits' is indeed great:

  $ time { git for-each-ref --format='%(objectname)' refs/heads/ | ./git
    commit-graph --write --update-head --stdin-commits ; }
  82fe9a5cd715ff578f01f7f44e0611d7902d20c8

  real  0m0.985s
  user  0m0.916s
  sys   0m0.024s

Considering the run-time difference, I think in the end it would be a
better default for a plain 'git commit-graph --write' to traverse
history from all refs, and it should enumerate pack files only if
explicitly told so via '--stdin-packs'.

To be clear: I'm not saying that traversing history should already be
the default when introducing construct_commit_graph() and '--write'.  If
enumerating pack files keeps the earlier patches simpler and easier to
review, then by all means stick with it, and only change the
'--stdin-*'-less behavior near the end of the series, when all the
building blocks are already in place (but then mention this in the early
commit messages).


I have also noticed a segfault when feeding non-commit object names to
'--stdin-commits', i.e. when I run the above command without restricting
'git for-each-ref' to branches and it listed object names of tags as
well.

  $ git rev-parse v2.16.1 |./git commit-graph --write --update-head
--stdin-commits
  error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
  error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
  error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
  Segmentation fault

(gdb) bt
#0  __memcpy_avx_unaligned ()
    at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:126
#1  0x00000000004ea97c in sha1write (f=0x356bbf0, buf=0x4, count=20)
    at csum-file.c:104
#2  0x00000000004d98e1 in write_graph_chunk_data (f=0x356bbf0, hash_len=20,
    commits=0x3508de0, nr_commits=50615) at commit-graph.c:506
#3  0x00000000004da9ca in construct_commit_graph (
    pack_dir=0x8ff360 ".git/objects/pack", pack_indexes=0x0, nr_packs=0,
    commit_hex=0x8ff790, nr_commits=1) at commit-graph.c:818
#4  0x000000000044184e in graph_write () at builtin/commit-graph.c:149
#5  0x0000000000441a8c in cmd_commit_graph (argc=0, argv=0x7fffffffe310,
    prefix=0x0) at builtin/commit-graph.c:224
#6  0x0000000000405a0a in run_builtin (p=0x8ad950 <commands+528>, argc=4,
    argv=0x7fffffffe310) at git.c:346
#7  0x0000000000405ce4 in handle_builtin (argc=4, argv=0x7fffffffe310)
    at git.c:555
#8  0x0000000000405ec8 in run_argv (argcp=0x7fffffffe1cc, argv=0x7fffffffe1c0)
    at git.c:607
#9  0x0000000000406079 in cmd_main (argc=4, argv=0x7fffffffe310) at git.c:684
#10 0x00000000004a85c8 in main (argc=5, argv=0x7fffffffe308)
    at common-main.c:43

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
  2018-02-07 15:08       ` SZEDER Gábor
@ 2018-02-07 15:10         ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-07 15:10 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Git mailing list, Junio C Hamano, Jeff King, git, Stefan Beller,
	Derrick Stolee

On 2/7/2018 10:08 AM, SZEDER Gábor wrote:
> On Mon, Feb 5, 2018 at 5:06 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> On 2/2/2018 10:32 AM, SZEDER Gábor wrote:
>>> In my git repo, with 9 pack files at the moment, i.e. not that big a
>>> repo and not that many pack files:
>>>
>>>     $ time ./git commit-graph --write --update-head
>>>     4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803
>>>
>>>     real    0m27.550s
>>>     user    0m27.113s
>>>     sys     0m0.376s
>>>
>>> In comparison, performing a good old revision walk to gather all the
>>> info that is written into the graph file:
>>>
>>>     $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
>>>     52954
>>>
>>>     real    0m0.903s
>>>     user    0m0.972s
>>>     sys     0m0.058s
>>
>> Two reasons this is in here:
>>
>> (1) It's easier to get the write implemented this way and add the reachable
>> closure later (which I do).
>>
>> (2) For GVFS, we want to add all commits that arrived in a "prefetch pack"
>> to the graph even if we do not have a ref that points to the commit yet. We
>> expect many commits to become reachable soon and having them in the graph
>> saves a lot of time in merge-base calculations.
>>
>> So, (1) is for patch simplicity, and (2) is why I want it to be an option in
>> the final version. See the --stdin-packs argument later for a way to do this
>> incrementally.
>>
>> I expect almost all users to use the reachable closure method with
>> --stdin-commits (and that's how I will integrate automatic updates with
>> 'fetch', 'repack', and 'gc' in a later patch).
> I see.  I was about to ask about the expected use-cases of the
> '--stdin-packs' option, considering how much slower it is to enumerate
> all objects in pack files, but run out of time after patch 10.
>
> The run-time using '--stdin-commits' is indeed great:
>
>    $ time { git for-each-ref --format='%(objectname)' refs/heads/ | ./git
>      commit-graph --write --update-head --stdin-commits ; }
>    82fe9a5cd715ff578f01f7f44e0611d7902d20c8
>
>    real  0m0.985s
>    user  0m0.916s
>    sys   0m0.024s
>
> Considering the run-time difference, I think in the end it would be a
> better default for a plain 'git commit-graph --write' to traverse
> history from all refs, and it should enumerate pack files only if
> explicitly told so via '--stdin-packs'.
>
> To be clear: I'm not saying that traversing history should already be
> the default when introducing construct_commit_graph() and '--write'.  If
> enumerating pack files keeps the earlier patches simpler and easier to
> review, then by all means stick with it, and only change the
> '--stdin-*'-less behavior near the end of the series, when all the
> building blocks are already in place (but then mention this in the early
> commit messages).

I will consider this.

> I have also noticed a segfault when feeding non-commit object names to
> '--stdin-commits', i.e. when I run the above command without restricting
> 'git for-each-ref' to branches and it listed object names of tags as
> well.
>
>    $ git rev-parse v2.16.1 |./git commit-graph --write --update-head
> --stdin-commits
>    error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
>    error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
>    error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
>    Segmentation fault
>
> (gdb) bt
> #0  __memcpy_avx_unaligned ()
>      at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:126
> #1  0x00000000004ea97c in sha1write (f=0x356bbf0, buf=0x4, count=20)
>      at csum-file.c:104
> #2  0x00000000004d98e1 in write_graph_chunk_data (f=0x356bbf0, hash_len=20,
>      commits=0x3508de0, nr_commits=50615) at commit-graph.c:506
> #3  0x00000000004da9ca in construct_commit_graph (
>      pack_dir=0x8ff360 ".git/objects/pack", pack_indexes=0x0, nr_packs=0,
>      commit_hex=0x8ff790, nr_commits=1) at commit-graph.c:818
> #4  0x000000000044184e in graph_write () at builtin/commit-graph.c:149
> #5  0x0000000000441a8c in cmd_commit_graph (argc=0, argv=0x7fffffffe310,
>      prefix=0x0) at builtin/commit-graph.c:224
> #6  0x0000000000405a0a in run_builtin (p=0x8ad950 <commands+528>, argc=4,
>      argv=0x7fffffffe310) at git.c:346
> #7  0x0000000000405ce4 in handle_builtin (argc=4, argv=0x7fffffffe310)
>      at git.c:555
> #8  0x0000000000405ec8 in run_argv (argcp=0x7fffffffe1cc, argv=0x7fffffffe1c0)
>      at git.c:607
> #9  0x0000000000406079 in cmd_main (argc=4, argv=0x7fffffffe310) at git.c:684
> #10 0x00000000004a85c8 in main (argc=5, argv=0x7fffffffe308)
>      at common-main.c:43

I noticed this while preparing v3. I have a fix, but you now remind me 
that I need to add tags to the test script.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v3 00/14] Serialized Git Commit Graph
  2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
                   ` (14 preceding siblings ...)
  2018-01-30 21:47 ` [PATCH v2 00/14] Serialized Git Commit Graph Stefan Beller
@ 2018-02-08 20:37 ` Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
                     ` (14 more replies)
  15 siblings, 15 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Thanks to everyone who gave comments on v1 and v2.

Hopefully the following points have been addressed:

* Fixed inter-commit problems where certain fixes did not arrive until
  later commits.

* Converted from submode flags ("git commit-graph --write") to
  subcommands ("git commit-graph write").

* Fixed a bug where a non-commit OID would cause a segfault when using
  --stdin-commits. Added a test for an annotated tag.

* Numerous style issues, especially in the test script.

I also based my patches on the branch jt/binsearch-with-fanout to make
use of the bsearch_hash() method.

I look forward to your feedback.

Thanks,
-Stolee

-- >8 --

As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base <A> <B>' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv                       |  0.42s |  0.27s | -35%  |
| rev-list --all                   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --update-head --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.

[1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a2ce@jeffhostetler.com/
    Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
    A GitHub pull request containing the latest version of this patch.

Derrick Stolee (14):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement write_commit_graph()
  commit-graph: implement 'git-commit-graph write'
  commit-graph: implement 'git-commit-graph read'
  commit-graph: update graph-head during write
  commit-graph: implement 'git-commit-graph clear'
  commit-graph: implement --delete-expired
  commit-graph: add core.commitGraph setting
  commit: integrate commit graph with commit parsing
  commit-graph: close under reachability
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits

 .gitignore                                      |   1 +
 Documentation/config.txt                        |   3 +
 Documentation/git-commit-graph.txt              | 115 ++++
 Documentation/technical/commit-graph-format.txt |  91 +++
 Documentation/technical/commit-graph.txt        | 189 ++++++
 Makefile                                        |   2 +
 alloc.c                                         |   1 +
 builtin.h                                       |   1 +
 builtin/commit-graph.c                          | 335 ++++++++++
 cache.h                                         |   1 +
 command-list.txt                                |   1 +
 commit-graph.c                                  | 828 ++++++++++++++++++++++++
 commit-graph.h                                  |  60 ++
 commit.c                                        |   3 +
 commit.h                                        |   3 +
 config.c                                        |   5 +
 environment.c                                   |   1 +
 git.c                                           |   1 +
 log-tree.c                                      |   3 +-
 packfile.c                                      |   4 +-
 packfile.h                                      |   2 +
 t/t5318-commit-graph.sh                         | 228 +++++++
 22 files changed, 1874 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 Documentation/technical/commit-graph-format.txt
 create mode 100644 Documentation/technical/commit-graph.txt
 create mode 100644 builtin/commit-graph.c
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h
 create mode 100755 t/t5318-commit-graph.sh

-- 
2.7.4


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v3 01/14] commit-graph: add format document
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 21:21     ` Junio C Hamano
  2018-02-08 20:37   ` [PATCH v3 02/14] graph: add commit graph design document Derrick Stolee
                     ` (13 subsequent siblings)
  14 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 91 +++++++++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 0000000000..349fa0c14c
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,91 @@
+Git commit graph format
+=======================
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+      The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+      Currently, the only valid version is 1.
+
+  1-byte Object Id Version (1 = SHA-1)
+
+  1-byte Object Id Length (H)
+
+  1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+      First 4 bytes describe chunk id. Value 0 is a terminating label.
+      Other 8 bytes provide offset in current file for chunk to start.
+      (Chunks are ordered contiguously in the file, so you can infer
+      the length using the next chunk position if necessary.)
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+      The ith entry, F[i], stores the number of OIDs with first
+      byte at most i. Thus F[255] stores the total
+      number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+      The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+    * The first H bytes are for the OID of the root tree.
+    * The next 8 bytes are for the int-ids of the first two parents
+      of the ith commit. Stores value 0xffffffff if no parent in that
+      position. If there are more than two parents, the second value
+      has its most-significant bit on and the other bits store an array
+      position into the Large Edge List chunk.
+    * The next 8 bytes store the generation number of the commit and
+      the commit time in seconds since EPOCH. The generation number
+      uses the higher 30 bits of the first 4 bytes, while the commit
+      time uses the 32 bits of the second 4 bytes, along with the lowest
+      2 bits of the lowest byte, storing the 33rd and 34th bit of the
+      commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'})
+      This list of 4-byte values store the second through nth parents for
+      all octopus merges. The second parent value in the commit data stores
+      an array position within this list along with the most-significant bit
+      on. Starting at that array position, iterate through this list of int-ids
+      for the parents until reaching a value with the most-significant bit on.
+      The other bits correspond to the int-id of the last parent. This chunk
+      should always be present, but may be empty.
+
+TRAILER:
+
+	H-byte HASH-checksum of all of the above.
+
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 02/14] graph: add commit graph design document
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 189 +++++++++++++++++++++++++++++++
 1 file changed, 189 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000000..fc86b06041
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,189 @@
+Git Commit Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+next to packfiles either in the .git/objects/pack directory or in the pack
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+    If A and B are commits with generation numbers N and M, respectively,
+    and N <= M, then A cannot reach B. That is, we know without searching
+    that B is not an ancestor of A because it is further from a root commit
+    than A.
+
+    Conversely, when checking if A is an ancestor of B, then we only need
+    to walk commits until all commits on the walk boundary have generation
+    number at most N. If we walk commits using a priority queue seeded by
+    generation numbers, then we always expand the boundary commit with highest
+    generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+    If A and B are commits with commit time X and Y, respectively, and
+    X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<hash>.graph' in the pack
+  directory. This could be stored in an alternate.
+
+- The most-recent graph file hash is stored in a 'graph-head' file for
+  immediate access and storing backup graphs. This could be stored in an
+  alternate, and refers to a 'graph-<hash>.graph' file in the same pack
+  directory.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object id length and hash
+  algorithm, so a future change of hash algorithm does not require a change
+  in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer position
+  to seek into the single graph file. It is possible to extend the model
+  for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'commit-graph' builtin. These are not
+  updated automatically during clone, fetch, repack, or creating new commits.
+
+- There is no '--verify' option for the 'commit-graph' builtin to verify the
+  contents of the graph file agree with the contents in the ODB.
+
+- When rewriting the graph, we do not check for a commit still existing
+  in the ODB, so garbage collection may remove commits.
+
+- Generation numbers are not computed in the current version. The file
+  format supports storing them, along with a mechanism to upgrade from
+  a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+  are not currently computed, so all generation numbers will be marked as
+  0 (or "uncomputed"). A later patch will include this calculation.
+
+- The commit graph is currently incompatible with commit grafts. This can be
+  remedied by duplicating or refactoring the current graft logic.
+
+- After computing and storing generation numbers, we must make graph
+  walks aware of generation numbers to gain the performance benefits they
+  enable. This will mostly be accomplished by swapping a commit-date-ordered
+  priority queue with one ordered by generation number. The following
+  operations are important candidates:
+
+    - paint_down_to_common()
+    - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+  When writing a new graph, we could check that the ODB still contains
+  the commits and choose to remove the commits that are deleted from the
+  ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+  object for a commit. This passes through lookup_tree() and consequently
+  lookup_object(). Also, it calls lookup_commit() when loading the parents.
+  These method calls check the ODB for object existence, even if the
+  consumer does not need the content. For example, we do not need the
+  tree contents when computing merge bases. Now that commit parsing is
+  removed from the computation time, these lookup operations are the
+  slowest operations keeping graph walks from being fast. Consider
+  loading these objects without verifying their existence in the ODB and
+  only loading them fully when consumers need them. Consider a method
+  such as "ensure_tree_loaded(commit)" that fully loads a tree before
+  using commit->tree.
+
+- The current design uses the 'commit-graph' builtin to generate the graph.
+  When this feature stabilizes enough to recommend to most users, we should
+  add automatic graph writes to common operations that create many commits.
+  For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
+  commands.
+
+- A server could provide a commit graph file as part of the network protocol
+  to avoid extra calculations by clients.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+    Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+    An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+    Discussion about generation numbers on commits and how they interact
+    with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+    More discussion about generation numbers and not storing them inside
+    commit objects. A valuable quote:
+
+    "I think we should be moving more in the direction of keeping
+     repo-local caches for optimizations. Reachability bitmaps have been
+     a big performance win. I think we should be doing the same with our
+     properties of commits. Not just generation numbers, but making it
+     cheap to access the graph structure without zlib-inflating whole
+     commit objects (i.e., packv4 or something like the "metapacks" I
+     proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+    A patch to remove the ahead-behind calculation from 'status'.
+
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 03/14] commit-graph: create git-commit-graph builtin
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 02/14] graph: add commit graph design document Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 21:27     ` Junio C Hamano
  2018-02-08 20:37   ` [PATCH v3 04/14] commit-graph: implement write_commit_graph() Derrick Stolee
                     ` (11 subsequent siblings)
  14 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a '--pack-dir' option.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                         |  1 +
 Documentation/git-commit-graph.txt | 11 +++++++++++
 Makefile                           |  1 +
 builtin.h                          |  1 +
 builtin/commit-graph.c             | 37 +++++++++++++++++++++++++++++++++++++
 command-list.txt                   |  1 +
 git.c                              |  1 +
 7 files changed, 53 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..e82f90184d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
new file mode 100644
index 0000000000..e1c3078ca1
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,11 @@
+git-commit-graph(1)
+===================
+
+NAME
+----
+git-commit-graph - Write and verify Git commit graphs (.graph files)
+
+GIT
+---
+Part of the linkgit:git[1] suite
+
diff --git a/Makefile b/Makefile
index ee9d5eb11e..fc40b816dc 100644
--- a/Makefile
+++ b/Makefile
@@ -932,6 +932,7 @@ BUILTIN_OBJS += builtin/clone.o
 BUILTIN_OBJS += builtin/column.o
 BUILTIN_OBJS += builtin/commit-tree.o
 BUILTIN_OBJS += builtin/commit.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/config.o
 BUILTIN_OBJS += builtin/count-objects.o
 BUILTIN_OBJS += builtin/credential.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..079855b6d4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 0000000000..a01c5d9981
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,37 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+	N_("git commit-graph [--pack-dir <packdir>]"),
+	NULL
+};
+
+static struct opts_commit_graph {
+	const char *pack_dir;
+} opts;
+
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_commit_graph_options[] = {
+		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
+			N_("dir"),
+			N_("The pack directory to store the graph") },
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_commit_graph_usage,
+				   builtin_commit_graph_options);
+
+	git_config(git_default_config, NULL);
+	argc = parse_options(argc, argv, prefix,
+			     builtin_commit_graph_options,
+			     builtin_commit_graph_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+
+	usage_with_options(builtin_commit_graph_usage,
+			   builtin_commit_graph_options);
+}
+
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..835c5890be 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean                               mainporcelain
 git-clone                               mainporcelain           init
 git-column                              purehelpers
 git-commit                              mainporcelain           history
+git-commit-graph                        plumbingmanipulators
 git-commit-tree                         plumbingmanipulators
 git-config                              ancillarymanipulators
 git-count-objects                       ancillaryinterrogators
diff --git a/git.c b/git.c
index 9e96dd4090..d4832c1e0d 100644
--- a/git.c
+++ b/git.c
@@ -388,6 +388,7 @@ static struct cmd_struct commands[] = {
 	{ "clone", cmd_clone },
 	{ "column", cmd_column, RUN_SETUP_GENTLY },
 	{ "commit", cmd_commit, RUN_SETUP | NEED_WORK_TREE },
+	{ "commit-graph", cmd_commit_graph, RUN_SETUP },
 	{ "commit-tree", cmd_commit_tree, RUN_SETUP },
 	{ "config", cmd_config, RUN_SETUP_GENTLY },
 	{ "count-objects", cmd_count_objects, RUN_SETUP },
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 04/14] commit-graph: implement write_commit_graph()
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (2 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 22:14     ` Junio C Hamano
  2018-02-15 18:19     ` Junio C Hamano
  2018-02-08 20:37   ` [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
                     ` (10 subsequent siblings)
  14 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given pack
directory.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |   1 +
 commit-graph.c | 368 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commit-graph.h |  13 ++
 3 files changed, 382 insertions(+)
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h

diff --git a/Makefile b/Makefile
index fc40b816dc..eeaeb6a745 100644
--- a/Makefile
+++ b/Makefile
@@ -761,6 +761,7 @@ LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
 LIB_OBJS += commit.o
+LIB_OBJS += commit-graph.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
diff --git a/commit-graph.c b/commit-graph.c
new file mode 100644
index 0000000000..cb47b68871
--- /dev/null
+++ b/commit-graph.c
@@ -0,0 +1,368 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "revision.h"
+#include "sha1-lookup.h"
+#include "commit-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 20
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_LARGE_EDGES_NEEDED 0x80000000
+#define GRAPH_PARENT_MISSING 0x7fffffff
+#define GRAPH_EDGE_LAST_MASK 0x7fffffff
+#define GRAPH_PARENT_NONE 0x70000000
+
+#define GRAPH_LAST_EDGE 0x80000000
+
+#define GRAPH_FANOUT_SIZE (4 * 256)
+#define GRAPH_CHUNKLOOKUP_WIDTH 12
+#define GRAPH_CHUNKLOOKUP_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH)
+#define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
+			GRAPH_OID_LEN + 8)
+
+char* get_commit_graph_filename_hash(const char *pack_dir,
+				     struct object_id *hash)
+{
+	size_t len;
+	struct strbuf path = STRBUF_INIT;
+	strbuf_addstr(&path, pack_dir);
+	strbuf_addstr(&path, "/graph-");
+	strbuf_addstr(&path, oid_to_hex(hash));
+	strbuf_addstr(&path, ".graph");
+
+	return strbuf_detach(&path, &len);
+}
+
+static void write_graph_chunk_fanout(struct sha1file *f,
+				     struct commit **commits,
+				     int nr_commits)
+{
+	uint32_t i, count = 0;
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+
+	/*
+	 * Write the first-level table (the list is sorted,
+	 * but we use a 256-entry lookup to be able to avoid
+	 * having to do eight extra binary search iterations).
+	 */
+	for (i = 0; i < 256; i++) {
+		while (list < last) {
+			if ((*list)->object.oid.hash[0] != i)
+				break;
+			count++;
+			list++;
+		}
+
+		sha1write_be32(f, count);
+	}
+}
+
+static void write_graph_chunk_oids(struct sha1file *f, int hash_len,
+				   struct commit **commits, int nr_commits)
+{
+	struct commit **list, **last = commits + nr_commits;
+	for (list = commits; list < last; list++)
+		sha1write(f, (*list)->object.oid.hash, (int)hash_len);
+}
+
+static int commit_pos(struct commit **commits, int nr_commits,
+		      const struct object_id *oid, uint32_t *pos)
+{
+	uint32_t first = 0, last = nr_commits;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		struct object_id *current;
+		int cmp;
+
+		current = &(commits[mid]->object.oid);
+		cmp = oidcmp(oid, current);
+		if (!cmp) {
+			*pos = mid;
+			return 1;
+		}
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	*pos = first;
+	return 0;
+}
+
+static void write_graph_chunk_data(struct sha1file *f, int hash_len,
+				   struct commit **commits, int nr_commits)
+{
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+	uint32_t num_large_edges = 0;
+
+	while (list < last) {
+		struct commit_list *parent;
+		uint32_t int_id;
+		uint32_t packedDate[2];
+
+		parse_commit(*list);
+		sha1write(f, (*list)->tree->object.oid.hash, hash_len);
+
+		parent = (*list)->parents;
+
+		if (!parent)
+			int_id = GRAPH_PARENT_NONE;
+		else if (!commit_pos(commits, nr_commits,
+				     &(parent->item->object.oid), &int_id))
+			int_id = GRAPH_PARENT_MISSING;
+
+		sha1write_be32(f, int_id);
+
+		if (parent)
+			parent = parent->next;
+
+		if (!parent)
+			int_id = GRAPH_PARENT_NONE;
+		else if (parent->next)
+			int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges;
+		else if (!commit_pos(commits, nr_commits,
+				    &(parent->item->object.oid), &int_id))
+			int_id = GRAPH_PARENT_MISSING;
+
+		sha1write_be32(f, int_id);
+
+		if (parent && parent->next) {
+			do {
+				num_large_edges++;
+				parent = parent->next;
+			} while (parent);
+		}
+
+		if (sizeof((*list)->date) > 4)
+			packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
+		else
+			packedDate[0] = 0;
+
+		packedDate[1] = htonl((*list)->date);
+		sha1write(f, packedDate, 8);
+
+		list++;
+	}
+}
+
+static void write_graph_chunk_large_edges(struct sha1file *f,
+					  struct commit **commits,
+					  int nr_commits)
+{
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+	struct commit_list *parent;
+
+	while (list < last) {
+		int num_parents = 0;
+		for (parent = (*list)->parents; num_parents < 3 && parent;
+		     parent = parent->next)
+			num_parents++;
+
+		if (num_parents <= 2) {
+			list++;
+			continue;
+		}
+
+		/* Since num_parents > 2, this initializer is safe. */
+		for (parent = (*list)->parents->next; parent; parent = parent->next) {
+			uint32_t int_id, swap_int_id;
+			uint32_t last_edge = 0;
+			if (!parent->next)
+				last_edge |= GRAPH_LAST_EDGE;
+
+			if (commit_pos(commits, nr_commits,
+				       &(parent->item->object.oid),
+				       &int_id))
+				swap_int_id = htonl(int_id | last_edge);
+			else
+				swap_int_id = htonl(GRAPH_PARENT_MISSING | last_edge);
+
+			sha1write(f, &swap_int_id, 4);
+		}
+
+		list++;
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	struct object_id *a = *(struct object_id **)_a;
+	struct object_id *b = *(struct object_id **)_b;
+	return oidcmp(a, b);
+}
+
+struct packed_commit_list {
+	struct commit **list;
+	int nr;
+	int alloc;
+};
+
+struct packed_oid_list {
+	struct object_id **list;
+	int nr;
+	int alloc;
+};
+
+static int if_packed_commit_add_to_list(const struct object_id *oid,
+					struct packed_git *pack,
+					uint32_t pos,
+					void *data)
+{
+	struct packed_oid_list *list = (struct packed_oid_list*)data;
+	enum object_type type;
+	unsigned long size;
+	void *inner_data;
+	off_t offset = nth_packed_object_offset(pack, pos);
+	inner_data = unpack_entry(pack, offset, &type, &size);
+
+	if (inner_data)
+		free(inner_data);
+
+	if (type != OBJ_COMMIT)
+		return 0;
+
+	ALLOC_GROW(list->list, list->nr + 1, list->alloc);
+	list->list[list->nr] = xmalloc(sizeof(struct object_id));
+	oidcpy(list->list[list->nr], oid);
+	(list->nr)++;
+
+	return 0;
+}
+
+struct object_id *write_commit_graph(const char *pack_dir)
+{
+	struct packed_oid_list oids;
+	struct packed_commit_list commits;
+	struct sha1file *f;
+	int i, count_distinct = 0;
+	struct strbuf tmp_file = STRBUF_INIT;
+	unsigned char final_hash[GIT_MAX_RAWSZ];
+	char *graph_name;
+	int fd;
+	uint32_t chunk_ids[5];
+	uint64_t chunk_offsets[5];
+	int num_long_edges;
+	struct object_id *f_hash;
+	char *fname;
+	struct commit_list *parent;
+
+	oids.nr = 0;
+	oids.alloc = 1024;
+	ALLOC_ARRAY(oids.list, oids.alloc);
+
+	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+
+	QSORT(oids.list, oids.nr, commit_compare);
+
+	count_distinct = 1;
+	for (i = 1; i < oids.nr; i++) {
+		if (oidcmp(oids.list[i-1], oids.list[i]))
+			count_distinct++;
+	}
+
+	commits.nr = 0;
+	commits.alloc = count_distinct;
+	ALLOC_ARRAY(commits.list, commits.alloc);
+
+	num_long_edges = 0;
+	for (i = 0; i < oids.nr; i++) {
+		int num_parents = 0;
+		if (i > 0 && !oidcmp(oids.list[i-1], oids.list[i]))
+			continue;
+
+		commits.list[commits.nr] = lookup_commit(oids.list[i]);
+		parse_commit(commits.list[commits.nr]);
+
+		for (parent = commits.list[commits.nr]->parents;
+		     parent; parent = parent->next)
+			num_parents++;
+
+		if (num_parents > 2)
+			num_long_edges += num_parents - 1;
+
+		commits.nr++;
+	}
+
+	strbuf_addstr(&tmp_file, pack_dir);
+	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
+
+	fd = git_mkstemp_mode(tmp_file.buf, 0444);
+	if (fd < 0)
+		die_errno("unable to create '%s'", tmp_file.buf);
+
+	graph_name = strbuf_detach(&tmp_file, NULL);
+	f = sha1fd(fd, graph_name);
+
+	sha1write_be32(f, GRAPH_SIGNATURE);
+
+	sha1write_u8(f, GRAPH_VERSION);
+	sha1write_u8(f, GRAPH_OID_VERSION);
+	sha1write_u8(f, GRAPH_OID_LEN);
+	sha1write_u8(f, 4); /* number of chunks */
+
+	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
+	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
+	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
+	chunk_ids[4] = 0;
+
+	chunk_offsets[0] = 8 + GRAPH_CHUNKLOOKUP_SIZE;
+	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
+	chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.nr;
+	chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.nr;
+	chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;
+
+	for (i = 0; i <= 4; i++) {
+		uint32_t chunk_write[3];
+
+		chunk_write[0] = htonl(chunk_ids[i]);
+		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
+		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
+		sha1write(f, chunk_write, 12);
+	}
+
+	write_graph_chunk_fanout(f, commits.list, commits.nr);
+	write_graph_chunk_oids(f, GRAPH_OID_LEN, commits.list, commits.nr);
+	write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.nr);
+	write_graph_chunk_large_edges(f, commits.list, commits.nr);
+
+	sha1close(f, final_hash, CSUM_CLOSE | CSUM_FSYNC);
+
+	f_hash = xmalloc(sizeof(struct object_id));
+	hashcpy(f_hash->hash, final_hash);
+	fname = get_commit_graph_filename_hash(pack_dir, f_hash);
+
+	if (rename(graph_name, fname))
+		die("failed to rename %s to %s", graph_name, fname);
+
+	free(oids.list);
+	oids.alloc = 0;
+	oids.nr = 0;
+
+	return f_hash;
+}
+
diff --git a/commit-graph.h b/commit-graph.h
new file mode 100644
index 0000000000..4756f6ba5b
--- /dev/null
+++ b/commit-graph.h
@@ -0,0 +1,13 @@
+#ifndef COMMIT_GRAPH_H
+#define COMMIT_GRAPH_H
+
+#include "git-compat-util.h"
+#include "commit.h"
+
+extern char* get_commit_graph_filename_hash(const char *pack_dir,
+					    struct object_id *hash);
+
+extern struct object_id *write_commit_graph(const char *pack_dir);
+
+#endif
+
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write'
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (3 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 04/14] commit-graph: implement write_commit_graph() Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-13 21:57     ` Jonathan Tan
  2018-02-08 20:37   ` [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read' Derrick Stolee
                     ` (9 subsequent siblings)
  14 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  39 +++++++++++++
 builtin/commit-graph.c             |  43 +++++++++++++++
 t/t5318-commit-graph.sh            | 109 +++++++++++++++++++++++++++++++++++++
 3 files changed, 191 insertions(+)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index e1c3078ca1..55dfe5c3d8 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,6 +5,45 @@ NAME
 ----
 git-commit-graph - Write and verify Git commit graphs (.graph files)
 
+
+SYNOPSIS
+--------
+[verse]
+'git commit-graph write' <options> [--pack-dir <pack_dir>]
+
+
+DESCRIPTION
+-----------
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+-------
+--pack-dir::
+	Use given directory for the location of packfiles, graph-head,
+	and graph files.
+
+
+COMMANDS
+--------
+'write'::
+
+Write a commit graph file based on the commits found in packfiles.
+Includes all commits from the existing commit graph file. Outputs the
+checksum hash of the written file.
+
+
+EXAMPLES
+--------
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+------------------------------------------------
+$ git commit-graph write
+------------------------------------------------
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index a01c5d9981..5dac033bfe 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,9 +1,16 @@
 #include "builtin.h"
 #include "config.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
+	N_("git commit-graph write [--pack-dir <packdir>]"),
+	NULL
+};
+
+static const char * const builtin_commit_graph_write_usage[] = {
+	N_("git commit-graph write [--pack-dir <packdir>]"),
 	NULL
 };
 
@@ -11,6 +18,37 @@ static struct opts_commit_graph {
 	const char *pack_dir;
 } opts;
 
+static int graph_write(int argc, const char **argv)
+{
+	struct object_id *graph_hash;
+
+	static struct option builtin_commit_graph_write_options[] = {
+		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
+			N_("dir"),
+			N_("The pack directory to store the graph") },
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_write_options,
+			     builtin_commit_graph_write_usage, 0);
+
+	if (!opts.pack_dir) {
+		struct strbuf path = STRBUF_INIT;
+		strbuf_addstr(&path, get_object_directory());
+		strbuf_addstr(&path, "/pack");
+		opts.pack_dir = strbuf_detach(&path, NULL);
+	}
+
+	graph_hash = write_commit_graph(opts.pack_dir);
+
+	if (graph_hash) {
+		printf("%s\n", oid_to_hex(graph_hash));
+		FREE_AND_NULL(graph_hash);
+	}
+
+	return 0;
+}
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
@@ -31,6 +69,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     builtin_commit_graph_usage,
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (argc > 0) {
+		if (!strcmp(argv[0], "write"))
+			return graph_write(argc, argv);
+	}
+
 	usage_with_options(builtin_commit_graph_usage,
 			   builtin_commit_graph_options);
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 0000000000..b762587595
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,109 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' '
+	rm -rf .git &&
+	mkdir full &&
+	cd full &&
+	git init &&
+	packdir=".git/objects/pack"'
+
+test_expect_success 'write graph with no packs' '
+	git commit-graph write --pack-dir .'
+
+test_expect_success 'create commits and repack' '
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git repack'
+
+test_expect_success 'write graph' '
+	graph1=$(git commit-graph write) &&
+	test_path_is_file $packdir/graph-$graph1.graph'
+
+test_expect_success 'Add more commits' '
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 7)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	git reset --hard commits/3 &&
+	git merge commits/5 commits/7 &&
+	git branch merge/3 &&
+	git repack'
+
+# Current graph structure:
+#
+#   __M3___
+#  /   |   \
+# 3 M1 5 M2 7
+# |/  \|/  \|
+# 2    4    6
+# |___/____/
+# 1
+
+
+test_expect_success 'write graph with merges' '
+	graph2=$(git commit-graph write)&&
+	test_path_is_file $packdir/graph-$graph2.graph'
+
+test_expect_success 'Add one more commit' '
+	test_commit 8 &&
+	git branch commits/8 &&
+	ls $packdir | grep idx >existing-idx &&
+	git repack &&
+	ls $packdir | grep idx | grep -v --file=existing-idx >new-idx'
+
+# Current graph structure:
+#
+#      8
+#      |
+#   __M3___
+#  /   |   \
+# 3 M1 5 M2 7
+# |/  \|/  \|
+# 2    4    6
+# |___/____/
+# 1
+
+test_expect_success 'write graph with new commit' '
+	graph3=$(git commit-graph write) &&
+	test_path_is_file $packdir/graph-$graph3.graph'
+
+
+test_expect_success 'write graph with nothing new' '
+	graph4=$(git commit-graph write) &&
+	test_path_is_file $packdir/graph-$graph4.graph &&
+	printf $graph3 >expect &&
+	printf $graph4 >output &&
+	test_cmp expect output'
+
+test_expect_success 'setup bare repo' '
+	cd .. &&
+	git clone --bare --no-local full bare &&
+	cd bare &&
+	baredir="./objects/pack"'
+
+test_expect_success 'write graph in bare repo' '
+	graphbare=$(git commit-graph write) &&
+	test_path_is_file $baredir/graph-$graphbare.graph'
+
+test_done
+
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read'
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (4 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 23:38     ` Junio C Hamano
  2018-02-08 20:37   ` [PATCH v3 07/14] commit-graph: update graph-head during write Derrick Stolee
                     ` (8 subsequent siblings)
  14 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach git-commit-graph to read commit graph files and summarize their contents.

Use the read subcommand to verify the contents of a commit graph file in the
tests.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  16 ++++
 builtin/commit-graph.c             |  71 ++++++++++++++++++
 commit-graph.c                     | 147 +++++++++++++++++++++++++++++++++++++
 commit-graph.h                     |  23 ++++++
 t/t5318-commit-graph.sh            |  34 +++++++--
 5 files changed, 286 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 55dfe5c3d8..67e107f06a 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graphs (.graph files)
 SYNOPSIS
 --------
 [verse]
+'git commit-graph read' <options> [--pack-dir <pack_dir>]
 'git commit-graph write' <options> [--pack-dir <pack_dir>]
 
 
@@ -34,6 +35,15 @@ Includes all commits from the existing commit graph file. Outputs the
 checksum hash of the written file.
 
 
+'read'::
+
+Read a graph file given by the graph-head file and output basic
+details about the graph file.
++
+With `--graph-hash=<hash>` option, consider the graph file
+graph-<hash>.graph in the pack directory.
+
+
 EXAMPLES
 --------
 
@@ -43,6 +53,12 @@ EXAMPLES
 $ git commit-graph write
 ------------------------------------------------
 
+* Read basic information from a graph file.
++
+------------------------------------------------
+$ git commit-graph read --graph-hash=<hash>
+------------------------------------------------
+
 
 GIT
 ---
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 5dac033bfe..3ffa7ec433 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -5,10 +5,16 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
+	N_("git commit-graph read [--graph-hash=<hash>]"),
 	N_("git commit-graph write [--pack-dir <packdir>]"),
 	NULL
 };
 
+static const char * const builtin_commit_graph_read_usage[] = {
+	N_("git commit-graph read [--pack-dir <packdir>]"),
+	NULL
+};
+
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--pack-dir <packdir>]"),
 	NULL
@@ -16,8 +22,71 @@ static const char * const builtin_commit_graph_write_usage[] = {
 
 static struct opts_commit_graph {
 	const char *pack_dir;
+	const char *graph_hash;
 } opts;
 
+static int graph_read(int argc, const char **argv)
+{
+	struct object_id graph_hash;
+	struct commit_graph *graph = 0;
+	const char *graph_file;
+
+	static struct option builtin_commit_graph_read_options[] = {
+		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
+			N_("dir"),
+			N_("The pack directory to store the graph") },
+		{ OPTION_STRING, 'H', "graph-hash", &opts.graph_hash,
+			N_("hash"),
+			N_("A hash for a specific graph file in the pack-dir."),
+			PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_read_options,
+			     builtin_commit_graph_read_usage, 0);
+
+	if (!opts.pack_dir) {
+		struct strbuf path = STRBUF_INIT;
+		strbuf_addstr(&path, get_object_directory());
+		strbuf_addstr(&path, "/pack");
+		opts.pack_dir = strbuf_detach(&path, NULL);
+	}
+
+	if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ)
+		get_oid_hex(opts.graph_hash, &graph_hash);
+	else
+		die("no graph hash specified");
+
+	graph_file = get_commit_graph_filename_hash(opts.pack_dir, &graph_hash);
+	graph = load_commit_graph_one(graph_file, opts.pack_dir);
+
+	if (!graph)
+		die("graph file %s does not exist", graph_file);
+
+	printf("header: %08x %02x %02x %02x %02x\n",
+		ntohl(*(uint32_t*)graph->data),
+		*(unsigned char*)(graph->data + 4),
+		*(unsigned char*)(graph->data + 5),
+		graph->hash_len,
+		graph->num_chunks);
+	printf("num_commits: %u\n", graph->num_commits);
+	printf("chunks:");
+
+	if (graph->chunk_oid_fanout)
+		printf(" oid_fanout");
+	if (graph->chunk_oid_lookup)
+		printf(" oid_lookup");
+	if (graph->chunk_commit_data)
+		printf(" commit_metadata");
+	if (graph->chunk_large_edges)
+		printf(" large_edges");
+	printf("\n");
+
+	printf("pack_dir: %s\n", graph->pack_dir);
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct object_id *graph_hash;
@@ -70,6 +139,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
 	if (argc > 0) {
+		if (!strcmp(argv[0], "read"))
+			return graph_read(argc, argv);
 		if (!strcmp(argv[0], "write"))
 			return graph_write(argc, argv);
 	}
diff --git a/commit-graph.c b/commit-graph.c
index cb47b68871..9a337cea4d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -51,6 +51,153 @@ char* get_commit_graph_filename_hash(const char *pack_dir,
 	return strbuf_detach(&path, &len);
 }
 
+static struct commit_graph *alloc_commit_graph(int extra)
+{
+	struct commit_graph *g = xmalloc(st_add(sizeof(*g), extra));
+	memset(g, 0, sizeof(*g));
+	g->graph_fd = -1;
+
+	return g;
+}
+
+static int close_commit_graph(struct commit_graph *g)
+{
+	if (g->graph_fd < 0)
+		return 0;
+
+	munmap((void *)g->data, g->data_len);
+	g->data = 0;
+
+	close(g->graph_fd);
+	g->graph_fd = -1;
+
+	return 1;
+}
+
+static void free_commit_graph(struct commit_graph **g)
+{
+	if (!g || !*g)
+		return;
+
+	close_commit_graph(*g);
+	FREE_AND_NULL(*g);
+}
+
+struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir)
+{
+	void *graph_map;
+	const unsigned char *data, *chunk_lookup;
+	size_t graph_size;
+	struct stat st;
+	uint32_t i;
+	struct commit_graph *graph;
+	int fd = git_open(graph_file);
+	uint64_t last_chunk_offset;
+	uint32_t last_chunk_id;
+	uint32_t graph_signature;
+	unsigned char graph_version, hash_version;
+
+	if (fd < 0)
+		return 0;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return 0;
+	}
+	graph_size = xsize_t(st.st_size);
+
+	if (graph_size < GRAPH_MIN_SIZE) {
+		close(fd);
+		die("graph file %s is too small", graph_file);
+	}
+	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	data = (const unsigned char *)graph_map;
+
+	graph_signature = ntohl(*(uint32_t*)data);
+	if (graph_signature != GRAPH_SIGNATURE) {
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("graph signature %X does not match signature %X",
+			graph_signature, GRAPH_SIGNATURE);
+	}
+
+	graph_version = *(unsigned char*)(data + 4);
+	if (graph_version != GRAPH_VERSION) {
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("graph version %X does not match version %X",
+			graph_version, GRAPH_VERSION);
+	}
+
+	hash_version = *(unsigned char*)(data + 5);
+	if (hash_version != GRAPH_OID_VERSION) {
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("hash version %X does not match version %X",
+			hash_version, GRAPH_OID_VERSION);
+	}
+
+	graph = alloc_commit_graph(strlen(pack_dir) + 1);
+
+	graph->hash_len = *(unsigned char*)(data + 6);
+	graph->num_chunks = *(unsigned char*)(data + 7);
+	graph->graph_fd = fd;
+	graph->data = graph_map;
+	graph->data_len = graph_size;
+
+	last_chunk_id = 0;
+	last_chunk_offset = 8;
+	chunk_lookup = data + 8;
+	for (i = 0; i < graph->num_chunks; i++) {
+		uint32_t chunk_id = get_be32(chunk_lookup + 0);
+		uint64_t chunk_offset1 = get_be32(chunk_lookup + 4);
+		uint32_t chunk_offset2 = get_be32(chunk_lookup + 8);
+		uint64_t chunk_offset = (chunk_offset1 << 32) | chunk_offset2;
+
+		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
+
+		if (chunk_offset > graph_size - GIT_MAX_RAWSZ)
+			die("improper chunk offset %08x%08x", (uint32_t)(chunk_offset >> 32),
+			    (uint32_t)chunk_offset);
+
+		switch (chunk_id) {
+			case GRAPH_CHUNKID_OIDFANOUT:
+				graph->chunk_oid_fanout = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_OIDLOOKUP:
+				graph->chunk_oid_lookup = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_DATA:
+				graph->chunk_commit_data = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_LARGEEDGES:
+				graph->chunk_large_edges = data + chunk_offset;
+				break;
+
+			case 0:
+				break;
+
+			default:
+				free_commit_graph(&graph);
+				die("unrecognized graph chunk id: %08x", chunk_id);
+		}
+
+		if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
+		{
+			graph->num_commits = (chunk_offset - last_chunk_offset)
+					     / graph->hash_len;
+		}
+
+		last_chunk_id = chunk_id;
+		last_chunk_offset = chunk_offset;
+	}
+
+	strcpy(graph->pack_dir, pack_dir);
+	return graph;
+}
+
 static void write_graph_chunk_fanout(struct sha1file *f,
 				     struct commit **commits,
 				     int nr_commits)
diff --git a/commit-graph.h b/commit-graph.h
index 4756f6ba5b..c1608976b3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -7,6 +7,29 @@
 extern char* get_commit_graph_filename_hash(const char *pack_dir,
 					    struct object_id *hash);
 
+struct commit_graph {
+	int graph_fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_commits;
+	struct object_id oid;
+
+	const unsigned char *chunk_oid_fanout;
+	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_large_edges;
+
+	/* something like ".git/objects/pack" */
+	char pack_dir[FLEX_ARRAY]; /* more */
+};
+
+extern struct commit_graph *load_commit_graph_one(const char *graph_file,
+						  const char *pack_dir);
+
 extern struct object_id *write_commit_graph(const char *pack_dir);
 
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index b762587595..ad1d0e621d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -21,9 +21,21 @@ test_expect_success 'create commits and repack' '
 	done &&
 	git repack'
 
+graph_read_expect() {
+	cat >expect <<- EOF
+	header: 43475048 01 01 14 04
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata large_edges
+	pack_dir: $2
+	EOF
+}
+
 test_expect_success 'write graph' '
 	graph1=$(git commit-graph write) &&
-	test_path_is_file $packdir/graph-$graph1.graph'
+	test_path_is_file $packdir/graph-$graph1.graph &&
+	git commit-graph read --graph-hash=$graph1 >output &&
+	graph_read_expect "3" "$packdir" &&
+	test_cmp expect output'
 
 test_expect_success 'Add more commits' '
 	git reset --hard commits/1 &&
@@ -62,7 +74,10 @@ test_expect_success 'Add more commits' '
 
 test_expect_success 'write graph with merges' '
 	graph2=$(git commit-graph write)&&
-	test_path_is_file $packdir/graph-$graph2.graph'
+	test_path_is_file $packdir/graph-$graph2.graph &&
+	git commit-graph read --graph-hash=$graph2 >output &&
+	graph_read_expect "10" "$packdir" &&
+	test_cmp expect output'
 
 test_expect_success 'Add one more commit' '
 	test_commit 8 &&
@@ -85,14 +100,20 @@ test_expect_success 'Add one more commit' '
 
 test_expect_success 'write graph with new commit' '
 	graph3=$(git commit-graph write) &&
-	test_path_is_file $packdir/graph-$graph3.graph'
-
+	test_path_is_file $packdir/graph-$graph3.graph &&
+	test_path_is_file $packdir/graph-$graph3.graph &&
+	git commit-graph read --graph-hash=$graph3 >output &&
+	graph_read_expect "11" "$packdir" &&
+	test_cmp expect output'
 
 test_expect_success 'write graph with nothing new' '
 	graph4=$(git commit-graph write) &&
 	test_path_is_file $packdir/graph-$graph4.graph &&
 	printf $graph3 >expect &&
 	printf $graph4 >output &&
+	test_cmp expect output &&
+	git commit-graph read --graph-hash=$graph4 >output &&
+	graph_read_expect "11" "$packdir" &&
 	test_cmp expect output'
 
 test_expect_success 'setup bare repo' '
@@ -103,7 +124,10 @@ test_expect_success 'setup bare repo' '
 
 test_expect_success 'write graph in bare repo' '
 	graphbare=$(git commit-graph write) &&
-	test_path_is_file $baredir/graph-$graphbare.graph'
+	test_path_is_file $baredir/graph-$graphbare.graph &&
+	git commit-graph read --graph-hash=$graphbare >output &&
+	graph_read_expect "11" "$baredir" &&
+	test_cmp expect output'
 
 test_done
 
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 07/14] commit-graph: update graph-head during write
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (5 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read' Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-12 18:56     ` Junio C Hamano
  2018-02-13 22:38     ` Jonathan Tan
  2018-02-08 20:37   ` [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear' Derrick Stolee
                     ` (7 subsequent siblings)
  14 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

It is possible to have multiple commit graph files in a pack directory,
but only one is important at a time. Use a 'graph_head' file to point
to the important file. Teach git-commit-graph to write 'graph_head' upon
writing a new commit graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 ++++++++++-
 builtin/commit-graph.c             | 27 +++++++++++++++++++++++++--
 commit-graph.c                     |  8 ++++++++
 commit-graph.h                     |  1 +
 t/t5318-commit-graph.sh            | 25 +++++++++++++++++++------
 5 files changed, 63 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 67e107f06a..5e32c43b27 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -33,7 +33,9 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 Includes all commits from the existing commit graph file. Outputs the
 checksum hash of the written file.
-
++
+With `--update-head` option, update the graph-head file to point
+to the written graph file.
 
 'read'::
 
@@ -53,6 +55,13 @@ EXAMPLES
 $ git commit-graph write
 ------------------------------------------------
 
+* Write a graph file for the packed commits in your local .git folder
+* and update graph-head.
++
+------------------------------------------------
+$ git commit-graph write --update-head
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 3ffa7ec433..776ca087e8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,12 +1,13 @@
 #include "builtin.h"
 #include "config.h"
+#include "lockfile.h"
 #include "parse-options.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph read [--graph-hash=<hash>]"),
-	N_("git commit-graph write [--pack-dir <packdir>]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head]"),
 	NULL
 };
 
@@ -16,13 +17,14 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--pack-dir <packdir>]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *pack_dir;
 	const char *graph_hash;
+	int update_head;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -87,6 +89,22 @@ static int graph_read(int argc, const char **argv)
 	return 0;
 }
 
+static void update_head_file(const char *pack_dir, const struct object_id *graph_hash)
+{
+	int fd;
+	struct lock_file lk = LOCK_INIT;
+	char *head_fname = get_graph_head_filename(pack_dir);
+
+	fd = hold_lock_file_for_update(&lk, head_fname, LOCK_DIE_ON_ERROR);
+	FREE_AND_NULL(head_fname);
+
+	if (fd < 0)
+		die_errno("unable to open graph-head");
+
+	write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ);
+	commit_lock_file(&lk);
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct object_id *graph_hash;
@@ -95,6 +113,8 @@ static int graph_write(int argc, const char **argv)
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
 			N_("dir"),
 			N_("The pack directory to store the graph") },
+		OPT_BOOL('u', "update-head", &opts.update_head,
+			N_("update graph-head to written graph file")),
 		OPT_END(),
 	};
 
@@ -111,6 +131,9 @@ static int graph_write(int argc, const char **argv)
 
 	graph_hash = write_commit_graph(opts.pack_dir);
 
+	if (opts.update_head)
+		update_head_file(opts.pack_dir, graph_hash);
+
 	if (graph_hash) {
 		printf("%s\n", oid_to_hex(graph_hash));
 		FREE_AND_NULL(graph_hash);
diff --git a/commit-graph.c b/commit-graph.c
index 9a337cea4d..9789fe37f9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,6 +38,14 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
 			GRAPH_OID_LEN + 8)
 
+char *get_graph_head_filename(const char *pack_dir)
+{
+	struct strbuf fname = STRBUF_INIT;
+	strbuf_addstr(&fname, pack_dir);
+	strbuf_addstr(&fname, "/graph-head");
+	return strbuf_detach(&fname, 0);
+}
+
 char* get_commit_graph_filename_hash(const char *pack_dir,
 				     struct object_id *hash)
 {
diff --git a/commit-graph.h b/commit-graph.h
index c1608976b3..726b8aa0f4 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 #include "commit.h"
 
+extern char *get_graph_head_filename(const char *pack_dir);
 extern char* get_commit_graph_filename_hash(const char *pack_dir,
 					    struct object_id *hash);
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index ad1d0e621d..21352d5a3c 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -11,7 +11,8 @@ test_expect_success 'setup full repo' '
 	packdir=".git/objects/pack"'
 
 test_expect_success 'write graph with no packs' '
-	git commit-graph write --pack-dir .'
+	git commit-graph write --pack-dir . &&
+	test_path_is_missing graph-head'
 
 test_expect_success 'create commits and repack' '
 	for i in $(test_seq 3)
@@ -32,6 +33,7 @@ graph_read_expect() {
 
 test_expect_success 'write graph' '
 	graph1=$(git commit-graph write) &&
+	test_path_is_missing graph-head &&
 	test_path_is_file $packdir/graph-$graph1.graph &&
 	git commit-graph read --graph-hash=$graph1 >output &&
 	graph_read_expect "3" "$packdir" &&
@@ -73,8 +75,11 @@ test_expect_success 'Add more commits' '
 
 
 test_expect_success 'write graph with merges' '
-	graph2=$(git commit-graph write)&&
+	graph2=$(git commit-graph write --update-head)&&
 	test_path_is_file $packdir/graph-$graph2.graph &&
+	test_path_is_file $packdir/graph-head &&
+	printf $graph2 >expect &&
+	test_cmp expect $packdir/graph-head &&
 	git commit-graph read --graph-hash=$graph2 >output &&
 	graph_read_expect "10" "$packdir" &&
 	test_cmp expect output'
@@ -99,19 +104,24 @@ test_expect_success 'Add one more commit' '
 # 1
 
 test_expect_success 'write graph with new commit' '
-	graph3=$(git commit-graph write) &&
-	test_path_is_file $packdir/graph-$graph3.graph &&
+	graph3=$(git commit-graph write --update-head) &&
 	test_path_is_file $packdir/graph-$graph3.graph &&
+        test_path_is_file $packdir/graph-head &&
+        printf $graph3 >expect &&
+        test_cmp expect $packdir/graph-head &&
 	git commit-graph read --graph-hash=$graph3 >output &&
 	graph_read_expect "11" "$packdir" &&
 	test_cmp expect output'
 
 test_expect_success 'write graph with nothing new' '
-	graph4=$(git commit-graph write) &&
+	graph4=$(git commit-graph write --update-head) &&
 	test_path_is_file $packdir/graph-$graph4.graph &&
 	printf $graph3 >expect &&
 	printf $graph4 >output &&
 	test_cmp expect output &&
+        test_path_is_file $packdir/graph-head &&
+        printf $graph4 >expect &&
+        test_cmp expect $packdir/graph-head &&
 	git commit-graph read --graph-hash=$graph4 >output &&
 	graph_read_expect "11" "$packdir" &&
 	test_cmp expect output'
@@ -123,8 +133,11 @@ test_expect_success 'setup bare repo' '
 	baredir="./objects/pack"'
 
 test_expect_success 'write graph in bare repo' '
-	graphbare=$(git commit-graph write) &&
+	graphbare=$(git commit-graph write --update-head) &&
 	test_path_is_file $baredir/graph-$graphbare.graph &&
+        test_path_is_file $baredir/graph-head &&
+        printf $graphbare >expect &&
+        test_cmp expect $baredir/graph-head &&
 	git commit-graph read --graph-hash=$graphbare >output &&
 	graph_read_expect "11" "$baredir" &&
 	test_cmp expect output'
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear'
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (6 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 07/14] commit-graph: update graph-head during write Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-13 22:49     ` Jonathan Tan
  2018-02-08 20:37   ` [PATCH v3 09/14] commit-graph: implement --delete-expired Derrick Stolee
                     ` (6 subsequent siblings)
  14 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach Git to delete the current 'graph_head' file and the commit graph
it references. This is a good safety valve if somehow the file is
corrupted and needs to be recalculated. Since the commit graph is a
summary of contents already in the ODB, it can be regenerated.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 +++++++++
 builtin/commit-graph.c             | 50 ++++++++++++++++++++++++++++++++++++++
 commit-graph.c                     | 23 ++++++++++++++++++
 commit-graph.h                     |  2 ++
 t/t5318-commit-graph.sh            |  5 ++++
 5 files changed, 91 insertions(+)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 5e32c43b27..8c2cbbc923 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graphs (.graph files)
 SYNOPSIS
 --------
 [verse]
+'git commit-graph clear' [--pack-dir <pack_dir>]
 'git commit-graph read' <options> [--pack-dir <pack_dir>]
 'git commit-graph write' <options> [--pack-dir <pack_dir>]
 
@@ -45,6 +46,10 @@ details about the graph file.
 With `--graph-hash=<hash>` option, consider the graph file
 graph-<hash>.graph in the pack directory.
 
+'clear'::
+
+Delete the graph-head file and the graph file it references.
+
 
 EXAMPLES
 --------
@@ -68,6 +73,12 @@ $ git commit-graph write --update-head
 $ git commit-graph read --graph-hash=<hash>
 ------------------------------------------------
 
+* Delete <dir>/graph-head and the file it references.
++
+------------------------------------------------
+$ git commit-graph clear --pack-dir=<dir>
+------------------------------------------------
+
 
 GIT
 ---
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 776ca087e8..529cb80de6 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,16 +1,23 @@
 #include "builtin.h"
 #include "config.h"
+#include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
+	N_("git commit-graph clear [--pack-dir <packdir>]"),
 	N_("git commit-graph read [--graph-hash=<hash>]"),
 	N_("git commit-graph write [--pack-dir <packdir>] [--update-head]"),
 	NULL
 };
 
+static const char * const builtin_commit_graph_clear_usage[] = {
+	N_("git commit-graph clear [--pack-dir <packdir>]"),
+	NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
 	N_("git commit-graph read [--pack-dir <packdir>]"),
 	NULL
@@ -27,6 +34,47 @@ static struct opts_commit_graph {
 	int update_head;
 } opts;
 
+static int graph_clear(int argc, const char **argv)
+{
+	char *old_path;
+	char *head_fname;
+	struct object_id old_graph_hash;
+
+	static struct option builtin_commit_graph_clear_options[] = {
+		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
+			N_("dir"),
+			N_("The pack directory to store the graph") },
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_clear_options,
+			     builtin_commit_graph_clear_usage, 0);
+
+	if (!opts.pack_dir) {
+		struct strbuf path = STRBUF_INIT;
+		strbuf_addstr(&path, get_object_directory());
+		strbuf_addstr(&path, "/pack");
+		opts.pack_dir = strbuf_detach(&path, NULL);
+	}
+
+	if (!get_graph_head_hash(opts.pack_dir, &old_graph_hash))
+		return 0;
+
+	head_fname = get_graph_head_filename(opts.pack_dir);
+	if (remove_path(head_fname))
+		die("failed to remove path %s", head_fname);
+	FREE_AND_NULL(head_fname);
+
+	old_path = get_commit_graph_filename_hash(opts.pack_dir,
+						  &old_graph_hash);
+	if (remove_path(old_path))
+		die("failed to remove path %s", old_path);
+	free(old_path);
+
+	return 0;
+}
+
 static int graph_read(int argc, const char **argv)
 {
 	struct object_id graph_hash;
@@ -162,6 +210,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
 	if (argc > 0) {
+		if (!strcmp(argv[0], "clear"))
+			return graph_clear(argc, argv);
 		if (!strcmp(argv[0], "read"))
 			return graph_read(argc, argv);
 		if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index 9789fe37f9..95b813c2c7 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -46,6 +46,29 @@ char *get_graph_head_filename(const char *pack_dir)
 	return strbuf_detach(&fname, 0);
 }
 
+struct object_id *get_graph_head_hash(const char *pack_dir, struct object_id *hash)
+{
+	char hex[GIT_MAX_HEXSZ + 1];
+	char *fname;
+	FILE *f;
+
+	fname = get_graph_head_filename(pack_dir);
+	f = fopen(fname, "r");
+	FREE_AND_NULL(fname);
+
+	if (!f)
+		return 0;
+
+	if (!fgets(hex, sizeof(hex), f))
+		die("failed to read graph-head");
+
+	fclose(f);
+
+	if (get_oid_hex(hex, hash))
+		return 0;
+	return hash;
+}
+
 char* get_commit_graph_filename_hash(const char *pack_dir,
 				     struct object_id *hash)
 {
diff --git a/commit-graph.h b/commit-graph.h
index 726b8aa0f4..75427cd5f6 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -5,6 +5,8 @@
 #include "commit.h"
 
 extern char *get_graph_head_filename(const char *pack_dir);
+extern struct object_id *get_graph_head_hash(const char *pack_dir,
+					     struct object_id *hash);
 extern char* get_commit_graph_filename_hash(const char *pack_dir,
 					    struct object_id *hash);
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 21352d5a3c..de81253790 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -126,6 +126,11 @@ test_expect_success 'write graph with nothing new' '
 	graph_read_expect "11" "$packdir" &&
 	test_cmp expect output'
 
+test_expect_success 'clear graph' '
+	git commit-graph clear &&
+	test_path_is_missing $packdir/graph-$graph4.graph &&
+	test_path_is_missing $packdir/graph-head'
+
 test_expect_success 'setup bare repo' '
 	cd .. &&
 	git clone --bare --no-local full bare &&
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 09/14] commit-graph: implement --delete-expired
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (7 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear' Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 10/14] commit-graph: add core.commitGraph setting Derrick Stolee
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach git-commit-graph to delete the graph files in the pack directory
that were not referenced by 'graph_head' during this process. This cleans
up space for the user while not causing race conditions with other running
Git processes that may be referencing the previous graph file.

To delete old graph files, a user (or managing process) would call

	git commit-graph write --update-head --delete-expired

but there is some responsibility that the caller must consider. Specifically,
ensure that processes that started before a previous 'commit-graph write'
command have completed. Otherwise, they may have an open handle on a graph file
that will be deleted by the new call.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 ++++--
 builtin/commit-graph.c             | 73 ++++++++++++++++++++++++++++++++++++--
 t/t5318-commit-graph.sh            |  7 ++--
 3 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 8c2cbbc923..7ae8f7484d 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -37,6 +37,11 @@ checksum hash of the written file.
 +
 With `--update-head` option, update the graph-head file to point
 to the written graph file.
++
+With the `--delete-expired` option, delete the graph files in the pack
+directory that are not referred to by the graph-head file. To avoid race
+conditions, do not delete the file previously referred to by the
+graph-head file if it is updated by the `--update-head` option.
 
 'read'::
 
@@ -60,11 +65,11 @@ EXAMPLES
 $ git commit-graph write
 ------------------------------------------------
 
-* Write a graph file for the packed commits in your local .git folder
-* and update graph-head.
+* Write a graph file for the packed commits in your local .git folder,
+* update graph-head, and delete state graph files.
 +
 ------------------------------------------------
-$ git commit-graph write --update-head
+$ git commit-graph write --update-head --delete-expired
 ------------------------------------------------
 
 * Read basic information from a graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 529cb80de6..15f647fd81 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph clear [--pack-dir <packdir>]"),
 	N_("git commit-graph read [--graph-hash=<hash>]"),
-	N_("git commit-graph write [--pack-dir <packdir>] [--update-head]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired]"),
 	NULL
 };
 
@@ -24,7 +24,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--pack-dir <packdir>] [--update-head]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired]"),
 	NULL
 };
 
@@ -32,6 +32,7 @@ static struct opts_commit_graph {
 	const char *pack_dir;
 	const char *graph_hash;
 	int update_head;
+	int delete_expired;
 } opts;
 
 static int graph_clear(int argc, const char **argv)
@@ -153,9 +154,68 @@ static void update_head_file(const char *pack_dir, const struct object_id *graph
 	commit_lock_file(&lk);
 }
 
+/*
+ * To avoid race conditions and deleting graph files that are being
+ * used by other processes, look inside a pack directory for all files
+ * of the form "graph-<hash>.graph" that do not match the old or new
+ * graph hashes and delete them.
+ */
+static void do_delete_expired(const char *pack_dir,
+			      struct object_id *old_graph_hash,
+			      struct object_id *new_graph_hash)
+{
+	DIR *dir;
+	struct dirent *de;
+	int dirnamelen;
+	struct strbuf path = STRBUF_INIT;
+	char *old_graph_path, *new_graph_path;
+
+	if (old_graph_hash)
+		old_graph_path = get_commit_graph_filename_hash(pack_dir, old_graph_hash);
+	else
+		old_graph_path = NULL;
+	new_graph_path = get_commit_graph_filename_hash(pack_dir, new_graph_hash);
+
+	dir = opendir(pack_dir);
+	if (!dir) {
+		if (errno != ENOENT)
+			error_errno("unable to open object pack directory: %s",
+				    pack_dir);
+		return;
+	}
+
+	strbuf_addstr(&path, pack_dir);
+	strbuf_addch(&path, '/');
+
+	dirnamelen = path.len;
+	while ((de = readdir(dir)) != NULL) {
+		size_t base_len;
+
+		if (is_dot_or_dotdot(de->d_name))
+			continue;
+
+		strbuf_setlen(&path, dirnamelen);
+		strbuf_addstr(&path, de->d_name);
+
+		base_len = path.len;
+		if (strip_suffix_mem(path.buf, &base_len, ".graph") &&
+		    strcmp(new_graph_path, path.buf) &&
+		    (!old_graph_path || strcmp(old_graph_path, path.buf)) &&
+		    remove_path(path.buf))
+			die("failed to remove path %s", path.buf);
+	}
+
+	strbuf_release(&path);
+	if (old_graph_path)
+		FREE_AND_NULL(old_graph_path);
+	FREE_AND_NULL(new_graph_path);
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct object_id *graph_hash;
+	struct object_id old_graph_hash;
+	int has_existing;
 
 	static struct option builtin_commit_graph_write_options[] = {
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
@@ -163,6 +223,8 @@ static int graph_write(int argc, const char **argv)
 			N_("The pack directory to store the graph") },
 		OPT_BOOL('u', "update-head", &opts.update_head,
 			N_("update graph-head to written graph file")),
+		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
+			N_("delete expired head graph file")),
 		OPT_END(),
 	};
 
@@ -177,11 +239,18 @@ static int graph_write(int argc, const char **argv)
 		opts.pack_dir = strbuf_detach(&path, NULL);
 	}
 
+	has_existing = !!get_graph_head_hash(opts.pack_dir, &old_graph_hash);
+
 	graph_hash = write_commit_graph(opts.pack_dir);
 
 	if (opts.update_head)
 		update_head_file(opts.pack_dir, graph_hash);
 
+	if (opts.delete_expired && graph_hash)
+		do_delete_expired(opts.pack_dir,
+				  has_existing ? &old_graph_hash : NULL,
+				  graph_hash);
+
 	if (graph_hash) {
 		printf("%s\n", oid_to_hex(graph_hash));
 		FREE_AND_NULL(graph_hash);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index de81253790..10dfb6c5cf 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -104,8 +104,10 @@ test_expect_success 'Add one more commit' '
 # 1
 
 test_expect_success 'write graph with new commit' '
-	graph3=$(git commit-graph write --update-head) &&
+	graph3=$(git commit-graph write --update-head --delete-expired) &&
 	test_path_is_file $packdir/graph-$graph3.graph &&
+	test_path_is_file $packdir/graph-$graph2.graph &&
+	test_path_is_missing $packdir/graph-$graph1.graph &&
         test_path_is_file $packdir/graph-head &&
         printf $graph3 >expect &&
         test_cmp expect $packdir/graph-head &&
@@ -114,8 +116,9 @@ test_expect_success 'write graph with new commit' '
 	test_cmp expect output'
 
 test_expect_success 'write graph with nothing new' '
-	graph4=$(git commit-graph write --update-head) &&
+	graph4=$(git commit-graph write --update-head --delete-expired) &&
 	test_path_is_file $packdir/graph-$graph4.graph &&
+	test_path_is_missing $packdir/graph-$graph2.graph &&
 	printf $graph3 >expect &&
 	printf $graph4 >output &&
 	test_cmp expect output &&
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 10/14] commit-graph: add core.commitGraph setting
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (8 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 09/14] commit-graph: implement --delete-expired Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

The commit graph feature is controlled by the new core.commitGraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitGraph is that a user can always stop checking
for or parsing commit graph files if core.commitGraph=0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h                  | 1 +
 config.c                 | 5 +++++
 environment.c            | 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9593bfabaa..e90d0d1262 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -883,6 +883,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitGraph::
+	Enable git commit graph feature. Allows reading from .graph files.
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 6440e2bf21..1063873316 100644
--- a/cache.h
+++ b/cache.h
@@ -771,6 +771,7 @@ extern char *git_replace_ref_base;
 
 extern int fsync_object_files;
 extern int core_preload_index;
+extern int core_commit_graph;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index 41862d4a32..614cf59ac4 100644
--- a/config.c
+++ b/config.c
@@ -1213,6 +1213,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.commitgraph")) {
+		core_commit_graph = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 8289c25b44..81fed83c50 100644
--- a/environment.c
+++ b/environment.c
@@ -60,6 +60,7 @@ enum push_default_type push_default = PUSH_DEFAULT_UNSPECIFIED;
 enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
+int core_commit_graph;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 11/14] commit: integrate commit graph with commit parsing
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (9 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 10/14] commit-graph: add core.commitGraph setting Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-14  0:12     ` Jonathan Tan
  2018-02-15 18:25     ` Junio C Hamano
  2018-02-08 20:37   ` [PATCH v3 12/14] commit-graph: close under reachability Derrick Stolee
                     ` (3 subsequent siblings)
  14 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date. The only loosely-expected
condition is that the commit buffer is loaded into the cache. This
was checked in log-tree.c:show_log(), but the "return;" on failure
produced unexpected results (i.e. the message line was never terminated).
The new behavior of loading the buffer when needed prevents the
unexpected behavior.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on
read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 704,766
reachable commits and is behind 'origin/master' by 19,610 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv                       |  0.42s |  0.27s | -35%  |
| rev-list --all                   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c                 |   1 +
 commit-graph.c          | 202 ++++++++++++++++++++++++++++++++++++++++++++++++
 commit-graph.h          |  21 ++++-
 commit.c                |   3 +
 commit.h                |   3 +
 log-tree.c              |   3 +-
 t/t5318-commit-graph.sh |  44 ++++++++++-
 7 files changed, 272 insertions(+), 5 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadfacd..cf4f8b61e1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
 	struct commit *c = alloc_node(&commit_state, sizeof(struct commit));
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
+	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 95b813c2c7..aff67c458e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,6 +38,9 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
 			GRAPH_OID_LEN + 8)
 
+/* global storage */
+struct commit_graph *commit_graph = NULL;
+
 char *get_graph_head_filename(const char *pack_dir)
 {
 	struct strbuf fname = STRBUF_INIT;
@@ -229,6 +232,205 @@ struct commit_graph *load_commit_graph_one(const char *graph_file, const char *p
 	return graph;
 }
 
+static void prepare_commit_graph_one(const char *obj_dir)
+{
+	char *graph_file;
+	struct object_id oid;
+	struct strbuf pack_dir = STRBUF_INIT;
+	strbuf_addstr(&pack_dir, obj_dir);
+	strbuf_add(&pack_dir, "/pack", 5);
+
+	if (!get_graph_head_hash(pack_dir.buf, &oid))
+		return;
+
+	graph_file = get_commit_graph_filename_hash(pack_dir.buf, &oid);
+
+	commit_graph = load_commit_graph_one(graph_file, pack_dir.buf);
+	strbuf_release(&pack_dir);
+}
+
+static int prepare_commit_graph_run_once = 0;
+void prepare_commit_graph(void)
+{
+	struct alternate_object_database *alt;
+	char *obj_dir;
+
+	if (prepare_commit_graph_run_once)
+		return;
+	prepare_commit_graph_run_once = 1;
+
+	obj_dir = get_object_directory();
+	prepare_commit_graph_one(obj_dir);
+	prepare_alt_odb();
+	for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next)
+		prepare_commit_graph_one(alt->path);
+}
+
+static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
+{
+	int result = bsearch_hash(oid->hash, g->chunk_oid_fanout,
+				  g->chunk_oid_lookup, g->hash_len);
+
+	if (result >= 0) {
+		*pos = result;
+		return 1;
+	} else {
+		*pos = -result - 1;
+		return 0;
+	}
+}
+
+struct object_id *get_nth_commit_oid(struct commit_graph *g,
+				     uint32_t n,
+				     struct object_id *oid)
+{
+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * n);
+	return oid;
+}
+
+static struct commit_list **insert_parent_or_die(struct commit_graph *g,
+					   int pos,
+					   struct commit_list **pptr)
+{
+	struct commit *c;
+	struct object_id oid;
+	get_nth_commit_oid(g, pos, &oid);
+	c = lookup_commit(&oid);
+	if (!c)
+		die("could not find commit %s", oid_to_hex(&oid));
+	c->graph_pos = pos;
+	return &commit_list_insert(c, pptr)->next;
+}
+
+static int check_commit_parents(struct commit *item, struct commit_graph *g,
+				uint32_t pos, const unsigned char *commit_data)
+{
+	uint32_t new_parent_pos;
+	uint32_t *parent_data_ptr;
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hash_len));
+
+	if (new_parent_pos == GRAPH_PARENT_MISSING)
+		return 0;
+
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return 1;
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hash_len + 4));
+
+	if (new_parent_pos == GRAPH_PARENT_MISSING)
+		return 0;
+	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED))
+		return 1;
+
+	new_parent_pos = new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED;
+
+	if (new_parent_pos == GRAPH_PARENT_MISSING)
+		return 0;
+
+	parent_data_ptr = (uint32_t*)(g->chunk_large_edges + 4 * new_parent_pos);
+	do {
+		new_parent_pos = ntohl(*parent_data_ptr);
+
+		if ((new_parent_pos & GRAPH_EDGE_LAST_MASK) == GRAPH_PARENT_MISSING)
+			return 0;
+
+		parent_data_ptr++;
+	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
+
+	return 1;
+}
+
+static int full_parse_commit(struct commit *item, struct commit_graph *g,
+			     uint32_t pos, const unsigned char *commit_data)
+{
+	struct object_id oid;
+	uint32_t new_parent_pos;
+	uint32_t *parent_data_ptr;
+	uint64_t date_low, date_high;
+	struct commit_list **pptr;
+
+	item->object.parsed = 1;
+	item->graph_pos = pos;
+
+	hashcpy(oid.hash, commit_data);
+	item->tree = lookup_tree(&oid);
+
+	date_high = ntohl(*(uint32_t*)(commit_data + g->hash_len + 8)) & 0x3;
+	date_low = ntohl(*(uint32_t*)(commit_data + g->hash_len + 12));
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
+	pptr = &item->parents;
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hash_len));
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return 1;
+	pptr = insert_parent_or_die(g, new_parent_pos, pptr);
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hash_len + 4));
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return 1;
+	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED)) {
+		pptr = insert_parent_or_die(g, new_parent_pos, pptr);
+		return 1;
+	}
+
+	parent_data_ptr = (uint32_t*)(g->chunk_large_edges +
+			  4 * (new_parent_pos ^ GRAPH_LARGE_EDGES_NEEDED));
+	do {
+		new_parent_pos = ntohl(*parent_data_ptr);
+		pptr = insert_parent_or_die(g, new_parent_pos, pptr);
+		parent_data_ptr++;
+	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
+
+	return 1;
+}
+
+static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+
+	if (!check_commit_parents(item, g, pos, commit_data))
+		return 0;
+
+	return full_parse_commit(item, g, pos, commit_data);
+}
+
+/*
+ * Given a commit struct, try to fill the commit struct info, including:
+ *  1. tree object
+ *  2. date
+ *  3. parents.
+ *
+ * Returns 1 if and only if the commit was found in the commit graph.
+ *
+ * See parse_commit_buffer() for the fallback after this call.
+ */
+int parse_commit_in_graph(struct commit *item)
+{
+	if (!core_commit_graph)
+		return 0;
+	if (item->object.parsed)
+		return 1;
+
+	prepare_commit_graph();
+	if (commit_graph) {
+		uint32_t pos;
+		int found;
+		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+			pos = item->graph_pos;
+			found = 1;
+		} else {
+			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
+		}
+
+		if (found)
+			return fill_commit_in_graph(item, commit_graph, pos);
+	}
+
+	return 0;
+}
+
 static void write_graph_chunk_fanout(struct sha1file *f,
 				     struct commit **commits,
 				     int nr_commits)
diff --git a/commit-graph.h b/commit-graph.h
index 75427cd5f6..7c4c9c38ab 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,13 +4,25 @@
 #include "git-compat-util.h"
 #include "commit.h"
 
+/*
+ * Given a commit struct, try to fill the commit struct info, including:
+ *  1. tree object
+ *  2. date
+ *  3. parents.
+ *
+ * Returns 1 if and only if the commit was found in the packed graph.
+ *
+ * See parse_commit_buffer() for the fallback after this call.
+ */
+extern int parse_commit_in_graph(struct commit *item);
+
 extern char *get_graph_head_filename(const char *pack_dir);
 extern struct object_id *get_graph_head_hash(const char *pack_dir,
 					     struct object_id *hash);
 extern char* get_commit_graph_filename_hash(const char *pack_dir,
 					    struct object_id *hash);
 
-struct commit_graph {
+extern struct commit_graph {
 	int graph_fd;
 
 	const unsigned char *data;
@@ -28,10 +40,15 @@ struct commit_graph {
 
 	/* something like ".git/objects/pack" */
 	char pack_dir[FLEX_ARRAY]; /* more */
-};
+} *commit_graph;
 
 extern struct commit_graph *load_commit_graph_one(const char *graph_file,
 						  const char *pack_dir);
+extern void prepare_commit_graph(void);
+
+extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
+					    uint32_t n,
+					    struct object_id *oid);
 
 extern struct object_id *write_commit_graph(const char *pack_dir);
 
diff --git a/commit.c b/commit.c
index cab8d4455b..a8b464d901 100644
--- a/commit.c
+++ b/commit.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "tag.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "pkt-line.h"
 #include "utf8.h"
 #include "diff.h"
@@ -385,6 +386,8 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return -1;
 	if (item->object.parsed)
 		return 0;
+	if (parse_commit_in_graph(item))
+		return 0;
 	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
 	if (!buffer)
 		return quiet_on_missing ? -1 :
diff --git a/commit.h b/commit.h
index 99a3fea68d..57963d86c3 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,8 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 
+#define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+
 struct commit_list {
 	struct commit *item;
 	struct commit_list *next;
@@ -20,6 +22,7 @@ struct commit {
 	timestamp_t date;
 	struct commit_list *parents;
 	struct tree *tree;
+	uint32_t graph_pos;
 };
 
 extern int save_commit_buffer;
diff --git a/log-tree.c b/log-tree.c
index 580b3a98a0..14735d412b 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -647,8 +647,7 @@ void show_log(struct rev_info *opt)
 		show_mergetag(opt, commit);
 	}
 
-	if (!get_cached_commit_buffer(commit, NULL))
-		return;
+	get_commit_buffer(commit, NULL);
 
 	if (opt->show_notes) {
 		int raw;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 10dfb6c5cf..1e3fe59d70 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -8,6 +8,7 @@ test_expect_success 'setup full repo' '
 	mkdir full &&
 	cd full &&
 	git init &&
+	git config core.commitGraph true &&
 	packdir=".git/objects/pack"'
 
 test_expect_success 'write graph with no packs' '
@@ -22,6 +23,26 @@ test_expect_success 'create commits and repack' '
 	done &&
 	git repack'
 
+graph_git_two_modes() {
+	git -c core.graph=true $1 >output
+	git -c core.graph=false $1 >expect
+	test_cmp output expect
+}
+
+graph_git_behavior() {
+	MSG=$1
+	BRANCH=$2
+	COMPARE=$3
+	test_expect_success "check normal git operations: $MSG" '
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"'
+}
+
+graph_git_behavior 'no graph' commits/3 commits/1
+
 graph_read_expect() {
 	cat >expect <<- EOF
 	header: 43475048 01 01 14 04
@@ -39,6 +60,8 @@ test_expect_success 'write graph' '
 	graph_read_expect "3" "$packdir" &&
 	test_cmp expect output'
 
+graph_git_behavior 'graph exists, no head' commits/3 commits/1
+
 test_expect_success 'Add more commits' '
 	git reset --hard commits/1 &&
 	for i in $(test_seq 4 5)
@@ -73,7 +96,6 @@ test_expect_success 'Add more commits' '
 # |___/____/
 # 1
 
-
 test_expect_success 'write graph with merges' '
 	graph2=$(git commit-graph write --update-head)&&
 	test_path_is_file $packdir/graph-$graph2.graph &&
@@ -84,6 +106,10 @@ test_expect_success 'write graph with merges' '
 	graph_read_expect "10" "$packdir" &&
 	test_cmp expect output'
 
+graph_git_behavior 'merge 1 vs 2' merge/1 merge/2
+graph_git_behavior 'merge 1 vs 3' merge/1 merge/3
+graph_git_behavior 'merge 2 vs 3' merge/2 merge/3
+
 test_expect_success 'Add one more commit' '
 	test_commit 8 &&
 	git branch commits/8 &&
@@ -103,6 +129,9 @@ test_expect_success 'Add one more commit' '
 # |___/____/
 # 1
 
+graph_git_behavior 'mixed mode, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'mixed mode, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'write graph with new commit' '
 	graph3=$(git commit-graph write --update-head --delete-expired) &&
 	test_path_is_file $packdir/graph-$graph3.graph &&
@@ -115,6 +144,9 @@ test_expect_success 'write graph with new commit' '
 	graph_read_expect "11" "$packdir" &&
 	test_cmp expect output'
 
+graph_git_behavior 'full graph, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'full graph, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'write graph with nothing new' '
 	graph4=$(git commit-graph write --update-head --delete-expired) &&
 	test_path_is_file $packdir/graph-$graph4.graph &&
@@ -134,12 +166,19 @@ test_expect_success 'clear graph' '
 	test_path_is_missing $packdir/graph-$graph4.graph &&
 	test_path_is_missing $packdir/graph-head'
 
+graph_git_behavior 'cleared graph, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'cleared graph, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd .. &&
 	git clone --bare --no-local full bare &&
 	cd bare &&
+	git config core.commitGraph true &&
 	baredir="./objects/pack"'
 
+graph_git_behavior 'bare repo, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'bare repo, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'write graph in bare repo' '
 	graphbare=$(git commit-graph write --update-head) &&
 	test_path_is_file $baredir/graph-$graphbare.graph &&
@@ -150,5 +189,8 @@ test_expect_success 'write graph in bare repo' '
 	graph_read_expect "11" "$baredir" &&
 	test_cmp expect output'
 
+graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' commits/8 merge/2
+
 test_done
 
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 12/14] commit-graph: close under reachability
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (10 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 13/14] commit-graph: read only from specific pack-indexes Derrick Stolee
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Also automatically add commits from the existing graph file, if it
exists.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index aff67c458e..d711a2cd81 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -633,6 +633,28 @@ static int if_packed_commit_add_to_list(const struct object_id *oid,
 	return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+	int i;
+	struct rev_info revs;
+	struct commit *commit;
+	init_revisions(&revs, NULL);
+	for (i = 0; i < oids->nr; i++) {
+		commit = lookup_commit(oids->list[i]);
+		if (commit && !parse_commit(commit))
+			revs.commits = commit_list_insert(commit, &revs.commits);
+	}
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+
+	while ((commit = get_revision(&revs)) != NULL) {
+		ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
+		oids->list[oids->nr] = &(commit->object.oid);
+		(oids->nr)++;
+	}
+}
+
 struct object_id *write_commit_graph(const char *pack_dir)
 {
 	struct packed_oid_list oids;
@@ -650,12 +672,27 @@ struct object_id *write_commit_graph(const char *pack_dir)
 	char *fname;
 	struct commit_list *parent;
 
+	prepare_commit_graph();
+
 	oids.nr = 0;
 	oids.alloc = 1024;
+
+	if (commit_graph && oids.alloc < commit_graph->num_commits)
+		oids.alloc = commit_graph->num_commits;
+
 	ALLOC_ARRAY(oids.list, oids.alloc);
 
+	if (commit_graph) {
+		for (i = 0; i < commit_graph->num_commits; i++) {
+			oids.list[i] = malloc(sizeof(struct object_id));
+			get_nth_commit_oid(commit_graph, i, oids.list[i]);
+		}
+		oids.nr = commit_graph->num_commits;
+	}
+
 	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
 
+	close_reachable(&oids);
 	QSORT(oids.list, oids.nr, commit_compare);
 
 	count_distinct = 1;
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 13/14] commit-graph: read only from specific pack-indexes
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (11 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 12/14] commit-graph: close under reachability Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-08 20:37   ` [PATCH v3 14/14] commit-graph: build graph from starting commits Derrick Stolee
  2018-02-14 18:15   ` [PATCH v3 00/14] Serialized Git Commit Graph Derrick Stolee
  14 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively, since we add all commits stored in a
previous commit graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 +++++++++++
 builtin/commit-graph.c             | 32 +++++++++++++++++++++++++++++---
 commit-graph.c                     | 25 +++++++++++++++++++++++--
 commit-graph.h                     |  4 +++-
 packfile.c                         |  4 ++--
 packfile.h                         |  2 ++
 t/t5318-commit-graph.sh            | 13 +++++++++++++
 7 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 7ae8f7484d..727d5d70bb 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -42,6 +42,10 @@ With the `--delete-expired` option, delete the graph files in the pack
 directory that are not referred to by the graph-head file. To avoid race
 conditions, do not delete the file previously referred to by the
 graph-head file if it is updated by the `--update-head` option.
++
+With the `--stdin-packs` option, generate the new commit graph by
+walking objects only in the specified packfiles and any commits in
+the existing graph-head.
 
 'read'::
 
@@ -72,6 +76,13 @@ $ git commit-graph write
 $ git commit-graph write --update-head --delete-expired
 ------------------------------------------------
 
+* Write a graph file, extending the current graph file using commits
+* in <pack-index>, update graph-head, and delete stale graph files.
++
+------------------------------------------------
+$ echo <pack-index> | git commit-graph write --update-head --delete-expired --stdin-packs
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 15f647fd81..fe5f00551c 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph clear [--pack-dir <packdir>]"),
 	N_("git commit-graph read [--graph-hash=<hash>]"),
-	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs]"),
 	NULL
 };
 
@@ -24,7 +24,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs]"),
 	NULL
 };
 
@@ -33,6 +33,7 @@ static struct opts_commit_graph {
 	const char *graph_hash;
 	int update_head;
 	int delete_expired;
+	int stdin_packs;
 } opts;
 
 static int graph_clear(int argc, const char **argv)
@@ -216,6 +217,11 @@ static int graph_write(int argc, const char **argv)
 	struct object_id *graph_hash;
 	struct object_id old_graph_hash;
 	int has_existing;
+	const char **pack_indexes = NULL;
+	int nr_packs = 0;
+	const char **lines = NULL;
+	int nr_lines = 0;
+	int alloc_lines = 0;
 
 	static struct option builtin_commit_graph_write_options[] = {
 		{ OPTION_STRING, 'p', "pack-dir", &opts.pack_dir,
@@ -225,6 +231,8 @@ static int graph_write(int argc, const char **argv)
 			N_("update graph-head to written graph file")),
 		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
 			N_("delete expired head graph file")),
+		OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
+			N_("only scan packfiles listed by stdin")),
 		OPT_END(),
 	};
 
@@ -241,7 +249,25 @@ static int graph_write(int argc, const char **argv)
 
 	has_existing = !!get_graph_head_hash(opts.pack_dir, &old_graph_hash);
 
-	graph_hash = write_commit_graph(opts.pack_dir);
+	if (opts.stdin_packs) {
+		struct strbuf buf = STRBUF_INIT;
+		nr_lines = 0;
+		alloc_lines = 128;
+		ALLOC_ARRAY(lines, alloc_lines);
+
+		while (strbuf_getline(&buf, stdin) != EOF) {
+			ALLOC_GROW(lines, nr_lines + 1, alloc_lines);
+			lines[nr_lines++] = buf.buf;
+			strbuf_detach(&buf, NULL);
+		}
+
+		pack_indexes = lines;
+		nr_packs = nr_lines;
+	}
+
+	graph_hash = write_commit_graph(opts.pack_dir,
+					pack_indexes,
+					nr_packs);
 
 	if (opts.update_head)
 		update_head_file(opts.pack_dir, graph_hash);
diff --git a/commit-graph.c b/commit-graph.c
index d711a2cd81..27a34f5eda 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -655,7 +655,9 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
-struct object_id *write_commit_graph(const char *pack_dir)
+struct object_id *write_commit_graph(const char *pack_dir,
+				     const char **pack_indexes,
+				     int nr_packs)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -690,7 +692,26 @@ struct object_id *write_commit_graph(const char *pack_dir)
 		oids.nr = commit_graph->num_commits;
 	}
 
-	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+	if (pack_indexes) {
+		int pack_dir_len = strlen(pack_dir) + 1;
+		struct strbuf packname = STRBUF_INIT;
+		strbuf_add(&packname, pack_dir, pack_dir_len - 1);
+		strbuf_addch(&packname, '/');
+		for (i = 0; i < nr_packs; i++) {
+			struct packed_git *p;
+			strbuf_setlen(&packname, pack_dir_len);
+			strbuf_addstr(&packname, pack_indexes[i]);
+			p = add_packed_git(packname.buf, packname.len, 1);
+			if (!p)
+				die("error adding pack %s", packname.buf);
+			if (open_pack_index(p))
+				die("error opening index for %s", packname.buf);
+			for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);
+			close_pack(p);
+		}
+	}
+	else
+		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
 
 	close_reachable(&oids);
 	QSORT(oids.list, oids.nr, commit_compare);
diff --git a/commit-graph.h b/commit-graph.h
index 7c4c9c38ab..918b34dd2b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -50,7 +50,9 @@ extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
 					    uint32_t n,
 					    struct object_id *oid);
 
-extern struct object_id *write_commit_graph(const char *pack_dir);
+extern struct object_id *write_commit_graph(const char *pack_dir,
+					    const char **pack_indexes,
+					    int nr_packs);
 
 #endif
 
diff --git a/packfile.c b/packfile.c
index 29f5dc2398..e4b1dc02bc 100644
--- a/packfile.c
+++ b/packfile.c
@@ -299,7 +299,7 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
-static void close_pack(struct packed_git *p)
+void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
@@ -1840,7 +1840,7 @@ int has_pack_index(const unsigned char *sha1)
 	return 1;
 }
 
-static int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
+int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
 {
 	uint32_t i;
 	int r = 0;
diff --git a/packfile.h b/packfile.h
index 0cdeb54dcd..9281e909d5 100644
--- a/packfile.h
+++ b/packfile.h
@@ -61,6 +61,7 @@ extern void close_pack_index(struct packed_git *);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 extern void close_pack_windows(struct packed_git *);
+extern void close_pack(struct packed_git *);
 extern void close_all_packs(void);
 extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
@@ -133,6 +134,7 @@ typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
 				  uint32_t pos,
 				  void *data);
+extern int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn, void *data);
 extern int for_each_packed_object(each_packed_object_fn, void *, unsigned flags);
 
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1e3fe59d70..e3546e6844 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -169,6 +169,19 @@ test_expect_success 'clear graph' '
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' commits/8 merge/1
 graph_git_behavior 'cleared graph, commit 8 vs merge 2' commits/8 merge/2
 
+test_expect_success 'build graph from latest pack with closure' '
+	graph5=$(cat new-idx | git commit-graph write --update-head --delete-expired --stdin-packs) &&
+	test_path_is_file $packdir/graph-$graph5.graph &&
+	test_path_is_file $packdir/graph-head &&
+	printf $graph5 >expect &&
+	test_cmp expect $packdir/graph-head &&
+	git commit-graph read --graph-hash=$graph5 >output &&
+	graph_read_expect "9" "$packdir" &&
+	test_cmp expect output'
+
+graph_git_behavior 'graph from pack, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'graph from pack, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd .. &&
 	git clone --bare --no-local full bare &&
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v3 14/14] commit-graph: build graph from starting commits
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (12 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 13/14] commit-graph: read only from specific pack-indexes Derrick Stolee
@ 2018-02-08 20:37   ` Derrick Stolee
  2018-02-09 13:02     ` SZEDER Gábor
  2018-02-14 18:15   ` [PATCH v3 00/14] Serialized Git Commit Graph Derrick Stolee
  14 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 15 ++++++++++++++-
 builtin/commit-graph.c             | 26 +++++++++++++++++++++-----
 commit-graph.c                     | 26 ++++++++++++++++++++++++--
 commit-graph.h                     |  4 +++-
 t/t5318-commit-graph.sh            | 19 +++++++++++++++++++
 5 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 727d5d70bb..bd1c54025a 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -45,7 +45,12 @@ graph-head file if it is updated by the `--update-head` option.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified packfiles and any commits in
-the existing graph-head.
+the existing graph-head. (Cannot be combined with --stdin-commits.)
++
+With the `--stdin-commits` option, generate the new commit graph by
+walking commits starting at the commits specified in stdin as a list
+of OIDs in hex, one OID per line. (Cannot be combined with
+--stdin-packs.)
 
 'read'::
 
@@ -83,6 +88,14 @@ $ git commit-graph write --update-head --delete-expired
 $ echo <pack-index> | git commit-graph write --update-head --delete-expired --stdin-packs
 ------------------------------------------------
 
+* Write a graph file, extending the current graph file using all
+* commits reachable from refs/heads/*, update graph-head, and delete
+* stale graph files.
++
+------------------------------------------------
+$ git show-ref --heads -s | git commit-graph write --update-head --delete-expired --stdin-commits
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index fe5f00551c..28d043b5a8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--pack-dir <packdir>]"),
 	N_("git commit-graph clear [--pack-dir <packdir>]"),
 	N_("git commit-graph read [--graph-hash=<hash>]"),
-	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -24,7 +24,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs]"),
+	N_("git commit-graph write [--pack-dir <packdir>] [--update-head] [--delete-expired] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -34,6 +34,7 @@ static struct opts_commit_graph {
 	int update_head;
 	int delete_expired;
 	int stdin_packs;
+	int stdin_commits;
 } opts;
 
 static int graph_clear(int argc, const char **argv)
@@ -219,6 +220,8 @@ static int graph_write(int argc, const char **argv)
 	int has_existing;
 	const char **pack_indexes = NULL;
 	int nr_packs = 0;
+	const char **commit_hex = NULL;
+	int nr_commits = 0;
 	const char **lines = NULL;
 	int nr_lines = 0;
 	int alloc_lines = 0;
@@ -233,6 +236,8 @@ static int graph_write(int argc, const char **argv)
 			N_("delete expired head graph file")),
 		OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
 			N_("only scan packfiles listed by stdin")),
+		OPT_BOOL('C', "stdin-commits", &opts.stdin_commits,
+			N_("start walk at commits listed by stdin")),
 		OPT_END(),
 	};
 
@@ -240,6 +245,9 @@ static int graph_write(int argc, const char **argv)
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
 
+	if (opts.stdin_packs && opts.stdin_commits)
+		die(_("cannot use both --stdin-commits and --stdin-packs"));
+
 	if (!opts.pack_dir) {
 		struct strbuf path = STRBUF_INIT;
 		strbuf_addstr(&path, get_object_directory());
@@ -261,13 +269,21 @@ static int graph_write(int argc, const char **argv)
 			strbuf_detach(&buf, NULL);
 		}
 
-		pack_indexes = lines;
-		nr_packs = nr_lines;
+		if (opts.stdin_packs) {
+			pack_indexes = lines;
+			nr_packs = nr_lines;
+		}
+		if (opts.stdin_commits) {
+			commit_hex = lines;
+			nr_commits = nr_lines;
+		}
 	}
 
 	graph_hash = write_commit_graph(opts.pack_dir,
 					pack_indexes,
-					nr_packs);
+					nr_packs,
+					commit_hex,
+					nr_commits);
 
 	if (opts.update_head)
 		update_head_file(opts.pack_dir, graph_hash);
diff --git a/commit-graph.c b/commit-graph.c
index 27a34f5eda..3ff3ab03ca 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -657,7 +657,9 @@ static void close_reachable(struct packed_oid_list *oids)
 
 struct object_id *write_commit_graph(const char *pack_dir,
 				     const char **pack_indexes,
-				     int nr_packs)
+				     int nr_packs,
+				     const char **commit_hex,
+				     int nr_commits)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -710,7 +712,27 @@ struct object_id *write_commit_graph(const char *pack_dir,
 			close_pack(p);
 		}
 	}
-	else
+
+	if (commit_hex) {
+		for (i = 0; i < nr_commits; i++) {
+			const char *end;
+			struct object_id oid;
+			struct commit *result;
+
+			if (commit_hex[i] && parse_oid_hex(commit_hex[i], &oid, &end))
+				continue;
+
+			result = lookup_commit_reference_gently(&oid, 1);
+
+			if (result) {
+				ALLOC_GROW(oids.list, oids.nr + 1, oids.alloc);
+				oids.list[oids.nr] = &(result->object.oid);
+				oids.nr++;
+			}
+		}
+	}
+
+	if (!pack_indexes && !commit_hex)
 		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
 
 	close_reachable(&oids);
diff --git a/commit-graph.h b/commit-graph.h
index 918b34dd2b..c412f76707 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -52,7 +52,9 @@ extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
 
 extern struct object_id *write_commit_graph(const char *pack_dir,
 					    const char **pack_indexes,
-					    int nr_packs);
+					    int nr_packs,
+					    const char **commit_hex,
+					    int nr_commits);
 
 #endif
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index e3546e6844..d803c12afd 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -182,6 +182,25 @@ test_expect_success 'build graph from latest pack with closure' '
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' commits/8 merge/1
 graph_git_behavior 'graph from pack, commit 8 vs merge 2' commits/8 merge/2
 
+test_expect_success 'build graph from commits with closure' '
+	git tag -a -m "merge" tag/merge merge/3 &&
+	git rev-parse tag/merge >commits-in &&
+	git rev-parse commits/8 >>commits-in &&
+	git rev-parse merge/1 >>commits-in &&
+	git rev-parse merge/2 >>commits-in &&
+	graph6=$(cat commits-in | git commit-graph write --update-head --delete-expired --stdin-commits) &&
+	test_path_is_file $packdir/graph-$graph6.graph &&
+	test_path_is_file $packdir/graph-$graph5.graph &&
+	test_path_is_file $packdir/graph-head &&
+	printf $graph6 >expect &&
+	test_cmp expect $packdir/graph-head &&
+	git commit-graph read --graph-hash=$graph6 >output &&
+	graph_read_expect "11" "$packdir" &&
+	test_cmp expect output'
+
+graph_git_behavior 'graph from commits, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'graph from commits, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd .. &&
 	git clone --bare --no-local full bare &&
-- 
2.15.1.45.g9b7079f


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: add format document
  2018-02-08 20:37   ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
@ 2018-02-08 21:21     ` Junio C Hamano
  2018-02-08 21:33       ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 21:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> Add document specifying the binary format for commit graphs. This
> format allows for:
>
> * New versions.
> * New hash functions and hash lengths.

It still is unclear, at least to me, why OID and OID length are
stored as if they can be independent.  If a reader does not
understand a new Object Id hash, is there anything the reader can
still do by knowing how long the hash (which it cannot recompute to
validate) is?  And if a reader does know what OID hashing scheme is
used to refer to the objects, it certainly would know how long the
OIDs are.

Giving length may make sense only when a reader can treat these OIDs
as completely opaque identifiers, without having to (re)hash from
the contents, but if that is the case, then there is no point saying
what exact hash function is used to compute OID.

So I'd understand storing only either one or the other, but not
both.  Am I missing something?

> +The Git commit graph stores a list of commit OIDs and some associated
> +metadata, including:
> +
> +- The generation number of the commit. Commits with no parents have
> +  generation number 1; commits with parents have generation number
> +  one more than the maximum generation number of its parents. We
> +  reserve zero as special, and can be used to mark a generation
> +  number invalid or as "not computed".

This "most natural" definition of generation number is stricter than
absolutely necessary (a looser definition that is sufficient is
"gennum of a child is larger than all of its parents'").  While I
personally think that is OK, some people who floated different ideas
in previous discussions on generation numbers may want to articulate
their ideas again.  One idea that I found clever was to use the
total number of commits that are ancestors of a commit instead (it
is far more expensive to compute than the most natural gennum, but
doing so may help other topology math, like "describe").

> +CHUNK LOOKUP:
> +
> +  (C + 1) * 12 bytes listing the table of contents for the chunks:
> +      First 4 bytes describe chunk id. Value 0 is a terminating label.
> +      Other 8 bytes provide offset in current file for chunk to start.
> +      (Chunks are ordered contiguously in the file, so you can infer
> +      the length using the next chunk position if necessary.)

Aren't chunks numbered contiguously, starting from #1, thereby
making it unnecessary to store the 4-byte?

How does a reader obtain the length of the last chunk?  Ahh, that is
why there are C+1 entries in this table, not just C, so that the
reader knows where to stop while reading the last one.  Does that
mean that this table looks like this?
   
    { 1, offset_1 },
    { 2, offset_2 },
    ...
    { C, offset_C },
    { 0, offset_C+1 },

where where (offset_N+1 - offset_N) gives the length of chunk #N?

> +  The remaining data in the body is described one chunk at a time, and
> +  these chunks may be given in any order. Chunks are required unless
> +  otherwise specified.
> +
> +CHUNK DATA:
> +
> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> +      The ith entry, F[i], stores the number of OIDs with first
> +      byte at most i. Thus F[255] stores the total
> +      number of commits (N).
> +
> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> +      The OIDs for all commits in the graph, sorted in ascending order.
> +
> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> +    * The first H bytes are for the OID of the root tree.
> +    * The next 8 bytes are for the int-ids of the first two parents
> +      of the ith commit. Stores value 0xffffffff if no parent in that
> +      position. If there are more than two parents, the second value
> +      has its most-significant bit on and the other bits store an array
> +      position into the Large Edge List chunk.
> +    * The next 8 bytes store the generation number of the commit and
> +      the commit time in seconds since EPOCH. The generation number
> +      uses the higher 30 bits of the first 4 bytes, while the commit
> +      time uses the 32 bits of the second 4 bytes, along with the lowest
> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
> +      commit time.
> +
> +  Large Edge List (ID: {'E', 'D', 'G', 'E'})
> +      This list of 4-byte values store the second through nth parents for
> +      all octopus merges. The second parent value in the commit data stores
> +      an array position within this list along with the most-significant bit
> +      on. Starting at that array position, iterate through this list of int-ids
> +      for the parents until reaching a value with the most-significant bit on.
> +      The other bits correspond to the int-id of the last parent. This chunk
> +      should always be present, but may be empty.

I am not convinced about the value of these 4-byte section IDs.  

They are useless unless you define what should happen when a reader
sees a block of data with an unknown ID; presence of these IDs given
an impression that you allow a reimplementation to reorder the
sections unnecessarily, especially when all of these are required
anyway, making the canonical reader implementation unnecessarily
complex, no?


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 03/14] commit-graph: create git-commit-graph builtin
  2018-02-08 20:37   ` [PATCH v3 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
@ 2018-02-08 21:27     ` Junio C Hamano
  2018-02-08 21:36       ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 21:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> Teach git the 'commit-graph' builtin that will be used for writing and
> reading packed graph files. The current implementation is mostly
> empty, except for a '--pack-dir' option.

Why do we want to use "pack" dir, when this is specifically designed
not tied to packfile?  .git/objects/pack/ certainly is a possibility
in the sense that anywhere inside .git/objects/ would make sense,
but using the "pack" dir smells like signalling a wrong message to
users.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: add format document
  2018-02-08 21:21     ` Junio C Hamano
@ 2018-02-08 21:33       ` Derrick Stolee
  2018-02-08 23:16         ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 21:33 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

On 2/8/2018 4:21 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Add document specifying the binary format for commit graphs. This
>> format allows for:
>>
>> * New versions.
>> * New hash functions and hash lengths.
> It still is unclear, at least to me, why OID and OID length are
> stored as if they can be independent.  If a reader does not
> understand a new Object Id hash, is there anything the reader can
> still do by knowing how long the hash (which it cannot recompute to
> validate) is?  And if a reader does know what OID hashing scheme is
> used to refer to the objects, it certainly would know how long the
> OIDs are.
>
> Giving length may make sense only when a reader can treat these OIDs
> as completely opaque identifiers, without having to (re)hash from
> the contents, but if that is the case, then there is no point saying
> what exact hash function is used to compute OID.
>
> So I'd understand storing only either one or the other, but not
> both.  Am I missing something?

You're right that this data is redundant. It is easy to describe the 
width of the tables using the OID length, so it is convenient to have 
that part of the format. Also, it is good to have 4-byte alignment here, 
so we are not wasting space.

There isn't a strong reason to put that here, but I don't have a great 
reason to remove it, either.

Perhaps leave a byte blank for possible future use?

>
>> +The Git commit graph stores a list of commit OIDs and some associated
>> +metadata, including:
>> +
>> +- The generation number of the commit. Commits with no parents have
>> +  generation number 1; commits with parents have generation number
>> +  one more than the maximum generation number of its parents. We
>> +  reserve zero as special, and can be used to mark a generation
>> +  number invalid or as "not computed".
> This "most natural" definition of generation number is stricter than
> absolutely necessary (a looser definition that is sufficient is
> "gennum of a child is larger than all of its parents'").  While I
> personally think that is OK, some people who floated different ideas
> in previous discussions on generation numbers may want to articulate
> their ideas again.  One idea that I found clever was to use the
> total number of commits that are ancestors of a commit instead (it
> is far more expensive to compute than the most natural gennum, but
> doing so may help other topology math, like "describe").

It is more difficult to compute the number of reachable commits, since 
you cannot learn that only by looking at the parents (you need to know 
how many commits are in the intersection of their reachable sets for a 
two-parent merge, or just walk all of the commits). This leads to a 
quadratic computation to discover the value for N commits.

I define it this rigidly now because I will submit a patch soon after 
this one lands that computes generation numbers and consumes them in 
paint_down_to_common(). I've got it sitting in my local repo ready for a 
rebase.

>
>> +CHUNK LOOKUP:
>> +
>> +  (C + 1) * 12 bytes listing the table of contents for the chunks:
>> +      First 4 bytes describe chunk id. Value 0 is a terminating label.
>> +      Other 8 bytes provide offset in current file for chunk to start.
>> +      (Chunks are ordered contiguously in the file, so you can infer
>> +      the length using the next chunk position if necessary.)
> Aren't chunks numbered contiguously, starting from #1, thereby
> making it unnecessary to store the 4-byte?
>
> How does a reader obtain the length of the last chunk?  Ahh, that is
> why there are C+1 entries in this table, not just C, so that the
> reader knows where to stop while reading the last one.  Does that
> mean that this table looks like this?
>     
>      { 1, offset_1 },
>      { 2, offset_2 },
>      ...
>      { C, offset_C },
>      { 0, offset_C+1 },
>
> where where (offset_N+1 - offset_N) gives the length of chunk #N?

This is correct.

>
>> +  The remaining data in the body is described one chunk at a time, and
>> +  these chunks may be given in any order. Chunks are required unless
>> +  otherwise specified.
>> +
>> +CHUNK DATA:
>> +
>> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>> +      The ith entry, F[i], stores the number of OIDs with first
>> +      byte at most i. Thus F[255] stores the total
>> +      number of commits (N).
>> +
>> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>> +      The OIDs for all commits in the graph, sorted in ascending order.
>> +
>> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>> +    * The first H bytes are for the OID of the root tree.
>> +    * The next 8 bytes are for the int-ids of the first two parents
>> +      of the ith commit. Stores value 0xffffffff if no parent in that
>> +      position. If there are more than two parents, the second value
>> +      has its most-significant bit on and the other bits store an array
>> +      position into the Large Edge List chunk.
>> +    * The next 8 bytes store the generation number of the commit and
>> +      the commit time in seconds since EPOCH. The generation number
>> +      uses the higher 30 bits of the first 4 bytes, while the commit
>> +      time uses the 32 bits of the second 4 bytes, along with the lowest
>> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
>> +      commit time.
>> +
>> +  Large Edge List (ID: {'E', 'D', 'G', 'E'})
>> +      This list of 4-byte values store the second through nth parents for
>> +      all octopus merges. The second parent value in the commit data stores
>> +      an array position within this list along with the most-significant bit
>> +      on. Starting at that array position, iterate through this list of int-ids
>> +      for the parents until reaching a value with the most-significant bit on.
>> +      The other bits correspond to the int-id of the last parent. This chunk
>> +      should always be present, but may be empty.
> I am not convinced about the value of these 4-byte section IDs.
>
> They are useless unless you define what should happen when a reader
> sees a block of data with an unknown ID; presence of these IDs given
> an impression that you allow a reimplementation to reorder the
> sections unnecessarily, especially when all of these are required
> anyway, making the canonical reader implementation unnecessarily
> complex, no?
>

One reason to have chunks is to be simple: we have a clear table of 
contents at the beginning of the file and can immediately navigate to 
portions of the file we care about. The chunk order also doesn't matter 
for the purpose of the format.

The true value they present is flexibility: We can extend the v1 format 
to include extra metadata in a chunk and insert it anywhere in the 
format. If there is some extra information that would be beneficial for 
graphs (say, a de-duplicated author list) then we could extend the 
format to be v1-compatible.

BUT: I do notice now that load_commit_graph_one() will die() if seeing a 
chunk id it doesn't recognize, so that should be fixed if we keep this 
chunk model. We should be able to read "v1.1" files if they have extra 
chunks (but ignore that data).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 03/14] commit-graph: create git-commit-graph builtin
  2018-02-08 21:27     ` Junio C Hamano
@ 2018-02-08 21:36       ` Derrick Stolee
  2018-02-08 23:21         ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 21:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

On 2/8/2018 4:27 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Teach git the 'commit-graph' builtin that will be used for writing and
>> reading packed graph files. The current implementation is mostly
>> empty, except for a '--pack-dir' option.
> Why do we want to use "pack" dir, when this is specifically designed
> not tied to packfile?  .git/objects/pack/ certainly is a possibility
> in the sense that anywhere inside .git/objects/ would make sense,
> but using the "pack" dir smells like signalling a wrong message to
> users.
>

I wanted to have the smallest footprint as possible in the objects 
directory, and the .git/objects directory currently only holds folders.

I suppose this feature, along with the multi-pack-index (MIDX), extends 
the concept of the pack directory to be a "compressed data" directory, 
but keeps the "pack" name to be compatible with earlier versions.

Another option is to create a .git/objects/graph directory instead, but 
then we need to worry about that directory being present.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 04/14] commit-graph: implement write_commit_graph()
  2018-02-08 20:37   ` [PATCH v3 04/14] commit-graph: implement write_commit_graph() Derrick Stolee
@ 2018-02-08 22:14     ` Junio C Hamano
  2018-02-15 18:19     ` Junio C Hamano
  1 sibling, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 22:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> +char* get_commit_graph_filename_hash(const char *pack_dir,

Asterisk sticks to the identifier, not type, in our codebase.

> +				     struct object_id *hash)
> +{
> +	size_t len;
> +	struct strbuf path = STRBUF_INIT;
> +	strbuf_addstr(&path, pack_dir);
> +	strbuf_addstr(&path, "/graph-");
> +	strbuf_addstr(&path, oid_to_hex(hash));
> +	strbuf_addstr(&path, ".graph");

Use of strbuf_addf() would make it easier to read and maintain, no?

> +
> +	return strbuf_detach(&path, &len);
> +}
> +
> +static void write_graph_chunk_fanout(struct sha1file *f,
> +				     struct commit **commits,
> +				     int nr_commits)
> +{
> +	uint32_t i, count = 0;
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +
> +	/*
> +	 * Write the first-level table (the list is sorted,
> +	 * but we use a 256-entry lookup to be able to avoid
> +	 * having to do eight extra binary search iterations).
> +	 */
> +	for (i = 0; i < 256; i++) {
> +		while (list < last) {
> +			if ((*list)->object.oid.hash[0] != i)
> +				break;
> +			count++;
> +			list++;
> +		}

If count and list are always incremented in unison, perhaps you do
not need an extra variable "last".  If typeof(nr_commits) is wider
than typeof(count), this loop and the next write-be32 is screwed
anyway ;-)

This comment probably applies equally to some other uses of the same
"compute last pointer to compare with running pointer for
termination" pattern in this patch.

> +		sha1write_be32(f, count);
> +	}
> +}

> +static int if_packed_commit_add_to_list(const struct object_id *oid,

That is a strange name.  "collect packed commits", perhaps?

> +					struct packed_git *pack,
> +					uint32_t pos,
> +					void *data)
> +{
> +	struct packed_oid_list *list = (struct packed_oid_list*)data;
> +	enum object_type type;
> +	unsigned long size;
> +	void *inner_data;
> +	off_t offset = nth_packed_object_offset(pack, pos);
> +	inner_data = unpack_entry(pack, offset, &type, &size);
> +
> +	if (inner_data)
> +		free(inner_data);
> +
> +	if (type != OBJ_COMMIT)
> +		return 0;
> +
> +	ALLOC_GROW(list->list, list->nr + 1, list->alloc);

This probably will become inefficient in large repositories.  You
know you'll be walking all the pack files, and total number of
objects in a packfile can be read cheaply, so it may make sense to
make a rough guestimate of the number of commits (e.g. 15-25% of the
total number of objects) in the repository and allocate the list
upfront, instead of a hardcoded 1024.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: add format document
  2018-02-08 21:33       ` Derrick Stolee
@ 2018-02-08 23:16         ` Junio C Hamano
  0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 23:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> You're right that this data is redundant. It is easy to describe the
> width of the tables using the OID length, so it is convenient to have
> that part of the format. Also, it is good to have 4-byte alignment
> here, so we are not wasting space.
>
> There isn't a strong reason to put that here, but I don't have a great
> reason to remove it, either.

Redundant information that can go out of sync is a great enough
reason not to have it in the first place.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 03/14] commit-graph: create git-commit-graph builtin
  2018-02-08 21:36       ` Derrick Stolee
@ 2018-02-08 23:21         ` Junio C Hamano
  0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 23:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> I wanted to have the smallest footprint as possible in the objects
> directory, and the .git/objects directory currently only holds
> folders.

When we cull stale files from pack directory, we rely on the related
files to share pack-<hash>.* pattern.  It is better not to contaminate
the directory with unrelated cruft.

As this is purely optional auxiliary information used for optimization,
perhaps .git/objects/info is a better place?  I dunno.

In any case, even if its default position ends up in .git/objects/pack/,
if this thing conceptually does not have any ties with packs
(i.e. it is not a corruption if the graph file also described
topologies including loose objects, and it is not a corruption if
the graph file did not cover objects in all packs), then the end
user visible option name shouldn't say "--pack-dir".  "--graph-dir"
that defaults to .git/objects/pack/ might be acceptable but it still
feels wrong.



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read'
  2018-02-08 20:37   ` [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read' Derrick Stolee
@ 2018-02-08 23:38     ` Junio C Hamano
  0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 23:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> +'read'::
> +
> +Read a graph file given by the graph-head file and output basic
> +details about the graph file.

"a graph file", assuming that there must be only one in the
specified place?  Or if there are more than one, read all of them?
Or is it an error to have more than one?

    Do not answer questions in a message that is a response to _me_;
    the purpose of a review is not to educate reviewers---it is to
    improve the patch.

> +With `--graph-hash=<hash>` option, consider the graph file
> +graph-<hash>.graph in the pack directory.

I think it is more in line with how plumbing works to just let the
full pathname be specifiable (e.g. learn from how pack-objects takes
"pack-" prefix from the command line, even though in practice names
of all packs you see in any repos start from "pack-").

> +struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir)

This somehow smells like a screwed up API.  It gets a filename to
read from that is directly passed to git_open().  Why does an
instance of graph has to know and remember the path to the directory
(i.e. pack_dir) that was given when it was constructed?  "I am an
instance that holds commit topology learned from this object
database" is something it might want to know and remember, but "I am
told that I'll eventually be written back to there when I was
created" does not sound like a useful thing to have.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 14/14] commit-graph: build graph from starting commits
  2018-02-08 20:37   ` [PATCH v3 14/14] commit-graph: build graph from starting commits Derrick Stolee
@ 2018-02-09 13:02     ` SZEDER Gábor
  2018-02-09 13:45       ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-09 13:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git mailing list, Derrick Stolee, git, Junio C Hamano, Jeff King,
	jonathantanmy, Stefan Beller

On Thu, Feb 8, 2018 at 9:37 PM, Derrick Stolee <stolee@gmail.com> wrote:
> Teach git-commit-graph to read commits from stdin when the
> --stdin-commits flag is specified. Commits reachable from these
> commits are added to the graph. This is a much faster way to construct
> the graph than inspecting all packed objects, but is restricted to
> known tips.
>
> For the Linux repository, 700,000+ commits were added to the graph
> file starting from 'master' in 7-9 seconds, depending on the number
> of packfiles in the repo (1, 24, or 120).

It seems something went wrong with '--stdin-commits' in v3, look:

  ~/src/git (commit-graph-v2 %)$ time { git rev-parse HEAD | ./git
commit-graph --write --update-head --stdin-commits ; }
  ee3223fe116bf7031a6c1ad6d41e0456beefa754

  real  0m1.199s
  user  0m1.123s
  sys   0m0.024s

  ~/src/git (commit-graph-v3 %)$ time { git rev-parse HEAD | ./git
commit-graph write --update-head --stdin-commits ; }
  ee3223fe116bf7031a6c1ad6d41e0456beefa754

  real  0m30.766s
  user  0m29.120s
  sys   0m0.546s

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 14/14] commit-graph: build graph from starting commits
  2018-02-09 13:02     ` SZEDER Gábor
@ 2018-02-09 13:45       ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-09 13:45 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Git mailing list, Derrick Stolee, git, Junio C Hamano, Jeff King,
	jonathantanmy, Stefan Beller

On 2/9/2018 8:02 AM, SZEDER Gábor wrote:
> On Thu, Feb 8, 2018 at 9:37 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> Teach git-commit-graph to read commits from stdin when the
>> --stdin-commits flag is specified. Commits reachable from these
>> commits are added to the graph. This is a much faster way to construct
>> the graph than inspecting all packed objects, but is restricted to
>> known tips.
>>
>> For the Linux repository, 700,000+ commits were added to the graph
>> file starting from 'master' in 7-9 seconds, depending on the number
>> of packfiles in the repo (1, 24, or 120).
> It seems something went wrong with '--stdin-commits' in v3, look:
>
>    ~/src/git (commit-graph-v2 %)$ time { git rev-parse HEAD | ./git
> commit-graph --write --update-head --stdin-commits ; }
>    ee3223fe116bf7031a6c1ad6d41e0456beefa754
>
>    real  0m1.199s
>    user  0m1.123s
>    sys   0m0.024s
>
>    ~/src/git (commit-graph-v3 %)$ time { git rev-parse HEAD | ./git
> commit-graph write --update-head --stdin-commits ; }
>    ee3223fe116bf7031a6c1ad6d41e0456beefa754
>
>    real  0m30.766s
>    user  0m29.120s
>    sys   0m0.546s

Thanks, Szeder. You're right. This is the diff that I forgot to apply in 
the last commit:

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 28d043b..175b967 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -257,7 +257,7 @@ static int graph_write(int argc, const char **argv)

         has_existing = !!get_graph_head_hash(opts.pack_dir, 
&old_graph_hash);

-       if (opts.stdin_packs) {
+       if (opts.stdin_packs || opts.stdin_commits) {
                 struct strbuf buf = STRBUF_INIT;
                 nr_lines = 0;
                 alloc_lines = 128;


I'll work to create a test that ensures we are only adding commits 
reachable from specific commits to prevent this regression.

Thanks,
-Stolee

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 07/14] commit-graph: update graph-head during write
  2018-02-08 20:37   ` [PATCH v3 07/14] commit-graph: update graph-head during write Derrick Stolee
@ 2018-02-12 18:56     ` Junio C Hamano
  2018-02-12 20:37       ` Junio C Hamano
  2018-02-13 22:38     ` Jonathan Tan
  1 sibling, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-12 18:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> It is possible to have multiple commit graph files in a pack directory,
> but only one is important at a time. Use a 'graph_head' file to point
> to the important file. Teach git-commit-graph to write 'graph_head' upon
> writing a new commit graph file.

Why this design, instead of what "repack -a" would do, iow, if there
always is a singleton that is the only one that matters, shouldn't
the creation of that latest singleton just clear the older ones
before it returns control?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 07/14] commit-graph: update graph-head during write
  2018-02-12 18:56     ` Junio C Hamano
@ 2018-02-12 20:37       ` Junio C Hamano
  2018-02-12 21:24         ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-12 20:37 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>> It is possible to have multiple commit graph files in a pack directory,
>> but only one is important at a time. Use a 'graph_head' file to point
>> to the important file. Teach git-commit-graph to write 'graph_head' upon
>> writing a new commit graph file.
>
> Why this design, instead of what "repack -a" would do, iow, if there
> always is a singleton that is the only one that matters, shouldn't
> the creation of that latest singleton just clear the older ones
> before it returns control?

Note that I am not complaining---I am just curious why we want to
expose this "there is one relevant one but we keep irrelevant ones
we usually do not look at and need to be garbage collected" to end
users, and also expect readers of the series, resulting code and
docs would have the same puzzled feeling.



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 07/14] commit-graph: update graph-head during write
  2018-02-12 20:37       ` Junio C Hamano
@ 2018-02-12 21:24         ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-12 21:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

On 2/12/2018 3:37 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> It is possible to have multiple commit graph files in a pack directory,
>>> but only one is important at a time. Use a 'graph_head' file to point
>>> to the important file. Teach git-commit-graph to write 'graph_head' upon
>>> writing a new commit graph file.
>> Why this design, instead of what "repack -a" would do, iow, if there
>> always is a singleton that is the only one that matters, shouldn't
>> the creation of that latest singleton just clear the older ones
>> before it returns control?
> Note that I am not complaining---I am just curious why we want to
> expose this "there is one relevant one but we keep irrelevant ones
> we usually do not look at and need to be garbage collected" to end
> users, and also expect readers of the series, resulting code and
> docs would have the same puzzled feeling.
>

Aside: I forgot to mention in my cover letter that the experience around 
the "--delete-expired" flag for "git commit-graph write" is different 
than v2. If specified, we delete all ".graph" files in the pack 
directory other than the one referenced by "graph_head" at the beginning 
of the process or the one written by the process. If these deletes fail, 
then we ignore the failure (assuming that they are being used by another 
Git process). In usual cases, we will delete these expired files in the 
next instance. I believe this matches similar behavior in gc and repack.

-- Back to discussion about the value of "graph_head" --

The current design of using a pointer file (graph_head) is intended to 
have these benefits:

1. We do not need to rely on a directory listing and mtimes to determine 
which graph file to use.

2. If we write a new graph file while another git process is reading the 
existing graph file, we can update the graph_head pointer without 
deleting the file that is currently memory-mapped. (This is why we 
cannot just rely on a canonical file name, such as "the_graph", to store 
the data.)

3. We can atomically change the 'graph_head' file without interrupting 
concurrent git processes. I think this is different from the "repack" 
situation because a concurrent process would load all packfiles in the 
pack directory and possibly have open handles when the repack is trying 
to delete them.

4. We remain open to making the graph file incremental (as the MIDX 
feature is designed to do; see [1]). It is less crucial to have an 
incremental graph file structure (the graph file for the Windows 
repository is currently ~120MB versus a MIDX file of 1.25 GB), but the 
graph_head pattern makes this a possibility.

I tried to avoid item 1 due to personal taste, and since I am storing 
the files in the objects/pack directory (so that listing may be very 
large with a lot of wasted entries). This is less important with our 
pending change of moving the graph files to a different directory. This 
also satisfies items 2 and 3, as long as we never write graph files so 
quickly that we have a collision on mtime.

I cannot think of another design that satisfies item 4.

As for your end user concerns: My philosophy with this feature is that 
end users will never interact with the commit-graph builtin. 99% of 
users will benefit from a repack or GC automatically computing a commit 
graph (when we add that integration point). The other uses for the 
builtin are for users who want extreme control over their data, such as 
code servers and build agents.

Perhaps someone with experience managing large repositories with git in 
a server environment could chime in with some design requirements they 
would need.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180107181459.222909-2-dstolee@microsoft.com/
     [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write'
  2018-02-08 20:37   ` [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
@ 2018-02-13 21:57     ` Jonathan Tan
  0 siblings, 0 replies; 146+ messages in thread
From: Jonathan Tan @ 2018-02-13 21:57 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, dstolee, git, gitster, peff, sbeller, szeder.dev

On Thu,  8 Feb 2018 15:37:29 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> +test_expect_success 'setup full repo' '
> +	rm -rf .git &&
> +	mkdir full &&
> +	cd full &&
> +	git init &&
> +	packdir=".git/objects/pack"'

Thanks for simplifying the repo generated in the test. One more style
nit: the final apostrophe goes onto its own line.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 07/14] commit-graph: update graph-head during write
  2018-02-08 20:37   ` [PATCH v3 07/14] commit-graph: update graph-head during write Derrick Stolee
  2018-02-12 18:56     ` Junio C Hamano
@ 2018-02-13 22:38     ` Jonathan Tan
  1 sibling, 0 replies; 146+ messages in thread
From: Jonathan Tan @ 2018-02-13 22:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, dstolee, git, gitster, peff, sbeller, szeder.dev

On Thu,  8 Feb 2018 15:37:31 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> It is possible to have multiple commit graph files in a pack directory,
> but only one is important at a time. Use a 'graph_head' file to point
> to the important file. Teach git-commit-graph to write 'graph_head' upon
> writing a new commit graph file.

You should probably include the rationale for a special "graph_head"
file that you describe here [1] in the commit message.

[1] https://public-inbox.org/git/99543db0-26e4-8daa-a580-b618497e48ba@gmail.com/

> +char *get_graph_head_filename(const char *pack_dir)
> +{
> +	struct strbuf fname = STRBUF_INIT;
> +	strbuf_addstr(&fname, pack_dir);
> +	strbuf_addstr(&fname, "/graph-head");
> +	return strbuf_detach(&fname, 0);

NULL, not 0.

> +}

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear'
  2018-02-08 20:37   ` [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear' Derrick Stolee
@ 2018-02-13 22:49     ` Jonathan Tan
  0 siblings, 0 replies; 146+ messages in thread
From: Jonathan Tan @ 2018-02-13 22:49 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, dstolee, git, gitster, peff, sbeller, szeder.dev

On Thu,  8 Feb 2018 15:37:32 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> Teach Git to delete the current 'graph_head' file and the commit graph
> it references. This is a good safety valve if somehow the file is
> corrupted and needs to be recalculated. Since the commit graph is a
> summary of contents already in the ODB, it can be regenerated.

Spelling of graph-head (hyphen, not underscore).

I'm not sure of the usefulness of this feature - if the graph is indeed
corrupt, the user can just be instructed to delete graph-head (not even
the commit graph it references, since when we create a new graph-head,
--delete-expired will take care of deleting the old one).

>  extern char *get_graph_head_filename(const char *pack_dir);
> +extern struct object_id *get_graph_head_hash(const char *pack_dir,
> +					     struct object_id *hash);
>  extern char* get_commit_graph_filename_hash(const char *pack_dir,
>  					    struct object_id *hash);

This file is starting to need documentation - in particular, the
difference between get_graph_head_hash() and
get_commit_graph_filename_hash() is not clear.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 11/14] commit: integrate commit graph with commit parsing
  2018-02-08 20:37   ` [PATCH v3 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
@ 2018-02-14  0:12     ` Jonathan Tan
  2018-02-14 18:08       ` Derrick Stolee
  2018-02-15 18:25     ` Junio C Hamano
  1 sibling, 1 reply; 146+ messages in thread
From: Jonathan Tan @ 2018-02-14  0:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, dstolee, git, gitster, peff, sbeller, szeder.dev

On Thu,  8 Feb 2018 15:37:35 -0500
Derrick Stolee <stolee@gmail.com> wrote:

> | Command                          | Before | After  | Rel % |
> |----------------------------------|--------|--------|-------|
> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
> | branch -vv                       |  0.42s |  0.27s | -35%  |
> | rev-list --all                   |  6.4s  |  1.0s  | -84%  |
> | rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

Could we have a performance test (in t/perf) demonstrating this?

> +static int check_commit_parents(struct commit *item, struct commit_graph *g,
> +				uint32_t pos, const unsigned char *commit_data)

Document what this function does? Also, this function probably needs a
better name.

> +/*
> + * Given a commit struct, try to fill the commit struct info, including:
> + *  1. tree object
> + *  2. date
> + *  3. parents.
> + *
> + * Returns 1 if and only if the commit was found in the commit graph.
> + *
> + * See parse_commit_buffer() for the fallback after this call.
> + */
> +int parse_commit_in_graph(struct commit *item)
> +{

The documentation above duplicates what's in the header file, so we can
probably omit it.

> +extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
> +					    uint32_t n,
> +					    struct object_id *oid);

This doesn't seem to be used elsewhere - do you plan for a future patch
to use it?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 11/14] commit: integrate commit graph with commit parsing
  2018-02-14  0:12     ` Jonathan Tan
@ 2018-02-14 18:08       ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-14 18:08 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, dstolee, git, gitster, peff, sbeller, szeder.dev

On 2/13/2018 7:12 PM, Jonathan Tan wrote:
> On Thu,  8 Feb 2018 15:37:35 -0500
> Derrick Stolee <stolee@gmail.com> wrote:
>
>> | Command                          | Before | After  | Rel % |
>> |----------------------------------|--------|--------|-------|
>> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
>> | branch -vv                       |  0.42s |  0.27s | -35%  |
>> | rev-list --all                   |  6.4s  |  1.0s  | -84%  |
>> | rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |
> Could we have a performance test (in t/perf) demonstrating this?

The rev-list perf tests are found in t/perf/p0001-rev-list.sh

The "log --oneline --topo-order -1000" test would be good to add to 
t/perf/p4211-line-log.sh

The "branch -vv" test is pretty uninteresting unless you set up your 
repo to have local branches significantly behind the remote branches. It 
depends a lot more on the data shape than the others which only need a 
large number of reachable objects.

One reason I did not use the builtin perf test scripts is that they seem 
to ignore all local config options, and hence do not inherit the 
core.commitGraph=true setting from the repos pointed at by GIT_PERF_REPO.

>
>> +static int check_commit_parents(struct commit *item, struct commit_graph *g,
>> +				uint32_t pos, const unsigned char *commit_data)
> Document what this function does? Also, this function probably needs a
> better name.
>
>> +/*
>> + * Given a commit struct, try to fill the commit struct info, including:
>> + *  1. tree object
>> + *  2. date
>> + *  3. parents.
>> + *
>> + * Returns 1 if and only if the commit was found in the commit graph.
>> + *
>> + * See parse_commit_buffer() for the fallback after this call.
>> + */
>> +int parse_commit_in_graph(struct commit *item)
>> +{
> The documentation above duplicates what's in the header file, so we can
> probably omit it.
>
>> +extern struct object_id *get_nth_commit_oid(struct commit_graph *g,
>> +					    uint32_t n,
>> +					    struct object_id *oid);
> This doesn't seem to be used elsewhere - do you plan for a future patch
> to use it?


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 00/14] Serialized Git Commit Graph
  2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
                     ` (13 preceding siblings ...)
  2018-02-08 20:37   ` [PATCH v3 14/14] commit-graph: build graph from starting commits Derrick Stolee
@ 2018-02-14 18:15   ` Derrick Stolee
  2018-02-14 18:27     ` Stefan Beller
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
  14 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-14 18:15 UTC (permalink / raw)
  To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev

There has been a lot of interesting discussion on this topic. Some of 
that involves some decently significant changes from v3, so I wanted to 
summarize my understanding of the feedback and seek out more feedback 
from reviewers before rolling v4.

If we have consensus on these topics, then I'll re-roll on Friday, Feb 
16th. Please let me know if you are planning on reviewing v3 and need 
more time than that.


* Graph Storage:

     - Move the graph files to a different directory than the "pack" 
directory. Currently considering ".git/objects/info"

     - Change the "--pack-dir" command-line arguments to "--object-dir" 
arguments.

     - Keep a "graph_head" file, but expand on the reasons (as discussed 
[1]) in the commit message.

     - Adjust "graph_head" and the "--graph-id" argument to use a full 
filename (assuming based in {object-dir}/info/).

     - Remove "pack_dir" from struct commit_graph and 
load_commit_graph_one().

     - Drop "git commit-graph clear" subcommand.


* Graph format:

     - remove redundant hash type & length bytes in favor of a combined 
type/length enum byte.

     - emphasize the fact that the file can contain chunk ids unknown to 
Git and will be ignored on read. Also fix the read code to not die() on 
unknown chunk ids.

     - Don't write the large-edge chunk if it is going to be empty. 
Modify tests to verify this.


* Tests:

     - Format (last apostrophe on new line)

     - Bug check (--stdin-commits should limit by reachability)


* Other style fixes.


[1] 
https://public-inbox.org/git/99543db0-26e4-8daa-a580-b618497e48ba@gmail.com/

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 00/14] Serialized Git Commit Graph
  2018-02-14 18:15   ` [PATCH v3 00/14] Serialized Git Commit Graph Derrick Stolee
@ 2018-02-14 18:27     ` Stefan Beller
  2018-02-14 19:11       ` Derrick Stolee
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
  1 sibling, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-02-14 18:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Derrick Stolee, Jeff Hostetler, Junio C Hamano, Jeff King,
	Jonathan Tan, SZEDER Gábor

On Wed, Feb 14, 2018 at 10:15 AM, Derrick Stolee <stolee@gmail.com> wrote:
> There has been a lot of interesting discussion on this topic. Some of that
> involves some decently significant changes from v3, so I wanted to summarize
> my understanding of the feedback and seek out more feedback from reviewers
> before rolling v4.
>
> If we have consensus on these topics, then I'll re-roll on Friday, Feb 16th.
> Please let me know if you are planning on reviewing v3 and need more time
> than that.
>
>
> * Graph Storage:
>
>     - Move the graph files to a different directory than the "pack"
> directory. Currently considering ".git/objects/info"

In my copy of git there is already a file

  $ cat .git/objects/info/packs
  P pack-8fdfd126aa8c2a868baf1f89788b07b79a4d365b.pack

which seems to be in line with the information provided in
'man gitrepository-layout':
    objects/info
           Additional information about the object store is
           recorded in this directory.

The commit graph files are not exactly "additional info about the
object store" but rather "about the objects". Close enough IMO.

Stefan

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 00/14] Serialized Git Commit Graph
  2018-02-14 18:27     ` Stefan Beller
@ 2018-02-14 19:11       ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-14 19:11 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Derrick Stolee, Jeff Hostetler, Junio C Hamano, Jeff King,
	Jonathan Tan, SZEDER Gábor



On 2/14/2018 1:27 PM, Stefan Beller wrote:
> On Wed, Feb 14, 2018 at 10:15 AM, Derrick Stolee <stolee@gmail.com> wrote:
>> There has been a lot of interesting discussion on this topic. Some of that
>> involves some decently significant changes from v3, so I wanted to summarize
>> my understanding of the feedback and seek out more feedback from reviewers
>> before rolling v4.
>>
>> If we have consensus on these topics, then I'll re-roll on Friday, Feb 16th.
>> Please let me know if you are planning on reviewing v3 and need more time
>> than that.
>>
>>
>> * Graph Storage:
>>
>>      - Move the graph files to a different directory than the "pack"
>> directory. Currently considering ".git/objects/info"
> In my copy of git there is already a file
>
>    $ cat .git/objects/info/packs
>    P pack-8fdfd126aa8c2a868baf1f89788b07b79a4d365b.pack
>
> which seems to be in line with the information provided in
> 'man gitrepository-layout':
>      objects/info
>             Additional information about the object store is
>             recorded in this directory.
>
> The commit graph files are not exactly "additional info about the
> object store" but rather "about the objects". Close enough IMO.
>
> Stefan

Thanks for the tip [1]. I was unfamiliar with it because it doesn't 
exist in repos that don't repack.

[1] 
https://git-scm.com/docs/gitrepository-layout/2.12.0#gitrepository-layout-objectsinfopacks

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 04/14] commit-graph: implement write_commit_graph()
  2018-02-08 20:37   ` [PATCH v3 04/14] commit-graph: implement write_commit_graph() Derrick Stolee
  2018-02-08 22:14     ` Junio C Hamano
@ 2018-02-15 18:19     ` Junio C Hamano
  2018-02-15 18:23       ` Derrick Stolee
  1 sibling, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-15 18:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> +struct packed_oid_list {
> +	struct object_id **list;
> +	int nr;
> +	int alloc;
> +};

What is the typical access pattern for this data structure?  If it
is pretty much "allocate and grow as we find more", then a dynamic
array of struct (rather than a dynamic array of pointers to struct)
would be a lot more appropriate.  IOW

	struct packed_oid_list {
		struct object_id *list;
		int nr, alloc;
	};

The version in the posted patch has to pay malloc overhead plus an
extra pointer for each object id in the list; unless you often
replace elements in the list randomly and/or you borrow object ID
field in other existing data structure whose lifetime is longer than
this list by pointing at it, I do not see how the extra indirection
is worth it.



^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 04/14] commit-graph: implement write_commit_graph()
  2018-02-15 18:19     ` Junio C Hamano
@ 2018-02-15 18:23       ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-15 18:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

On 2/15/2018 1:19 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +struct packed_oid_list {
>> +	struct object_id **list;
>> +	int nr;
>> +	int alloc;
>> +};
> What is the typical access pattern for this data structure?  If it
> is pretty much "allocate and grow as we find more", then a dynamic
> array of struct (rather than a dynamic array of pointers to struct)
> would be a lot more appropriate.  IOW
>
> 	struct packed_oid_list {
> 		struct object_id *list;
> 		int nr, alloc;
> 	};
>
> The version in the posted patch has to pay malloc overhead plus an
> extra pointer for each object id in the list; unless you often
> replace elements in the list randomly and/or you borrow object ID
> field in other existing data structure whose lifetime is longer than
> this list by pointing at it, I do not see how the extra indirection
> is worth it.
>

The pattern used in write_commit_graph() is to append OIDs to the list 
as we discover them and then sort in lexicographic order. The sort then 
only swaps pointers.

I can switch this to sort the 'struct object_id' elements themselves, if 
that is a better pattern.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v3 11/14] commit: integrate commit graph with commit parsing
  2018-02-08 20:37   ` [PATCH v3 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
  2018-02-14  0:12     ` Jonathan Tan
@ 2018-02-15 18:25     ` Junio C Hamano
  1 sibling, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-15 18:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev

Derrick Stolee <stolee@gmail.com> writes:

> +struct object_id *get_nth_commit_oid(struct commit_graph *g,
> +				     uint32_t n,
> +				     struct object_id *oid)
> +{
> +	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * n);
> +	return oid;
> +}

This looks like a rather klunky API to me.  

It seems that many current callers in this series (not limited to
this step but in later patches in the series) discard the returned
value.

I would understand the API a lot better if the function returned
"const struct object_id *" that points into the copy of the oid the
graph structure keeps (and the caller can do hashcpy() if it wants
to).

That would allow the API to later check for errors when the caller
gives 'n' that is too large by returning a NULL, for example.

> +static struct commit_list **insert_parent_or_die(struct commit_graph *g,
> +					   int pos,
> +					   struct commit_list **pptr)
> +{
> +	struct commit *c;
> +	struct object_id oid;
> +	get_nth_commit_oid(g, pos, &oid);
> +	c = lookup_commit(&oid);

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v4 00/13] Serialized Git Commit Graph
  2018-02-14 18:15   ` [PATCH v3 00/14] Serialized Git Commit Graph Derrick Stolee
  2018-02-14 18:27     ` Stefan Beller
@ 2018-02-19 18:53     ` Derrick Stolee
  2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
                         ` (13 more replies)
  1 sibling, 14 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Thanks for all of the feedback. I've learned a lot working on this patch.

As discussed [0], this version changes several fundamental structures
and operations, including:

* Graph files are stored in .git/objects/info

* The "graph-head" file is now called "graph-latest" to avoid confusion
  with HEAD. We use "--set-latest" argument instead of "--update-head".

* The graph format no longer stores the hash width, but expects the hash
  version to imply a length. This frees up one byte for future use while
  preserving 4-byte alignment.

* The graph format is more explicit about optional chunks and no longer
  dies when seeing an unknown chunk. The large edges chunk is now optional.

* There is no longer a "clear" subcommand to git-commit-graph.

* Fixed the bug related to "--stdin-commits" and check that the command
  only includes commits reachable from the input OIDs.

* The struct packed_oid_list type is now an array of struct object_id
  instead of an array of pointers. In my testing, I saw no performance
  difference between these two options, so I switched to the simpler
  pattern.

* My patch is based on jt/binsearch-with-fanout (b4e00f730), with the
  newer method prototype since v3.

Thanks,
-Stolee

[0] https://public-inbox.org/git/1517348383-112294-1-git-send-email-dstolee@microsoft.com/T/#m22bfdb7cf7b3d6e5f380b8bf0eec957e2cfd2dd7

-- >8 --

As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base <A> <B>' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv                       |  0.42s |  0.27s | -35%  |
| rev-list --all                   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --set-latest --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.

[1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a2ce@jeffhostetler.com/
    Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
    A GitHub pull request containing the latest version of this patch.

Derrick Stolee (13):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement write_commit_graph()
  commit-graph: implement 'git-commit-graph write'
  commit-graph: implement git commit-graph read
  commit-graph: implement --set-latest
  commit-graph: implement --delete-expired
  commit-graph: add core.commitGraph setting
  commit-graph: close under reachability
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits

 .gitignore                                      |   1 +
 Documentation/config.txt                        |   3 +
 Documentation/git-commit-graph.txt              | 105 ++++
 Documentation/technical/commit-graph-format.txt |  90 +++
 Documentation/technical/commit-graph.txt        | 185 ++++++
 Makefile                                        |   2 +
 alloc.c                                         |   1 +
 builtin.h                                       |   1 +
 builtin/commit-graph.c                          | 261 +++++++++
 cache.h                                         |   1 +
 command-list.txt                                |   1 +
 commit-graph.c                                  | 724 ++++++++++++++++++++++++
 commit-graph.h                                  |  49 ++
 commit.c                                        |   3 +
 commit.h                                        |   3 +
 config.c                                        |   5 +
 environment.c                                   |   1 +
 git.c                                           |   1 +
 log-tree.c                                      |   3 +-
 packfile.c                                      |   4 +-
 packfile.h                                      |   2 +
 t/t5318-commit-graph.sh                         | 244 ++++++++
 22 files changed, 1686 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 Documentation/technical/commit-graph-format.txt
 create mode 100644 Documentation/technical/commit-graph.txt
 create mode 100644 builtin/commit-graph.c
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h
 create mode 100755 t/t5318-commit-graph.sh

-- 
2.15.1.44.g453ed2b


^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v4 01/13] commit-graph: add format document
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-20 20:49         ` Junio C Hamano
                           ` (2 more replies)
  2018-02-19 18:53       ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
                         ` (12 subsequent siblings)
  13 siblings, 3 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 90 +++++++++++++++++++++++++
 1 file changed, 90 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 0000000..11b18b5
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,90 @@
+Git commit graph format
+=======================
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+      The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+      Currently, the only valid version is 1.
+
+  1-byte Object Id Version (1 = SHA-1)
+
+  1-byte number (C) of "chunks"
+
+  1-byte (reserved for later use)
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+      First 4 bytes describe chunk id. Value 0 is a terminating label.
+      Other 8 bytes provide offset in current file for chunk to start.
+      (Chunks are ordered contiguously in the file, so you can infer
+      the length using the next chunk position if necessary.)
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+      The ith entry, F[i], stores the number of OIDs with first
+      byte at most i. Thus F[255] stores the total
+      number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+      The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+    * The first H bytes are for the OID of the root tree.
+    * The next 8 bytes are for the int-ids of the first two parents
+      of the ith commit. Stores value 0xffffffff if no parent in that
+      position. If there are more than two parents, the second value
+      has its most-significant bit on and the other bits store an array
+      position into the Large Edge List chunk.
+    * The next 8 bytes store the generation number of the commit and
+      the commit time in seconds since EPOCH. The generation number
+      uses the higher 30 bits of the first 4 bytes, while the commit
+      time uses the 32 bits of the second 4 bytes, along with the lowest
+      2 bits of the lowest byte, storing the 33rd and 34th bit of the
+      commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+      This list of 4-byte values store the second through nth parents for
+      all octopus merges. The second parent value in the commit data stores
+      an array position within this list along with the most-significant bit
+      on. Starting at that array position, iterate through this list of int-ids
+      for the parents until reaching a value with the most-significant bit on.
+      The other bits correspond to the int-id of the last parent.
+
+TRAILER:
+
+	H-byte HASH-checksum of all of the above.
+
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 02/13] graph: add commit graph design document
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
  2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-20 21:42         ` Junio C Hamano
  2018-02-21 19:34         ` Stefan Beller
  2018-02-19 18:53       ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
                         ` (11 subsequent siblings)
  13 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 185 +++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000..e52ab23
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,185 @@
+Git Commit Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+either in the .git/objects/info directory or in the info directory of an
+alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+    If A and B are commits with generation numbers N and M, respectively,
+    and N <= M, then A cannot reach B. That is, we know without searching
+    that B is not an ancestor of A because it is further from a root commit
+    than A.
+
+    Conversely, when checking if A is an ancestor of B, then we only need
+    to walk commits until all commits on the walk boundary have generation
+    number at most N. If we walk commits using a priority queue seeded by
+    generation numbers, then we always expand the boundary commit with highest
+    generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+    If A and B are commits with commit time X and Y, respectively, and
+    X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<hash>.graph' in the
+  .git/objects/info directory. This could be stored in the info directory
+  of an alternate.
+
+- The latest graph file name is stored in a 'graph-latest' file next to
+  the graph files. This allows atomic swaps of latest graph files without
+  race conditions with concurrent processes.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+  so a future change of hash algorithm does not require a change in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer position
+  to seek into the single graph file. It is possible to extend the model
+  for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'commit-graph' builtin. These are not
+  updated automatically during clone, fetch, repack, or creating new commits.
+
+- There is no 'verify' subcommand for the 'commit-graph' builtin to verify
+  the contents of the graph file agree with the contents in the ODB.
+
+- Generation numbers are not computed in the current version. The file
+  format supports storing them, along with a mechanism to upgrade from
+  a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+  are not currently computed, so all generation numbers will be marked as
+  0 (or "uncomputed"). A later patch will include this calculation.
+
+- The commit graph is currently incompatible with commit grafts. This can be
+  remedied by duplicating or refactoring the current graft logic.
+
+- After computing and storing generation numbers, we must make graph
+  walks aware of generation numbers to gain the performance benefits they
+  enable. This will mostly be accomplished by swapping a commit-date-ordered
+  priority queue with one ordered by generation number. The following
+  operations are important candidates:
+
+    - paint_down_to_common()
+    - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+  When writing a new graph, we could check that the ODB still contains
+  the commits and choose to remove the commits that are deleted from the
+  ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+  object for a commit. This passes through lookup_tree() and consequently
+  lookup_object(). Also, it calls lookup_commit() when loading the parents.
+  These method calls check the ODB for object existence, even if the
+  consumer does not need the content. For example, we do not need the
+  tree contents when computing merge bases. Now that commit parsing is
+  removed from the computation time, these lookup operations are the
+  slowest operations keeping graph walks from being fast. Consider
+  loading these objects without verifying their existence in the ODB and
+  only loading them fully when consumers need them. Consider a method
+  such as "ensure_tree_loaded(commit)" that fully loads a tree before
+  using commit->tree.
+
+- The current design uses the 'commit-graph' builtin to generate the graph.
+  When this feature stabilizes enough to recommend to most users, we should
+  add automatic graph writes to common operations that create many commits.
+  For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
+  commands.
+
+- A server could provide a commit graph file as part of the network protocol
+  to avoid extra calculations by clients.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+    Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+    An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+    Discussion about generation numbers on commits and how they interact
+    with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+    More discussion about generation numbers and not storing them inside
+    commit objects. A valuable quote:
+
+    "I think we should be moving more in the direction of keeping
+     repo-local caches for optimizations. Reachability bitmaps have been
+     a big performance win. I think we should be doing the same with our
+     properties of commits. Not just generation numbers, but making it
+     cheap to access the graph structure without zlib-inflating whole
+     commit objects (i.e., packv4 or something like the "metapacks" I
+     proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+    A patch to remove the ahead-behind calculation from 'status'.
+
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 03/13] commit-graph: create git-commit-graph builtin
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
  2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
  2018-02-19 18:53       ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-20 21:51         ` Junio C Hamano
  2018-02-26 16:25         ` SZEDER Gábor
  2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
                         ` (10 subsequent siblings)
  13 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for an '--object-dir' option.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                         |  1 +
 Documentation/git-commit-graph.txt | 11 +++++++++++
 Makefile                           |  1 +
 builtin.h                          |  1 +
 builtin/commit-graph.c             | 37 +++++++++++++++++++++++++++++++++++++
 command-list.txt                   |  1 +
 git.c                              |  1 +
 7 files changed, 53 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b..e82f901 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
new file mode 100644
index 0000000..e1c3078
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,11 @@
+git-commit-graph(1)
+===================
+
+NAME
+----
+git-commit-graph - Write and verify Git commit graphs (.graph files)
+
+GIT
+---
+Part of the linkgit:git[1] suite
+
diff --git a/Makefile b/Makefile
index ee9d5eb..fc40b81 100644
--- a/Makefile
+++ b/Makefile
@@ -932,6 +932,7 @@ BUILTIN_OBJS += builtin/clone.o
 BUILTIN_OBJS += builtin/column.o
 BUILTIN_OBJS += builtin/commit-tree.o
 BUILTIN_OBJS += builtin/commit.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/config.o
 BUILTIN_OBJS += builtin/count-objects.o
 BUILTIN_OBJS += builtin/credential.o
diff --git a/builtin.h b/builtin.h
index 42378f3..079855b 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 0000000..98110bb
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,37 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+	N_("git commit-graph [--object-dir <objdir>]"),
+	NULL
+};
+
+static struct opts_commit_graph {
+	const char *obj_dir;
+} opts;
+
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_commit_graph_options[] = {
+		{ OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
+			N_("dir"),
+			N_("The object directory to store the graph") },
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_commit_graph_usage,
+				   builtin_commit_graph_options);
+
+	git_config(git_default_config, NULL);
+	argc = parse_options(argc, argv, prefix,
+			     builtin_commit_graph_options,
+			     builtin_commit_graph_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+
+	usage_with_options(builtin_commit_graph_usage,
+			   builtin_commit_graph_options);
+}
+
diff --git a/command-list.txt b/command-list.txt
index a1fad28..835c589 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean                               mainporcelain
 git-clone                               mainporcelain           init
 git-column                              purehelpers
 git-commit                              mainporcelain           history
+git-commit-graph                        plumbingmanipulators
 git-commit-tree                         plumbingmanipulators
 git-config                              ancillarymanipulators
 git-count-objects                       ancillaryinterrogators
diff --git a/git.c b/git.c
index 9e96dd4..d4832c1 100644
--- a/git.c
+++ b/git.c
@@ -388,6 +388,7 @@ static struct cmd_struct commands[] = {
 	{ "clone", cmd_clone },
 	{ "column", cmd_column, RUN_SETUP_GENTLY },
 	{ "commit", cmd_commit, RUN_SETUP | NEED_WORK_TREE },
+	{ "commit-graph", cmd_commit_graph, RUN_SETUP },
 	{ "commit-tree", cmd_commit_tree, RUN_SETUP },
 	{ "config", cmd_config, RUN_SETUP_GENTLY },
 	{ "count-objects", cmd_count_objects, RUN_SETUP },
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (2 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-20 22:57         ` Junio C Hamano
                           ` (2 more replies)
  2018-02-19 18:53       ` [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
                         ` (9 subsequent siblings)
  13 siblings, 3 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given object
directory.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |   1 +
 commit-graph.c | 370 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commit-graph.h |   7 ++
 3 files changed, 378 insertions(+)
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h

diff --git a/Makefile b/Makefile
index fc40b81..eeaeb6a 100644
--- a/Makefile
+++ b/Makefile
@@ -761,6 +761,7 @@ LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
 LIB_OBJS += commit.o
+LIB_OBJS += commit-graph.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
diff --git a/commit-graph.c b/commit-graph.c
new file mode 100644
index 0000000..f9e39b0
--- /dev/null
+++ b/commit-graph.c
@@ -0,0 +1,370 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "revision.h"
+#include "sha1-lookup.h"
+#include "commit-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 20
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_LARGE_EDGES_NEEDED 0x80000000
+#define GRAPH_PARENT_MISSING 0x7fffffff
+#define GRAPH_EDGE_LAST_MASK 0x7fffffff
+#define GRAPH_PARENT_NONE 0x70000000
+
+#define GRAPH_LAST_EDGE 0x80000000
+
+#define GRAPH_FANOUT_SIZE (4 * 256)
+#define GRAPH_CHUNKLOOKUP_WIDTH 12
+#define GRAPH_CHUNKLOOKUP_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH)
+#define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
+			GRAPH_OID_LEN + 8)
+
+static void write_graph_chunk_fanout(struct sha1file *f,
+				     struct commit **commits,
+				     int nr_commits)
+{
+	uint32_t i, count = 0;
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+
+	/*
+	 * Write the first-level table (the list is sorted,
+	 * but we use a 256-entry lookup to be able to avoid
+	 * having to do eight extra binary search iterations).
+	 */
+	for (i = 0; i < 256; i++) {
+		while (list < last) {
+			if ((*list)->object.oid.hash[0] != i)
+				break;
+			count++;
+			list++;
+		}
+
+		sha1write_be32(f, count);
+	}
+}
+
+static void write_graph_chunk_oids(struct sha1file *f, int hash_len,
+				   struct commit **commits, int nr_commits)
+{
+	struct commit **list, **last = commits + nr_commits;
+	for (list = commits; list < last; list++)
+		sha1write(f, (*list)->object.oid.hash, (int)hash_len);
+}
+
+static int commit_pos(struct commit **commits, int nr_commits,
+		      const struct object_id *oid, uint32_t *pos)
+{
+	uint32_t first = 0, last = nr_commits;
+
+	while (first < last) {
+		uint32_t mid = first + (last - first) / 2;
+		struct object_id *current;
+		int cmp;
+
+		current = &(commits[mid]->object.oid);
+		cmp = oidcmp(oid, current);
+		if (!cmp) {
+			*pos = mid;
+			return 1;
+		}
+		if (cmp > 0) {
+			first = mid + 1;
+			continue;
+		}
+		last = mid;
+	}
+
+	*pos = first;
+	return 0;
+}
+
+static void write_graph_chunk_data(struct sha1file *f, int hash_len,
+				   struct commit **commits, int nr_commits)
+{
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+	uint32_t num_large_edges = 0;
+
+	while (list < last) {
+		struct commit_list *parent;
+		uint32_t int_id;
+		uint32_t packedDate[2];
+
+		parse_commit(*list);
+		sha1write(f, (*list)->tree->object.oid.hash, hash_len);
+
+		parent = (*list)->parents;
+
+		if (!parent)
+			int_id = GRAPH_PARENT_NONE;
+		else if (!commit_pos(commits, nr_commits,
+				     &(parent->item->object.oid), &int_id))
+			int_id = GRAPH_PARENT_MISSING;
+
+		sha1write_be32(f, int_id);
+
+		if (parent)
+			parent = parent->next;
+
+		if (!parent)
+			int_id = GRAPH_PARENT_NONE;
+		else if (parent->next)
+			int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges;
+		else if (!commit_pos(commits, nr_commits,
+				    &(parent->item->object.oid), &int_id))
+			int_id = GRAPH_PARENT_MISSING;
+
+		sha1write_be32(f, int_id);
+
+		if (parent && parent->next) {
+			do {
+				num_large_edges++;
+				parent = parent->next;
+			} while (parent);
+		}
+
+		if (sizeof((*list)->date) > 4)
+			packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
+		else
+			packedDate[0] = 0;
+
+		packedDate[1] = htonl((*list)->date);
+		sha1write(f, packedDate, 8);
+
+		list++;
+	}
+}
+
+static void write_graph_chunk_large_edges(struct sha1file *f,
+					  struct commit **commits,
+					  int nr_commits)
+{
+	struct commit **list = commits;
+	struct commit **last = commits + nr_commits;
+	struct commit_list *parent;
+
+	while (list < last) {
+		int num_parents = 0;
+		for (parent = (*list)->parents; num_parents < 3 && parent;
+		     parent = parent->next)
+			num_parents++;
+
+		if (num_parents <= 2) {
+			list++;
+			continue;
+		}
+
+		/* Since num_parents > 2, this initializer is safe. */
+		for (parent = (*list)->parents->next; parent; parent = parent->next) {
+			uint32_t int_id, swap_int_id;
+			uint32_t last_edge = 0;
+			if (!parent->next)
+				last_edge |= GRAPH_LAST_EDGE;
+
+			if (commit_pos(commits, nr_commits,
+				       &(parent->item->object.oid),
+				       &int_id))
+				swap_int_id = htonl(int_id | last_edge);
+			else
+				swap_int_id = htonl(GRAPH_PARENT_MISSING | last_edge);
+
+			sha1write(f, &swap_int_id, 4);
+		}
+
+		list++;
+	}
+}
+
+static int commit_compare(const void *_a, const void *_b)
+{
+	struct object_id *a = (struct object_id *)_a;
+	struct object_id *b = (struct object_id *)_b;
+	return oidcmp(a, b);
+}
+
+struct packed_commit_list {
+	struct commit **list;
+	int nr;
+	int alloc;
+};
+
+struct packed_oid_list {
+	struct object_id *list;
+	int nr;
+	int alloc;
+};
+
+static int if_packed_commit_add_to_list(const struct object_id *oid,
+					struct packed_git *pack,
+					uint32_t pos,
+					void *data)
+{
+	struct packed_oid_list *list = (struct packed_oid_list*)data;
+	enum object_type type;
+	unsigned long size;
+	void *inner_data;
+	off_t offset = nth_packed_object_offset(pack, pos);
+	inner_data = unpack_entry(pack, offset, &type, &size);
+
+	if (inner_data)
+		free(inner_data);
+
+	if (type != OBJ_COMMIT)
+		return 0;
+
+	ALLOC_GROW(list->list, list->nr + 1, list->alloc);
+	oidcpy(&(list->list[list->nr]), oid);
+	(list->nr)++;
+
+	return 0;
+}
+
+char *write_commit_graph(const char *obj_dir)
+{
+	struct packed_oid_list oids;
+	struct packed_commit_list commits;
+	struct sha1file *f;
+	int i, count_distinct = 0;
+	DIR *info_dir;
+	struct strbuf tmp_file = STRBUF_INIT;
+	struct strbuf graph_file = STRBUF_INIT;
+	unsigned char final_hash[GIT_MAX_RAWSZ];
+	char *graph_name;
+	int fd;
+	uint32_t chunk_ids[5];
+	uint64_t chunk_offsets[5];
+	int num_chunks;
+	int num_long_edges;
+	struct commit_list *parent;
+
+	oids.nr = 0;
+	oids.alloc = (int)(0.15 * approximate_object_count());
+
+	if (oids.alloc < 1024)
+		oids.alloc = 1024;
+	ALLOC_ARRAY(oids.list, oids.alloc);
+
+	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+
+	QSORT(oids.list, oids.nr, commit_compare);
+
+	count_distinct = 1;
+	for (i = 1; i < oids.nr; i++) {
+		if (oidcmp(&oids.list[i-1], &oids.list[i]))
+			count_distinct++;
+	}
+
+	commits.nr = 0;
+	commits.alloc = count_distinct;
+	ALLOC_ARRAY(commits.list, commits.alloc);
+
+	num_long_edges = 0;
+	for (i = 0; i < oids.nr; i++) {
+		int num_parents = 0;
+		if (i > 0 && !oidcmp(&oids.list[i-1], &oids.list[i]))
+			continue;
+
+		commits.list[commits.nr] = lookup_commit(&oids.list[i]);
+		parse_commit(commits.list[commits.nr]);
+
+		for (parent = commits.list[commits.nr]->parents;
+		     parent; parent = parent->next)
+			num_parents++;
+
+		if (num_parents > 2)
+			num_long_edges += num_parents - 1;
+
+		commits.nr++;
+	}
+	num_chunks = num_long_edges ? 4 : 3;
+
+	strbuf_addf(&tmp_file, "%s/info", obj_dir);
+	info_dir = opendir(tmp_file.buf);
+
+	if (!info_dir && mkdir(tmp_file.buf, 0777) < 0)
+		die_errno(_("cannot mkdir %s"), tmp_file.buf);
+	if (info_dir)
+		closedir(info_dir);
+
+	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
+
+	fd = git_mkstemp_mode(tmp_file.buf, 0444);
+	if (fd < 0)
+		die_errno("unable to create '%s'", tmp_file.buf);
+
+	f = sha1fd(fd, tmp_file.buf);
+
+	sha1write_be32(f, GRAPH_SIGNATURE);
+
+	sha1write_u8(f, GRAPH_VERSION);
+	sha1write_u8(f, GRAPH_OID_VERSION);
+	sha1write_u8(f, num_chunks);
+	sha1write_u8(f, 0); /* unused padding byte */
+
+	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
+	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
+	chunk_ids[2] = GRAPH_CHUNKID_DATA;
+	if (num_long_edges)
+		chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
+	else
+		chunk_ids[3] = 0;
+	chunk_ids[4] = 0;
+
+	chunk_offsets[0] = 8 + GRAPH_CHUNKLOOKUP_SIZE;
+	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
+	chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.nr;
+	chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.nr;
+	chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;
+
+	for (i = 0; i <= num_chunks; i++) {
+		uint32_t chunk_write[3];
+
+		chunk_write[0] = htonl(chunk_ids[i]);
+		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
+		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
+		sha1write(f, chunk_write, 12);
+	}
+
+	write_graph_chunk_fanout(f, commits.list, commits.nr);
+	write_graph_chunk_oids(f, GRAPH_OID_LEN, commits.list, commits.nr);
+	write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.nr);
+	write_graph_chunk_large_edges(f, commits.list, commits.nr);
+
+	sha1close(f, final_hash, CSUM_CLOSE | CSUM_FSYNC);
+
+	strbuf_addf(&graph_file, "graph-%s.graph", sha1_to_hex(final_hash));
+	graph_name = strbuf_detach(&graph_file, NULL);
+	strbuf_addf(&graph_file, "%s/info/%s", obj_dir, graph_name);
+
+	if (rename(tmp_file.buf, graph_file.buf))
+		die("failed to rename %s to %s", tmp_file.buf, graph_file.buf);
+
+	strbuf_release(&tmp_file);
+	strbuf_release(&graph_file);
+	free(oids.list);
+	oids.alloc = 0;
+	oids.nr = 0;
+
+	return graph_name;
+}
+
diff --git a/commit-graph.h b/commit-graph.h
new file mode 100644
index 0000000..dc8c73a
--- /dev/null
+++ b/commit-graph.h
@@ -0,0 +1,7 @@
+#ifndef COMMIT_GRAPH_H
+#define COMMIT_GRAPH_H
+
+extern char *write_commit_graph(const char *obj_dir);
+
+#endif
+
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write'
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (3 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-21 19:25         ` Junio C Hamano
  2018-02-19 18:53       ` [PATCH v4 06/13] commit-graph: implement git commit-graph read Derrick Stolee
                         ` (8 subsequent siblings)
  13 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  40 +++++++++++++
 builtin/commit-graph.c             |  43 +++++++++++++-
 t/t5318-commit-graph.sh            | 119 +++++++++++++++++++++++++++++++++++++
 3 files changed, 201 insertions(+), 1 deletion(-)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index e1c3078..c3f222f 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,6 +5,46 @@ NAME
 ----
 git-commit-graph - Write and verify Git commit graphs (.graph files)
 
+
+SYNOPSIS
+--------
+[verse]
+'git commit-graph write' <options> [--object-dir <dir>]
+
+
+DESCRIPTION
+-----------
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+-------
+--object-dir::
+	Use given directory for the location of packfiles and graph files.
+	The graph files will be in <dir>/info and the packfiles are expected
+	to be in <dir>/pack.
+
+
+COMMANDS
+--------
+'write'::
+
+Write a commit graph file based on the commits found in packfiles.
+Includes all commits from the existing commit graph file. Outputs the
+resulting filename.
+
+
+EXAMPLES
+--------
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+------------------------------------------------
+$ git commit-graph write
+------------------------------------------------
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 98110bb..a51d87b 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,9 +1,18 @@
 #include "builtin.h"
 #include "config.h"
+#include "dir.h"
+#include "lockfile.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
+	N_("git commit-graph write [--object-dir <objdir>]"),
+	NULL
+};
+
+static const char * const builtin_commit_graph_write_usage[] = {
+	N_("git commit-graph write [--object-dir <objdir>]"),
 	NULL
 };
 
@@ -11,11 +20,38 @@ static struct opts_commit_graph {
 	const char *obj_dir;
 } opts;
 
+static int graph_write(int argc, const char **argv)
+{
+	char *graph_name;
+
+	static struct option builtin_commit_graph_write_options[] = {
+		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
+			N_("dir"),
+			N_("The object directory to store the graph") },
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_write_options,
+			     builtin_commit_graph_write_usage, 0);
+
+	if (!opts.obj_dir)
+		opts.obj_dir = get_object_directory();
+
+	graph_name = write_commit_graph(opts.obj_dir);
+
+	if (graph_name) {
+		printf("%s\n", graph_name);
+		FREE_AND_NULL(graph_name);
+	}
+
+	return 0;
+}
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
-		{ OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
+		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
 			N_("dir"),
 			N_("The object directory to store the graph") },
 		OPT_END(),
@@ -31,6 +67,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     builtin_commit_graph_usage,
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (argc > 0) {
+		if (!strcmp(argv[0], "write"))
+			return graph_write(argc, argv);
+	}
+
 	usage_with_options(builtin_commit_graph_usage,
 			   builtin_commit_graph_options);
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 0000000..6a5e93c
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,119 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' '
+	rm -rf .git &&
+	mkdir full &&
+	cd full &&
+	git init &&
+	objdir=".git/objects"
+'
+
+test_expect_success 'write graph with no packs' '
+	git commit-graph write --object-dir .
+'
+
+test_expect_success 'create commits and repack' '
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git repack
+'
+
+test_expect_success 'write graph' '
+	graph1=$(git commit-graph write) &&
+	test_path_is_file $objdir/info/$graph1
+'
+
+test_expect_success 'Add more commits' '
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 7)
+	do
+		test_commit $i &&
+		git branch commits/$i
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	git reset --hard commits/3 &&
+	git merge commits/5 commits/7 &&
+	git branch merge/3 &&
+	git repack
+'
+
+# Current graph structure:
+#
+#   __M3___
+#  /   |   \
+# 3 M1 5 M2 7
+# |/  \|/  \|
+# 2    4    6
+# |___/____/
+# 1
+
+
+test_expect_success 'write graph with merges' '
+	graph2=$(git commit-graph write)&&
+	test_path_is_file $objdir/info/$graph2
+'
+
+test_expect_success 'Add one more commit' '
+	test_commit 8 &&
+	git branch commits/8 &&
+	ls $objdir/pack | grep idx >existing-idx &&
+	git repack &&
+	ls $objdir/pack| grep idx | grep -v --file=existing-idx >new-idx
+'
+
+# Current graph structure:
+#
+#      8
+#      |
+#   __M3___
+#  /   |   \
+# 3 M1 5 M2 7
+# |/  \|/  \|
+# 2    4    6
+# |___/____/
+# 1
+
+test_expect_success 'write graph with new commit' '
+	graph3=$(git commit-graph write) &&
+	test_path_is_file $objdir/info/$graph3
+'
+
+test_expect_success 'write graph with nothing new' '
+	graph4=$(git commit-graph write) &&
+	test_path_is_file $objdir/info/$graph4 &&
+	printf $graph3 >expect &&
+	printf $graph4 >output &&
+	test_cmp expect output
+'
+
+test_expect_success 'setup bare repo' '
+	cd .. &&
+	git clone --bare --no-local full bare &&
+	cd bare &&
+	baredir="./objects"
+'
+
+test_expect_success 'write graph in bare repo' '
+	graphbare=$(git commit-graph write) &&
+	test_path_is_file $baredir/info/$graphbare
+'
+
+test_done
+
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 06/13] commit-graph: implement git commit-graph read
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (4 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-21 20:11         ` Junio C Hamano
  2018-02-19 18:53       ` [PATCH v4 07/13] commit-graph: implement --set-latest Derrick Stolee
                         ` (7 subsequent siblings)
  13 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach git-commit-graph to read commit graph files and summarize their contents.

Use the read subcommand to verify the contents of a commit graph file in the
tests.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  15 +++++
 builtin/commit-graph.c             |  63 ++++++++++++++++++++
 commit-graph.c                     | 116 +++++++++++++++++++++++++++++++++++++
 commit-graph.h                     |  21 +++++++
 t/t5318-commit-graph.sh            |  38 ++++++++++--
 5 files changed, 249 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index c3f222f..6d26e56 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graphs (.graph files)
 SYNOPSIS
 --------
 [verse]
+'git commit-graph read' <options> [--object-dir <dir>]
 'git commit-graph write' <options> [--object-dir <dir>]
 
 
@@ -34,6 +35,14 @@ Write a commit graph file based on the commits found in packfiles.
 Includes all commits from the existing commit graph file. Outputs the
 resulting filename.
 
+'read'::
+
+Read a graph file given by the graph-head file and output basic
+details about the graph file.
++
+With `--file=<name>` option, consider the graph stored in the file at
+the path  <object-dir>/info/<name>.
+
 
 EXAMPLES
 --------
@@ -44,6 +53,12 @@ EXAMPLES
 $ git commit-graph write
 ------------------------------------------------
 
+* Read basic information from a graph file.
++
+------------------------------------------------
+$ git commit-graph read --file=<name>
+------------------------------------------------
+
 
 GIT
 ---
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index a51d87b..28cd097 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,10 +7,16 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
+	N_("git commit-graph read [--object-dir <objdir>] [--file=<hash>]"),
 	N_("git commit-graph write [--object-dir <objdir>]"),
 	NULL
 };
 
+static const char * const builtin_commit_graph_read_usage[] = {
+	N_("git commit-graph read [--object-dir <objdir>] [--file=<hash>]"),
+	NULL
+};
+
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>]"),
 	NULL
@@ -18,8 +24,63 @@ static const char * const builtin_commit_graph_write_usage[] = {
 
 static struct opts_commit_graph {
 	const char *obj_dir;
+	const char *graph_file;
 } opts;
 
+static int graph_read(int argc, const char **argv)
+{
+	struct commit_graph *graph = 0;
+	struct strbuf full_path = STRBUF_INIT;
+
+	static struct option builtin_commit_graph_read_options[] = {
+		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
+			N_("dir"),
+			N_("The object directory to store the graph") },
+		{ OPTION_STRING, 'H', "file", &opts.graph_file,
+			N_("file"),
+			N_("The filename for a specific commit graph file in the object directory."),
+			PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_read_options,
+			     builtin_commit_graph_read_usage, 0);
+
+	if (!opts.obj_dir)
+		opts.obj_dir = get_object_directory();
+
+	if (!opts.graph_file)
+		die("no graph hash specified");
+
+	strbuf_addf(&full_path, "%s/info/%s", opts.obj_dir, opts.graph_file);
+	graph = load_commit_graph_one(full_path.buf);
+
+	if (!graph)
+		die("graph file %s does not exist", full_path.buf);
+
+	printf("header: %08x %d %d %d %d\n",
+		ntohl(*(uint32_t*)graph->data),
+		*(unsigned char*)(graph->data + 4),
+		*(unsigned char*)(graph->data + 5),
+		*(unsigned char*)(graph->data + 6),
+		*(unsigned char*)(graph->data + 7));
+	printf("num_commits: %u\n", graph->num_commits);
+	printf("chunks:");
+
+	if (graph->chunk_oid_fanout)
+		printf(" oid_fanout");
+	if (graph->chunk_oid_lookup)
+		printf(" oid_lookup");
+	if (graph->chunk_commit_data)
+		printf(" commit_metadata");
+	if (graph->chunk_large_edges)
+		printf(" large_edges");
+	printf("\n");
+
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	char *graph_name;
@@ -68,6 +129,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
 	if (argc > 0) {
+		if (!strcmp(argv[0], "read"))
+			return graph_read(argc, argv);
 		if (!strcmp(argv[0], "write"))
 			return graph_write(argc, argv);
 	}
diff --git a/commit-graph.c b/commit-graph.c
index f9e39b0..2a8594f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,6 +38,122 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
 			GRAPH_OID_LEN + 8)
 
+static struct commit_graph *alloc_commit_graph(void)
+{
+	struct commit_graph *g = xmalloc(sizeof(*g));
+	memset(g, 0, sizeof(*g));
+	g->graph_fd = -1;
+
+	return g;
+}
+
+struct commit_graph *load_commit_graph_one(const char *graph_file)
+{
+	void *graph_map;
+	const unsigned char *data, *chunk_lookup;
+	size_t graph_size;
+	struct stat st;
+	uint32_t i;
+	struct commit_graph *graph;
+	int fd = git_open(graph_file);
+	uint64_t last_chunk_offset;
+	uint32_t last_chunk_id;
+	uint32_t graph_signature;
+	unsigned char graph_version, hash_version;
+
+	if (fd < 0)
+		return 0;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return 0;
+	}
+	graph_size = xsize_t(st.st_size);
+
+	if (graph_size < GRAPH_MIN_SIZE) {
+		close(fd);
+		die("graph file %s is too small", graph_file);
+	}
+	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	data = (const unsigned char *)graph_map;
+
+	graph_signature = ntohl(*(uint32_t*)data);
+	if (graph_signature != GRAPH_SIGNATURE) {
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("graph signature %X does not match signature %X",
+			graph_signature, GRAPH_SIGNATURE);
+	}
+
+	graph_version = *(unsigned char*)(data + 4);
+	if (graph_version != GRAPH_VERSION) {
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("graph version %X does not match version %X",
+			graph_version, GRAPH_VERSION);
+	}
+
+	hash_version = *(unsigned char*)(data + 5);
+	if (hash_version != GRAPH_OID_VERSION) {
+		munmap(graph_map, graph_size);
+		close(fd);
+		die("hash version %X does not match version %X",
+			hash_version, GRAPH_OID_VERSION);
+	}
+
+	graph = alloc_commit_graph();
+
+	graph->hash_len = GRAPH_OID_LEN;
+	graph->num_chunks = *(unsigned char*)(data + 6);
+	graph->graph_fd = fd;
+	graph->data = graph_map;
+	graph->data_len = graph_size;
+
+	last_chunk_id = 0;
+	last_chunk_offset = 8;
+	chunk_lookup = data + 8;
+	for (i = 0; i < graph->num_chunks; i++) {
+		uint32_t chunk_id = get_be32(chunk_lookup + 0);
+		uint64_t chunk_offset1 = get_be32(chunk_lookup + 4);
+		uint32_t chunk_offset2 = get_be32(chunk_lookup + 8);
+		uint64_t chunk_offset = (chunk_offset1 << 32) | chunk_offset2;
+
+		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
+
+		if (chunk_offset > graph_size - GIT_MAX_RAWSZ)
+			die("improper chunk offset %08x%08x", (uint32_t)(chunk_offset >> 32),
+			    (uint32_t)chunk_offset);
+
+		switch (chunk_id) {
+			case GRAPH_CHUNKID_OIDFANOUT:
+				graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
+				break;
+
+			case GRAPH_CHUNKID_OIDLOOKUP:
+				graph->chunk_oid_lookup = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_DATA:
+				graph->chunk_commit_data = data + chunk_offset;
+				break;
+
+			case GRAPH_CHUNKID_LARGEEDGES:
+				graph->chunk_large_edges = data + chunk_offset;
+				break;
+		}
+
+		if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
+		{
+			graph->num_commits = (chunk_offset - last_chunk_offset)
+					     / graph->hash_len;
+		}
+
+		last_chunk_id = chunk_id;
+		last_chunk_offset = chunk_offset;
+	}
+
+	return graph;
+}
+
 static void write_graph_chunk_fanout(struct sha1file *f,
 				     struct commit **commits,
 				     int nr_commits)
diff --git a/commit-graph.h b/commit-graph.h
index dc8c73a..9093b97 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -1,6 +1,27 @@
 #ifndef COMMIT_GRAPH_H
 #define COMMIT_GRAPH_H
 
+#include "git-compat-util.h"
+
+struct commit_graph {
+	int graph_fd;
+
+	const unsigned char *data;
+	size_t data_len;
+
+	unsigned char hash_len;
+	unsigned char num_chunks;
+	uint32_t num_commits;
+	struct object_id oid;
+
+	const uint32_t *chunk_oid_fanout;
+	const unsigned char *chunk_oid_lookup;
+	const unsigned char *chunk_commit_data;
+	const unsigned char *chunk_large_edges;
+};
+
+extern struct commit_graph *load_commit_graph_one(const char *graph_file);
+
 extern char *write_commit_graph(const char *obj_dir);
 
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6a5e93c..893fa24 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -24,9 +24,27 @@ test_expect_success 'create commits and repack' '
 	git repack
 '
 
+graph_read_expect() {
+	OPTIONAL=""
+	NUM_CHUNKS=3
+	if [ ! -z $2 ]
+	then
+		OPTIONAL=" $2"
+		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 1 $NUM_CHUNKS 0
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
+	EOF
+}
+
 test_expect_success 'write graph' '
 	graph1=$(git commit-graph write) &&
-	test_path_is_file $objdir/info/$graph1
+	test_path_is_file $objdir/info/$graph1 &&
+	git commit-graph read --file=$graph1 >output &&
+	graph_read_expect "3" &&
+	test_cmp expect output
 '
 
 test_expect_success 'Add more commits' '
@@ -67,7 +85,10 @@ test_expect_success 'Add more commits' '
 
 test_expect_success 'write graph with merges' '
 	graph2=$(git commit-graph write)&&
-	test_path_is_file $objdir/info/$graph2
+	test_path_is_file $objdir/info/$graph2 &&
+	git commit-graph read --file=$graph2 >output &&
+	graph_read_expect "10" "large_edges" &&
+	test_cmp expect output
 '
 
 test_expect_success 'Add one more commit' '
@@ -92,7 +113,10 @@ test_expect_success 'Add one more commit' '
 
 test_expect_success 'write graph with new commit' '
 	graph3=$(git commit-graph write) &&
-	test_path_is_file $objdir/info/$graph3
+	test_path_is_file $objdir/info/$graph3 &&
+	git commit-graph read --file=$graph3 >output &&
+	graph_read_expect "11" "large_edges" &&
+	test_cmp expect output
 '
 
 test_expect_success 'write graph with nothing new' '
@@ -100,6 +124,9 @@ test_expect_success 'write graph with nothing new' '
 	test_path_is_file $objdir/info/$graph4 &&
 	printf $graph3 >expect &&
 	printf $graph4 >output &&
+	test_cmp expect output &&
+	git commit-graph read --file=$graph4 >output &&
+	graph_read_expect "11" "large_edges" &&
 	test_cmp expect output
 '
 
@@ -112,7 +139,10 @@ test_expect_success 'setup bare repo' '
 
 test_expect_success 'write graph in bare repo' '
 	graphbare=$(git commit-graph write) &&
-	test_path_is_file $baredir/info/$graphbare
+	test_path_is_file $baredir/info/$graphbare &&
+	git commit-graph read --file=$graphbare >output &&
+	graph_read_expect "11" "large_edges" &&
+	test_cmp expect output
 '
 
 test_done
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 07/13] commit-graph: implement --set-latest
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (5 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 06/13] commit-graph: implement git commit-graph read Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-22 18:31         ` Junio C Hamano
  2018-02-19 18:53       ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
                         ` (6 subsequent siblings)
  13 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

It is possible to have multiple commit graph files in a directory, but
only one is important at a time.

Use a 'graph-latest' file to point to the important file. Teach
git-commit-graph to write 'graph-latest' when given the "--set-latest"
option. Using this 'graph-latest' file is more robust than relying on
directory scanning and modified times.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 10 ++++++++++
 builtin/commit-graph.c             | 26 ++++++++++++++++++++++++--
 commit-graph.c                     |  7 +++++++
 commit-graph.h                     |  2 ++
 t/t5318-commit-graph.sh            | 24 +++++++++++++++++++-----
 5 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 6d26e56..dc948c5 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -34,6 +34,9 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 Includes all commits from the existing commit graph file. Outputs the
 resulting filename.
++
+With `--set-latest` option, update the graph-latest file to point
+to the written graph file.
 
 'read'::
 
@@ -53,6 +56,13 @@ EXAMPLES
 $ git commit-graph write
 ------------------------------------------------
 
+* Write a graph file for the packed commits in your local .git folder
+* and update graph-latest.
++
+------------------------------------------------
+$ git commit-graph write --set-latest
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 28cd097..bf86172 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>] [--file=<hash>]"),
-	N_("git commit-graph write [--object-dir <objdir>]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest]"),
 	NULL
 };
 
@@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *obj_dir;
 	const char *graph_file;
+	int set_latest;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -81,6 +82,22 @@ static int graph_read(int argc, const char **argv)
 	return 0;
 }
 
+static void set_latest_file(const char *obj_dir, const char *graph_file)
+{
+	int fd;
+	struct lock_file lk = LOCK_INIT;
+	char *latest_fname = get_graph_latest_filename(obj_dir);
+
+	fd = hold_lock_file_for_update(&lk, latest_fname, LOCK_DIE_ON_ERROR);
+	FREE_AND_NULL(latest_fname);
+
+	if (fd < 0)
+		die_errno("unable to open graph-head");
+
+	write_in_full(fd, graph_file, strlen(graph_file));
+	commit_lock_file(&lk);
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	char *graph_name;
@@ -89,6 +106,8 @@ static int graph_write(int argc, const char **argv)
 		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
 			N_("dir"),
 			N_("The object directory to store the graph") },
+		OPT_BOOL('u', "set-latest", &opts.set_latest,
+			N_("update graph-head to written graph file")),
 		OPT_END(),
 	};
 
@@ -102,6 +121,9 @@ static int graph_write(int argc, const char **argv)
 	graph_name = write_commit_graph(opts.obj_dir);
 
 	if (graph_name) {
+		if (opts.set_latest)
+			set_latest_file(opts.obj_dir, graph_name);
+
 		printf("%s\n", graph_name);
 		FREE_AND_NULL(graph_name);
 	}
diff --git a/commit-graph.c b/commit-graph.c
index 2a8594f..5ee0805 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,6 +38,13 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
 			GRAPH_OID_LEN + 8)
 
+char *get_graph_latest_filename(const char *obj_dir)
+{
+	struct strbuf fname = STRBUF_INIT;
+	strbuf_addf(&fname, "%s/info/graph-latest", obj_dir);
+	return strbuf_detach(&fname, 0);
+}
+
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xmalloc(sizeof(*g));
diff --git a/commit-graph.h b/commit-graph.h
index 9093b97..ae24b3a 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -3,6 +3,8 @@
 
 #include "git-compat-util.h"
 
+extern char *get_graph_latest_filename(const char *obj_dir);
+
 struct commit_graph {
 	int graph_fd;
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 893fa24..cad9d90 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -12,7 +12,8 @@ test_expect_success 'setup full repo' '
 '
 
 test_expect_success 'write graph with no packs' '
-	git commit-graph write --object-dir .
+	git commit-graph write --object-dir . &&
+	test_path_is_missing info/graph-latest
 '
 
 test_expect_success 'create commits and repack' '
@@ -42,6 +43,7 @@ graph_read_expect() {
 test_expect_success 'write graph' '
 	graph1=$(git commit-graph write) &&
 	test_path_is_file $objdir/info/$graph1 &&
+	test_path_is_missing $objdir/info/graph-latest &&
 	git commit-graph read --file=$graph1 >output &&
 	graph_read_expect "3" &&
 	test_cmp expect output
@@ -84,8 +86,11 @@ test_expect_success 'Add more commits' '
 
 
 test_expect_success 'write graph with merges' '
-	graph2=$(git commit-graph write)&&
+	graph2=$(git commit-graph write --set-latest)&&
 	test_path_is_file $objdir/info/$graph2 &&
+	test_path_is_file $objdir/info/graph-latest &&
+	printf $graph2 >expect &&
+	test_cmp expect $objdir/info/graph-latest &&
 	git commit-graph read --file=$graph2 >output &&
 	graph_read_expect "10" "large_edges" &&
 	test_cmp expect output
@@ -112,19 +117,25 @@ test_expect_success 'Add one more commit' '
 # 1
 
 test_expect_success 'write graph with new commit' '
-	graph3=$(git commit-graph write) &&
+	graph3=$(git commit-graph write --set-latest) &&
 	test_path_is_file $objdir/info/$graph3 &&
+	test_path_is_file $objdir/info/graph-latest &&
+	printf $graph3 >expect &&
+	test_cmp expect $objdir/info/graph-latest &&
 	git commit-graph read --file=$graph3 >output &&
 	graph_read_expect "11" "large_edges" &&
 	test_cmp expect output
 '
 
 test_expect_success 'write graph with nothing new' '
-	graph4=$(git commit-graph write) &&
+	graph4=$(git commit-graph write --set-latest) &&
 	test_path_is_file $objdir/info/$graph4 &&
 	printf $graph3 >expect &&
 	printf $graph4 >output &&
 	test_cmp expect output &&
+	test_path_is_file $objdir/info/graph-latest &&
+	printf $graph4 >expect &&
+	test_cmp expect $objdir/info/graph-latest &&
 	git commit-graph read --file=$graph4 >output &&
 	graph_read_expect "11" "large_edges" &&
 	test_cmp expect output
@@ -138,8 +149,11 @@ test_expect_success 'setup bare repo' '
 '
 
 test_expect_success 'write graph in bare repo' '
-	graphbare=$(git commit-graph write) &&
+	graphbare=$(git commit-graph write --set-latest) &&
 	test_path_is_file $baredir/info/$graphbare &&
+	test_path_is_file $baredir/info/graph-latest &&
+	printf $graphbare >expect &&
+	test_cmp expect $baredir/info/graph-latest &&
 	git commit-graph read --file=$graphbare >output &&
 	graph_read_expect "11" "large_edges" &&
 	test_cmp expect output
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (6 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 07/13] commit-graph: implement --set-latest Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-21 21:34         ` Stefan Beller
  2018-02-22 18:48         ` Junio C Hamano
  2018-02-19 18:53       ` [PATCH v4 09/13] commit-graph: add core.commitGraph setting Derrick Stolee
                         ` (5 subsequent siblings)
  13 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach git-commit-graph to delete the .graph files that are siblings of a
newly-written graph file, except for the file referenced by 'graph-latest'
at the beginning of the process and the newly-written file. If we fail to
delete a graph file, only report a warning because another git process may
be using that file. In a multi-process environment, we expect the previoius
graph file to be used by a concurrent process, so we do not delete it to
avoid race conditions.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 +++++--
 builtin/commit-graph.c             | 61 ++++++++++++++++++++++++++++++++++++--
 commit-graph.c                     | 23 ++++++++++++++
 commit-graph.h                     |  1 +
 t/t5318-commit-graph.sh            |  7 +++--
 5 files changed, 96 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index dc948c5..b9b4031 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -37,6 +37,11 @@ resulting filename.
 +
 With `--set-latest` option, update the graph-latest file to point
 to the written graph file.
++
+With the `--delete-expired` option, delete the graph files in the pack
+directory that are not referred to by the graph-latest file. To avoid race
+conditions, do not delete the file previously referred to by the
+graph-latest file if it is updated by the `--set-latest` option.
 
 'read'::
 
@@ -56,11 +61,11 @@ EXAMPLES
 $ git commit-graph write
 ------------------------------------------------
 
-* Write a graph file for the packed commits in your local .git folder
-* and update graph-latest.
+* Write a graph file for the packed commits in your local .git folder,
+* update graph-latest, and delete stale graph files.
 +
 ------------------------------------------------
-$ git commit-graph write --set-latest
+$ git commit-graph write --set-latest --delete-expired
 ------------------------------------------------
 
 * Read basic information from a graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index bf86172..fd99169 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>] [--file=<hash>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--set-latest]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired]"),
 	NULL
 };
 
@@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--set-latest]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired]"),
 	NULL
 };
 
@@ -26,6 +26,7 @@ static struct opts_commit_graph {
 	const char *obj_dir;
 	const char *graph_file;
 	int set_latest;
+	int delete_expired;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -98,9 +99,56 @@ static void set_latest_file(const char *obj_dir, const char *graph_file)
 	commit_lock_file(&lk);
 }
 
+/*
+ * To avoid race conditions and deleting graph files that are being
+ * used by other processes, look inside a pack directory for all files
+ * of the form "graph-<hash>.graph" that do not match the old or new
+ * graph hashes and delete them.
+ */
+static void do_delete_expired(const char *obj_dir,
+			      const char *old_graph_name,
+			      const char *new_graph_name)
+{
+	DIR *dir;
+	struct dirent *de;
+	int dirnamelen;
+	struct strbuf path = STRBUF_INIT;
+
+	strbuf_addf(&path, "%s/info", obj_dir);
+	dir = opendir(path.buf);
+	if (!dir) {
+		if (errno != ENOENT)
+			error_errno("unable to open object pack directory: %s",
+				    obj_dir);
+		return;
+	}
+
+	strbuf_addch(&path, '/');
+	dirnamelen = path.len;
+	while ((de = readdir(dir)) != NULL) {
+		size_t base_len;
+
+		if (is_dot_or_dotdot(de->d_name))
+			continue;
+
+		strbuf_setlen(&path, dirnamelen);
+		strbuf_addstr(&path, de->d_name);
+
+		base_len = path.len;
+		if (strip_suffix_mem(path.buf, &base_len, ".graph") &&
+		    strcmp(new_graph_name, de->d_name) &&
+		    (!old_graph_name || strcmp(old_graph_name, de->d_name)) &&
+		    remove_path(path.buf))
+			die("failed to remove path %s", path.buf);
+	}
+
+	strbuf_release(&path);
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	char *graph_name;
+	char *old_graph_name;
 
 	static struct option builtin_commit_graph_write_options[] = {
 		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
@@ -108,6 +156,8 @@ static int graph_write(int argc, const char **argv)
 			N_("The object directory to store the graph") },
 		OPT_BOOL('u', "set-latest", &opts.set_latest,
 			N_("update graph-head to written graph file")),
+		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
+			N_("delete expired head graph file")),
 		OPT_END(),
 	};
 
@@ -118,12 +168,19 @@ static int graph_write(int argc, const char **argv)
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
 
+	old_graph_name = get_graph_latest_contents(opts.obj_dir);
+
 	graph_name = write_commit_graph(opts.obj_dir);
 
 	if (graph_name) {
 		if (opts.set_latest)
 			set_latest_file(opts.obj_dir, graph_name);
 
+		if (opts.delete_expired)
+			do_delete_expired(opts.obj_dir,
+					  old_graph_name,
+					  graph_name);
+
 		printf("%s\n", graph_name);
 		FREE_AND_NULL(graph_name);
 	}
diff --git a/commit-graph.c b/commit-graph.c
index 5ee0805..c8fb38f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,6 +45,29 @@ char *get_graph_latest_filename(const char *obj_dir)
 	return strbuf_detach(&fname, 0);
 }
 
+char *get_graph_latest_contents(const char *obj_dir)
+{
+	struct strbuf graph_file = STRBUF_INIT;
+	char *fname;
+	FILE *f;
+	char buf[64];
+
+	fname = get_graph_latest_filename(obj_dir);
+	f = fopen(fname, "r");
+	FREE_AND_NULL(fname);
+
+	if (!f)
+		return 0;
+
+	while (!feof(f)) {
+		if (fgets(buf, sizeof(buf), f))
+			strbuf_addstr(&graph_file, buf);
+	}
+
+	fclose(f);
+	return strbuf_detach(&graph_file, NULL);
+}
+
 static struct commit_graph *alloc_commit_graph(void)
 {
 	struct commit_graph *g = xmalloc(sizeof(*g));
diff --git a/commit-graph.h b/commit-graph.h
index ae24b3a..56215ad 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 
 extern char *get_graph_latest_filename(const char *obj_dir);
+extern char *get_graph_latest_contents(const char *obj_dir);
 
 struct commit_graph {
 	int graph_fd;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index cad9d90..1d5ec7d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -117,8 +117,10 @@ test_expect_success 'Add one more commit' '
 # 1
 
 test_expect_success 'write graph with new commit' '
-	graph3=$(git commit-graph write --set-latest) &&
+	graph3=$(git commit-graph write --set-latest --delete-expired) &&
 	test_path_is_file $objdir/info/$graph3 &&
+	test_path_is_file $objdir/info/$graph2 &&
+	test_path_is_missing $objdir/info/$graph1 &&
 	test_path_is_file $objdir/info/graph-latest &&
 	printf $graph3 >expect &&
 	test_cmp expect $objdir/info/graph-latest &&
@@ -128,8 +130,9 @@ test_expect_success 'write graph with new commit' '
 '
 
 test_expect_success 'write graph with nothing new' '
-	graph4=$(git commit-graph write --set-latest) &&
+	graph4=$(git commit-graph write --set-latest --delete-expired) &&
 	test_path_is_file $objdir/info/$graph4 &&
+	test_path_is_missing $objdir/info/$graph2 &&
 	printf $graph3 >expect &&
 	printf $graph4 >output &&
 	test_cmp expect output &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 09/13] commit-graph: add core.commitGraph setting
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (7 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-19 18:53       ` [PATCH v4 10/13] commit-graph: close under reachability Derrick Stolee
                         ` (4 subsequent siblings)
  13 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

The commit graph feature is controlled by the new core.commitGraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitGraph is that a user can always stop checking
for or parsing commit graph files if core.commitGraph=0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h                  | 1 +
 config.c                 | 5 +++++
 environment.c            | 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9593bfa..e90d0d1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -883,6 +883,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitGraph::
+	Enable git commit graph feature. Allows reading from .graph files.
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 6440e2b..1063873 100644
--- a/cache.h
+++ b/cache.h
@@ -771,6 +771,7 @@ extern char *git_replace_ref_base;
 
 extern int fsync_object_files;
 extern int core_preload_index;
+extern int core_commit_graph;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index 41862d4..614cf59 100644
--- a/config.c
+++ b/config.c
@@ -1213,6 +1213,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.commitgraph")) {
+		core_commit_graph = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.sparsecheckout")) {
 		core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 8289c25..81fed83 100644
--- a/environment.c
+++ b/environment.c
@@ -60,6 +60,7 @@ enum push_default_type push_default = PUSH_DEFAULT_UNSPECIFIED;
 enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
+int core_commit_graph;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 10/13] commit-graph: close under reachability
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (8 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 09/13] commit-graph: add core.commitGraph setting Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-19 18:53       ` [PATCH v4 11/13] commit: integrate commit graph with commit parsing Derrick Stolee
                         ` (3 subsequent siblings)
  13 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Also automatically add commits from the existing graph file, if it
exists.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index c8fb38f..00bd73a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -385,6 +385,28 @@ static int if_packed_commit_add_to_list(const struct object_id *oid,
 	return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+	int i;
+	struct rev_info revs;
+	struct commit *commit;
+	init_revisions(&revs, NULL);
+	for (i = 0; i < oids->nr; i++) {
+		commit = lookup_commit(&oids->list[i]);
+		if (commit && !parse_commit(commit))
+			revs.commits = commit_list_insert(commit, &revs.commits);
+	}
+
+	if (prepare_revision_walk(&revs))
+		die(_("revision walk setup failed"));
+
+	while ((commit = get_revision(&revs)) != NULL) {
+		ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
+		oidcpy(&oids->list[oids->nr], &(commit->object.oid));
+		(oids->nr)++;
+	}
+}
+
 char *write_commit_graph(const char *obj_dir)
 {
 	struct packed_oid_list oids;
@@ -411,6 +433,7 @@ char *write_commit_graph(const char *obj_dir)
 	ALLOC_ARRAY(oids.list, oids.alloc);
 
 	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+	close_reachable(&oids);
 
 	QSORT(oids.list, oids.nr, commit_compare);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 11/13] commit: integrate commit graph with commit parsing
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (9 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 10/13] commit-graph: close under reachability Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-19 18:53       ` [PATCH v4 12/13] commit-graph: read only from specific pack-indexes Derrick Stolee
                         ` (2 subsequent siblings)
  13 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date. The only loosely-expected
condition is that the commit buffer is loaded into the cache. This
was checked in log-tree.c:show_log(), but the "return;" on failure
produced unexpected results (i.e. the message line was never terminated).
The new behavior of loading the buffer when needed prevents the
unexpected behavior.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on
read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 704,766
reachable commits and is behind 'origin/master' by 19,610 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv                       |  0.42s |  0.27s | -35%  |
| rev-list --all                   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 alloc.c                 |   1 +
 commit-graph.c          | 148 ++++++++++++++++++++++++++++++++++++++++++++++++
 commit-graph.h          |  18 +++++-
 commit.c                |   3 +
 commit.h                |   3 +
 log-tree.c              |   3 +-
 t/t5318-commit-graph.sh |  45 ++++++++++++++-
 7 files changed, 216 insertions(+), 5 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadf..cf4f8b6 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
 	struct commit *c = alloc_node(&commit_state, sizeof(struct commit));
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index();
+	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
 	return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 00bd73a..ea07b47 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,6 +38,9 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
 			GRAPH_OID_LEN + 8)
 
+/* global storage */
+struct commit_graph *commit_graph = NULL;
+
 char *get_graph_latest_filename(const char *obj_dir)
 {
 	struct strbuf fname = STRBUF_INIT;
@@ -184,6 +187,150 @@ struct commit_graph *load_commit_graph_one(const char *graph_file)
 	return graph;
 }
 
+static void prepare_commit_graph_one(const char *obj_dir)
+{
+	struct strbuf graph_file = STRBUF_INIT;
+	char *graph_name;
+
+	if (commit_graph)
+		return;
+
+	graph_name = get_graph_latest_contents(obj_dir);
+
+	if (!graph_name)
+		return;
+
+	strbuf_addf(&graph_file, "%s/info/%s", obj_dir, graph_name);
+
+	commit_graph = load_commit_graph_one(graph_file.buf);
+
+	FREE_AND_NULL(graph_name);
+	strbuf_release(&graph_file);
+}
+
+static int prepare_commit_graph_run_once = 0;
+void prepare_commit_graph(void)
+{
+	struct alternate_object_database *alt;
+	char *obj_dir;
+
+	if (prepare_commit_graph_run_once)
+		return;
+	prepare_commit_graph_run_once = 1;
+
+	obj_dir = get_object_directory();
+	prepare_commit_graph_one(obj_dir);
+	prepare_alt_odb();
+	for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next)
+		prepare_commit_graph_one(alt->path);
+}
+
+static void close_commit_graph(void)
+{
+	if (!commit_graph)
+		return;
+
+	if (commit_graph->graph_fd >= 0) {
+		munmap((void *)commit_graph->data, commit_graph->data_len);
+		commit_graph->data = NULL;
+		close(commit_graph->graph_fd);
+	}
+
+	FREE_AND_NULL(commit_graph);
+}
+
+static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
+{
+	return bsearch_hash(oid->hash, g->chunk_oid_fanout,
+			    g->chunk_oid_lookup, g->hash_len, pos);
+}
+
+static struct commit_list **insert_parent_or_die(struct commit_graph *g,
+						 uint64_t pos,
+						 struct commit_list **pptr)
+{
+	struct commit *c;
+	struct object_id oid;
+	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+	c = lookup_commit(&oid);
+	if (!c)
+		die("could not find commit %s", oid_to_hex(&oid));
+	c->graph_pos = pos;
+	return &commit_list_insert(c, pptr)->next;
+}
+
+static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	struct object_id oid;
+	uint32_t new_parent_pos;
+	uint32_t *parent_data_ptr;
+	uint64_t date_low, date_high;
+	struct commit_list **pptr;
+	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
+
+	item->object.parsed = 1;
+	item->graph_pos = pos;
+
+	hashcpy(oid.hash, commit_data);
+	item->tree = lookup_tree(&oid);
+
+	date_high = ntohl(*(uint32_t*)(commit_data + g->hash_len + 8)) & 0x3;
+	date_low = ntohl(*(uint32_t*)(commit_data + g->hash_len + 12));
+	item->date = (timestamp_t)((date_high << 32) | date_low);
+
+	pptr = &item->parents;
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hash_len));
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return 1;
+	pptr = insert_parent_or_die(g, new_parent_pos, pptr);
+
+	new_parent_pos = ntohl(*(uint32_t*)(commit_data + g->hash_len + 4));
+	if (new_parent_pos == GRAPH_PARENT_NONE)
+		return 1;
+	if (!(new_parent_pos & GRAPH_LARGE_EDGES_NEEDED)) {
+		pptr = insert_parent_or_die(g, new_parent_pos, pptr);
+		return 1;
+	}
+
+	parent_data_ptr = (uint32_t*)(g->chunk_large_edges +
+			  4 * (uint64_t)(new_parent_pos & GRAPH_EDGE_LAST_MASK));
+	do {
+		new_parent_pos = ntohl(*parent_data_ptr);
+		pptr = insert_parent_or_die(g,
+					    new_parent_pos & GRAPH_EDGE_LAST_MASK,
+					    pptr);
+		parent_data_ptr++;
+	} while (!(new_parent_pos & GRAPH_LAST_EDGE));
+
+	return 1;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+	if (!core_commit_graph)
+		return 0;
+	if (item->object.parsed)
+		return 1;
+
+	prepare_commit_graph();
+	if (commit_graph) {
+		uint32_t pos;
+		int found;
+		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+			pos = item->graph_pos;
+			found = 1;
+		} else {
+			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
+		}
+
+		if (found)
+			return fill_commit_in_graph(item, commit_graph, pos);
+	}
+
+	return 0;
+}
+
 static void write_graph_chunk_fanout(struct sha1file *f,
 				     struct commit **commits,
 				     int nr_commits)
@@ -525,6 +672,7 @@ char *write_commit_graph(const char *obj_dir)
 	graph_name = strbuf_detach(&graph_file, NULL);
 	strbuf_addf(&graph_file, "%s/info/%s", obj_dir, graph_name);
 
+	close_commit_graph();
 	if (rename(tmp_file.buf, graph_file.buf))
 		die("failed to rename %s to %s", tmp_file.buf, graph_file.buf);
 
diff --git a/commit-graph.h b/commit-graph.h
index 56215ad..4818838 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,7 +6,19 @@
 extern char *get_graph_latest_filename(const char *obj_dir);
 extern char *get_graph_latest_contents(const char *obj_dir);
 
-struct commit_graph {
+/*
+ * Given a commit struct, try to fill the commit struct info, including:
+ *  1. tree object
+ *  2. date
+ *  3. parents.
+ *
+ * Returns 1 if and only if the commit was found in the packed graph.
+ *
+ * See parse_commit_buffer() for the fallback after this call.
+ */
+extern int parse_commit_in_graph(struct commit *item);
+
+extern struct commit_graph {
 	int graph_fd;
 
 	const unsigned char *data;
@@ -21,10 +33,12 @@ struct commit_graph {
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_large_edges;
-};
+} *commit_graph;
 
 extern struct commit_graph *load_commit_graph_one(const char *graph_file);
 
+extern void prepare_commit_graph(void);
+
 extern char *write_commit_graph(const char *obj_dir);
 
 #endif
diff --git a/commit.c b/commit.c
index cab8d44..a8b464d 100644
--- a/commit.c
+++ b/commit.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "tag.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "pkt-line.h"
 #include "utf8.h"
 #include "diff.h"
@@ -385,6 +386,8 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return -1;
 	if (item->object.parsed)
 		return 0;
+	if (parse_commit_in_graph(item))
+		return 0;
 	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
 	if (!buffer)
 		return quiet_on_missing ? -1 :
diff --git a/commit.h b/commit.h
index 99a3fea..57963d8 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,8 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 
+#define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
+
 struct commit_list {
 	struct commit *item;
 	struct commit_list *next;
@@ -20,6 +22,7 @@ struct commit {
 	timestamp_t date;
 	struct commit_list *parents;
 	struct tree *tree;
+	uint32_t graph_pos;
 };
 
 extern int save_commit_buffer;
diff --git a/log-tree.c b/log-tree.c
index 580b3a9..14735d4 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -647,8 +647,7 @@ void show_log(struct rev_info *opt)
 		show_mergetag(opt, commit);
 	}
 
-	if (!get_cached_commit_buffer(commit, NULL))
-		return;
+	get_commit_buffer(commit, NULL);
 
 	if (opt->show_notes) {
 		int raw;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 1d5ec7d..8c6b510 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -8,6 +8,7 @@ test_expect_success 'setup full repo' '
 	mkdir full &&
 	cd full &&
 	git init &&
+	git config core.commitGraph true &&
 	objdir=".git/objects"
 '
 
@@ -25,6 +26,27 @@ test_expect_success 'create commits and repack' '
 	git repack
 '
 
+graph_git_two_modes() {
+	git -c core.graph=true $1 >output
+	git -c core.graph=false $1 >expect
+	test_cmp output expect
+}
+
+graph_git_behavior() {
+	MSG=$1
+	BRANCH=$2
+	COMPARE=$3
+	test_expect_success "check normal git operations: $MSG" '
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
+	'
+}
+
+graph_git_behavior 'no graph' commits/3 commits/1
+
 graph_read_expect() {
 	OPTIONAL=""
 	NUM_CHUNKS=3
@@ -49,6 +71,8 @@ test_expect_success 'write graph' '
 	test_cmp expect output
 '
 
+graph_git_behavior 'graph exists, no head' commits/3 commits/1
+
 test_expect_success 'Add more commits' '
 	git reset --hard commits/1 &&
 	for i in $(test_seq 4 5)
@@ -84,7 +108,6 @@ test_expect_success 'Add more commits' '
 # |___/____/
 # 1
 
-
 test_expect_success 'write graph with merges' '
 	graph2=$(git commit-graph write --set-latest)&&
 	test_path_is_file $objdir/info/$graph2 &&
@@ -96,6 +119,10 @@ test_expect_success 'write graph with merges' '
 	test_cmp expect output
 '
 
+graph_git_behavior 'merge 1 vs 2' merge/1 merge/2
+graph_git_behavior 'merge 1 vs 3' merge/1 merge/3
+graph_git_behavior 'merge 2 vs 3' merge/2 merge/3
+
 test_expect_success 'Add one more commit' '
 	test_commit 8 &&
 	git branch commits/8 &&
@@ -116,6 +143,9 @@ test_expect_success 'Add one more commit' '
 # |___/____/
 # 1
 
+graph_git_behavior 'mixed mode, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'mixed mode, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'write graph with new commit' '
 	graph3=$(git commit-graph write --set-latest --delete-expired) &&
 	test_path_is_file $objdir/info/$graph3 &&
@@ -129,6 +159,9 @@ test_expect_success 'write graph with new commit' '
 	test_cmp expect output
 '
 
+graph_git_behavior 'full graph, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'full graph, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'write graph with nothing new' '
 	graph4=$(git commit-graph write --set-latest --delete-expired) &&
 	test_path_is_file $objdir/info/$graph4 &&
@@ -144,13 +177,20 @@ test_expect_success 'write graph with nothing new' '
 	test_cmp expect output
 '
 
+graph_git_behavior 'cleared graph, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'cleared graph, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd .. &&
 	git clone --bare --no-local full bare &&
 	cd bare &&
+	git config core.commitGraph true &&
 	baredir="./objects"
 '
 
+graph_git_behavior 'bare repo, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'bare repo, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'write graph in bare repo' '
 	graphbare=$(git commit-graph write --set-latest) &&
 	test_path_is_file $baredir/info/$graphbare &&
@@ -162,5 +202,8 @@ test_expect_success 'write graph in bare repo' '
 	test_cmp expect output
 '
 
+graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' commits/8 merge/2
+
 test_done
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 12/13] commit-graph: read only from specific pack-indexes
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (10 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 11/13] commit: integrate commit graph with commit parsing Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-02-21 22:25         ` Stefan Beller
  2018-02-19 18:53       ` [PATCH v4 13/13] commit-graph: build graph from starting commits Derrick Stolee
  2018-03-30 11:10       ` [PATCH v4 00/13] Serialized Git Commit Graph Jakub Narebski
  13 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively, since we add all commits stored in a
previous commit graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 +++++++++++
 builtin/commit-graph.c             | 32 +++++++++++++++++++++++++++++---
 commit-graph.c                     | 26 ++++++++++++++++++++++++--
 commit-graph.h                     |  4 +++-
 packfile.c                         |  4 ++--
 packfile.h                         |  2 ++
 t/t5318-commit-graph.sh            | 16 ++++++++++++++++
 7 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index b9b4031..93d50d1 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -42,6 +42,10 @@ With the `--delete-expired` option, delete the graph files in the pack
 directory that are not referred to by the graph-latest file. To avoid race
 conditions, do not delete the file previously referred to by the
 graph-latest file if it is updated by the `--set-latest` option.
++
+With the `--stdin-packs` option, generate the new commit graph by
+walking objects only in the specified packfiles and any commits in
+the existing graph-head.
 
 'read'::
 
@@ -68,6 +72,13 @@ $ git commit-graph write
 $ git commit-graph write --set-latest --delete-expired
 ------------------------------------------------
 
+* Write a graph file, extending the current graph file using commits
+* in <pack-index>, update graph-latest, and delete stale graph files.
++
+------------------------------------------------
+$ echo <pack-index> | git commit-graph write --set-latest --delete-expired --stdin-packs
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index fd99169..5f08c40 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>] [--file=<hash>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired] [--stdin-packs]"),
 	NULL
 };
 
@@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired] [--stdin-packs]"),
 	NULL
 };
 
@@ -27,6 +27,7 @@ static struct opts_commit_graph {
 	const char *graph_file;
 	int set_latest;
 	int delete_expired;
+	int stdin_packs;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -149,6 +150,11 @@ static int graph_write(int argc, const char **argv)
 {
 	char *graph_name;
 	char *old_graph_name;
+	const char **pack_indexes = NULL;
+	int nr_packs = 0;
+	const char **lines = NULL;
+	int nr_lines = 0;
+	int alloc_lines = 0;
 
 	static struct option builtin_commit_graph_write_options[] = {
 		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
@@ -158,6 +164,8 @@ static int graph_write(int argc, const char **argv)
 			N_("update graph-head to written graph file")),
 		OPT_BOOL('d', "delete-expired", &opts.delete_expired,
 			N_("delete expired head graph file")),
+		OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
+			N_("only scan packfiles listed by stdin")),
 		OPT_END(),
 	};
 
@@ -170,7 +178,25 @@ static int graph_write(int argc, const char **argv)
 
 	old_graph_name = get_graph_latest_contents(opts.obj_dir);
 
-	graph_name = write_commit_graph(opts.obj_dir);
+	if (opts.stdin_packs) {
+		struct strbuf buf = STRBUF_INIT;
+		nr_lines = 0;
+		alloc_lines = 128;
+		ALLOC_ARRAY(lines, alloc_lines);
+
+		while (strbuf_getline(&buf, stdin) != EOF) {
+			ALLOC_GROW(lines, nr_lines + 1, alloc_lines);
+			lines[nr_lines++] = buf.buf;
+			strbuf_detach(&buf, NULL);
+		}
+
+		pack_indexes = lines;
+		nr_packs = nr_lines;
+	}
+
+	graph_name = write_commit_graph(opts.obj_dir,
+					pack_indexes,
+					nr_packs);
 
 	if (graph_name) {
 		if (opts.set_latest)
diff --git a/commit-graph.c b/commit-graph.c
index ea07b47..943192c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -554,7 +554,9 @@ static void close_reachable(struct packed_oid_list *oids)
 	}
 }
 
-char *write_commit_graph(const char *obj_dir)
+char *write_commit_graph(const char *obj_dir,
+			 const char **pack_indexes,
+			 int nr_packs)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -579,7 +581,27 @@ char *write_commit_graph(const char *obj_dir)
 		oids.alloc = 1024;
 	ALLOC_ARRAY(oids.list, oids.alloc);
 
-	for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+	if (pack_indexes) {
+		struct strbuf packname = STRBUF_INIT;
+		int dirlen;
+		strbuf_addf(&packname, "%s/pack/", obj_dir);
+		dirlen = packname.len;
+		for (i = 0; i < nr_packs; i++) {
+			struct packed_git *p;
+			strbuf_setlen(&packname, dirlen);
+			strbuf_addstr(&packname, pack_indexes[i]);
+			p = add_packed_git(packname.buf, packname.len, 1);
+			if (!p)
+				die("error adding pack %s", packname.buf);
+			if (open_pack_index(p))
+				die("error opening index for %s", packname.buf);
+			for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);
+			close_pack(p);
+		}
+	}
+	else
+		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
+
 	close_reachable(&oids);
 
 	QSORT(oids.list, oids.nr, commit_compare);
diff --git a/commit-graph.h b/commit-graph.h
index 4818838..5617842 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -39,7 +39,9 @@ extern struct commit_graph *load_commit_graph_one(const char *graph_file);
 
 extern void prepare_commit_graph(void);
 
-extern char *write_commit_graph(const char *obj_dir);
+extern char *write_commit_graph(const char *obj_dir,
+				const char **pack_indexes,
+				int nr_packs);
 
 #endif
 
diff --git a/packfile.c b/packfile.c
index 59648a1..b9ad7b1 100644
--- a/packfile.c
+++ b/packfile.c
@@ -299,7 +299,7 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
-static void close_pack(struct packed_git *p)
+void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
@@ -1839,7 +1839,7 @@ int has_pack_index(const unsigned char *sha1)
 	return 1;
 }
 
-static int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
+int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
 {
 	uint32_t i;
 	int r = 0;
diff --git a/packfile.h b/packfile.h
index 0cdeb54..9281e90 100644
--- a/packfile.h
+++ b/packfile.h
@@ -61,6 +61,7 @@ extern void close_pack_index(struct packed_git *);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 extern void close_pack_windows(struct packed_git *);
+extern void close_pack(struct packed_git *);
 extern void close_all_packs(void);
 extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
@@ -133,6 +134,7 @@ typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
 				  uint32_t pos,
 				  void *data);
+extern int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn, void *data);
 extern int for_each_packed_object(each_packed_object_fn, void *, unsigned flags);
 
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 8c6b510..5bd1f77 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -180,6 +180,22 @@ test_expect_success 'write graph with nothing new' '
 graph_git_behavior 'cleared graph, commit 8 vs merge 1' commits/8 merge/1
 graph_git_behavior 'cleared graph, commit 8 vs merge 2' commits/8 merge/2
 
+test_expect_success 'build graph from latest pack with closure' '
+	rm $objdir/info/graph-latest &&
+	graph5=$(cat new-idx | git commit-graph write --set-latest --delete-expired --stdin-packs) &&
+	test_path_is_file $objdir/info/$graph5 &&
+	test_path_is_missing $objdir/info/$graph4 &&
+	test_path_is_file $objdir/info/graph-latest &&
+	printf $graph5 >expect &&
+	test_cmp expect $objdir/info/graph-latest &&
+	git commit-graph read --file=$graph5 >output &&
+	graph_read_expect "9" "large_edges" &&
+	test_cmp expect output
+'
+
+graph_git_behavior 'graph from pack, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'graph from pack, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd .. &&
 	git clone --bare --no-local full bare &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v4 13/13] commit-graph: build graph from starting commits
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (11 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 12/13] commit-graph: read only from specific pack-indexes Derrick Stolee
@ 2018-02-19 18:53       ` Derrick Stolee
  2018-03-30 11:10       ` [PATCH v4 00/13] Serialized Git Commit Graph Jakub Narebski
  13 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
  To: git, git
  Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee

Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120). If a commit graph file
already exists (and core.commitGraph=true), then this operation takes
only 1.8 seconds due to less time spent parsing commits.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 15 ++++++++++++++-
 builtin/commit-graph.c             | 27 +++++++++++++++++++++------
 commit-graph.c                     | 26 ++++++++++++++++++++++++--
 commit-graph.h                     |  4 +++-
 t/t5318-commit-graph.sh            | 19 +++++++++++++++++++
 5 files changed, 81 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 93d50d1..43ac74b 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -45,7 +45,12 @@ graph-latest file if it is updated by the `--set-latest` option.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified packfiles and any commits in
-the existing graph-head.
+the existing graph-head. (Cannot be combined with --stdin-commits.)
++
+With the `--stdin-commits` option, generate the new commit graph by
+walking commits starting at the commits specified in stdin as a list
+of OIDs in hex, one OID per line. (Cannot be combined with
+--stdin-packs.)
 
 'read'::
 
@@ -79,6 +84,14 @@ $ git commit-graph write --set-latest --delete-expired
 $ echo <pack-index> | git commit-graph write --set-latest --delete-expired --stdin-packs
 ------------------------------------------------
 
+* Write a graph file, extending the current graph file using all
+* commits reachable from refs/heads/*, update graph-latest, and delete
+* stale graph files.
++
+------------------------------------------------
+$ git show-ref -s | git commit-graph write --set-latest --delete-expired --stdin-commits
+------------------------------------------------
+
 * Read basic information from a graph file.
 +
 ------------------------------------------------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 5f08c40..9b92549 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>] [--file=<hash>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired] [--stdin-packs]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired] [--stdin-packs]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--set-latest] [--delete-expired] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -28,6 +28,7 @@ static struct opts_commit_graph {
 	int set_latest;
 	int delete_expired;
 	int stdin_packs;
+	int stdin_commits;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -152,6 +153,8 @@ static int graph_write(int argc, const char **argv)
 	char *old_graph_name;
 	const char **pack_indexes = NULL;
 	int nr_packs = 0;
+	const char **commit_hex = NULL;
+	int nr_commits = 0;
 	const char **lines = NULL;
 	int nr_lines = 0;
 	int alloc_lines = 0;
@@ -166,6 +169,8 @@ static int graph_write(int argc, const char **argv)
 			N_("delete expired head graph file")),
 		OPT_BOOL('s', "stdin-packs", &opts.stdin_packs,
 			N_("only scan packfiles listed by stdin")),
+		OPT_BOOL('C', "stdin-commits", &opts.stdin_commits,
+			N_("start walk at commits listed by stdin")),
 		OPT_END(),
 	};
 
@@ -173,12 +178,14 @@ static int graph_write(int argc, const char **argv)
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
 
+	if (opts.stdin_packs && opts.stdin_commits)
+		die(_("cannot use both --stdin-commits and --stdin-packs"));
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
 
 	old_graph_name = get_graph_latest_contents(opts.obj_dir);
 
-	if (opts.stdin_packs) {
+	if (opts.stdin_packs || opts.stdin_commits) {
 		struct strbuf buf = STRBUF_INIT;
 		nr_lines = 0;
 		alloc_lines = 128;
@@ -190,13 +197,21 @@ static int graph_write(int argc, const char **argv)
 			strbuf_detach(&buf, NULL);
 		}
 
-		pack_indexes = lines;
-		nr_packs = nr_lines;
+		if (opts.stdin_packs) {
+			pack_indexes = lines;
+			nr_packs = nr_lines;
+		}
+		if (opts.stdin_commits) {
+			commit_hex = lines;
+			nr_commits = nr_lines;
+		}
 	}
 
 	graph_name = write_commit_graph(opts.obj_dir,
 					pack_indexes,
-					nr_packs);
+					nr_packs,
+					commit_hex,
+					nr_commits);
 
 	if (graph_name) {
 		if (opts.set_latest)
diff --git a/commit-graph.c b/commit-graph.c
index 943192c..b9e938c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -556,7 +556,9 @@ static void close_reachable(struct packed_oid_list *oids)
 
 char *write_commit_graph(const char *obj_dir,
 			 const char **pack_indexes,
-			 int nr_packs)
+			 int nr_packs,
+			 const char **commit_hex,
+			 int nr_commits)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -599,7 +601,27 @@ char *write_commit_graph(const char *obj_dir,
 			close_pack(p);
 		}
 	}
-	else
+
+	if (commit_hex) {
+		for (i = 0; i < nr_commits; i++) {
+			const char *end;
+			struct object_id oid;
+			struct commit *result;
+
+			if (commit_hex[i] && parse_oid_hex(commit_hex[i], &oid, &end))
+				continue;
+
+			result = lookup_commit_reference_gently(&oid, 1);
+
+			if (result) {
+				ALLOC_GROW(oids.list, oids.nr + 1, oids.alloc);
+				oidcpy(&oids.list[oids.nr], &(result->object.oid));
+				oids.nr++;
+			}
+		}
+	}
+
+	if (!pack_indexes && !commit_hex)
 		for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
 
 	close_reachable(&oids);
diff --git a/commit-graph.h b/commit-graph.h
index 5617842..2582a3c 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -41,7 +41,9 @@ extern void prepare_commit_graph(void);
 
 extern char *write_commit_graph(const char *obj_dir,
 				const char **pack_indexes,
-				int nr_packs);
+				int nr_packs,
+				const char **commit_hex,
+				int nr_commits);
 
 #endif
 
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 5bd1f77..2ed6b19 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -196,6 +196,25 @@ test_expect_success 'build graph from latest pack with closure' '
 graph_git_behavior 'graph from pack, commit 8 vs merge 1' commits/8 merge/1
 graph_git_behavior 'graph from pack, commit 8 vs merge 2' commits/8 merge/2
 
+test_expect_success 'build graph from commits with closure' '
+	git tag -a -m "merge" tag/merge merge/2 &&
+	git rev-parse tag/merge >commits-in &&
+	git rev-parse merge/1 >>commits-in &&
+	rm $objdir/info/graph-latest &&
+	graph6=$(cat commits-in | git commit-graph write --set-latest --delete-expired --stdin-commits) &&
+	test_path_is_file $objdir/info/$graph6 &&
+	test_path_is_missing $objdir/info/$graph5 &&
+	test_path_is_file $objdir/info/graph-latest &&
+	printf $graph6 >expect &&
+	test_cmp expect $objdir/info/graph-latest &&
+	git commit-graph read --file=$graph6 >output &&
+	graph_read_expect "6" &&
+	test_cmp expect output
+'
+
+graph_git_behavior 'graph from commits, commit 8 vs merge 1' commits/8 merge/1
+graph_git_behavior 'graph from commits, commit 8 vs merge 2' commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd .. &&
 	git clone --bare --no-local full bare &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 01/13] commit-graph: add format document
  2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
@ 2018-02-20 20:49         ` Junio C Hamano
  2018-02-21 19:23         ` Stefan Beller
  2018-03-30 13:25         ` Jakub Narebski
  2 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-20 20:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>  Documentation/technical/commit-graph-format.txt | 90 +++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
>  create mode 100644 Documentation/technical/commit-graph-format.txt

Hopefully just a few remaining nits.  Overall I find this written
really clearly.

> +== graph-*.graph files have the following format:
> +
> +In order to allow extensions that add extra data to the graph, we organize
> +the body into "chunks" and provide a binary lookup table at the beginning
> +of the body. The header includes certain values, such as number of chunks,
> +hash lengths and types.

We no longer have lengths stored.

> + ...
> +  The remaining data in the body is described one chunk at a time, and
> +  these chunks may be given in any order. Chunks are required unless
> +  otherwise specified.

It is good that this explicitly says chunks can come in any order,
and which ones are required.  It should also say which chunk can
appear multiple times.  I think all four chunk types we currently
define can have at most one instance in a file.

> +CHUNK DATA:
> +
> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> +      The ith entry, F[i], stores the number of OIDs with first
> +      byte at most i. Thus F[255] stores the total
> +      number of commits (N).
> +
> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> +      The OIDs for all commits in the graph, sorted in ascending order.

Somewhere in this document, we probably would want to say that this
format allows at most (1<<31)-1 commits recorded in the file (as
CGET and EDGE uses 31-bit uint to index into this table, using MSB
for other purposes, and the all-1-bit pattern is also special), and
when we refer to "int-ids" of a commit, it is this 31-bit number
that is an index into this table.

> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> +    * The first H bytes are for the OID of the root tree.
> +    * The next 8 bytes are for the int-ids of the first two parents
> +      of the ith commit. Stores value 0xffffffff if no parent in that
> +      position. If there are more than two parents, the second value
> +      has its most-significant bit on and the other bits store an array
> +      position into the Large Edge List chunk.
> +    * The next 8 bytes store the generation number of the commit and
> +      the commit time in seconds since EPOCH. The generation number
> +      uses the higher 30 bits of the first 4 bytes, while the commit
> +      time uses the 32 bits of the second 4 bytes, along with the lowest
> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
> +      commit time.
> +
> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> +      This list of 4-byte values store the second through nth parents for
> +      all octopus merges. The second parent value in the commit data stores
> +      an array position within this list along with the most-significant bit
> +      on. Starting at that array position, iterate through this list of int-ids
> +      for the parents until reaching a value with the most-significant bit on.
> +      The other bits correspond to the int-id of the last parent.
> +
> +TRAILER:
> +
> +	H-byte HASH-checksum of all of the above.
> +

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 02/13] graph: add commit graph design document
  2018-02-19 18:53       ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
@ 2018-02-20 21:42         ` Junio C Hamano
  2018-02-23 15:44           ` Derrick Stolee
  2018-02-21 19:34         ` Stefan Beller
  1 sibling, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-20 21:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +2. Walking the entire graph to avoid topological order mistakes.

You have at least one more mention of "topological order mistakes"
below, but we commonly refer to this issue and blame it for "clock
skew".  Using the word highlights that there is no "mistake" in topo
order algorithm and mistakes are in the commit timestamps.

> +In practice, we expect some commits to be created recently and not stored
> +in the commit graph. We can treat these commits as having "infinite"
> +generation number and walk until reaching commits with known generation
> +number.

Hmm, "pretend infinity" is an interesting approach---I need to think
about it a bit more if it is sufficient.

> +- .graph files are managed only by the 'commit-graph' builtin. These are not
> +  updated automatically during clone, fetch, repack, or creating new commits.

OK.  s/builtin/subcommand/; it does not make much difference if it
is a built-in or standalone command.

> +- There is no 'verify' subcommand for the 'commit-graph' builtin to verify
> +  the contents of the graph file agree with the contents in the ODB.

I am not entirely sure about the merit of going into this level of
detail.  Being able to use only a single file looks like a more
fundamental design limitation, which deserves to be decribed in this
section, and we could ship the subsystem with that limitation.

But the lack of verify that can be called from fsck is merely the
matter of not the subsystem being mature enough (to be written,
reviewed and tested) and not a fundamental one, and we will not be
shipping the subsystem until that limitation is lifted.

So I'd guess that we prefer this bullet item to be in the commit log
message, not here, that describes the current status of the
development (as opposed to the state of the subsystem).

> +- Generation numbers are not computed in the current version. The file
> +  format supports storing them, along with a mechanism to upgrade from
> +  a file without generation numbers to one that uses them.

Exactly the same comment as above applies to this item.

> +- The commit graph is currently incompatible with commit grafts. This can be
> +  remedied by duplicating or refactoring the current graft logic.

Hmm.  Can it be lifted without first allowing us to use more than
one commit graph file (i.e. one for "traverse while honoring the
grafts", the other for "traverse while ignoring the grafts")?

> +- After computing and storing generation numbers, we must make graph
> +  walks aware of generation numbers to gain the performance benefits they
> +  enable. This will mostly be accomplished by swapping a commit-date-ordered
> +  priority queue with one ordered by generation number. The following
> +  operations are important candidates:
> +
> +    - paint_down_to_common()
> +    - 'log --topo-order'

Yes.

> +- The graph currently only adds commits to a previously existing graph.
> +  When writing a new graph, we could check that the ODB still contains
> +  the commits and choose to remove the commits that are deleted from the
> +  ODB. For performance reasons, this check should remain optional.

The last sentence is somehow unconvincing.  It probably is not
appropriate for the "Future Work" section to be making a hurried
design decision before having any working verification code to run
benchmark on.

> +- Currently, parse_commit_gently() requires filling in the root tree
> +  object for a commit. This passes through lookup_tree() and consequently
> +  lookup_object(). Also, it calls lookup_commit() when loading the parents.
> +  These method calls check the ODB for object existence, even if the
> +  consumer does not need the content. For example, we do not need the
> +  tree contents when computing merge bases. Now that commit parsing is
> +  removed from the computation time, these lookup operations are the
> +  slowest operations keeping graph walks from being fast. Consider
> +  loading these objects without verifying their existence in the ODB and
> +  only loading them fully when consumers need them. Consider a method
> +  such as "ensure_tree_loaded(commit)" that fully loads a tree before
> +  using commit->tree.

Very good idea.

> +- The current design uses the 'commit-graph' builtin to generate the graph.
> +  When this feature stabilizes enough to recommend to most users, we should
> +  add automatic graph writes to common operations that create many commits.
> +  For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
> +  commands.

s/coulde/could/.

Also do not forget "fsck" that calls "verify".  That is more urgent
than intergration with any other subcommand.

> +- A server could provide a commit graph file as part of the network protocol
> +  to avoid extra calculations by clients.

We need to assess the riskiness and threat models regarding this, if
we really want to follow this "could" through.  I would imagine that
the cost for verification is comparable to the cost for regenerating,
in which case it may not be worth doing this _unless_ the user opts
into it saying that the other side over the wire is trusted without
any reservation.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 03/13] commit-graph: create git-commit-graph builtin
  2018-02-19 18:53       ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
@ 2018-02-20 21:51         ` Junio C Hamano
  2018-02-21 18:58           ` Junio C Hamano
  2018-02-26 16:25         ` SZEDER Gábor
  1 sibling, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-20 21:51 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +int cmd_commit_graph(int argc, const char **argv, const char *prefix)
> +{
> +	static struct option builtin_commit_graph_options[] = {
> +		{ OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
> +			N_("dir"),
> +			N_("The object directory to store the graph") },

I have a suspicion that this was modeled after some other built-in
that has a similar issue (perhaps written long time ago), but isn't
OPT_STRING() sufficient to define this element these days?

Or am I missing something?

Why squat on short-and-sweet "-p"?  For that matter, since this is
not expected to be end-user facing command anyway, I suspect that we
do not want to allocate a single letter option from day one, which
paints ourselves into a corner from where we cannot escape.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
@ 2018-02-20 22:57         ` Junio C Hamano
  2018-02-23 17:23           ` Derrick Stolee
  2018-02-26 16:10         ` SZEDER Gábor
  2018-02-28 18:47         ` Junio C Hamano
  2 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-20 22:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +#define GRAPH_OID_VERSION_SHA1 1
> +#define GRAPH_OID_LEN_SHA1 20

This hardcoded 20 on the right hand side of this #define is probably
problematic.   Unless you are planning to possibly store truncated
hash value for some future hash algorithm, GRAPH_OID_LEN_$HASH should
always be the same as GIT_$HASH_RAWSZ, I would think.  IOW

    #define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ

perhaps?

> +static void write_graph_chunk_fanout(struct sha1file *f,
> +				     struct commit **commits,
> +				     int nr_commits)
> +{
> +	uint32_t i, count = 0;
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +
> +	/*
> +	 * Write the first-level table (the list is sorted,
> +	 * but we use a 256-entry lookup to be able to avoid
> +	 * having to do eight extra binary search iterations).
> +	 */
> +	for (i = 0; i < 256; i++) {
> +		while (list < last) {
> +			if ((*list)->object.oid.hash[0] != i)
> +				break;
> +			count++;
> +			list++;
> +		}

If count and list are always incremented in unison, perhaps you do
not need an extra variable "last".  If typeof(nr_commits) is wider
than typeof(count), this loop and the next write-be32 is screwed
anyway ;-)

This comment probably applies equally to some other uses of the same
"compute last pointer to compare with running pointer for
termination" pattern in this patch.

> +		sha1write_be32(f, count);
> +	}
> +}

> +static int commit_pos(struct commit **commits, int nr_commits,
> +		      const struct object_id *oid, uint32_t *pos)
> +{

It is a bit unusual to see something_pos() that returns an integer
that does *NOT* return the position as its return value.  Dropping
the *pos parameter, and returning "mid" when commits[mid] is what we
wanted to see, and otherwise returning "-1 - first" to signal the
position at which we _would_ have found the object, if it were in
the table, would make it more consistent with the usual convention.

Don't we even have such a generalized binary search helper already
somewhere in the system?

> +static void write_graph_chunk_data(struct sha1file *f, int hash_len,
> +				   struct commit **commits, int nr_commits)
> +{
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +	uint32_t num_large_edges = 0;
> +
> +	while (list < last) {
> +		struct commit_list *parent;
> +		uint32_t int_id;
> +		uint32_t packedDate[2];
> +
> +...
> +		if (!parent)
> +			int_id = GRAPH_PARENT_NONE;
> +		else if (parent->next)
> +			int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges;
> +		else if (!commit_pos(commits, nr_commits,
> +				    &(parent->item->object.oid), &int_id))
> +			int_id = GRAPH_PARENT_MISSING;
> +
> +		sha1write_be32(f, int_id);
> +
> +		if (parent && parent->next) {

This is equivalent to checking "int_id & GRAPH_LARGE_EDGES_NEEDED",
right?  Not suggesting to use the other form of checks, but trying
to see what's the best way to express it in the most readable way.

> +			do {
> +				num_large_edges++;
> +				parent = parent->next;
> +			} while (parent);

It feels somewhat wasteful to traverse the commit's parents list
only to count, without populating the octopus table (which I
understand is assumed to be minority case under this design).

> +		}
> +
> +		if (sizeof((*list)->date) > 4)
> +			packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
> +		else
> +			packedDate[0] = 0;

OK, the undefined pattern in the previous round is now gone ;-)  Good.

> +		packedDate[1] = htonl((*list)->date);
> +		sha1write(f, packedDate, 8);
> +
> +		list++;
> +	}
> +}
> +
> +static void write_graph_chunk_large_edges(struct sha1file *f,
> +					  struct commit **commits,
> +					  int nr_commits)
> +{
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +	struct commit_list *parent;
> +
> +	while (list < last) {
> +		int num_parents = 0;
> +		for (parent = (*list)->parents; num_parents < 3 && parent;
> +		     parent = parent->next)
> +			num_parents++;
> +
> +		if (num_parents <= 2) {
> +			list++;
> +			continue;
> +		}
> +
> +		/* Since num_parents > 2, this initializer is safe. */
> +		for (parent = (*list)->parents->next; parent; parent = parent->next) {
> +			uint32_t int_id, swap_int_id;
> +			uint32_t last_edge = 0;
> +			if (!parent->next)
> +				last_edge |= GRAPH_LAST_EDGE;
> +
> +			if (commit_pos(commits, nr_commits,
> +				       &(parent->item->object.oid),
> +				       &int_id))
> +				swap_int_id = htonl(int_id | last_edge);
> +			else
> +				swap_int_id = htonl(GRAPH_PARENT_MISSING | last_edge);
> +			sha1write(f, &swap_int_id, 4);

What does "swap_" in the name of this variable mean?  For some
archs, there is no swap.  The only difference between int_id and the
variable is that its MSB may possibly be smudged with last_edge bit.

This is a tangent, but after having seen many instances of "int_id",
I started to feel that it is grossly misnamed.  We do not care about
its "int" ness---what's more significant about it is that we use can
it as a short identifier in place for a full object name, given the
table of known OIDs.  "oid_table_index" may be a better name (but
others may be able to suggest even better one).

	int pos;
	pos = commit_pos(commits, nr_commits, parent->item->object.oid);
	oid_table_pos = (pos < 0) ? GRAPH_PARENT_MISSING : pos;
	if (!parent->net)
		oid_table_pos |= GRAPH_LAST_EDGE;
	oid_table_pos = htonl(oid_table_pos);
	sha1write(f, &oid_table_pos, sizeof(oid_table_pos));

or something like that, perhaps?

> +static int commit_compare(const void *_a, const void *_b)
> +{
> +	struct object_id *a = (struct object_id *)_a;
> +	struct object_id *b = (struct object_id *)_b;
> +	return oidcmp(a, b);
> +}

I think oidcmp() takes const pointers, so there is no need to
discard constness from the parameter like this code does.  Also I
think we tend to prefer writing a_/b_ (instead of _a/_b) to appease
language lawyers who do not want us mere mortals to use names that
begin with underscore.

> +static int if_packed_commit_add_to_list(const struct object_id *oid,
> +					struct packed_git *pack,
> +					uint32_t pos,
> +					void *data)

That is a strange name.  "collect packed commits", perhaps?

> +char *write_commit_graph(const char *obj_dir)
> +{
> +	struct packed_oid_list oids;
> +	struct packed_commit_list commits;
> +	struct sha1file *f;
> +	int i, count_distinct = 0;
> +	DIR *info_dir;
> +	struct strbuf tmp_file = STRBUF_INIT;
> +	struct strbuf graph_file = STRBUF_INIT;
> +	unsigned char final_hash[GIT_MAX_RAWSZ];
> +	char *graph_name;
> +	int fd;
> +	uint32_t chunk_ids[5];
> +	uint64_t chunk_offsets[5];
> +	int num_chunks;
> +	int num_long_edges;
> +	struct commit_list *parent;
> +
> +	oids.nr = 0;
> +	oids.alloc = (int)(0.15 * approximate_object_count());

Heh, traditionalist would probably avoid unnecessary use of float
and use something like 1/4 or 1/8 ;-)  After all, it is merely a
ballpark guestimate.

> +	num_long_edges = 0;

This again is about naming, but I find it a bit unnatural to call
the edge between a chind and its octopus parents "long".  Individual
edges are not long--the only thing that is long is your "list of
edges".  Some other codepaths in this patch seems to call the same
concept with s/long/large/, which I found somewhat puzzling.

> +	for (i = 0; i < oids.nr; i++) {
> +		int num_parents = 0;
> +		if (i > 0 && !oidcmp(&oids.list[i-1], &oids.list[i]))
> +			continue;
> +
> +		commits.list[commits.nr] = lookup_commit(&oids.list[i]);
> +		parse_commit(commits.list[commits.nr]);
> +
> +		for (parent = commits.list[commits.nr]->parents;
> +		     parent; parent = parent->next)
> +			num_parents++;
> +
> +		if (num_parents > 2)
> +			num_long_edges += num_parents - 1;

OK, so we count how many entries we will record in the overflow
parent table, and...

> +
> +		commits.nr++;
> +	}
> +	num_chunks = num_long_edges ? 4 : 3;

... if we do not have any octopus commit, we do not need the chunk
for the overflow parent table.  Makes sense.

> +	strbuf_addf(&tmp_file, "%s/info", obj_dir);
> +	info_dir = opendir(tmp_file.buf);
> +
> +	if (!info_dir && mkdir(tmp_file.buf, 0777) < 0)
> +		die_errno(_("cannot mkdir %s"), tmp_file.buf);
> +	if (info_dir)
> +		closedir(info_dir);
> +	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
> +
> +	fd = git_mkstemp_mode(tmp_file.buf, 0444);
> +	if (fd < 0)
> +		die_errno("unable to create '%s'", tmp_file.buf);

It is not performance critical, but it feels a bit wasteful to
opendir merely to see if something exists as a directory, and it is
misleading to the readers (it looks as if we care about what files
we already have in the directory).

The approach that optimizes for the most common case would be to

	- prepare full path to the tempfile first
	- try create with mkstemp
	  - if successful, you do not have to worry about creating
	    the directory at all, which is the most common case
        - see why mkstemp step above failed.  Was it because you
	  did not have the surrounding directory?
          - if not, there is no point continuing.  Just error out.
	  - if it was due to missing directory, try creating one.
	- try create with mkstemp
	  - if successful, all is well.
        - otherwise there isn't anything more we can do here.



> +
> +	f = sha1fd(fd, tmp_file.buf);
> +
> +	sha1write_be32(f, GRAPH_SIGNATURE);
> +
> +	sha1write_u8(f, GRAPH_VERSION);
> +	sha1write_u8(f, GRAPH_OID_VERSION);
> +	sha1write_u8(f, num_chunks);
> +	sha1write_u8(f, 0); /* unused padding byte */
> +
> +	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
> +	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
> +	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> +	if (num_long_edges)
> +		chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
> +	else
> +		chunk_ids[3] = 0;
> +	chunk_ids[4] = 0;
> +
> +	chunk_offsets[0] = 8 + GRAPH_CHUNKLOOKUP_SIZE;
> +	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
> +	chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.nr;
> +	chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.nr;
> +	chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;

Do we have to care about overflowing any of the above?  For example,
the format allows only up to (1<<31)-1 commits, but did something
actually check if commits.nr at this point stayed under that limit?


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 03/13] commit-graph: create git-commit-graph builtin
  2018-02-20 21:51         ` Junio C Hamano
@ 2018-02-21 18:58           ` Junio C Hamano
  2018-02-23 16:07             ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-21 18:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>> +int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>> +{
>> +	static struct option builtin_commit_graph_options[] = {
>> +		{ OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
>> +			N_("dir"),
>> +			N_("The object directory to store the graph") },
>
> I have a suspicion that this was modeled after some other built-in
> that has a similar issue (perhaps written long time ago), but isn't
> OPT_STRING() sufficient to define this element these days?
>
> Or am I missing something?
>
> Why squat on short-and-sweet "-p"?  For that matter, since this is
> not expected to be end-user facing command anyway, I suspect that we
> do not want to allocate a single letter option from day one, which
> paints ourselves into a corner from where we cannot escape.

I suspect that exactly the same comment applies to patches in this
series that add other subcommands (I just saw one in the patch for
adding 'write').


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 01/13] commit-graph: add format document
  2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
  2018-02-20 20:49         ` Junio C Hamano
@ 2018-02-21 19:23         ` Stefan Beller
  2018-02-21 19:45           ` Derrick Stolee
  2018-03-30 13:25         ` Jakub Narebski
  2 siblings, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-02-21 19:23 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
> Add document specifying the binary format for commit graphs. This
> format allows for:
>
> * New versions.
> * New hash functions and hash lengths.
> * Optional extensions.
>
> Basic header information is followed by a binary table of contents
> into "chunks" that include:
>
> * An ordered list of commit object IDs.
> * A 256-entry fanout into that list of OIDs.
> * A list of metadata for the commits.
> * A list of "large edges" to enable octopus merges.
>
> The format automatically includes two parent positions for every
> commit. This favors speed over space, since using only one position
> per commit would cause an extra level of indirection for every merge
> commit. (Octopus merges suffer from this indirection, but they are
> very rare.)
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph-format.txt | 90 +++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
>  create mode 100644 Documentation/technical/commit-graph-format.txt
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> new file mode 100644
> index 0000000..11b18b5
> --- /dev/null
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -0,0 +1,90 @@
> +Git commit graph format
> +=======================
> +
> +The Git commit graph stores a list of commit OIDs and some associated
> +metadata, including:
> +
> +- The generation number of the commit. Commits with no parents have
> +  generation number 1; commits with parents have generation number
> +  one more than the maximum generation number of its parents. We
> +  reserve zero as special, and can be used to mark a generation
> +  number invalid or as "not computed".
> +
> +- The root tree OID.
> +
> +- The commit date.
> +
> +- The parents of the commit, stored using positional references within
> +  the graph file.
> +
> +== graph-*.graph files have the following format:
> +
> +In order to allow extensions that add extra data to the graph, we organize
> +the body into "chunks" and provide a binary lookup table at the beginning
> +of the body. The header includes certain values, such as number of chunks,
> +hash lengths and types.
> +
> +All 4-byte numbers are in network order.
> +
> +HEADER:
> +
> +  4-byte signature:
> +      The signature is: {'C', 'G', 'P', 'H'}
> +
> +  1-byte version number:
> +      Currently, the only valid version is 1.
> +
> +  1-byte Object Id Version (1 = SHA-1)
> +
> +  1-byte number (C) of "chunks"
> +
> +  1-byte (reserved for later use)

What should clients of today do with it?
* ignore it completely [as they have no idea what it is] or
* throw hands up in the air if it is anything other than 0 ?
  [because clearly we will increment the version
   or have new information in a new chunk instead of just sneaking
   in information here?]

> +CHUNK LOOKUP:
> +
> +  (C + 1) * 12 bytes listing the table of contents for the chunks:
> +      First 4 bytes describe chunk id. Value 0 is a terminating label.
> +      Other 8 bytes provide offset in current file for chunk to start.

offset [in bytes? I could imagine having a larger granularity here,
because chunks don't sound small.]

> +      (Chunks are ordered contiguously in the file, so you can infer
> +      the length using the next chunk position if necessary.)
> +
> +  The remaining data in the body is described one chunk at a time, and
> +  these chunks may be given in any order. Chunks are required unless
> +  otherwise specified.
> +
> +CHUNK DATA:
> +
> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> +      The ith entry, F[i], stores the number of OIDs with first
> +      byte at most i. Thus F[255] stores the total
> +      number of commits (N).

[ so in small repos, where there are fewer than 256 objects,
F[i] == F[i+1], for all i'th where there is no object starting with i byte]

> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> +      The OIDs for all commits in the graph, sorted in ascending order.
> +
> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> +    * The first H bytes are for the OID of the root tree.
> +    * The next 8 bytes are for the int-ids of the first two parents
> +      of the ith commit. Stores value 0xffffffff if no parent in that
> +      position. If there are more than two parents, the second value
> +      has its most-significant bit on and the other bits store an array
> +      position into the Large Edge List chunk.
> +    * The next 8 bytes store the generation number of the commit and
> +      the commit time in seconds since EPOCH. The generation number
> +      uses the higher 30 bits of the first 4 bytes, while the commit
> +      time uses the 32 bits of the second 4 bytes, along with the lowest
> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
> +      commit time.
> +
> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> +      This list of 4-byte values store the second through nth parents for
> +      all octopus merges. The second parent value in the commit data stores
> +      an array position within this list along with the most-significant bit
> +      on. Starting at that array position, iterate through this list of int-ids
> +      for the parents until reaching a value with the most-significant bit on.
> +      The other bits correspond to the int-id of the last parent.
> +
> +TRAILER:
> +
> +       H-byte HASH-checksum of all of the above.
> +
> --
> 2.7.4

Makes sense so far, I'll read on.
I agree with Junio, that I could read this documentation without
the urge to point out nits. :)

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write'
  2018-02-19 18:53       ` [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
@ 2018-02-21 19:25         ` Junio C Hamano
  0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-21 19:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +static int graph_write(int argc, const char **argv)
> +{
> + ...
> +	graph_name = write_commit_graph(opts.obj_dir);
> +
> +	if (graph_name) {
> +		printf("%s\n", graph_name);
> +		FREE_AND_NULL(graph_name);
> +	}
> +
> +	return 0;
> +}

After successfully writing a graph file out, write_commit_graph()
signals that fact by returning a non-NULL pointer, so that this
caller can report the filename to the end user.  This caller
protects itself from a NULL return, presumably because the callee
uses it to signal an error when writing the graph file out?  

Is it OK to lose that 1-bit of information, or should we have more like

	if (graph_name) {
		printf;
		return 0;
	} else {
		return -1;
	}

>  int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_commit_graph_options[] = {
> -		{ OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
> +		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
>  			N_("dir"),
>  			N_("The object directory to store the graph") },
>  		OPT_END(),

The same comment for a no-op patch from an earlier step applies
here, and we have another one that we saw above in graph_write().

> @@ -31,6 +67,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  			     builtin_commit_graph_usage,
>  			     PARSE_OPT_STOP_AT_NON_OPTION);
>  
> +	if (argc > 0) {
> +		if (!strcmp(argv[0], "write"))
> +			return graph_write(argc, argv);

And if we fix "graph_write" to report an error with negative return,
this needs to become something like

		return !!graph_write(argc, argv);

as we do not want to return a negative value to be passed via
run_builtin() to exit(3) in handle_builtin().

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> new file mode 100755
> index 0000000..6a5e93c
> --- /dev/null
> +++ b/t/t5318-commit-graph.sh
> @@ -0,0 +1,119 @@
> +#!/bin/sh
> +
> +test_description='commit graph'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup full repo' '
> +	rm -rf .git &&

I am perfectly OK with creating a separate subdirectory called
'full' in the trash directory given by the test framework, but
unless absolutely necessary I'd rather not to see "rm -rf", 
especially on ".git", in our test scripts.  People can screw up
doing various things (like copying and pasting).

> +	mkdir full &&
> +	cd full &&
> +	git init &&
> +	objdir=".git/objects"
> +'

And I absolutely do not want to see "cd full" that leaves and stays
in the subdirectory after this step is done.  

Imagine what happens if any earlier step fails before doing "cd
full", causing this "setup full" step to report failure, and then
the test goes on to the next step?  We will not be in "full" and
worse yet because we do not have "$TRASH_DIRECTORY/.git" (you
removed it), the "git commit-graph write --object-dir" command we
end up doing next will see the git source repository as the
repository it is working on.  Never risk trashing our source
repository with your test.  That is why we give you $TRASH_DIRECTORY
to play in.  Make use of it when you can.

I'd make this step just a single

	git init full

and then the next one

	git -C full commit-graph write --object-dir .

In later tests that have multi-step things, I'd instead make them

	(
		cd full &&
		... whatever you do  &&
		... in that separate  &&
		... 'full' repository
	)

if I were writing this test *and* if I really wanted to do things
inside $TRASH_DIRECTORY/full/.git repository.  I am not convinced
yet about the latter.  I know that it will make certain things
simpler to use a separate /full hierarchy (e.g. cleaning up, having
another unrelated test repository, etc.) while making other things
more cumbersome (e.g. you need to be careful when you "cd" and the
easiest way to do so is to ( do things in a subshell )).  I just do
not know what the trade-off would look like in this particular case.

A simple rule of thumb I try to follow is not to change $(pwd) for
the process that runs these test_expect_success shell functions.

> +
> +test_expect_success 'write graph with no packs' '
> +	git commit-graph write --object-dir .
> +'
> +
> +test_expect_success 'create commits and repack' '
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i
> +	done &&
> +	git repack
> +'

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 02/13] graph: add commit graph design document
  2018-02-19 18:53       ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
  2018-02-20 21:42         ` Junio C Hamano
@ 2018-02-21 19:34         ` Stefan Beller
  1 sibling, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-02-21 19:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

> +[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
> +    More discussion about generation numbers and not storing them inside
> +    commit objects. A valuable quote:

Unlike the other public inbox links this links to a discussion with
all messages on one page,
https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
would
have this be more inline with the other links. (this is a super small
nit, which I am not sure if
we care about at all; the rest of the doc is awesome!)

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 01/13] commit-graph: add format document
  2018-02-21 19:23         ` Stefan Beller
@ 2018-02-21 19:45           ` Derrick Stolee
  2018-02-21 19:48             ` Stefan Beller
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-21 19:45 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

On 2/21/2018 2:23 PM, Stefan Beller wrote:
> On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
>> +In order to allow extensions that add extra data to the graph, we organize
>> +the body into "chunks" and provide a binary lookup table at the beginning
>> +of the body. The header includes certain values, such as number of chunks,
>> +hash lengths and types.
>> +
>> +All 4-byte numbers are in network order.
>> +
>> +HEADER:
>> +
>> +  4-byte signature:
>> +      The signature is: {'C', 'G', 'P', 'H'}
>> +
>> +  1-byte version number:
>> +      Currently, the only valid version is 1.
>> +
>> +  1-byte Object Id Version (1 = SHA-1)
>> +
>> +  1-byte number (C) of "chunks"
>> +
>> +  1-byte (reserved for later use)
> What should clients of today do with it?
> * ignore it completely [as they have no idea what it is] or
> * throw hands up in the air if it is anything other than 0 ?
>    [because clearly we will increment the version
>     or have new information in a new chunk instead of just sneaking
>     in information here?]

They should ignore it completely, which will allow using the value for 
something meaningful later without causing a version change (which we DO 
die() for). A user could downgrade from a version that uses this byte 
for something meaningful and not require a new commit-graph file.

The "commit-graph read" subcommand does output this byte, so we can 
verify that the "write" subcommand places a 0 in this position.

>
>> +CHUNK LOOKUP:
>> +
>> +  (C + 1) * 12 bytes listing the table of contents for the chunks:
>> +      First 4 bytes describe chunk id. Value 0 is a terminating label.
>> +      Other 8 bytes provide offset in current file for chunk to start.
> offset [in bytes? I could imagine having a larger granularity here,
> because chunks don't sound small.]

It is good to specify "offset in bytes".

>
>> +      (Chunks are ordered contiguously in the file, so you can infer
>> +      the length using the next chunk position if necessary.)
>> +
>> +  The remaining data in the body is described one chunk at a time, and
>> +  these chunks may be given in any order. Chunks are required unless
>> +  otherwise specified.
>> +
>> +CHUNK DATA:
>> +
>> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>> +      The ith entry, F[i], stores the number of OIDs with first
>> +      byte at most i. Thus F[255] stores the total
>> +      number of commits (N).
> [ so in small repos, where there are fewer than 256 objects,
> F[i] == F[i+1], for all i'th where there is no object starting with i byte]

Correct. I'm not sure this additional information is valuable for the 
document, though.

>
>> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>> +      The OIDs for all commits in the graph, sorted in ascending order.
>> +
>> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>> +    * The first H bytes are for the OID of the root tree.
>> +    * The next 8 bytes are for the int-ids of the first two parents
>> +      of the ith commit. Stores value 0xffffffff if no parent in that
>> +      position. If there are more than two parents, the second value
>> +      has its most-significant bit on and the other bits store an array
>> +      position into the Large Edge List chunk.
>> +    * The next 8 bytes store the generation number of the commit and
>> +      the commit time in seconds since EPOCH. The generation number
>> +      uses the higher 30 bits of the first 4 bytes, while the commit
>> +      time uses the 32 bits of the second 4 bytes, along with the lowest
>> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
>> +      commit time.
>> +
>> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>> +      This list of 4-byte values store the second through nth parents for
>> +      all octopus merges. The second parent value in the commit data stores
>> +      an array position within this list along with the most-significant bit
>> +      on. Starting at that array position, iterate through this list of int-ids
>> +      for the parents until reaching a value with the most-significant bit on.
>> +      The other bits correspond to the int-id of the last parent.
>> +
>> +TRAILER:
>> +
>> +       H-byte HASH-checksum of all of the above.
>> +
>> --
>> 2.7.4
> Makes sense so far, I'll read on.
> I agree with Junio, that I could read this documentation without
> the urge to point out nits. :)
>
> Thanks,
> Stefan


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 01/13] commit-graph: add format document
  2018-02-21 19:45           ` Derrick Stolee
@ 2018-02-21 19:48             ` Stefan Beller
  0 siblings, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-02-21 19:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

>>
>> [ so in small repos, where there are fewer than 256 objects,
>> F[i] == F[i+1], for all i'th where there is no object starting with i
>> byte]
>
>
> Correct. I'm not sure this additional information is valuable for the
> document, though.

It is not, I was just making sure I'd understand correctly.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 06/13] commit-graph: implement git commit-graph read
  2018-02-19 18:53       ` [PATCH v4 06/13] commit-graph: implement git commit-graph read Derrick Stolee
@ 2018-02-21 20:11         ` Junio C Hamano
  2018-02-22 18:25           ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-21 20:11 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +'read'::
> +
> +Read a graph file given by the graph-head file and output basic
> +details about the graph file.
> ++
> +With `--file=<name>` option, consider the graph stored in the file at
> +the path  <object-dir>/info/<name>.
> +

A sample reader confusion after reading the above twice:

    What is "the graph-head file" and how does the user specify it?  Is
    it given by  the value for the "--file=<name>" command line option?

Another sample reader reaction after reading the above:

    What are the kind of "basic details" we can learn from this
    command is unclear, but perhaps there is an example to help me
    decide if this command is worth studying.

> @@ -44,6 +53,12 @@ EXAMPLES
>  $ git commit-graph write
>  ------------------------------------------------
>  
> +* Read basic information from a graph file.
> ++
> +------------------------------------------------
> +$ git commit-graph read --file=<name>
> +------------------------------------------------
> +

And the sample reader is utterly disappointed at this point.

> +static int graph_read(int argc, const char **argv)
> +{
> +	struct commit_graph *graph = 0;
> +	struct strbuf full_path = STRBUF_INIT;
> +
> +	static struct option builtin_commit_graph_read_options[] = {
> +		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
> +			N_("dir"),
> +			N_("The object directory to store the graph") },
> +		{ OPTION_STRING, 'H', "file", &opts.graph_file,
> +			N_("file"),
> +			N_("The filename for a specific commit graph file in the object directory."),
> +			PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
> +		OPT_END(),
> +	};

The same comment as all the previous ones apply, wrt short options
and non-use of OPT_STRING().

Also, I suspect that these two would want to use OPT_FILENAME
instead, if we anticipate that the command might want to be
sometimes run from a subdirectory.  Otherwise wouldn't

	cd t && git commit-graph read --file=../.git/object/info/$whatever

end up referring to a wrong place because the code that uses the
value obtained from OPTION_STRING does not do the equivalent of
parse-options.c::fix_filename()?  The same applies to object-dir
handling.

> +	argc = parse_options(argc, argv, NULL,
> +			     builtin_commit_graph_read_options,
> +			     builtin_commit_graph_read_usage, 0);
> +
> +	if (!opts.obj_dir)
> +		opts.obj_dir = get_object_directory();
> +
> +	if (!opts.graph_file)
> +		die("no graph hash specified");
> +
> +	strbuf_addf(&full_path, "%s/info/%s", opts.obj_dir, opts.graph_file);

Ahh, I was fooled by a misnamed option.  --file does *not* name the
file.  It is a filename in a fixed place that is determined by other
things.

So it would be a mistake to use OPT_FILENAME() in the parser for
that misnamed "--file" option.  The parser for --object-dir still
would want to be OPT_FILENAME(), but quite honestly, I do not see
the point of having --object-dir option in the first place.  The
graph file is not relative to it but is forced to have /info/ in
between that directory and the filename, so it is not like the user
gets useful flexibility out of being able to specify two different
places using --object-dir= option and $GIT_OBJECT_DIRECTORY
environment (iow, a caller that wants to work on a specific object
directory can use the environment, which is how it would tell any
other Git subcommand which object store it wants to work with).

But stepping back a bit, I think the way --file argument is defined
is halfway off from two possible more useful ways to define it.  If
it were just "path to the file" (iow, what OPT_FILENAME() is suited
for parsing it), then a user could say "I have this graph file that
I created for testing, it is not installed in its usual place in
$GIT_OBJECT_DIRECTORY/info/ at all, but I want you to read it
because I am debugging".  That is one possible useful extreme.  The
other possibility would be to allow *only* the hash part to be
specified, iow, not just forcing /info/ relative to object
directory, you would force the "graph-" prefix and ".graph" suffix.
That would be the other extreme that is useful (less typing and less
error prone).

For a low-level command line this, my gut feeling is that it would
be better to allow paths to the object directory and the graph file
to be totally independently specified.

> +	if (graph_signature != GRAPH_SIGNATURE) {
> +		munmap(graph_map, graph_size);
> +		close(fd);
> +		die("graph signature %X does not match signature %X",
> +			graph_signature, GRAPH_SIGNATURE);
> +	}
> +
> +	graph_version = *(unsigned char*)(data + 4);
> +	if (graph_version != GRAPH_VERSION) {
> +		munmap(graph_map, graph_size);
> +		close(fd);
> +		die("graph version %X does not match version %X",
> +			graph_version, GRAPH_VERSION);
> +	}
> +
> +	hash_version = *(unsigned char*)(data + 5);
> +	if (hash_version != GRAPH_OID_VERSION) {
> +		munmap(graph_map, graph_size);
> +		close(fd);
> +		die("hash version %X does not match version %X",
> +			hash_version, GRAPH_OID_VERSION);

It becomes a bit tiring to see munmap/close/die pattern repreated
over and over again, doesn't it?  Can we make it simpler, perhaps by
letting die() take care of the clean-up?  After all, if the very
next step dies because alloc_commit_graph() got NULL in xmalloc(),
we are letting die() there take care of the clean-up anyway already,
and die() in the chunk parsing look has no such cleanup, either.

Of course, when we later want to libify this part of the code, then
we wouldn't be calling die() from this codepath, but the change
required to do so will not be just s/die/error/; it would be more
like

	if (x_version != X_VERSION) {
		error("X version %X does not match",...);
		goto cleanup_fail;
	}

with munmap/close done at the jumped-to label.

> +	}
> +
> +	graph = alloc_commit_graph();
> +
> +	graph->hash_len = GRAPH_OID_LEN;
> + ...
> +		if (chunk_offset > graph_size - GIT_MAX_RAWSZ)
> +			die("improper chunk offset %08x%08x", (uint32_t)(chunk_offset >> 32),
> +			    (uint32_t)chunk_offset);
> +
> +		switch (chunk_id) {
> +			case GRAPH_CHUNKID_OIDFANOUT:
> +				graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
> +				break;

This is over-indented from our point of view.  In our codebase, case
arms aling with switch, i.e.

		switch (chunk_id) {
		case GRAPH_CHUNKID_OIDFANOUT:
			graph->chunk_oid_fanout = ...;
			break;

When the input file has GRAPH_CHUNKID_OIDFANOUT twice, I think it
should be flagged as a corrupt/malformed input file, causing the
reader to reject it.  It is plausible that you wanted to make it
"the last one wins", but even if that is the case, I think the user
should at least get a warning, as (I'd imagine) it is an unusual
condition.

The same applies to multiple instances of any currently-defined
chunk types.

> +graph_read_expect() {
> +	OPTIONAL=""
> +	NUM_CHUNKS=3
> +	if [ ! -z $2 ]

We use "test" and do not use "[ ... ]" or "[[ ... ]]".

I'll stop here.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-19 18:53       ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
@ 2018-02-21 21:34         ` Stefan Beller
  2018-02-23 17:43           ` Derrick Stolee
  2018-02-22 18:48         ` Junio C Hamano
  1 sibling, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-02-21 21:34 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:

>         graph_name = write_commit_graph(opts.obj_dir);
>
>         if (graph_name) {
>                 if (opts.set_latest)
>                         set_latest_file(opts.obj_dir, graph_name);
>
> +               if (opts.delete_expired)
> +                       do_delete_expired(opts.obj_dir,
> +                                         old_graph_name,
> +                                         graph_name);
> +

So this only allows to delete expired things and setting the latest
when writing a new graph. Would we ever envision a user to produce
a new graph (e.g. via obtaining a graph that they got from a server) and
then manually rerouting the latest to that new graph file without writing
that graph file in the same process? The same for expired.

I guess these operations are just available via editing the
latest or deleting files manually, which slightly contradicts
e.g. "git update-ref", which in olden times was just a fancy way
of rewriting the refs file manually. (though it claims to be less
prone to errors as it takes lock files)

>
>  extern char *get_graph_latest_filename(const char *obj_dir);
> +extern char *get_graph_latest_contents(const char *obj_dir);

Did
https://public-inbox.org/git/20180208213806.GA6381@sigill.intra.peff.net/
ever make it into tree? (It is sort of new, but I feel we'd want to
strive for consistency in the code base, eventually.)

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 12/13] commit-graph: read only from specific pack-indexes
  2018-02-19 18:53       ` [PATCH v4 12/13] commit-graph: read only from specific pack-indexes Derrick Stolee
@ 2018-02-21 22:25         ` Stefan Beller
  2018-02-23 19:19           ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-02-21 22:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
>
> Teach git-commit-graph to inspect the objects only in a certain list
> of pack-indexes within the given pack directory. This allows updating
> the commit graph iteratively, since we add all commits stored in a
> previous commit graph.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 11 +++++++++++
>  builtin/commit-graph.c             | 32 +++++++++++++++++++++++++++++---
>  commit-graph.c                     | 26 ++++++++++++++++++++++++--
>  commit-graph.h                     |  4 +++-
>  packfile.c                         |  4 ++--
>  packfile.h                         |  2 ++
>  t/t5318-commit-graph.sh            | 16 ++++++++++++++++
>  7 files changed, 87 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index b9b4031..93d50d1 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -42,6 +42,10 @@ With the `--delete-expired` option, delete the graph files in the pack
>  directory that are not referred to by the graph-latest file. To avoid race
>  conditions, do not delete the file previously referred to by the
>  graph-latest file if it is updated by the `--set-latest` option.
> ++
> +With the `--stdin-packs` option, generate the new commit graph by
> +walking objects only in the specified packfiles and any commits in
> +the existing graph-head.

A general question on this series:
How do commit graph buildups deal with garbage collected commits?
(my personal workflow is heavy on rebase, which generates lots of
dangling commits, to be thrown out later)

The second half of the sentence makes it sound like once a
commit is in the graph it cannot be pulled out easily again, hence
the question on the impact of graphs on a long living repository
which is garbage collected frequently.

AFAICT you could just run
    git commit-graph write --set-latest [--delete-expired]
as that actually looks up objects from outside the existing graph files,
such that lost objects are ignored?

> +       const char **lines = NULL;
> +       int nr_lines = 0;
> +       int alloc_lines = 0;

(nit:)
I had the impression that these triplet-variables, that are used in
ALLOC_GROW are allo X, X_nr and X_allow, but I might be wrong.

> @@ -170,7 +178,25 @@ static int graph_write(int argc, const char **argv)
>
>         old_graph_name = get_graph_latest_contents(opts.obj_dir);
>
> -       graph_name = write_commit_graph(opts.obj_dir);
> +       if (opts.stdin_packs) {
> +               struct strbuf buf = STRBUF_INIT;
> +               nr_lines = 0;
> +               alloc_lines = 128;

alloc_lines has been initialized before, so why redo it here again?
Also what is the rationale for choosing 128 as a good default?
I would guess 0 is just as fine, because ALLOC_GROW makes sure
that it growth fast in the first couple entries by having an additional
offset. (no need to fine tune the starting allocation IMHO)

> +               ALLOC_ARRAY(lines, alloc_lines);
> +
> +               while (strbuf_getline(&buf, stdin) != EOF) {
> +                       ALLOC_GROW(lines, nr_lines + 1, alloc_lines);
> +                       lines[nr_lines++] = buf.buf;
> +                       strbuf_detach(&buf, NULL);

strbuf_detach returns its previous buf.buf, such that you can combine these
two lines as
    lines[nr_lines++] = strbuf_detach(&buf, NULL);


> +               }
> +
> +               pack_indexes = lines;
> +               nr_packs = nr_lines;

Technically we do not need to strbuf_release(&buf) here, because
strbuf_detach is always called, and by knowing its implementation,
it is just as good.


> @@ -579,7 +581,27 @@ char *write_commit_graph(const char *obj_dir)
>                 oids.alloc = 1024;
>         ALLOC_ARRAY(oids.list, oids.alloc);
>
> -       for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
> +       if (pack_indexes) {
> +               struct strbuf packname = STRBUF_INIT;
> +               int dirlen;
> +               strbuf_addf(&packname, "%s/pack/", obj_dir);
> +               dirlen = packname.len;
> +               for (i = 0; i < nr_packs; i++) {
> +                       struct packed_git *p;
> +                       strbuf_setlen(&packname, dirlen);
> +                       strbuf_addstr(&packname, pack_indexes[i]);
> +                       p = add_packed_git(packname.buf, packname.len, 1);
> +                       if (!p)
> +                               die("error adding pack %s", packname.buf);
> +                       if (open_pack_index(p))
> +                               die("error opening index for %s", packname.buf);
> +                       for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);
> +                       close_pack(p);
> +               }

strbuf_release(&packname);

> +       }
> +       else

(micro style nit)

    } else

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 06/13] commit-graph: implement git commit-graph read
  2018-02-21 20:11         ` Junio C Hamano
@ 2018-02-22 18:25           ` Junio C Hamano
  0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-22 18:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>> +'read'::
>> +
>> +Read a graph file given by the graph-head file and output basic
>> +details about the graph file.
>> ++
>> +With `--file=<name>` option, consider the graph stored in the file at
>> +the path  <object-dir>/info/<name>.
>> +
>
> A sample reader confusion after reading the above twice:
>
>     What is "the graph-head file" and how does the user specify it?  Is
>     it given by  the value for the "--file=<name>" command line option?

This confusion is somewhat lightened with s/graph-head/graph-latest/
(I just saw 07/13 to realize that the file is renamed).

Perhaps describe it as "Read the graph file currently active
(i.e. the one pointed at by graph-latest file in the object/info
directory) and output blah" + "With --file parameter, read the
graph file specified with that parameter instead"?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 07/13] commit-graph: implement --set-latest
  2018-02-19 18:53       ` [PATCH v4 07/13] commit-graph: implement --set-latest Derrick Stolee
@ 2018-02-22 18:31         ` Junio C Hamano
  2018-02-23 17:53           ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-22 18:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>  static struct opts_commit_graph {
>  	const char *obj_dir;
>  	const char *graph_file;
> +	int set_latest;
>  } opts;
> ...
> @@ -89,6 +106,8 @@ static int graph_write(int argc, const char **argv)
>  		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
>  			N_("dir"),
>  			N_("The object directory to store the graph") },
> +		OPT_BOOL('u', "set-latest", &opts.set_latest,
> +			N_("update graph-head to written graph file")),
>  		OPT_END(),
>  	};
>  
> @@ -102,6 +121,9 @@ static int graph_write(int argc, const char **argv)
>  	graph_name = write_commit_graph(opts.obj_dir);
>  
>  	if (graph_name) {
> +		if (opts.set_latest)
> +			set_latest_file(opts.obj_dir, graph_name);
> +

This feels like a very strange API from potential caller's point of
view.  Because you have to decide that you are going to mark it as
the latest one upfront before actually writing the graph file, if
you forget to pass --set-latest, you have to know how to manually
mark the file as latest anyway.  I would understand if it were one
of the following:

 (1) whenever a new commit graph file is written in the
     objects/info/ directory, always mark it as the latest (drop
     --set-latest option altogether); or

 (2) make set-latest command that takes a name of an existing graph
     file in the objects/info/ directory, and sets the latest
     pointer to point at it (make it separate from 'write' command).

though.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-19 18:53       ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
  2018-02-21 21:34         ` Stefan Beller
@ 2018-02-22 18:48         ` Junio C Hamano
  2018-02-23 17:59           ` Derrick Stolee
  1 sibling, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-22 18:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> Teach git-commit-graph to delete the .graph files that are siblings of a
> newly-written graph file, except for the file referenced by 'graph-latest'
> at the beginning of the process and the newly-written file. If we fail to
> delete a graph file, only report a warning because another git process may
> be using that file. In a multi-process environment, we expect the previoius
> graph file to be used by a concurrent process, so we do not delete it to
> avoid race conditions.

I do not understand the later part of the above.  On some operating
systems, you actually can remove a file that is open by another
process without any ill effect.  There are systems that do not allow
removing a file that is in use, and an attempt to unlink it may
fail.  The need to handle such a failure gracefully is not limited
to the case of removing a commit graph file---we need to deal with
it when removing file of _any_ type.

Especially the last sentence "we do not delete it to avoid race
conditions" I find problematic.  If a system does not allow removing
a file in use and we detect a failure after an attempt to do so, it
is not "we do not delete it" --- even if you do, you won't succeed
anyway, so there is no point saying that.  And on systems that do
allow safe removal of a file in use (i.e. they allow an open file to
be used by processes that have open filehandles to it after its
removal), there is no point refraining to delete it "to avoid race
conditions", either---in fact it is unlikely that you would even know
somebody else had it open and was using it.

In any case, I do not think '--delete-expired' option that can be
given only when you are writing makes much sense as an API.  An
'expire' command, just like 'set-latest' command, that is a separate
command from 'write',  may make sense, though.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 02/13] graph: add commit graph design document
  2018-02-20 21:42         ` Junio C Hamano
@ 2018-02-23 15:44           ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 15:44 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

On 2/20/2018 4:42 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +2. Walking the entire graph to avoid topological order mistakes.
> You have at least one more mention of "topological order mistakes"
> below, but we commonly refer to this issue and blame it for "clock
> skew".  Using the word highlights that there is no "mistake" in topo
> order algorithm and mistakes are in the commit timestamps.

I'll drop the word "mistakes" and instead here say:

   2. Walking the entire graph to satisfy topological order constraints.

and later say

   This heuristic is currently used whenever the computation is allowed to
   violate topological relationships due to clock skew (such as "git log"
   with default order), but is not used when the topological order is
   required (such as merge base calculations, "git log --graph").

>
>> +In practice, we expect some commits to be created recently and not stored
>> +in the commit graph. We can treat these commits as having "infinite"
>> +generation number and walk until reaching commits with known generation
>> +number.
> Hmm, "pretend infinity" is an interesting approach---I need to think
> about it a bit more if it is sufficient.

Since we require the commit graph file to be closed under reachability, 
the commits reachable from the file all have "finite" generation number.

>
>> +- .graph files are managed only by the 'commit-graph' builtin. These are not
>> +  updated automatically during clone, fetch, repack, or creating new commits.
> OK.  s/builtin/subcommand/; it does not make much difference if it
> is a built-in or standalone command.
>
>> +- There is no 'verify' subcommand for the 'commit-graph' builtin to verify
>> +  the contents of the graph file agree with the contents in the ODB.
> I am not entirely sure about the merit of going into this level of
> detail.  Being able to use only a single file looks like a more
> fundamental design limitation, which deserves to be decribed in this
> section, and we could ship the subsystem with that limitation.
>
> But the lack of verify that can be called from fsck is merely the
> matter of not the subsystem being mature enough (to be written,
> reviewed and tested) and not a fundamental one, and we will not be
> shipping the subsystem until that limitation is lifted.
>
> So I'd guess that we prefer this bullet item to be in the commit log
> message, not here, that describes the current status of the
> development (as opposed to the state of the subsystem).

I was treating this design document as a living document that will be 
updated as the feature matures. It is difficult to time when to discuss 
these limitations, since in this commit the graph feature is not 
implemented at all. But, it is important to have _some_ design document 
before continuing to implement the feature.

I can remove this bullet, but I'm not sure which commit message would be 
appropriate to contain that information.

I do intend to remove these limitations and future work bullets as they 
are implemented in later patches.

>
>> +- Generation numbers are not computed in the current version. The file
>> +  format supports storing them, along with a mechanism to upgrade from
>> +  a file without generation numbers to one that uses them.
> Exactly the same comment as above applies to this item.
>
>> +- The commit graph is currently incompatible with commit grafts. This can be
>> +  remedied by duplicating or refactoring the current graft logic.
> Hmm.  Can it be lifted without first allowing us to use more than
> one commit graph file (i.e. one for "traverse while honoring the
> grafts", the other for "traverse while ignoring the grafts")?

I consider this list unordered, but will move this bullet to the top and 
replace its first sentence with:

   The commit graph feature currently does not honor commit grafts.

>
>> +- After computing and storing generation numbers, we must make graph
>> +  walks aware of generation numbers to gain the performance benefits they
>> +  enable. This will mostly be accomplished by swapping a commit-date-ordered
>> +  priority queue with one ordered by generation number. The following
>> +  operations are important candidates:
>> +
>> +    - paint_down_to_common()
>> +    - 'log --topo-order'
> Yes.
>
>> +- The graph currently only adds commits to a previously existing graph.
>> +  When writing a new graph, we could check that the ODB still contains
>> +  the commits and choose to remove the commits that are deleted from the
>> +  ODB. For performance reasons, this check should remain optional.
> The last sentence is somehow unconvincing.  It probably is not
> appropriate for the "Future Work" section to be making a hurried
> design decision before having any working verification code to run
> benchmark on.

I'll remove this entire block, since it is not relevant starting at v4. 
I dropped this "additive only" step in v4 and forgot to remove the bullet.

>
>> +- Currently, parse_commit_gently() requires filling in the root tree
>> +  object for a commit. This passes through lookup_tree() and consequently
>> +  lookup_object(). Also, it calls lookup_commit() when loading the parents.
>> +  These method calls check the ODB for object existence, even if the
>> +  consumer does not need the content. For example, we do not need the
>> +  tree contents when computing merge bases. Now that commit parsing is
>> +  removed from the computation time, these lookup operations are the
>> +  slowest operations keeping graph walks from being fast. Consider
>> +  loading these objects without verifying their existence in the ODB and
>> +  only loading them fully when consumers need them. Consider a method
>> +  such as "ensure_tree_loaded(commit)" that fully loads a tree before
>> +  using commit->tree.
> Very good idea.

I will likely submit an orthogonal patch that does this, as it will save 
time even without the commit graph. The time spent in 'lookup_tree()' is 
less significant when the majority of the time is spent parsing commits, 
but it is still 1-2% in some cases.

>
>> +- The current design uses the 'commit-graph' builtin to generate the graph.
>> +  When this feature stabilizes enough to recommend to most users, we should
>> +  add automatic graph writes to common operations that create many commits.
>> +  For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
>> +  commands.
> s/coulde/could/.
>
> Also do not forget "fsck" that calls "verify".  That is more urgent
> than intergration with any other subcommand.

Noted.

>
>> +- A server could provide a commit graph file as part of the network protocol
>> +  to avoid extra calculations by clients.
> We need to assess the riskiness and threat models regarding this, if
> we really want to follow this "could" through.  I would imagine that
> the cost for verification is comparable to the cost for regenerating,
> in which case it may not be worth doing this _unless_ the user opts
> into it saying that the other side over the wire is trusted without
> any reservation.

I agree. There is a certain level of trust that is required here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 03/13] commit-graph: create git-commit-graph builtin
  2018-02-21 18:58           ` Junio C Hamano
@ 2018-02-23 16:07             ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 16:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

On 2/21/2018 1:58 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> +int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>>> +{
>>> +	static struct option builtin_commit_graph_options[] = {
>>> +		{ OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
>>> +			N_("dir"),
>>> +			N_("The object directory to store the graph") },
>> I have a suspicion that this was modeled after some other built-in
>> that has a similar issue (perhaps written long time ago), but isn't
>> OPT_STRING() sufficient to define this element these days?
>>
>> Or am I missing something?

You are not. There are several places in this history of this patch 
where I was using old patterns because I was using old code as my model 
(places like 'index-pack').

>> Why squat on short-and-sweet "-p"?  For that matter, since this is
>> not expected to be end-user facing command anyway, I suspect that we
>> do not want to allocate a single letter option from day one, which
>> paints ourselves into a corner from where we cannot escape.

I'll drop all single-letter shortcuts.

> I suspect that exactly the same comment applies to patches in this
> series that add other subcommands (I just saw one in the patch for
> adding 'write').
>

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-20 22:57         ` Junio C Hamano
@ 2018-02-23 17:23           ` Derrick Stolee
  2018-02-23 19:30             ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

On 2/20/2018 5:57 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +#define GRAPH_OID_VERSION_SHA1 1
>> +#define GRAPH_OID_LEN_SHA1 20
> This hardcoded 20 on the right hand side of this #define is probably
> problematic.   Unless you are planning to possibly store truncated
> hash value for some future hash algorithm, GRAPH_OID_LEN_$HASH should
> always be the same as GIT_$HASH_RAWSZ, I would think.  IOW
>
>      #define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ
>
> perhaps?

Yes.

>
>> +static void write_graph_chunk_fanout(struct sha1file *f,
>> +				     struct commit **commits,
>> +				     int nr_commits)
>> +{
>> +	uint32_t i, count = 0;
>> +	struct commit **list = commits;
>> +	struct commit **last = commits + nr_commits;
>> +
>> +	/*
>> +	 * Write the first-level table (the list is sorted,
>> +	 * but we use a 256-entry lookup to be able to avoid
>> +	 * having to do eight extra binary search iterations).
>> +	 */
>> +	for (i = 0; i < 256; i++) {
>> +		while (list < last) {
>> +			if ((*list)->object.oid.hash[0] != i)
>> +				break;
>> +			count++;
>> +			list++;
>> +		}
> If count and list are always incremented in unison, perhaps you do
> not need an extra variable "last".  If typeof(nr_commits) is wider
> than typeof(count), this loop and the next write-be32 is screwed
> anyway ;-)
>
> This comment probably applies equally to some other uses of the same
> "compute last pointer to compare with running pointer for
> termination" pattern in this patch.

Yes. Also turning i and count into int to match nr_commits.

>
>> +		sha1write_be32(f, count);
>> +	}
>> +}
>> +static int commit_pos(struct commit **commits, int nr_commits,
>> +		      const struct object_id *oid, uint32_t *pos)
>> +{
> It is a bit unusual to see something_pos() that returns an integer
> that does *NOT* return the position as its return value.  Dropping
> the *pos parameter, and returning "mid" when commits[mid] is what we
> wanted to see, and otherwise returning "-1 - first" to signal the
> position at which we _would_ have found the object, if it were in
> the table, would make it more consistent with the usual convention.

I can make this change. I found the boolean return to make the 
consumer's logic simpler, but it isn't that much simpler.

> Don't we even have such a generalized binary search helper already
> somewhere in the system?

jt/binsearch-with-fanout introduces one when there is a 256-entry fanout 
table (not the case here).

The bsearch() method in search.h (and used in 
pack-write.c:need_large_offset) does not return the _position_ of a 
found element.

Neither of these suit my needs, but I could just be searching for the 
wrong strings. Also, I could divert my energies in this area to create a 
generic search in the style of jt/binsearch-with-fanout.

>
>> +static void write_graph_chunk_data(struct sha1file *f, int hash_len,
>> +				   struct commit **commits, int nr_commits)
>> +{
>> +	struct commit **list = commits;
>> +	struct commit **last = commits + nr_commits;
>> +	uint32_t num_large_edges = 0;
>> +
>> +	while (list < last) {
>> +		struct commit_list *parent;
>> +		uint32_t int_id;
>> +		uint32_t packedDate[2];
>> +
>> +...
>> +		if (!parent)
>> +			int_id = GRAPH_PARENT_NONE;
>> +		else if (parent->next)
>> +			int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges;
>> +		else if (!commit_pos(commits, nr_commits,
>> +				    &(parent->item->object.oid), &int_id))
>> +			int_id = GRAPH_PARENT_MISSING;
>> +
>> +		sha1write_be32(f, int_id);
>> +
>> +		if (parent && parent->next) {
> This is equivalent to checking "int_id & GRAPH_LARGE_EDGES_NEEDED",
> right?  Not suggesting to use the other form of checks, but trying
> to see what's the best way to express it in the most readable way.

You're right, we already set the bit above, so let's make use of that 
check. Important to note that GRAPH_LARGE_EDGES_NEEDED & 
GRAPH_PARENT_MISSING == 0.

>
>> +			do {
>> +				num_large_edges++;
>> +				parent = parent->next;
>> +			} while (parent);
> It feels somewhat wasteful to traverse the commit's parents list
> only to count, without populating the octopus table (which I
> understand is assumed to be minority case under this design).

Since we are writing the commit graph file in-order, we cannot write the 
octopus table until after the chunk lengths are known. We could store 
the octopus table in memory and then dump it into the file later, but 
walking the parents is quite fast after all the commits are loaded. I'm 
not sure the time optimization merits the extra complexity here. (I'm 
happy to revisit this if we do see this performance lacking.)

P.S. I really like the name "octopus table" and will use that for 
informal discussions of this format.

>
>> +		}
>> +
>> +		if (sizeof((*list)->date) > 4)
>> +			packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
>> +		else
>> +			packedDate[0] = 0;
> OK, the undefined pattern in the previous round is now gone ;-)  Good.
>
>> +		packedDate[1] = htonl((*list)->date);
>> +		sha1write(f, packedDate, 8);
>> +
>> +		list++;
>> +	}
>> +}
>> +
>> +static void write_graph_chunk_large_edges(struct sha1file *f,
>> +					  struct commit **commits,
>> +					  int nr_commits)
>> +{
>> +	struct commit **list = commits;
>> +	struct commit **last = commits + nr_commits;
>> +	struct commit_list *parent;
>> +
>> +	while (list < last) {
>> +		int num_parents = 0;
>> +		for (parent = (*list)->parents; num_parents < 3 && parent;
>> +		     parent = parent->next)
>> +			num_parents++;
>> +
>> +		if (num_parents <= 2) {
>> +			list++;
>> +			continue;
>> +		}
>> +
>> +		/* Since num_parents > 2, this initializer is safe. */
>> +		for (parent = (*list)->parents->next; parent; parent = parent->next) {
>> +			uint32_t int_id, swap_int_id;
>> +			uint32_t last_edge = 0;
>> +			if (!parent->next)
>> +				last_edge |= GRAPH_LAST_EDGE;
>> +
>> +			if (commit_pos(commits, nr_commits,
>> +				       &(parent->item->object.oid),
>> +				       &int_id))
>> +				swap_int_id = htonl(int_id | last_edge);
>> +			else
>> +				swap_int_id = htonl(GRAPH_PARENT_MISSING | last_edge);
>> +			sha1write(f, &swap_int_id, 4);
> What does "swap_" in the name of this variable mean?  For some
> archs, there is no swap.  The only difference between int_id and the
> variable is that its MSB may possibly be smudged with last_edge bit.

Sorry, I tried to catch all of these, but some fell through the cracks. 
I should be using sha1write_be32() after modifying int_id directly.

This whole block is a bit of a mess. I'll replace it with something like:

                         uint32_t int_id;
                         if (!commit_pos(commits, nr_commits,
&(parent->item->object.oid),
                                        &int_id))
                                 int_id = GRAPH_PARENT_MISSING;
                         else if (!parent->next)
                                 int_id |= GRAPH_LAST_EDGE;

                         sha1write_be32(f, int_id);

> This is a tangent, but after having seen many instances of "int_id",
> I started to feel that it is grossly misnamed.  We do not care about
> its "int" ness---what's more significant about it is that we use can
> it as a short identifier in place for a full object name, given the
> table of known OIDs.  "oid_table_index" may be a better name (but
> others may be able to suggest even better one).
>
> 	int pos;
> 	pos = commit_pos(commits, nr_commits, parent->item->object.oid);
> 	oid_table_pos = (pos < 0) ? GRAPH_PARENT_MISSING : pos;
> 	if (!parent->net)
> 		oid_table_pos |= GRAPH_LAST_EDGE;
> 	oid_table_pos = htonl(oid_table_pos);
> 	sha1write(f, &oid_table_pos, sizeof(oid_table_pos));
>
> or something like that, perhaps?

You're right that int_id isn't great, and your more-specific 
"oid_table_pos" shows an extra reason why it isn't great: when we add 
the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value is 
NOT a table position.

I'll rework references of "int_id" into "edge_value" to store a value 
that goes into - or is read from - the file, either in the two parent 
columns or the octopus table.

>
>> +static int commit_compare(const void *_a, const void *_b)
>> +{
>> +	struct object_id *a = (struct object_id *)_a;
>> +	struct object_id *b = (struct object_id *)_b;
>> +	return oidcmp(a, b);
>> +}
> I think oidcmp() takes const pointers, so there is no need to
> discard constness from the parameter like this code does.  Also I
> think we tend to prefer writing a_/b_ (instead of _a/_b) to appease
> language lawyers who do not want us mere mortals to use names that
> begin with underscore.
>
>> +static int if_packed_commit_add_to_list(const struct object_id *oid,
>> +					struct packed_git *pack,
>> +					uint32_t pos,
>> +					void *data)
> That is a strange name.  "collect packed commits", perhaps?

We are walking all objects in the pack-index and calling this method. If 
the object is a commit, we add it to the list; otherwise do nothing. 
"data" points to the list.

I think the current name makes the following call very clear:

   for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);

i.e. "for each object in the pack p: if it is a commit, then add it to 
the list of oids".

>
>> +char *write_commit_graph(const char *obj_dir)
>> +{
>> +	struct packed_oid_list oids;
>> +	struct packed_commit_list commits;
>> +	struct sha1file *f;
>> +	int i, count_distinct = 0;
>> +	DIR *info_dir;
>> +	struct strbuf tmp_file = STRBUF_INIT;
>> +	struct strbuf graph_file = STRBUF_INIT;
>> +	unsigned char final_hash[GIT_MAX_RAWSZ];
>> +	char *graph_name;
>> +	int fd;
>> +	uint32_t chunk_ids[5];
>> +	uint64_t chunk_offsets[5];
>> +	int num_chunks;
>> +	int num_long_edges;
>> +	struct commit_list *parent;
>> +
>> +	oids.nr = 0;
>> +	oids.alloc = (int)(0.15 * approximate_object_count());
> Heh, traditionalist would probably avoid unnecessary use of float
> and use something like 1/4 or 1/8 ;-)  After all, it is merely a
> ballpark guestimate.
>
>> +	num_long_edges = 0;
> This again is about naming, but I find it a bit unnatural to call
> the edge between a chind and its octopus parents "long".  Individual
> edges are not long--the only thing that is long is your "list of
> edges".  Some other codepaths in this patch seems to call the same
> concept with s/long/large/, which I found somewhat puzzling.

How about "num_extra_edges" which counts the "overflow" into the octopus 
table. Note: "num_octopus_edges" sounds like summing the out-degree of 
octopus merges, but this count is really "(total number of parents of 
octopus merges) - (number of octopus merges)".

>
>> +	for (i = 0; i < oids.nr; i++) {
>> +		int num_parents = 0;
>> +		if (i > 0 && !oidcmp(&oids.list[i-1], &oids.list[i]))
>> +			continue;
>> +
>> +		commits.list[commits.nr] = lookup_commit(&oids.list[i]);
>> +		parse_commit(commits.list[commits.nr]);
>> +
>> +		for (parent = commits.list[commits.nr]->parents;
>> +		     parent; parent = parent->next)
>> +			num_parents++;
>> +
>> +		if (num_parents > 2)
>> +			num_long_edges += num_parents - 1;
> OK, so we count how many entries we will record in the overflow
> parent table, and...
>
>> +
>> +		commits.nr++;
>> +	}
>> +	num_chunks = num_long_edges ? 4 : 3;
> ... if we do not have any octopus commit, we do not need the chunk
> for the overflow parent table.  Makes sense.
>
>> +	strbuf_addf(&tmp_file, "%s/info", obj_dir);
>> +	info_dir = opendir(tmp_file.buf);
>> +
>> +	if (!info_dir && mkdir(tmp_file.buf, 0777) < 0)
>> +		die_errno(_("cannot mkdir %s"), tmp_file.buf);
>> +	if (info_dir)
>> +		closedir(info_dir);
>> +	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
>> +
>> +	fd = git_mkstemp_mode(tmp_file.buf, 0444);
>> +	if (fd < 0)
>> +		die_errno("unable to create '%s'", tmp_file.buf);
> It is not performance critical, but it feels a bit wasteful to
> opendir merely to see if something exists as a directory, and it is
> misleading to the readers (it looks as if we care about what files
> we already have in the directory).
>
> The approach that optimizes for the most common case would be to
>
> 	- prepare full path to the tempfile first
> 	- try create with mkstemp
> 	  - if successful, you do not have to worry about creating
> 	    the directory at all, which is the most common case
>          - see why mkstemp step above failed.  Was it because you
> 	  did not have the surrounding directory?
>            - if not, there is no point continuing.  Just error out.
> 	  - if it was due to missing directory, try creating one.
> 	- try create with mkstemp
> 	  - if successful, all is well.
>          - otherwise there isn't anything more we can do here.

It looks like sha1_file.c:create_tmpfile() has code I can use as a model 
here. Thanks.

I wonder: should we move that method into wrapper.c and have its 
external definition be available in cache.h? Then I can just consume it 
from here.

>
>> +
>> +	f = sha1fd(fd, tmp_file.buf);
>> +
>> +	sha1write_be32(f, GRAPH_SIGNATURE);
>> +
>> +	sha1write_u8(f, GRAPH_VERSION);
>> +	sha1write_u8(f, GRAPH_OID_VERSION);
>> +	sha1write_u8(f, num_chunks);
>> +	sha1write_u8(f, 0); /* unused padding byte */
>> +
>> +	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
>> +	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
>> +	chunk_ids[2] = GRAPH_CHUNKID_DATA;
>> +	if (num_long_edges)
>> +		chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
>> +	else
>> +		chunk_ids[3] = 0;
>> +	chunk_ids[4] = 0;
>> +
>> +	chunk_offsets[0] = 8 + GRAPH_CHUNKLOOKUP_SIZE;
>> +	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
>> +	chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.nr;
>> +	chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.nr;
>> +	chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;
> Do we have to care about overflowing any of the above?  For example,
> the format allows only up to (1<<31)-1 commits, but did something
> actually check if commits.nr at this point stayed under that limit?

Thanks for pointing this out. It should be a while before we have a repo 
with 2 billion commits, but there's no time like the present to be safe.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-21 21:34         ` Stefan Beller
@ 2018-02-23 17:43           ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:43 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

On 2/21/2018 4:34 PM, Stefan Beller wrote:
> On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
>
>>          graph_name = write_commit_graph(opts.obj_dir);
>>
>>          if (graph_name) {
>>                  if (opts.set_latest)
>>                          set_latest_file(opts.obj_dir, graph_name);
>>
>> +               if (opts.delete_expired)
>> +                       do_delete_expired(opts.obj_dir,
>> +                                         old_graph_name,
>> +                                         graph_name);
>> +
> So this only allows to delete expired things and setting the latest
> when writing a new graph. Would we ever envision a user to produce
> a new graph (e.g. via obtaining a graph that they got from a server) and
> then manually rerouting the latest to that new graph file without writing
> that graph file in the same process? The same for expired.
>
> I guess these operations are just available via editing the
> latest or deleting files manually, which slightly contradicts
> e.g. "git update-ref", which in olden times was just a fancy way
> of rewriting the refs file manually. (though it claims to be less
> prone to errors as it takes lock files)

I imagine these alternatives for placing a new, latest commit graph file 
would want Git to handle rewriting the "graph-latest" file. Given such a 
use case, we could consider extending the 'commit-graph' interface, but 
I don't want to plan for it now.

>
>>   extern char *get_graph_latest_filename(const char *obj_dir);
>> +extern char *get_graph_latest_contents(const char *obj_dir);
> Did
> https://public-inbox.org/git/20180208213806.GA6381@sigill.intra.peff.net/
> ever make it into tree? (It is sort of new, but I feel we'd want to
> strive for consistency in the code base, eventually.)

Thank you for the reminder. I've removed the externs from 'commit-graph.h'.

Should I also remove the externs from other methods I introduce even 
though their surrounding definitions include 'extern'?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 07/13] commit-graph: implement --set-latest
  2018-02-22 18:31         ` Junio C Hamano
@ 2018-02-23 17:53           ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

On 2/22/2018 1:31 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>>   static struct opts_commit_graph {
>>   	const char *obj_dir;
>>   	const char *graph_file;
>> +	int set_latest;
>>   } opts;
>> ...
>> @@ -89,6 +106,8 @@ static int graph_write(int argc, const char **argv)
>>   		{ OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
>>   			N_("dir"),
>>   			N_("The object directory to store the graph") },
>> +		OPT_BOOL('u', "set-latest", &opts.set_latest,
>> +			N_("update graph-head to written graph file")),
>>   		OPT_END(),
>>   	};
>>   
>> @@ -102,6 +121,9 @@ static int graph_write(int argc, const char **argv)
>>   	graph_name = write_commit_graph(opts.obj_dir);
>>   
>>   	if (graph_name) {
>> +		if (opts.set_latest)
>> +			set_latest_file(opts.obj_dir, graph_name);
>> +
> This feels like a very strange API from potential caller's point of
> view.  Because you have to decide that you are going to mark it as
> the latest one upfront before actually writing the graph file, if
> you forget to pass --set-latest, you have to know how to manually
> mark the file as latest anyway.  I would understand if it were one
> of the following:
>
>   (1) whenever a new commit graph file is written in the
>       objects/info/ directory, always mark it as the latest (drop
>       --set-latest option altogether); or
>
>   (2) make set-latest command that takes a name of an existing graph
>       file in the objects/info/ directory, and sets the latest
>       pointer to point at it (make it separate from 'write' command).
>
> though.

Perhaps the 'write' subcommand should be replaced with 'replace' which 
does the following:

1. Write a new commit graph based on the starting commits (from all 
packs, from specified packs, from OIDs).
2. Update 'graph-latest' to point to that new file.
3. Delete all "expired" commit graph files.

(Hence, we would drop the "--set-latest" and "--delete-expired" options.)

Due to the concerns with concurrency, I really don't want to split these 
operations into independent processes that consumers need to call in 
series. Since this sequence of events is the only real interaction we 
expect (for now), this interface will simplify the design.

The biggest reason I didn't design it like this in the first place is 
that the behavior changes as the patch develops.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-22 18:48         ` Junio C Hamano
@ 2018-02-23 17:59           ` Derrick Stolee
  2018-02-23 19:33             ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

On 2/22/2018 1:48 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Teach git-commit-graph to delete the .graph files that are siblings of a
>> newly-written graph file, except for the file referenced by 'graph-latest'
>> at the beginning of the process and the newly-written file. If we fail to
>> delete a graph file, only report a warning because another git process may
>> be using that file. In a multi-process environment, we expect the previoius
>> graph file to be used by a concurrent process, so we do not delete it to
>> avoid race conditions.
> I do not understand the later part of the above.  On some operating
> systems, you actually can remove a file that is open by another
> process without any ill effect.  There are systems that do not allow
> removing a file that is in use, and an attempt to unlink it may
> fail.  The need to handle such a failure gracefully is not limited
> to the case of removing a commit graph file---we need to deal with
> it when removing file of _any_ type.

My thought is that we should _warn_ when we fail to delete a .graph file 
that we think should be safe to delete. However, if we are warning for a 
file that is currently being accessed (as is the case on Windows, at 
least), then we will add a lot of noise. This is especially true when 
using IDEs that run 'status' or 'fetch' in the background, frequently.

> Especially the last sentence "we do not delete it to avoid race
> conditions" I find problematic.  If a system does not allow removing
> a file in use and we detect a failure after an attempt to do so, it
> is not "we do not delete it" --- even if you do, you won't succeed
> anyway, so there is no point saying that.  And on systems that do
> allow safe removal of a file in use (i.e. they allow an open file to
> be used by processes that have open filehandles to it after its
> removal), there is no point refraining to delete it "to avoid race
> conditions", either---in fact it is unlikely that you would even know
> somebody else had it open and was using it.

The (unlikely, but possible) race condition involves two processes (P1 
and P2):

1. P1 reads from graph-latest to see commit graph file F1.
2. P2 updates graph-latest to point to F2 and deletes F1.
3. P1 tries to read F1 and fails.

I could explicitly mention this condition in the message, or we can just 
let P2 fail by deleting all files other than the one referenced by 
'graph-latest'. Thoughts?

> In any case, I do not think '--delete-expired' option that can be
> given only when you are writing makes much sense as an API.  An
> 'expire' command, just like 'set-latest' command, that is a separate
> command from 'write',  may make sense, though.

In another message, I proposed dropping the argument and assuming 
expires happen on every write.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 12/13] commit-graph: read only from specific pack-indexes
  2018-02-21 22:25         ` Stefan Beller
@ 2018-02-23 19:19           ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 19:19 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
	Junio C Hamano, Derrick Stolee

On 2/21/2018 5:25 PM, Stefan Beller wrote:
> On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
>> Teach git-commit-graph to inspect the objects only in a certain list
>> of pack-indexes within the given pack directory. This allows updating
>> the commit graph iteratively, since we add all commits stored in a
>> previous commit graph.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/git-commit-graph.txt | 11 +++++++++++
>>   builtin/commit-graph.c             | 32 +++++++++++++++++++++++++++++---
>>   commit-graph.c                     | 26 ++++++++++++++++++++++++--
>>   commit-graph.h                     |  4 +++-
>>   packfile.c                         |  4 ++--
>>   packfile.h                         |  2 ++
>>   t/t5318-commit-graph.sh            | 16 ++++++++++++++++
>>   7 files changed, 87 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index b9b4031..93d50d1 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -42,6 +42,10 @@ With the `--delete-expired` option, delete the graph files in the pack
>>   directory that are not referred to by the graph-latest file. To avoid race
>>   conditions, do not delete the file previously referred to by the
>>   graph-latest file if it is updated by the `--set-latest` option.
>> ++
>> +With the `--stdin-packs` option, generate the new commit graph by
>> +walking objects only in the specified packfiles and any commits in
>> +the existing graph-head.
> A general question on this series:
> How do commit graph buildups deal with garbage collected commits?
> (my personal workflow is heavy on rebase, which generates lots of
> dangling commits, to be thrown out later)
>
> The second half of the sentence makes it sound like once a
> commit is in the graph it cannot be pulled out easily again, hence
> the question on the impact of graphs on a long living repository
> which is garbage collected frequently.

This is another place that I failed to update when I stopped 
automatically including commits from the existing graph. As of v4, the 
new graph should only contain commits reachable from the commits 
discovered by the three mechanisms (inspecting all packs, inspecting the 
--stdin-packs, or reading the OIDs with --stdin-commits). Thus, commits 
that are GC'd will not appear in the new graph.

If a commit has been GC'd, then parse_commit_gently() will never be 
called since it is called after lookup_object() to create the struct 
commit. The only case we could have is where we navigate to a parent 
using the commmit graph but that parent is GC'd (this does not make sense).

It may be helpful to add an "--additive" argument to specify that we 
want to keep all commits that are already in the graph.

> AFAICT you could just run
>      git commit-graph write --set-latest [--delete-expired]
> as that actually looks up objects from outside the existing graph files,
> such that lost objects are ignored?
>
>> +       const char **lines = NULL;
>> +       int nr_lines = 0;
>> +       int alloc_lines = 0;
> (nit:)
> I had the impression that these triplet-variables, that are used in
> ALLOC_GROW are allo X, X_nr and X_allow, but I might be wrong.

"git grep ALLOC_GROW" confirms your impression. Will fix.

>
>> @@ -170,7 +178,25 @@ static int graph_write(int argc, const char **argv)
>>
>>          old_graph_name = get_graph_latest_contents(opts.obj_dir);
>>
>> -       graph_name = write_commit_graph(opts.obj_dir);
>> +       if (opts.stdin_packs) {
>> +               struct strbuf buf = STRBUF_INIT;
>> +               nr_lines = 0;
>> +               alloc_lines = 128;
> alloc_lines has been initialized before, so why redo it here again?
> Also what is the rationale for choosing 128 as a good default?
> I would guess 0 is just as fine, because ALLOC_GROW makes sure
> that it growth fast in the first couple entries by having an additional
> offset. (no need to fine tune the starting allocation IMHO)

I was unaware that ALLOC_GROW() handled the alloc == 0 case. Thanks.

>
>> +               ALLOC_ARRAY(lines, alloc_lines);
>> +
>> +               while (strbuf_getline(&buf, stdin) != EOF) {
>> +                       ALLOC_GROW(lines, nr_lines + 1, alloc_lines);
>> +                       lines[nr_lines++] = buf.buf;
>> +                       strbuf_detach(&buf, NULL);
> strbuf_detach returns its previous buf.buf, such that you can combine these
> two lines as
>      lines[nr_lines++] = strbuf_detach(&buf, NULL);
>
>
>> +               }
>> +
>> +               pack_indexes = lines;
>> +               nr_packs = nr_lines;
> Technically we do not need to strbuf_release(&buf) here, because
> strbuf_detach is always called, and by knowing its implementation,
> it is just as good.
>
>
>> @@ -579,7 +581,27 @@ char *write_commit_graph(const char *obj_dir)
>>                  oids.alloc = 1024;
>>          ALLOC_ARRAY(oids.list, oids.alloc);
>>
>> -       for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
>> +       if (pack_indexes) {
>> +               struct strbuf packname = STRBUF_INIT;
>> +               int dirlen;
>> +               strbuf_addf(&packname, "%s/pack/", obj_dir);
>> +               dirlen = packname.len;
>> +               for (i = 0; i < nr_packs; i++) {
>> +                       struct packed_git *p;
>> +                       strbuf_setlen(&packname, dirlen);
>> +                       strbuf_addstr(&packname, pack_indexes[i]);
>> +                       p = add_packed_git(packname.buf, packname.len, 1);
>> +                       if (!p)
>> +                               die("error adding pack %s", packname.buf);
>> +                       if (open_pack_index(p))
>> +                               die("error opening index for %s", packname.buf);
>> +                       for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);
>> +                       close_pack(p);
>> +               }
> strbuf_release(&packname);
>
>> +       }
>> +       else
> (micro style nit)
>
>      } else


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-23 17:23           ` Derrick Stolee
@ 2018-02-23 19:30             ` Junio C Hamano
  2018-02-23 19:48               ` Junio C Hamano
  2018-02-23 20:02               ` Derrick Stolee
  0 siblings, 2 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:30 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> jt/binsearch-with-fanout introduces one when there is a 256-entry
> fanout table (not the case here).
>
> The bsearch() method in search.h (and used in
> pack-write.c:need_large_offset) does not return the _position_ of a
> found element.
>
> Neither of these suit my needs, but I could just be searching for the
> wrong strings. Also, I could divert my energies in this area to create
> a generic search in the style of jt/binsearch-with-fanout.

... me goes and digs ...

What I had in mind was the one in sha1-lookup.c, actually.  Having
said that, hand-rolling another one is not too huge a deal;
eventually people will notice and consolidate code after the series
stabilizes anyway ;-)

>>> +				num_large_edges++;
>>> +				parent = parent->next;
>>> +			} while (parent);
>> It feels somewhat wasteful to traverse the commit's parents list
>> only to count, without populating the octopus table (which I
>> understand is assumed to be minority case under this design).
>
> Since we are writing the commit graph file in-order, we cannot write
> the octopus table until after the chunk lengths are known.

Oh, my "minority case" comment was me wondering "since we expect
there are only a few, why not keep them in memory while we discover
them here, so that the writing phase that come later do not have to
go through all commits again counting their parents?  would it be
more performant and a better trade-off?"  We can certainly do such
an optimization later (iow, it is not all that crucial issue and
certainly I didn't mention the above as something that needs to be
"fixed"--there is nothing broken).

> store the octopus table in memory and then dump it into the file
> later, but walking the parents is quite fast after all the commits are
> loaded. I'm not sure the time optimization merits the extra complexity
> here. (I'm happy to revisit this if we do see this performance
> lacking.)
>
> P.S. I really like the name "octopus table" and will use that for
> informal discussions of this format.

I actually do not mind that name used as the official term.  I find
it far more descriptive and understandable than "long edge" / "large
edge" at least to a Git person.

> You're right that int_id isn't great, and your more-specific
> "oid_table_pos" shows an extra reason why it isn't great: when we add
> the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value
> is NOT a table position.

Perhaps I am somewhat biased, but it is quite natural for our
codebase and internal API to say something like this:

    x_pos(table, key) function's return value is the non-negative
    position for the key in the table when the key is there; when it
    returns a negative value, it is (-1 - position) where the "position"
    is the position in the table they key would have been found if
    it was in there.

and store the return value from such a function in a variable called
"pos".  Surely, sometimes "pos" does not have _the_ position, but
that does not make it a bad name.

Saying "MISSING is a special value that denotes 'nothing is here'"
and allowing it to be set to a variable that meant to hold the
position is not such a bad thing, though.  After all, that is how
you use NULL as a special value for a pointer variable ;-).

Same for using the high bit to mean something else.  Taking these
together you would explain "low 31-bit of pos holds the position for
the item in the table.  MISSING is a special value that you can use
to denote there is nothing.  The LAST_EDGE bit denotes that one
group of positions ends there", or something like that.

> I think the current name makes the following call very clear:

It is still a strange name nevertheless.

>>> +char *write_commit_graph(const char *obj_dir)
>>> +{
>>> +	struct packed_oid_list oids;
>>> +	struct packed_commit_list commits;
>>> +	struct sha1file *f;
>>> +	int i, count_distinct = 0;
>>> +	DIR *info_dir;
>>> +	struct strbuf tmp_file = STRBUF_INIT;
>>> +	struct strbuf graph_file = STRBUF_INIT;
>>> +	unsigned char final_hash[GIT_MAX_RAWSZ];
>>> +	char *graph_name;
>>> +	int fd;
>>> +	uint32_t chunk_ids[5];
>>> +	uint64_t chunk_offsets[5];
>>> +	int num_chunks;
>>> +	int num_long_edges;
>>> +	struct commit_list *parent;
>>> +
>>> +	oids.nr = 0;
>>> +	oids.alloc = (int)(0.15 * approximate_object_count());
>> Heh, traditionalist would probably avoid unnecessary use of float
>> and use something like 1/4 or 1/8 ;-)  After all, it is merely a
>> ballpark guestimate.
>>
>>> +	num_long_edges = 0;
>> This again is about naming, but I find it a bit unnatural to call
>> the edge between a chind and its octopus parents "long".  Individual
>> edges are not long--the only thing that is long is your "list of
>> edges".  Some other codepaths in this patch seems to call the same
>> concept with s/long/large/, which I found somewhat puzzling.
>
> How about "num_extra_edges"...

Yes, "extra" in the name makes it very understandable.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-23 17:59           ` Derrick Stolee
@ 2018-02-23 19:33             ` Junio C Hamano
  2018-02-23 19:41               ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:33 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> The (unlikely, but possible) race condition involves two processes (P1
> and P2):
>
> 1. P1 reads from graph-latest to see commit graph file F1.
> 2. P2 updates graph-latest to point to F2 and deletes F1.
> 3. P1 tries to read F1 and fails.
>
> I could explicitly mention this condition in the message, or we can
> just let P2 fail by deleting all files other than the one referenced
> by 'graph-latest'. Thoughts?

The established way we do this is not to have -latest pointer, I
would think, and instead, make -latest be the actual thing.  That is
how $GIT_DIR/index is updated, for example, by first writing into a
temporary file and then moving it to the final destination.  Is
there a reason why the same pattern cannot be used?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-23 19:33             ` Junio C Hamano
@ 2018-02-23 19:41               ` Derrick Stolee
  2018-02-23 19:51                 ` Junio C Hamano
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 19:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

On 2/23/2018 2:33 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> The (unlikely, but possible) race condition involves two processes (P1
>> and P2):
>>
>> 1. P1 reads from graph-latest to see commit graph file F1.
>> 2. P2 updates graph-latest to point to F2 and deletes F1.
>> 3. P1 tries to read F1 and fails.
>>
>> I could explicitly mention this condition in the message, or we can
>> just let P2 fail by deleting all files other than the one referenced
>> by 'graph-latest'. Thoughts?
> The established way we do this is not to have -latest pointer, I
> would think, and instead, make -latest be the actual thing.  That is
> how $GIT_DIR/index is updated, for example, by first writing into a
> temporary file and then moving it to the final destination.  Is
> there a reason why the same pattern cannot be used?

My thought was that using a fixed name (e.g. 
.git/objects/info/commit-graph) would block making the graph 
incremental. Upon thinking again, this is not the case. That feature 
could be designed with a fixed name for the small, frequently-updated 
file and use .../info/graph-<hash>.graph names for the "base" graph files.

I'll spend some time thinking about the ramifications of this fixed-name 
concept. At the moment, it would remove two commits from this patch 
series, which is nice.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-23 19:30             ` Junio C Hamano
@ 2018-02-23 19:48               ` Junio C Hamano
  2018-02-23 20:02               ` Derrick Stolee
  1 sibling, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>> I think the current name makes the following call very clear:
>
> It is still a strange name nevertheless.

Sorry for simply repeating "strange" without spelling out why in the
previous message.  This certainly is subjective and depends on your
cultural background, but in our codebase, I tried to name functions
after "what" they do and "why", rather than "how" they do so.  In a
way, it's the same kind of uneasiness I feel when I see variables
named in hungarian notation.

You would inspect the object and treat 'data' as a list and add to
the object if it is a commit, and if_packed_commit_add_to_list()
certainly is a name that describes all of that well, but does it
give readers a good answer when they wonder why the function is
doing so?  You described with the name of the function how it
collects commits that are in the pack, without explicitly saying
that you want to collect packed commits and that is why you are
inspecting for type and doing so only for commit (i.e.
"if_packed_commit" part of the name) and why you are adding to a
list.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
  2018-02-23 19:41               ` Derrick Stolee
@ 2018-02-23 19:51                 ` Junio C Hamano
  0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:51 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> My thought was that using a fixed name
> (e.g. .git/objects/info/commit-graph) would block making the graph
> incremental. Upon thinking again, this is not the case. That feature
> could be designed with a fixed name for the small, frequently-updated
> file and use .../info/graph-<hash>.graph names for the "base" graph
> files.

I guess that is in line with the way how split-index folks did their
thing ;-)

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-23 19:30             ` Junio C Hamano
  2018-02-23 19:48               ` Junio C Hamano
@ 2018-02-23 20:02               ` Derrick Stolee
  1 sibling, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 20:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee



On 2/23/2018 2:30 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> jt/binsearch-with-fanout introduces one when there is a 256-entry
>> fanout table (not the case here).
>>
>> The bsearch() method in search.h (and used in
>> pack-write.c:need_large_offset) does not return the _position_ of a
>> found element.
>>
>> Neither of these suit my needs, but I could just be searching for the
>> wrong strings. Also, I could divert my energies in this area to create
>> a generic search in the style of jt/binsearch-with-fanout.
> ... me goes and digs ...
>
> What I had in mind was the one in sha1-lookup.c, actually.  Having
> said that, hand-rolling another one is not too huge a deal;
> eventually people will notice and consolidate code after the series
> stabilizes anyway ;-)

Ah, sha1_pos(). That definitely satisfies my use case. Thanks! My local 
branch has this replacement.

>
>>>> +				num_large_edges++;
>>>> +				parent = parent->next;
>>>> +			} while (parent);
>>> It feels somewhat wasteful to traverse the commit's parents list
>>> only to count, without populating the octopus table (which I
>>> understand is assumed to be minority case under this design).
>> Since we are writing the commit graph file in-order, we cannot write
>> the octopus table until after the chunk lengths are known.
> Oh, my "minority case" comment was me wondering "since we expect
> there are only a few, why not keep them in memory while we discover
> them here, so that the writing phase that come later do not have to
> go through all commits again counting their parents?  would it be
> more performant and a better trade-off?"  We can certainly do such
> an optimization later (iow, it is not all that crucial issue and
> certainly I didn't mention the above as something that needs to be
> "fixed"--there is nothing broken).
>
>> store the octopus table in memory and then dump it into the file
>> later, but walking the parents is quite fast after all the commits are
>> loaded. I'm not sure the time optimization merits the extra complexity
>> here. (I'm happy to revisit this if we do see this performance
>> lacking.)
>>
>> P.S. I really like the name "octopus table" and will use that for
>> informal discussions of this format.
> I actually do not mind that name used as the official term.  I find
> it far more descriptive and understandable than "long edge" / "large
> edge" at least to a Git person.

I will consider using this in the format and design documents.

>
>> You're right that int_id isn't great, and your more-specific
>> "oid_table_pos" shows an extra reason why it isn't great: when we add
>> the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value
>> is NOT a table position.
> Perhaps I am somewhat biased, but it is quite natural for our
> codebase and internal API to say something like this:
>
>      x_pos(table, key) function's return value is the non-negative
>      position for the key in the table when the key is there; when it
>      returns a negative value, it is (-1 - position) where the "position"
>      is the position in the table they key would have been found if
>      it was in there.
>
> and store the return value from such a function in a variable called
> "pos".  Surely, sometimes "pos" does not have _the_ position, but
> that does not make it a bad name.
>
> Saying "MISSING is a special value that denotes 'nothing is here'"
> and allowing it to be set to a variable that meant to hold the
> position is not such a bad thing, though.  After all, that is how
> you use NULL as a special value for a pointer variable ;-).
>
> Same for using the high bit to mean something else.  Taking these
> together you would explain "low 31-bit of pos holds the position for
> the item in the table.  MISSING is a special value that you can use
> to denote there is nothing.  The LAST_EDGE bit denotes that one
> group of positions ends there", or something like that.
>
>> I think the current name makes the following call very clear:
> It is still a strange name nevertheless.
>
>>>> +char *write_commit_graph(const char *obj_dir)
>>>> +{
>>>> +	struct packed_oid_list oids;
>>>> +	struct packed_commit_list commits;
>>>> +	struct sha1file *f;
>>>> +	int i, count_distinct = 0;
>>>> +	DIR *info_dir;
>>>> +	struct strbuf tmp_file = STRBUF_INIT;
>>>> +	struct strbuf graph_file = STRBUF_INIT;
>>>> +	unsigned char final_hash[GIT_MAX_RAWSZ];
>>>> +	char *graph_name;
>>>> +	int fd;
>>>> +	uint32_t chunk_ids[5];
>>>> +	uint64_t chunk_offsets[5];
>>>> +	int num_chunks;
>>>> +	int num_long_edges;
>>>> +	struct commit_list *parent;
>>>> +
>>>> +	oids.nr = 0;
>>>> +	oids.alloc = (int)(0.15 * approximate_object_count());
>>> Heh, traditionalist would probably avoid unnecessary use of float
>>> and use something like 1/4 or 1/8 ;-)  After all, it is merely a
>>> ballpark guestimate.
>>>
>>>> +	num_long_edges = 0;
>>> This again is about naming, but I find it a bit unnatural to call
>>> the edge between a chind and its octopus parents "long".  Individual
>>> edges are not long--the only thing that is long is your "list of
>>> edges".  Some other codepaths in this patch seems to call the same
>>> concept with s/long/large/, which I found somewhat puzzling.
>> How about "num_extra_edges"...
> Yes, "extra" in the name makes it very understandable.
>


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
  2018-02-20 22:57         ` Junio C Hamano
@ 2018-02-26 16:10         ` SZEDER Gábor
  2018-02-28 18:47         ` Junio C Hamano
  2 siblings, 0 replies; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-26 16:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git mailing list, git, Jeff King, Jonathan Tan, Stefan Beller,
	Junio C Hamano, Derrick Stolee

On Mon, Feb 19, 2018 at 7:53 PM, Derrick Stolee <stolee@gmail.com> wrote:

> +static int if_packed_commit_add_to_list(const struct object_id *oid,
> +                                       struct packed_git *pack,
> +                                       uint32_t pos,
> +                                       void *data)
> +{
> +       struct packed_oid_list *list = (struct packed_oid_list*)data;
> +       enum object_type type;
> +       unsigned long size;
> +       void *inner_data;
> +       off_t offset = nth_packed_object_offset(pack, pos);
> +       inner_data = unpack_entry(pack, offset, &type, &size);
> +
> +       if (inner_data)
> +               free(inner_data);

The condition is unnecessary, free() can handle a NULL argument just fine.

(Suggested by Coccinelle and 'contrib/coccinelle/free.cocci.patch'.)

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 03/13] commit-graph: create git-commit-graph builtin
  2018-02-19 18:53       ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
  2018-02-20 21:51         ` Junio C Hamano
@ 2018-02-26 16:25         ` SZEDER Gábor
  2018-02-26 17:08           ` Derrick Stolee
  1 sibling, 1 reply; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-26 16:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, git, peff, jonathantanmy, sbeller,
	gitster, Derrick Stolee

> Teach git the 'commit-graph' builtin that will be used for writing and
> reading packed graph files. The current implementation is mostly
> empty, except for an '--object-dir' option.

Since 'git commit-graph' is a builtin command, it shouldn't show up in
completion when doing 'git co<TAB>'.
Please squash in the patch below to make it so.

Furthermore, please have a look at
  
  https://public-inbox.org/git/20180202160132.31550-1-szeder.dev@gmail.com/

for an other oneliner change.


diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index 17929b0809..fafed13c06 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -841,6 +841,7 @@ __git_list_porcelain_commands ()
 		check-ref-format) : plumbing;;
 		checkout-index)   : plumbing;;
 		column)           : internal helper;;
+		commit-graph)     : plumbing;;
 		commit-tree)      : plumbing;;
 		count-objects)    : infrequent;;
 		credential)       : credentials;;


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 03/13] commit-graph: create git-commit-graph builtin
  2018-02-26 16:25         ` SZEDER Gábor
@ 2018-02-26 17:08           ` Derrick Stolee
  0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-26 17:08 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: git, git, peff, jonathantanmy, sbeller, gitster, Derrick Stolee

On 2/26/2018 11:25 AM, SZEDER Gábor wrote:
>> Teach git the 'commit-graph' builtin that will be used for writing and
>> reading packed graph files. The current implementation is mostly
>> empty, except for an '--object-dir' option.
> Since 'git commit-graph' is a builtin command, it shouldn't show up in
> completion when doing 'git co<TAB>'.
> Please squash in the patch below to make it so.
>
> Furthermore, please have a look at
>    
>    https://public-inbox.org/git/20180202160132.31550-1-szeder.dev@gmail.com/
>
> for an other oneliner change.
>
>
> diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
> index 17929b0809..fafed13c06 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -841,6 +841,7 @@ __git_list_porcelain_commands ()
>   		check-ref-format) : plumbing;;
>   		checkout-index)   : plumbing;;
>   		column)           : internal helper;;
> +		commit-graph)     : plumbing;;
>   		commit-tree)      : plumbing;;
>   		count-objects)    : infrequent;;
>   		credential)       : credentials;;

Thanks for this, and the reminder. I made these changes locally, so they 
will be in v5.

-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
  2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
  2018-02-20 22:57         ` Junio C Hamano
  2018-02-26 16:10         ` SZEDER Gábor
@ 2018-02-28 18:47         ` Junio C Hamano
  2 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-28 18:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> diff --git a/commit-graph.h b/commit-graph.h
> new file mode 100644
> index 0000000..dc8c73a
> --- /dev/null
> +++ b/commit-graph.h
> @@ -0,0 +1,7 @@
> +#ifndef COMMIT_GRAPH_H
> +#define COMMIT_GRAPH_H
> +
> +extern char *write_commit_graph(const char *obj_dir);
> +
> +#endif
> +

Trailing blank line at the end of the file.  Remove it.

t5318 has the same issue.

Thanks.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
                         ` (12 preceding siblings ...)
  2018-02-19 18:53       ` [PATCH v4 13/13] commit-graph: build graph from starting commits Derrick Stolee
@ 2018-03-30 11:10       ` Jakub Narebski
  2018-04-02 13:02         ` Derrick Stolee
  13 siblings, 1 reply; 146+ messages in thread
From: Jakub Narebski @ 2018-03-30 11:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
	Derrick Stolee

I hope that I am addressing the most recent version of this series.

Derrick Stolee <stolee@gmail.com> writes:

> As promised [1], this patch contains a way to serialize the commit graph.
> The current implementation defines a new file format to store the graph
> structure (parent relationships) and basic commit metadata (commit date,
> root tree OID) in order to prevent parsing raw commits while performing
> basic graph walks. For example, we do not need to parse the full commit
> when performing these walks:
>
> * 'git log --topo-order -1000' walks all reachable commits to avoid
>   incorrect topological orders, but only needs the commit message for
>   the top 1000 commits.
>
> * 'git merge-base <A> <B>' may walk many commits to find the correct
>   boundary between the commits reachable from A and those reachable
>   from B. No commit messages are needed.
>
> * 'git branch -vv' checks ahead/behind status for all local branches
>   compared to their upstream remote branches. This is essentially as
>   hard as computing merge bases for each.
>
> The current patch speeds up these calculations by injecting a check in
> parse_commit_gently() to check if there is a graph file and using that
> to provide the required metadata to the struct commit.

That's nice.

What are the assumptions about the serialized commit graph format? Does
it need to be:
 - extensible without rewriting (e.g. append-only)?
 - like the above, but may need rewriting for optimal performance?
 - extending it needs to rewrite whole file?

Excuse me if it waas already asked and answered.

>
> The file format has room to store generation numbers, which will be
> provided as a patch after this framework is merged. Generation numbers
> are referenced by the design document but not implemented in order to
> make the current patch focus on the graph construction process. Once
> that is stable, it will be easier to add generation numbers and make
> graph walks aware of generation numbers one-by-one.

As the serialized commit graph format is versioned, I wonder if it would
be possible to speed up graph walks even more by adding to it FELINE
index (pair of numbers) from "Reachability Queries in Very Large Graphs:
A Fast Refined Olnine Search Approach" (2014) - available at
http://openproceedings.org/EDBT/2014/paper_166.pdf

The implementation would probably need adjustments to make it
unambiguous and unambiguously extensible; unless there is place for
indices that are local-only and need to be recalculated from scratch
when graph changes (to cover all graph).

>
> Here are some performance results for a copy of the Linux repository
> where 'master' has 704,766 reachable commits and is behind 'origin/master'
> by 19,610 commits.
>
> | Command                          | Before | After  | Rel % |
> |----------------------------------|--------|--------|-------|
> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
> | branch -vv                       |  0.42s |  0.27s | -35%  |
> | rev-list --all                   |  6.4s  |  1.0s  | -84%  |
> | rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

That's the "Rel %" of "Before", that is delta/before, isn't it?

> To test this yourself, run the following on your repo:
>
>   git config core.commitGraph true
>   git show-ref -s | git commit-graph write --set-latest --stdin-commits
>
> The second command writes a commit graph file containing every commit
> reachable from your refs. Now, all git commands that walk commits will
> check your graph first before consulting the ODB. You can run your own
> performance comparisions by toggling the 'core.commitgraph' setting.

Good.  It is nicely similar to how bitmap indices (of reachability) are
handled.

I just wonder what happens in the (rare) presence of grafts (old
mechanism), or "git replace"-d commits...

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 01/13] commit-graph: add format document
  2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
  2018-02-20 20:49         ` Junio C Hamano
  2018-02-21 19:23         ` Stefan Beller
@ 2018-03-30 13:25         ` Jakub Narebski
  2018-04-02 13:09           ` Derrick Stolee
  2 siblings, 1 reply; 146+ messages in thread
From: Jakub Narebski @ 2018-03-30 13:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> +== graph-*.graph files have the following format:

What is this '*' here?

[...]
> +  The remaining data in the body is described one chunk at a time, and
> +  these chunks may be given in any order. Chunks are required unless
> +  otherwise specified.

Does Git need to understand all chunks, or could there be optional
chunks that can be safely ignored (like in PNG format)?  Though this may
be overkill, and could be left for later revision of the format if
deemed necessary.

> +
> +CHUNK DATA:
> +
> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> +      The ith entry, F[i], stores the number of OIDs with first
> +      byte at most i. Thus F[255] stores the total
> +      number of commits (N).

All right, it is small enough that can be required even for a very small
number of commits.

> +
> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> +      The OIDs for all commits in the graph, sorted in ascending order.
> +
> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)

Do commits need to be put here in the ascending order of OIDs?

If so, this would mean that it is not possible to add information about
new commits by only appending data and maybe overwriting some fields, I
think.  You would need to do full rewrite to insert new commit in
appropriate place.

> +    * The first H bytes are for the OID of the root tree.
> +    * The next 8 bytes are for the int-ids of the first two parents
> +      of the ith commit. Stores value 0xffffffff if no parent in that
> +      position. If there are more than two parents, the second value
> +      has its most-significant bit on and the other bits store an array
> +      position into the Large Edge List chunk.
> +    * The next 8 bytes store the generation number of the commit and
> +      the commit time in seconds since EPOCH. The generation number
> +      uses the higher 30 bits of the first 4 bytes, while the commit
> +      time uses the 32 bits of the second 4 bytes, along with the lowest
> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
> +      commit time.
> +
> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> +      This list of 4-byte values store the second through nth parents for
> +      all octopus merges. The second parent value in the commit data stores
> +      an array position within this list along with the most-significant bit
> +      on. Starting at that array position, iterate through this list of int-ids
> +      for the parents until reaching a value with the most-significant bit on.
> +      The other bits correspond to the int-id of the last parent.

All right, that is one chunk that cannot use fixed-length records; this
shouldn't matter much, as we iterate only up to the number of parents
less two.

A question: what happens to the last list of parents?  Is there a
guardian value of 0xffffffff at last place?

> +
> +TRAILER:
> +
> +	H-byte HASH-checksum of all of the above.
> +

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-03-30 11:10       ` [PATCH v4 00/13] Serialized Git Commit Graph Jakub Narebski
@ 2018-04-02 13:02         ` Derrick Stolee
  2018-04-02 14:46           ` Jakub Narebski
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 13:02 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
	Derrick Stolee

On 3/30/2018 7:10 AM, Jakub Narebski wrote:
> I hope that I am addressing the most recent version of this series.

Hi Jakub. Thanks for the interest in this patch series.

The most-recent version is v6 [1], but I will re-roll to v7 soon (after 
v2.17.0 is marked).

[1] 
https://public-inbox.org/git/20180314192736.70602-1-dstolee@microsoft.com/T/#u

> Derrick Stolee <stolee@gmail.com> writes:
>
>> As promised [1], this patch contains a way to serialize the commit graph.
>> The current implementation defines a new file format to store the graph
>> structure (parent relationships) and basic commit metadata (commit date,
>> root tree OID) in order to prevent parsing raw commits while performing
>> basic graph walks. For example, we do not need to parse the full commit
>> when performing these walks:
>>
>> * 'git log --topo-order -1000' walks all reachable commits to avoid
>>    incorrect topological orders, but only needs the commit message for
>>    the top 1000 commits.
>>
>> * 'git merge-base <A> <B>' may walk many commits to find the correct
>>    boundary between the commits reachable from A and those reachable
>>    from B. No commit messages are needed.
>>
>> * 'git branch -vv' checks ahead/behind status for all local branches
>>    compared to their upstream remote branches. This is essentially as
>>    hard as computing merge bases for each.
>>
>> The current patch speeds up these calculations by injecting a check in
>> parse_commit_gently() to check if there is a graph file and using that
>> to provide the required metadata to the struct commit.
> That's nice.
>
> What are the assumptions about the serialized commit graph format? Does
> it need to be:
>   - extensible without rewriting (e.g. append-only)?
>   - like the above, but may need rewriting for optimal performance?
>   - extending it needs to rewrite whole file?
>
> Excuse me if it waas already asked and answered.

It is not extensible without rewriting. Reducing write time was not a 
main goal, since the graph will be written only occasionally during data 
management phases (like 'gc' or 'repack'; this integration is not 
implemented yet).

>
>> The file format has room to store generation numbers, which will be
>> provided as a patch after this framework is merged. Generation numbers
>> are referenced by the design document but not implemented in order to
>> make the current patch focus on the graph construction process. Once
>> that is stable, it will be easier to add generation numbers and make
>> graph walks aware of generation numbers one-by-one.
> As the serialized commit graph format is versioned, I wonder if it would
> be possible to speed up graph walks even more by adding to it FELINE
> index (pair of numbers) from "Reachability Queries in Very Large Graphs:
> A Fast Refined Olnine Search Approach" (2014) - available at
> http://openproceedings.org/EDBT/2014/paper_166.pdf
>
> The implementation would probably need adjustments to make it
> unambiguous and unambiguously extensible; unless there is place for
> indices that are local-only and need to be recalculated from scratch
> when graph changes (to cover all graph).

The chunk-based format is intended to allow extra indexes like the one 
you recommend, without needing to increase the version number. Using an 
optional chunk allows older versions of Git to read the file without 
error, since the data is "extra", and newer versions can take advantage 
of the acceleration.

At one point, I was investigating these reachability indexes (I read 
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn, 
Ruan, Dey, and Xu [2]) but find the question that these indexes target 
to be lacking for most of the Git uses. That is, they ask the boolean 
question "Can X reach Y?". More often, Git needs to answer "What is the 
set of commits reachable from X but not from Y" or "Topologically sort 
commits reachable from X" or "How many commits are in each part of the 
symmetric difference between reachable from X or reachable from Y?"

The case for "Can X reach Y?" is mostly for commands like 'git branch 
--contains', when 'git fetch' checks for forced-updates of branches, or 
when the server decides enough negotiation has occurred during a 'git 
fetch'. While these may be worth investigating, they also benefit 
greatly from the accelerated graph walk introduced in the current format.

I would be happy to review any effort to extend the commit-graph format 
to include such indexes, as long as the performance benefits outweigh 
the complexity to create them.

[2] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf

>
>> Here are some performance results for a copy of the Linux repository
>> where 'master' has 704,766 reachable commits and is behind 'origin/master'
>> by 19,610 commits.
>>
>> | Command                          | Before | After  | Rel % |
>> |----------------------------------|--------|--------|-------|
>> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
>> | branch -vv                       |  0.42s |  0.27s | -35%  |
>> | rev-list --all                   |  6.4s  |  1.0s  | -84%  |
>> | rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |
> That's the "Rel %" of "Before", that is delta/before, isn't it?

I do mean the relative change.

>
>> To test this yourself, run the following on your repo:
>>
>>    git config core.commitGraph true
>>    git show-ref -s | git commit-graph write --set-latest --stdin-commits
>>
>> The second command writes a commit graph file containing every commit
>> reachable from your refs. Now, all git commands that walk commits will
>> check your graph first before consulting the ODB. You can run your own
>> performance comparisions by toggling the 'core.commitgraph' setting.
> Good.  It is nicely similar to how bitmap indices (of reachability) are
> handled.
>
> I just wonder what happens in the (rare) presence of grafts (old
> mechanism), or "git replace"-d commits...

In the design document, I mention that the current implementation does 
not work with grafts (it will ignore them). A later patch will refactor 
the graft code so we can access it from the commit-graph parsing of a 
commit without copy-pasting the code out of parse_commit_gently().

The commit-graph is only a compact representation of the object 
database. If a commit is replaced with 'git replace' before 'git 
commit-graph write' then the commit-graph write will write the replaced 
object. I haven't tested what happens when a commit-graph is written and 
then a commit is replaced, but my guess is that the replacement does not 
occur until a full parse is attempted (i.e. when reading author or 
commit message information). This will lead to unknown results.

Thanks for pointing out the interaction with 'git replace'. I have items 
to fix grafts and replaced commits before integrating commit-graph 
writes into automatic actions like 'gc.auto'.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 01/13] commit-graph: add format document
  2018-03-30 13:25         ` Jakub Narebski
@ 2018-04-02 13:09           ` Derrick Stolee
  2018-04-02 14:09             ` Jakub Narebski
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 13:09 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
	Derrick Stolee

On 3/30/2018 9:25 AM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +== graph-*.graph files have the following format:
> What is this '*' here?

No longer necessary. It used to be a placeholder for a hash value, but 
now the graph is stored in objects/info/commit-graph.

>
> [...]
>> +  The remaining data in the body is described one chunk at a time, and
>> +  these chunks may be given in any order. Chunks are required unless
>> +  otherwise specified.
> Does Git need to understand all chunks, or could there be optional
> chunks that can be safely ignored (like in PNG format)?  Though this may
> be overkill, and could be left for later revision of the format if
> deemed necessary.

In v6, the format and design documents are edited to make clear the use 
of optional chunks, specifically for future extension without increasing 
the version number.

>
>> +
>> +CHUNK DATA:
>> +
>> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>> +      The ith entry, F[i], stores the number of OIDs with first
>> +      byte at most i. Thus F[255] stores the total
>> +      number of commits (N).
> All right, it is small enough that can be required even for a very small
> number of commits.
>
>> +
>> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>> +      The OIDs for all commits in the graph, sorted in ascending order.
>> +
>> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> Do commits need to be put here in the ascending order of OIDs?

Yes.

> If so, this would mean that it is not possible to add information about
> new commits by only appending data and maybe overwriting some fields, I
> think.  You would need to do full rewrite to insert new commit in
> appropriate place.

That is the idea. This file is not updated with every new commit, but 
instead will be updated on some scheduled cleanup events. The 
commit-graph file is designed in a way to be non-critical, and not tied 
to the packfile layout. This allows flexibility for when to do the write.

For example, in GVFS, we will write a new commit-graph when there are 
new daily prefetch packs.

This could also integrate with 'gc' and 'repack' so whenever they are 
triggered the commit-graph is written as well.

Commits that do not exist in the commit-graph file will load from the 
object database as normal (after a failed lookup in the commit-graph file).

>> +    * The first H bytes are for the OID of the root tree.
>> +    * The next 8 bytes are for the int-ids of the first two parents
>> +      of the ith commit. Stores value 0xffffffff if no parent in that
>> +      position. If there are more than two parents, the second value
>> +      has its most-significant bit on and the other bits store an array
>> +      position into the Large Edge List chunk.
>> +    * The next 8 bytes store the generation number of the commit and
>> +      the commit time in seconds since EPOCH. The generation number
>> +      uses the higher 30 bits of the first 4 bytes, while the commit
>> +      time uses the 32 bits of the second 4 bytes, along with the lowest
>> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
>> +      commit time.
>> +
>> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>> +      This list of 4-byte values store the second through nth parents for
>> +      all octopus merges. The second parent value in the commit data stores
>> +      an array position within this list along with the most-significant bit
>> +      on. Starting at that array position, iterate through this list of int-ids
>> +      for the parents until reaching a value with the most-significant bit on.
>> +      The other bits correspond to the int-id of the last parent.
> All right, that is one chunk that cannot use fixed-length records; this
> shouldn't matter much, as we iterate only up to the number of parents
> less two.

Less one: the second "parent" column of the commit data chunk is used to 
point into this list, so (P-1) parents are in this chunk for a commit 
with P parents.

> A question: what happens to the last list of parents?  Is there a
> guardian value of 0xffffffff at last place?

The termination condition is in the position of the last parent, since 
the most-significant bit is on. The other 31 bits contain the int-id of 
the parent.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 01/13] commit-graph: add format document
  2018-04-02 13:09           ` Derrick Stolee
@ 2018-04-02 14:09             ` Jakub Narebski
  0 siblings, 0 replies; 146+ messages in thread
From: Jakub Narebski @ 2018-04-02 14:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 3/30/2018 9:25 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> +== graph-*.graph files have the following format:
>> What is this '*' here?
>
> No longer necessary. It used to be a placeholder for a hash value, but
> now the graph is stored in objects/info/commit-graph.

All right.

Excuse me replying to v4 instead of v6 of the patch series, where it
would be answered or rather made moot already.

>>
>> [...]
>>> +  The remaining data in the body is described one chunk at a time, and
>>> +  these chunks may be given in any order. Chunks are required unless
>>> +  otherwise specified.
>> Does Git need to understand all chunks, or could there be optional
>> chunks that can be safely ignored (like in PNG format)?  Though this may
>> be overkill, and could be left for later revision of the format if
>> deemed necessary.
>
> In v6, the format and design documents are edited to make clear the
> use of optional chunks, specifically for future extension without
> increasing the version number.

That's good.

>>> +CHUNK DATA:
>>> +
>>> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>>> +      The ith entry, F[i], stores the number of OIDs with first
>>> +      byte at most i. Thus F[255] stores the total
>>> +      number of commits (N).
>> All right, it is small enough that can be required even for a very small
>> number of commits.
>>
>>> +
>>> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>>> +      The OIDs for all commits in the graph, sorted in ascending order.
>>> +
>>> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>> Do commits need to be put here in the ascending order of OIDs?
>
> Yes.
>
>> If so, this would mean that it is not possible to add information about
>> new commits by only appending data and maybe overwriting some fields, I
>> think.  You would need to do full rewrite to insert new commit in
>> appropriate place.
>
> That is the idea. This file is not updated with every new commit, but
> instead will be updated on some scheduled cleanup events. The
> commit-graph file is designed in a way to be non-critical, and not
> tied to the packfile layout. This allows flexibility for when to do
> the write.
>
> For example, in GVFS, we will write a new commit-graph when there are
> new daily prefetch packs.
>
> This could also integrate with 'gc' and 'repack' so whenever they are
> triggered the commit-graph is written as well.

I wonder if it would be possible to use existing hooks...

> Commits that do not exist in the commit-graph file will load from the
> object database as normal (after a failed lookup in the commit-graph
> file).

Ah. I thought wrongly that it would (or at least could) be something
that can be kept up to date, and extended when adding any new commit.

>>> +    * The first H bytes are for the OID of the root tree.
>>> +    * The next 8 bytes are for the int-ids of the first two parents
>>> +      of the ith commit. Stores value 0xffffffff if no parent in that
>>> +      position. If there are more than two parents, the second value
>>> +      has its most-significant bit on and the other bits store an array
>>> +      position into the Large Edge List chunk.
>>> +    * The next 8 bytes store the generation number of the commit and
>>> +      the commit time in seconds since EPOCH. The generation number
>>> +      uses the higher 30 bits of the first 4 bytes, while the commit
>>> +      time uses the 32 bits of the second 4 bytes, along with the lowest
>>> +      2 bits of the lowest byte, storing the 33rd and 34th bit of the
>>> +      commit time.
>>> +
>>> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>>> +      This list of 4-byte values store the second through nth parents for
>>> +      all octopus merges. The second parent value in the commit data stores
>>> +      an array position within this list along with the most-significant bit
>>> +      on. Starting at that array position, iterate through this list of int-ids
>>> +      for the parents until reaching a value with the most-significant bit on.
>>> +      The other bits correspond to the int-id of the last parent.
>>
>> All right, that is one chunk that cannot use fixed-length records; this
>> shouldn't matter much, as we iterate only up to the number of parents
>> less two.
>
> Less one: the second "parent" column of the commit data chunk is used
> to point into this list, so (P-1) parents are in this chunk for a
> commit with P parents.

Right.

>> A question: what happens to the last list of parents?  Is there a
>> guardian value of 0xffffffff at last place?
>
> The termination condition is in the position of the last parent, since
> the most-significant bit is on. The other 31 bits contain the int-id
> of the parent.

Ah. I have misunderstood the format: I thought that first entry is
marked with most-significant bit set to 1, and all the rest to 0, while
it is last entry (last parent) has most-significant bit set, while all
others (if any) do not. So there is no need for guardian value.

Best regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-04-02 13:02         ` Derrick Stolee
@ 2018-04-02 14:46           ` Jakub Narebski
  2018-04-02 15:02             ` Derrick Stolee
  0 siblings, 1 reply; 146+ messages in thread
From: Jakub Narebski @ 2018-04-02 14:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 3/30/2018 7:10 AM, Jakub Narebski wrote:
>> I hope that I am addressing the most recent version of this series.
>
> Hi Jakub. Thanks for the interest in this patch series.
>
> The most-recent version is v6 [1], but I will re-roll to v7 soon
> (after v2.17.0 is marked).
>
> [1] https://public-inbox.org/git/20180314192736.70602-1-dstolee@microsoft.com/T/#u

Ooops.  Sorry about that.

>> Derrick Stolee <stolee@gmail.com> writes:
[...]

>> What are the assumptions about the serialized commit graph format? Does
>> it need to be:
>>   - extensible without rewriting (e.g. append-only)?
>>   - like the above, but may need rewriting for optimal performance?
>>   - extending it needs to rewrite whole file?
>>
>> Excuse me if it waas already asked and answered.
>
> It is not extensible without rewriting. Reducing write time was not a
> main goal, since the graph will be written only occasionally during
> data management phases (like 'gc' or 'repack'; this integration is not
> implemented yet).

Ah.  I thought that it could be something easily extensible in-place,
and thus easy to keep up to date on each commit.

Recalculating it on 'gc' or 'repack' is still good, especially that it
works even when there are come commits outside commit-graph, without
this information.

>>
>>> The file format has room to store generation numbers, which will be
>>> provided as a patch after this framework is merged. Generation numbers
>>> are referenced by the design document but not implemented in order to
>>> make the current patch focus on the graph construction process. Once
>>> that is stable, it will be easier to add generation numbers and make
>>> graph walks aware of generation numbers one-by-one.
>>>
>> As the serialized commit graph format is versioned, I wonder if it would
>> be possible to speed up graph walks even more by adding to it FELINE
>> index (pair of numbers) from "Reachability Queries in Very Large Graphs:
>> A Fast Refined Olnine Search Approach" (2014) - available at
>> http://openproceedings.org/EDBT/2014/paper_166.pdf
>>
>> The implementation would probably need adjustments to make it
>> unambiguous and unambiguously extensible; unless there is place for
>> indices that are local-only and need to be recalculated from scratch
>> when graph changes (to cover all graph).
>
> The chunk-based format is intended to allow extra indexes like the one
> you recommend, without needing to increase the version number. Using
> an optional chunk allows older versions of Git to read the file
> without error, since the data is "extra", and newer versions can take
> advantage of the acceleration.

That's good.

> At one point, I was investigating these reachability indexes (I read
> "SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
> Ruan, Dey, and Xu [2]) but find the question that these indexes target
> to be lacking for most of the Git uses. That is, they ask the boolean
> question "Can X reach Y?". More often, Git needs to answer "What is
> the set of commits reachable from X but not from Y" or "Topologically
> sort commits reachable from X" or "How many commits are in each part
> of the symmetric difference between reachable from X or reachable from
> Y?"

In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
Meira and Zaki FELINE-index work, authors mention SCARAB as something
that can be used in addition to FELINE-index, as a complementary data
(FELINE-SCARAB in the work, section 4.4).

I see the FELINE-index as a stronger form of generation numbers (called
also level of the vertex / node), in that it allows to negative-cut even
more, pruning paths that are known to be unreachable (or marking nodes
known to be unreachable in the "calculate difference" scenario). 

Also, FELINE-index uses two integer numbers (coordinates in 2d space);
one of those indices can be the topological numbering (topological
sorting order) of nodes in the commit graph.  That would help to answer
even more Git questions.

> The case for "Can X reach Y?" is mostly for commands like 'git branch
> --contains', when 'git fetch' checks for forced-updates of branches,
> or when the server decides enough negotiation has occurred during a
> 'git fetch'. While these may be worth investigating, they also benefit
> greatly from the accelerated graph walk introduced in the current
> format.
>
> I would be happy to review any effort to extend the commit-graph
> format to include such indexes, as long as the performance benefits
> outweigh the complexity to create them.
>
> [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf

The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.

>>
>>> Here are some performance results for a copy of the Linux repository
>>> where 'master' has 704,766 reachable commits and is behind 'origin/master'
>>> by 19,610 commits.
>>>
>>> | Command                          | Before | After  | Rel % |
>>> |----------------------------------|--------|--------|-------|
>>> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
>>> | branch -vv                       |  0.42s |  0.27s | -35%  |
>>> | rev-list --all                   |  6.4s  |  1.0s  | -84%  |
>>> | rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |
>>
>> That's the "Rel %" of "Before", that is delta/before, isn't it?
>
> I do mean the relative change.

But is it relative to the state before, or relative to the state after?

[...]
>> I just wonder what happens in the (rare) presence of grafts (old
>> mechanism), or "git replace"-d commits...
>
> In the design document, I mention that the current implementation does
> not work with grafts (it will ignore them). A later patch will
> refactor the graft code so we can access it from the commit-graph
> parsing of a commit without copy-pasting the code out of
> parse_commit_gently().
>
> The commit-graph is only a compact representation of the object
> database. If a commit is replaced with 'git replace' before 'git
> commit-graph write' then the commit-graph write will write the
> replaced object. I haven't tested what happens when a commit-graph is
> written and then a commit is replaced, but my guess is that the
> replacement does not occur until a full parse is attempted (i.e. when
> reading author or commit message information). This will lead to
> unknown results.
>
> Thanks for pointing out the interaction with 'git replace'. I have
> items to fix grafts and replaced commits before integrating
> commit-graph writes into automatic actions like 'gc.auto'.

Note that you can make Git ignore replacements with appropriate command
line option for "git" wrapper; the transfer mechanism can safely
ignore replacements (treating refs/replacements just like it would any
other refs/).

Best regards,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-04-02 14:46           ` Jakub Narebski
@ 2018-04-02 15:02             ` Derrick Stolee
  2018-04-02 17:35               ` Stefan Beller
  2018-04-07 22:37               ` Jakub Narebski
  0 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 15:02 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
	Derrick Stolee

On 4/2/2018 10:46 AM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
[...]
>> At one point, I was investigating these reachability indexes (I read
>> "SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
>> Ruan, Dey, and Xu [2]) but find the question that these indexes target
>> to be lacking for most of the Git uses. That is, they ask the boolean
>> question "Can X reach Y?". More often, Git needs to answer "What is
>> the set of commits reachable from X but not from Y" or "Topologically
>> sort commits reachable from X" or "How many commits are in each part
>> of the symmetric difference between reachable from X or reachable from
>> Y?"
> In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
> Meira and Zaki FELINE-index work, authors mention SCARAB as something
> that can be used in addition to FELINE-index, as a complementary data
> (FELINE-SCARAB in the work, section 4.4).
>
> I see the FELINE-index as a stronger form of generation numbers (called
> also level of the vertex / node), in that it allows to negative-cut even
> more, pruning paths that are known to be unreachable (or marking nodes
> known to be unreachable in the "calculate difference" scenario).
>
> Also, FELINE-index uses two integer numbers (coordinates in 2d space);
> one of those indices can be the topological numbering (topological
> sorting order) of nodes in the commit graph.  That would help to answer
> even more Git questions.

This two-dimensional generation number is helpful for non-reachability 
queries, but is something that needs the "full" commit graph in order to 
define the value for a single commit (hence the O(N lg N) performance 
mentioned below). Generation numbers are effective while being easy to 
compute and immutable.

I wonder if FELINE was compared directly to a one-dimensional index (I 
apologize that I have not read the paper in detail, so I don't 
understand the indexes they compare against). It also appears the graphs 
they use for their tests are not commit graphs, which have a different 
shape than many of the digraphs studies by that work.

This is all to say: I would love to see an interesting study in this 
direction, specifically comparing this series' definition of generation 
numbers to the 2-dimensional system in FELINE, and on a large sample of 
commit graphs available in open-source data sets (Linux kernel, Git, 
etc.) and possibly on interesting closed-source data sets.

>
>> The case for "Can X reach Y?" is mostly for commands like 'git branch
>> --contains', when 'git fetch' checks for forced-updates of branches,
>> or when the server decides enough negotiation has occurred during a
>> 'git fetch'. While these may be worth investigating, they also benefit
>> greatly from the accelerated graph walk introduced in the current
>> format.
>>
>> I would be happy to review any effort to extend the commit-graph
>> format to include such indexes, as long as the performance benefits
>> outweigh the complexity to create them.
>>
>> [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
> storage complexity is 2*|V|.
>

This would be very easy to add as an optional chunk, since it can use 
one row per commit.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-04-02 15:02             ` Derrick Stolee
@ 2018-04-02 17:35               ` Stefan Beller
  2018-04-02 17:54                 ` Derrick Stolee
  2018-04-07 22:37               ` Jakub Narebski
  1 sibling, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-04-02 17:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jakub Narebski, git, Jeff Hostetler, Jeff King, Jonathan Tan,
	SZEDER Gábor, Junio C Hamano, Derrick Stolee

On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
>>>
>>> I would be happy to review any effort to extend the commit-graph
>>> format to include such indexes, as long as the performance benefits
>>> outweigh the complexity to create them.
>>>
>>> [2]
>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
>>
>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>> storage complexity is 2*|V|.
>>
>
> This would be very easy to add as an optional chunk, since it can use one
> row per commit.

Given this discussion, I wonder if we want to include generation numbers
as a first class citizen in the current format. They could also go as
an optional
chunk and we may want to discuss further if we want generation numbers or
FELINE numbers or GRAIL or SCARAB, which are all graph related speedup
mechanism AFAICT.
In case we decide against generation numbers in the long run,
the row of mandatory generation numbers would be dead weight
that we'd need to carry.

I only glanced at the paper, but it looks like a "more advanced 2d
generation number" that seems to be able to answer questions
that gen numbers can answer, but that paper also refers
to SCARAB as well as GRAIL as the state of the art, so maybe
there are even more papers to explore?

Stefan

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-04-02 17:35               ` Stefan Beller
@ 2018-04-02 17:54                 ` Derrick Stolee
  2018-04-02 18:02                   ` Stefan Beller
  0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 17:54 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jakub Narebski, git, Jeff Hostetler, Jeff King, Jonathan Tan,
	SZEDER Gábor, Junio C Hamano, Derrick Stolee

On 4/2/2018 1:35 PM, Stefan Beller wrote:
> On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
>>>> I would be happy to review any effort to extend the commit-graph
>>>> format to include such indexes, as long as the performance benefits
>>>> outweigh the complexity to create them.
>>>>
>>>> [2]
>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
>>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>>> storage complexity is 2*|V|.
>>>
>> This would be very easy to add as an optional chunk, since it can use one
>> row per commit.
> Given this discussion, I wonder if we want to include generation numbers
> as a first class citizen in the current format. They could also go as
> an optional
> chunk and we may want to discuss further if we want generation numbers or
> FELINE numbers or GRAIL or SCARAB, which are all graph related speedup
> mechanism AFAICT.
> In case we decide against generation numbers in the long run,
> the row of mandatory generation numbers would be dead weight
> that we'd need to carry.

Currently, the format includes 8 bytes to share between the generation 
number and commit date. Due to alignment concerns, we will want to keep 
this as 8 bytes or truncate it to 4-bytes. Either we would be wasting at 
least 3 bytes or truncating dates too much (presenting the 2038 problem 
[1] since dates are signed).

> I only glanced at the paper, but it looks like a "more advanced 2d
> generation number" that seems to be able to answer questions
> that gen numbers can answer, but that paper also refers
> to SCARAB as well as GRAIL as the state of the art, so maybe
> there are even more papers to explore?

The biggest reason I can say to advance this series (and the small 
follow-up series that computes and consumes generation numbers) is that 
generation numbers are _extremely simple_. You only need to know your 
parents and their generation numbers to compute your own. These other 
reachability indexes require examining the entire graph to create "good" 
index values.

The hard part about using generation numbers (or any other reachability 
index) in Git is refactoring the revision-walk machinery to take 
advantage of them; current code requires O(reachable commits) to 
topo-order instead of O(commits that will be output). I think we should 
table any discussion of these advanced indexes until that work is done 
and a valuable comparison can be done. "Premature optimization is the 
root of all evil" and all that.

Thanks,
-Stolee

[1] https://en.wikipedia.org/wiki/Year_2038_problem

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-04-02 17:54                 ` Derrick Stolee
@ 2018-04-02 18:02                   ` Stefan Beller
  0 siblings, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-04-02 18:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jakub Narebski, git, Jeff Hostetler, Jeff King, Jonathan Tan,
	SZEDER Gábor, Junio C Hamano, Derrick Stolee

> Currently, the format includes 8 bytes to share between the generation
> number and commit date. Due to alignment concerns, we will want to keep this
> as 8 bytes or truncate it to 4-bytes. Either we would be wasting at least 3
> bytes or truncating dates too much (presenting the 2038 problem [1] since
> dates are signed).

Good point. I forgot about them while writing the previous email.
That is reason enough to keep the generation numbers, sorry
for the noise.

>
>> I only glanced at the paper, but it looks like a "more advanced 2d
>> generation number" that seems to be able to answer questions
>> that gen numbers can answer, but that paper also refers
>> to SCARAB as well as GRAIL as the state of the art, so maybe
>> there are even more papers to explore?
>
>
> The biggest reason I can say to advance this series (and the small follow-up
> series that computes and consumes generation numbers) is that generation
> numbers are _extremely simple_. You only need to know your parents and their
> generation numbers to compute your own. These other reachability indexes
> require examining the entire graph to create "good" index values.

Yes, that is a good point, too. Generation numbers can be computed
"commit locally" and do not need expensive setups, which the others
presumably need.

> The hard part about using generation numbers (or any other reachability
> index) in Git is refactoring the revision-walk machinery to take advantage
> of them; current code requires O(reachable commits) to topo-order instead of
> O(commits that will be output). I think we should table any discussion of
> these advanced indexes until that work is done and a valuable comparison can
> be done. "Premature optimization is the root of all evil" and all that.

agreed,

Stefan

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v4 00/13] Serialized Git Commit Graph
  2018-04-02 15:02             ` Derrick Stolee
  2018-04-02 17:35               ` Stefan Beller
@ 2018-04-07 22:37               ` Jakub Narebski
  1 sibling, 0 replies; 146+ messages in thread
From: Jakub Narebski @ 2018-04-07 22:37 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, Szeder Gábor,
	Stefan Beller, Junio C Hamano, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 4/2/2018 10:46 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
> [...]
>> I see the FELINE-index as a stronger form of generation numbers (called
>> also level of the vertex / node), in that it allows to negative-cut even
>> more, pruning paths that are known to be unreachable (or marking nodes
>> known to be unreachable in the "calculate difference" scenario).
>>
>> Also, FELINE-index uses two integer numbers (coordinates in 2d space);
>> one of those indices can be the topological numbering (topological
>> sorting order) of nodes in the commit graph.  That would help to answer
>> even more Git questions.
>
> This two-dimensional generation number is helpful for non-reachability
> queries, but is something that needs the "full" commit graph in order
> to define the value for a single commit (hence the O(N lg N)
> performance mentioned below). Generation numbers are effective while
> being easy to compute and immutable.

O(N log N) is the cost of sort algorithm, namely peforming topological
sorting of commits.  Which Git can currently do.

We can use the main idea behind FELINE index, namely weak dominance
drawing, while perhaps changing the detail.  The main idea is that if
you draw edges of DAG always to be in selected quadrant -- the default
is drawing them up and to the right, then paths would also always be in
the same quadrant; if target node is not in given quadrant with respect
to source node, then target node cannot be reached from source node.

For fast answering of reachability queries it is important to have
minimal number of false positives (falsely implied paths), that is
situation where FELINE index does imply that there could be a path, but
in fact target is not reachable from the source.  The authors propose a
heuristics: use positions in [some] topological order for X coordinates
of FELINE index, and generate new optimal locally topological sorting to
use for Y coordinates.


Generation numbers (which can be used together with FELINE index, and
what paper does use -- calling them Level Filter) have the advantage of
being able to be calculated incrementally. I think this is what you
wanted to say.

I think it would be possible to calculate FELINE index incrementally,
too.  If Git's topological order is stable, at least for X coordinates
of FELINE index it would be easy.


I have started experimenting with this approach in Jupyter Notebook,
available on Google Colaboratory (you can examine it, run it and edit it
from web browser -- the default is to use remote runtime on Google
cloud).  Currently I am at the stage of reproducing the theoretical
parts of the FELINE paper.

  https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
  https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

> I wonder if FELINE was compared directly to a one-dimensional index (I
> apologize that I have not read the paper in detail, so I don't
> understand the indexes they compare against).

They have compared FELINE using real graphs (like ArXiv, Pubmed,
CiteseerX, Uniprot150m) and synthetic DAG against:
 - GRAIL (Graph Reachability Indexing via RAndomized Interval Labeling)
 - FERRARI (Flexible and Efficient Reachability Range Assignment for
   gRapth Indexing), with interval set compression to approximate ones
 - Nuutila's INTERVAL, which extracts complete transitive closure of
   the graph and stores it using PWAH bit-vector compression
 - TF-Label (Topological Folding LABELling)

>                                                It also appears the
> graphs they use for their tests are not commit graphs, which have a
> different shape than many of the digraphs studies by that work.

I plan to check how FELINE index works for commit graphs in
above-mentioned Jupyter Notebook.

> This is all to say: I would love to see an interesting study in this
> direction, specifically comparing this series' definition of
> generation numbers to the 2-dimensional system in FELINE, and on a
> large sample of commit graphs available in open-source data sets
> (Linux kernel, Git, etc.) and possibly on interesting closed-source
> data sets.

I wonder if there are more recent works, with even faster solutions.

>>> The case for "Can X reach Y?" is mostly for commands like 'git branch
>>> --contains', when 'git fetch' checks for forced-updates of branches,
>>> or when the server decides enough negotiation has occurred during a
>>> 'git fetch'. While these may be worth investigating, they also benefit
>>> greatly from the accelerated graph walk introduced in the current
>>> format.
>>>
>>> I would be happy to review any effort to extend the commit-graph
>>> format to include such indexes, as long as the performance benefits
>>> outweigh the complexity to create them.
>>>
>>> [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf

I wonder if next low-hanging branch would be to store topological
ordering of commits.  It could be done, I think, with two chunks (or two
parts of one chunk): first to store position in topological order for
each commit (entries sorted by hash), second to store list of commits in
topological order (entries sorted by topological sort).

>>
>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>> storage complexity is 2*|V|.
>>
>
> This would be very easy to add as an optional chunk, since it can use
> one row per commit.

Right.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 146+ messages in thread

end of thread, other threads:[~2018-04-07 22:37 UTC | newest]

Thread overview: 146+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
2018-02-01 21:44   ` Jonathan Tan
2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
2018-01-31  2:19   ` Stefan Beller
2018-01-30 21:39 ` [PATCH v2 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-02  0:53   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
2018-02-01 22:23   ` Jonathan Tan
2018-02-01 23:46   ` SZEDER Gábor
2018-02-02 15:32   ` SZEDER Gábor
2018-02-05 16:06     ` Derrick Stolee
2018-02-07 15:08       ` SZEDER Gábor
2018-02-07 15:10         ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
2018-02-01 23:33   ` Jonathan Tan
2018-02-02 18:36     ` Stefan Beller
2018-02-02 22:48       ` Junio C Hamano
2018-02-03  1:58         ` Derrick Stolee
2018-02-03  9:28           ` Jeff King
2018-02-05 18:48             ` Junio C Hamano
2018-02-06 18:55               ` Derrick Stolee
2018-02-01 23:48   ` SZEDER Gábor
2018-02-05 18:07     ` Derrick Stolee
2018-02-02  1:47   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
2018-01-31  2:22   ` Stefan Beller
2018-02-02  0:02   ` SZEDER Gábor
2018-02-02  0:23   ` Jonathan Tan
2018-02-05 19:29     ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head Derrick Stolee
2018-02-02  1:35   ` SZEDER Gábor
2018-02-05 21:01     ` Derrick Stolee
2018-02-02  2:45   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear Derrick Stolee
2018-02-02  4:01   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired Derrick Stolee
2018-02-02 15:04   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 10/14] commit-graph: add core.commitgraph setting Derrick Stolee
2018-01-31 22:44   ` Igor Djordjevic
2018-02-02 16:01   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-02  1:51   ` Jonathan Tan
2018-02-06 14:53     ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 13/14] commit-graph: close under reachability Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 14/14] commit-graph: build graph from starting commits Derrick Stolee
2018-01-30 21:47 ` [PATCH v2 00/14] Serialized Git Commit Graph Stefan Beller
2018-02-01  2:34   ` Stefan Beller
2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
2018-02-08 21:21     ` Junio C Hamano
2018-02-08 21:33       ` Derrick Stolee
2018-02-08 23:16         ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 02/14] graph: add commit graph design document Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-08 21:27     ` Junio C Hamano
2018-02-08 21:36       ` Derrick Stolee
2018-02-08 23:21         ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 04/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-08 22:14     ` Junio C Hamano
2018-02-15 18:19     ` Junio C Hamano
2018-02-15 18:23       ` Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-13 21:57     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read' Derrick Stolee
2018-02-08 23:38     ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 07/14] commit-graph: update graph-head during write Derrick Stolee
2018-02-12 18:56     ` Junio C Hamano
2018-02-12 20:37       ` Junio C Hamano
2018-02-12 21:24         ` Derrick Stolee
2018-02-13 22:38     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear' Derrick Stolee
2018-02-13 22:49     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 09/14] commit-graph: implement --delete-expired Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 10/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-14  0:12     ` Jonathan Tan
2018-02-14 18:08       ` Derrick Stolee
2018-02-15 18:25     ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 12/14] commit-graph: close under reachability Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 13/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 14/14] commit-graph: build graph from starting commits Derrick Stolee
2018-02-09 13:02     ` SZEDER Gábor
2018-02-09 13:45       ` Derrick Stolee
2018-02-14 18:15   ` [PATCH v3 00/14] Serialized Git Commit Graph Derrick Stolee
2018-02-14 18:27     ` Stefan Beller
2018-02-14 19:11       ` Derrick Stolee
2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
2018-02-20 20:49         ` Junio C Hamano
2018-02-21 19:23         ` Stefan Beller
2018-02-21 19:45           ` Derrick Stolee
2018-02-21 19:48             ` Stefan Beller
2018-03-30 13:25         ` Jakub Narebski
2018-04-02 13:09           ` Derrick Stolee
2018-04-02 14:09             ` Jakub Narebski
2018-02-19 18:53       ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
2018-02-20 21:42         ` Junio C Hamano
2018-02-23 15:44           ` Derrick Stolee
2018-02-21 19:34         ` Stefan Beller
2018-02-19 18:53       ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-20 21:51         ` Junio C Hamano
2018-02-21 18:58           ` Junio C Hamano
2018-02-23 16:07             ` Derrick Stolee
2018-02-26 16:25         ` SZEDER Gábor
2018-02-26 17:08           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-20 22:57         ` Junio C Hamano
2018-02-23 17:23           ` Derrick Stolee
2018-02-23 19:30             ` Junio C Hamano
2018-02-23 19:48               ` Junio C Hamano
2018-02-23 20:02               ` Derrick Stolee
2018-02-26 16:10         ` SZEDER Gábor
2018-02-28 18:47         ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-21 19:25         ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 06/13] commit-graph: implement git commit-graph read Derrick Stolee
2018-02-21 20:11         ` Junio C Hamano
2018-02-22 18:25           ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 07/13] commit-graph: implement --set-latest Derrick Stolee
2018-02-22 18:31         ` Junio C Hamano
2018-02-23 17:53           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
2018-02-21 21:34         ` Stefan Beller
2018-02-23 17:43           ` Derrick Stolee
2018-02-22 18:48         ` Junio C Hamano
2018-02-23 17:59           ` Derrick Stolee
2018-02-23 19:33             ` Junio C Hamano
2018-02-23 19:41               ` Derrick Stolee
2018-02-23 19:51                 ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 09/13] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 10/13] commit-graph: close under reachability Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 11/13] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 12/13] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-21 22:25         ` Stefan Beller
2018-02-23 19:19           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 13/13] commit-graph: build graph from starting commits Derrick Stolee
2018-03-30 11:10       ` [PATCH v4 00/13] Serialized Git Commit Graph Jakub Narebski
2018-04-02 13:02         ` Derrick Stolee
2018-04-02 14:46           ` Jakub Narebski
2018-04-02 15:02             ` Derrick Stolee
2018-04-02 17:35               ` Stefan Beller
2018-04-02 17:54                 ` Derrick Stolee
2018-04-02 18:02                   ` Stefan Beller
2018-04-07 22:37               ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).