From: Abhishek Kumar <abhishekkumar8222@gmail.com>
To: abhishekkumar8222@gmail.com
Cc: jnareb@gmail.com, stolee@gmail.com, git@vger.kernel.org
Subject: Re: [RFC] Metadata vs Generation Data Chunk
Date: Tue, 30 Jun 2020 20:30:56 +0530 [thread overview]
Message-ID: <20200630150056.GA4111@Abhishek-Arch> (raw)
Hello everyone,
I have finished all tests that I can think of and instead of throwing
performance numbers at you, I am going to try and tell you a story.
WRITING COMMIT GRAPH
====================
The run time of writing a commit-graph is affected by time taken to
compute generation number(s) and write them to the file.
We had the following possible ideas for writing commit-graph:
MO: Calculate generation number V5 and write the offset into CDAT
(along with a metadata chunk).
G5O: Calculate generation number V5 and write the corrected date into
Generation Data chunk.
G5D: Calculate generation number V5 and write the offset into Generation
Data chunk.
G3IO: Calculate generation number V3 and write GENERATION_NUMBER_MAX into
CDAT and the offset into Generation Data chunk.
G3ID: Calculate generation number V3 and write GENERATION_NUMBER_MAX into
CDAT and the corrected date into Generation Data chunk.
G3TO: Calculate generation number V3 and generation number V0, write
topological levels into CDAT and the offset into Generation Data
chunk.
G3TD: Calculate generation number V3 and generation number V0, write
topological levels into CDAT and the offset into Generation Data
chunk.
Key:
- M -> Metadata, G -> Generation
- 5 -> Generation Number V5, 3 -> Generation Number V3,
- T -> Topological level, I (for infinity) -> Generation Number Max
- O -> Offset, D -> Date
On the Linux repo with HEAD at 08bf1a27, time is taken by different
versions:
| Version | time Taken |
|---------|-----------------|
| master | 14.200s |
| MO | 14.094s |
| G5D | 14.406s |
| G5O | 14.323s |
| G3IO | 14.175s |
| G3ID | 14.258s |
| G3TO | 14.331s |
| G3TD | 14.419s |
Observations:
- Writing offset (i.e., 32-bits) is faster than writing corrected date
(which are 64-bit long). However, we would have to deal with overflow
issues.
The largest offset value observed was of the order 2 ** 29. Assuming
no further "skips" in offset, we can store around ~635 million
commits without overflows. This, however, might not be true for other
repositories.
A repository with a commit with Unix epoch '0' will overflow.
If we choose some sane default like commit date must be higher than
the release date of git, we can store around 110 million commits even
with '0' epoch.
- Calculating topological levels with generation number V3 takes nearly
the same time as calculating generation number V5.
- The commit-graph file is larger by 4 Mb (+8%) and 8 Mb (+16%) when
storing offsets and dates.
Reading Commit Graph
====================
The time taken in using commit-graph to answer queries depends on:
- The number of commits walked.
- The time taken to load commit data from commit-graph.
Number of commits walked for different commands:
| Command | Master | V3 | V5 |
|---------------------------------|--------|--------|--------|
| log --topo-order -10000 | 48011 | 49954 | 49933 |
| log --topo-order -100 v5.4 v5.5 | 3794 | 3314 | 3312 |
| log --topo-order -100 v4.8 v4.9 | 5736 | 3887 | 3625 |
| merge-base v5.4 v5.5 | 55053 | 57097 | 55109 |
| merge-base v4.8 v4.9 | 635579 | 167468 | 506577 |
- For log --topo-order, V3, and V5 walk 35% fewer commits than
the topological level when comparing v4.8 and v4.9.
- V3 walks far fewer commits than V5 when comparing by generation
numbers and then date for paint_down_to_common(). This is unusual
and is probably due to my implementation. Based on log --topo-order,
we might expect V5 to perform better than V3.
- V3 walks the same number of commits when compared using commit
date for merge-base.
The time taken for each command rough corresponds to the number of
commits walked, with no difference between Metadata chunk and Generation
Data chunk approaches.
| Command | Master | V3 | V5 |
|---------------------------------|---------|---------|--------|
| log --topo-order -10000 | 0.210s | 0.209s | 0.207s |
| log --topo-order -100 v5.4 v5.5 | 0.013s | 0.013s | 0.013s |
| log --topo-order -100 v4.8 v4.9 | 0.015s | 0.013s | 0.013s |
| merge-base v5.4 v5.5 | 0.048s | 0.047s | 0.047s |
| merge-base v4.8 v4.9 | 0.135s | 0.133s | 0.134s |
CONCLUSION
==========
With all this, my inital recommendation becomes more specific as:
- If overflows are unlikely to happen and not accounted for, implement
generation number V5 using metadata chunk. It has the lowest write
time and walks just as few commits as V3.
- If overflows are a concern, implement generation number V5 with
generation data chunk, storing offsets in generation data chunk.
Thanks
- Abhishek
next reply other threads:[~2020-06-30 15:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-30 15:00 Abhishek Kumar [this message]
2020-06-30 20:46 ` [RFC] Metadata vs Generation Data Chunk Derrick Stolee
2020-07-03 8:29 ` Abhishek Kumar
-- strict thread matches above, loose matches on Subject: below --
2020-06-26 13:44 Abhishek Kumar
2020-06-26 14:40 ` Derrick Stolee
2020-06-22 9:34 Abhishek Kumar
2020-06-22 13:40 ` Jakub Narębski
2020-06-22 14:46 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200630150056.GA4111@Abhishek-Arch \
--to=abhishekkumar8222@gmail.com \
--cc=20200622093451.GA1719@abhishek-arch \
--cc=git@vger.kernel.org \
--cc=jnareb@gmail.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).