All of lore.kernel.org
 help / color / mirror / Atom feed
From: Plato Kiorpelidis <kioplato@gmail.com>
To: git <git@vger.kernel.org>
Cc: Taylor Blau <me@ttaylorr.com>,
	Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Subject: [GSoC][RFC][Proposal] Reachability bitmap improvements
Date: Tue, 19 Apr 2022 03:10:22 +0300	[thread overview]
Message-ID: <CAO2gv81YQRTAvJOW4J9ROK5bi0xizhveLoyVbdXniRZ8N+ic0A@mail.gmail.com> (raw)

Hello all,

I completed my proposal and I'm sharing it with you. Even thought it's
close to the deadline I hope it gets reviewed. I'm looking for more
related material like mailing list conversations to include in my
proposal and learn more.
How can I further improve it? Have I missed something?

I know I didn't include any links here. The PDF version I submitted has links.

Thanks in advance, I'm sorry for the late submission,
Plato

~~~~~~~~~~

# Contact info [...]

# About me

Final year undergrad @ University of Athens, majoring Computer Science
and minoring Mathematics. Throughout my studies I was selected to help
conduct uni lab lectures voluntarily for courses: data structures in C
(4 years), oop in C++ (2 years). GPA: 8.4/10.

I was a founding member of my university department’s ACM student
chapter, with the goal to grow and bond our uni community by
organizing lectures and workshops by students, academic researchers
and professionals. Mostly I was active in open source channels where I
directed, along with others, conversations about how to get involved
and contribute to open source projects. We even launched our own dummy
project to help provide aided exposure with PRs, reviews and related
procedures, which was accompanied by a talk I conducted (pt. 1, pt. 2)
regarding the workflow with Git & Github. Covid outbreak hindered the
development of the ACM student chapter.

My computing world volunteerism does not stop there as I helped
construct the website & autograder for the cryptography course in ‘15
for NTUA ECE, a different university from the one I’m enrolled in. I
also gave a talk for first-year undergrads on how to switch from
Windows to Linux and why, including its advantages and disadvantages.

In late 2019 I was recommended by a prof. to work in my uni department
to develop systems for academic and educational purposes. I focused on
setting up and synchronizing LDAP databases, SCIM identity management,
LSC and maintenance of our GitLab instance.

# Benefits to Community

I plan to become a long term Git contributor, help develop and
influence its design and attract new contributors. I plan to give a
talk to my fellow undergrad and graduate classmates about my
experience in GSoC with Git. How they can become involved, the GSoC
program, Git’s codebase structure and how things are done within the
Git organization.

I will invite them to participate in GSoC ‘23 with Git where,
hopefully, I will mentor the selected candidates. My participation
with GSoC this year will lay the foundations for me to learn and
accumulate the necessary experience and understanding so I can give
back to GSoC, Git, my university and fellow classmates.

Throughout my GSoC journey, and after it, I’ll publish digests about
my bitmap work and related knowledge under kioplato.github.io, similar
to the ones in Git Rev News. I’ll also PR related parts to Git Rev
News. This way my work will be transparent and accessible to anyone
who’s interested. It will also serve as reference throughout my GSoC
journey and towards future GSoC candidates interested in participating
with Git organization.

Improving Git will benefit open source projects and their communities
as well, increasing productivity due to reduction in wait times as the
proposed project will, hopefully, yield non-trivial speed-up over
network operations. Git is a widely used technology throughout the
industry and further developing it will benefit everyone who’s trying
to improve our lives through technology.

# Microproject & Git related work

  –  In 2019, I planned to participate in GSoC ‘20. To familiarize
myself with Git and get involved early, I picked up this microproject
from GSoC ‘19 and submitted a patch converting remove_subtree()
function from entry.c to use dir-iterator API instead of opendir,
readdir, closedir API. This function is required to iterate directory
paths after their contents, rendering my patch invalid, as pointed out
by Matheus, since dir-iterator didn’t and still doesn’t offer such
functionality. I didn’t follow up with a dir-iterator API patch to
implement the required functionality nor did I attempt to participate
in GSoC ‘20 due to having limited time.

  –  My dream to participate in GSoC continues and I plan to
participate in GSoC ‘22. I picked up from where I left off and
introduced the first set of changes requested by this thread to
dir-iterator and submitted RFC v1. The first set of changes refer to
dir-iterator iterating directories before and/or after their contents.
My goal is to considerably improve dir-iterator, guided by the needs
of its possible customers like read_directory_recursive(), resulting
in a simpler, more maintainable codebase. Getting selected for GSoC
will put this goal on hold.

  –  While reading through the SoC documentation in git.github.com I
noticed the CI section of the General Microproject Information webpage
wasn’t updated after the switch from Travis-ci to Github actions. The
CI instructions were out of date, pointing to Travis-ci. I builded the
site, updated the section and opened a PR to git.github.com repo,
which was merged.

# Proposal Abstract

Pre Git v2.0.0, network operations like fetches and clones were
hindered by the counting objects phase. In this phase, Git needed to
calculate which objects are required to be sent. However, Git only
knows the tips of all branches, so its only option was to walk down
the graph. To deal with this, Git adopted JGit’s reachability bitmaps
in EWAH format. This method led to a significant speed-up over the
network operations, as it indexed which objects are reachable from
salient commits.

While EWAH bitmaps eliminate a performance bottleneck by indexing
which commits are reachable from a given commit, there are possible
improvements to be made. There is a sense that EWAH bitmap compression
is slow in some cases. Are Roaring bitmaps better in Git’s case? There
is strong evidence they will be. Are Roaring bitmaps faster to access
specific parts of the bitmap due to its random-access feature and how
can we use this? Is the current bitmap format sufficient to
incorporate a lookup table which specifies the commits which have a
bitmap? How can we improve it? This project attempts to answer these
questions.

The proposed project would entail a) Designing & implementing a suite
of performance tests benchmarking EWAH to validate our assumption that
the performance of bitmap decompression is worth pursuing. Compare the
two bitmap compressions and perform any necessary changes to the
.bitmap format to accommodate the new compression scheme. Speeding-up
read, write bitmap operations or reducing used memory will convince us
to switch to Roaring bitmaps. b) Further explore the idea of
introducing a small “table of contents”, indexing which commits have a
bitmap and where to find them in the .bitmap file. In this case we
will need to implement a new .bitmap format to add a variable-width
table of contents.

# Proposal Timeline

The proposed project is challenging and in combination with its
experimental nature requires diligence and careful planning. I need to
be realistic. I will need time to familiarize myself and be confident
in my understanding regarding Git’s implementation related to bitmaps,
packing objects and related material, before proceeding with the
actual changes to .bitmap format and performance testing. These
require time and effort, therefore I’ll reserve sufficient time to get
them right: I’ll dedicate 35-40 hours/week. I will reserve more if the
project requires it and I’m open to extending the duration of the
program in case the proposed goals are not achieved.

## Overview

The plan is to first validate the need for an alternate bitmap
compression algorithm. This will be verified from the performance test
suite. → If there is indeed a need, we will attempt to adopt Roaring
bitmaps instead of EWAH, and compare the two implementations.
Hopefully Roaring bitmaps will speed-up bitmap operations and convince
us to adopt. → Regardless of the necessity of an alternate bitmap
compression algorithm, the idea of introducing a “table of contents”
to the bitmap format will be explored, even post GSoC. If the
performance tests prove that EWAH is good enough, then the “table of
contents” along with a new .bitmap format will become my primary
focus.

1. Spend time understanding bitmap data structure, EWAH, Git’s bitmap
format, codebase and respective implementations. Self code and tinker
with codebase. Don’t be afraid to ask questions and make sure it’s
clear to me what’s expected.
Material: Bitmap EWAH Bitmap format Bitmaps & inverted lists Git: 1 2 3 4 5
2. Design & implement a suite of performance tests. Validate our
assumption that the performance of bitmap decompression is worth
pursuing.
Material: Git performance tests Taylor’s .bitmap format: mail-list fork
3. Implement Roaring bitmap with run containers. Think about the
architecture & design. Compare it against the existing implementation.
Adopt if bitmap operations are faster.
Material: Roaring paper Roaring spec Bitmap format Git: 1 2 3 4 5
4. (alternative) Dedicate some time to investigate and sketch out a
new .bitmap format for the “table of contents” sub-project. Current
bitmap format comes with limitations.
Material: Bitmap format Taylor’s .bitmap format: mail-list fork
Stolee’s chunk-format API
5. (alternative) Implement a new bitmap format to incorporate the
variable-width “table of contents” metadata. Implement and run related
tests to verify speed-up.
Material: Bitmap format Taylor’s .bitmap format: mail-list fork
Stolee’s chunk-format API

## Timeline

- April 19: GSoC contributor application deadline
- April 20 - May 08 (~3 weeks, 19 days, 120h): Understand codebase
Tinker with and understand Bitmap, EWAH, Git’s EWAH, Bitmap format.
Local testing and self code. Gather a good understanding of Git’s
codebase related to bitmaps in general 1 2 3 4 5. Explore related Git
commands like git gc, pack-objects, unpack-objects, repack, bundle and
other related commands. See which code paths trigger and understand
what’s happening. Ask questions to mentors Taylor & Sivaraam and wider
audience. Read documentation.
- May 09 - May 13 (~1 week, 5 days, 40h): Understand performance tests library
Understand the architecture of Git’s performance tests & existing
bitmap performance tests. Investigate which code paths trigger in the
codebase. Consult Taylor’s performance test parts: mail-list to aid
with understanding of how the performance tests work.
- May 20: Accepted GSoC contributor projects announced & Community
bonding period begins
- May 14 - June 03 (3 weeks, 21 days, 120+h): Compression perf tests &
validate assumptions
Design and implement a suite of performance tests. Consult existing
tests, README, conversations, Taylor’s previous work on .bitmap
format: mail-list fork. Validate our assumption that the performance
of bitmap decompression is worth pursuing. Discuss possible new
.bitmap format. Tie up loose ends.
- June 13: Coding officially begins
- June 04  - July 03 (4 weeks, 30 days, 160h): Implement Roaring &
compare to EWAH
Implement Roaring bitmaps with run containers. Understand Roaring
specification and other compressions. Self code and local testing to
understand Roaring format. Read Roaring paper. Minimally implement
Roaring bitmaps and compare against existing compression. Adopt if
bitmap operations are faster. Discuss possible new .bitmap format. Tie
up loose ends.
- July 04 - July 24 (3 weeks, 21 days, 120h): Bitmap perf tests &
further discuss bitmap format
Consult Taylor’s previous work on .bitmap format with commit lookup
table: mail-list fork. Implement more tests to identify & discuss
possible improvements to be made to the current .bitmap format.
Investigate how we can incorporate a variable-width table of contents
to a new .bitmap format. How does the way we read hash-cache extension
limit the current .bitmap format? Can we make it compatible with
Stolee’s chunk-format API? Further understand current .bitmap format.
Understand pack objects, deltas, related operations & code paths. Self
code and tinker.
- July 29: Phase 1 Evaluation deadline: Submit code
- July 25 - July 29 (5 days): Submit phase 1 evaluation
- July 30 - Aug 21 (4 weeks, 160h): Implement new .bitmap format ideas
& run perf tests
Experiment with new .bitmap format ideas and document possible
speed-up. Continue discussions related to .bitmap formats. Consult
previous conversations. Hopefully arrive at a satisfying new .bitmap
format. Think about its architecture & design.
- Aug 22 - Sep 04 (2 weeks, 80h): Unallocated time in case of delays &
bad time management
- Sep 05 - Sep 12 (1 week): Submit final work product & mentor evaluation.

The above timeline serves mostly as a guideline. The overall pace and
progress of the project depends on intermediate steps like
“Compression perf tests & validate assumptions”. There is a plethora
of work to be done and possible improvements to be discussed in this
area. That’s the purpose behind the multiple subprojects of the
proposed project i.e. additional flexibility.

As a possible GSoC candidate, I want to successfully complete GSoC and
get this project right. I want to explore bitmaps, their compressions
and offer as much as I can. For this reason I’ll take advantage and
start preparing from 20 April as the proposal timeline suggests. This
will give me additional flexibility, given the nature of the project
(e.g. in case EWAH compression is good enough), and confidence since
I’ll have more time to work and understand underlying implementations.
Even if I’m not selected, these conversations and my exposure to
bitmaps will help the selected candidate. Additionally, I will try to
participate in the process and help the selected candidate.

# Related Work

There is strong evidence that Roaring bitmaps, usually, perform better
than its alternatives:
 –  Guzun et al compared Concise, WAH, EWAH, Roaring (without run
containers) with a hybrid system combining EWAH and uncompressed
bitsets. In many of their tests, but not all, Roaring was superior to
Concise, WAH and EWAH.
 –  Wang et al compared a wide range of bitmap compression techniques.
They conclude their comparison by a strong recommendation in favor of
Roaring: “Use Roaring for bitmap compression whenever possible. Do not
use other bitmap compression methods such as BBC, WAH, EWAH, PLWAH,
CONCISE, VALWAH, and SBH.”
 –  Lemire et al reviewed and evaluated the CRoaring library. They
compared BitMagic, EWAH (code from this library is part of the Git
version-control tool), uncompressed bitset, WAH & Concise, C++
std::vector and std::unordered_set for hash-set performance. In their
tests Roaring performance was superior. They concluded by saying “No
single approach can be best in all cases, but Roaring offers good
all-around performance”.
 –  CRoaring is an implementation of the Roaring bitmaps in C and C++.
Will be used as reference while adopting Roaring bitmaps in Git.
 –  Bitmap implementation in C. Excellent for getting introduced into
the bitmap data structure. Will be used as reference.
 –  Roaring format specification thoroughly describes the compressed
bitmap format.
Will be consulted throughout the project’s duration.
 –  EWAH implementation in C. Excellent for getting introduced into
how things are currently implemented. Git’s EWAH implementation will
complement that.
 –  Roaring bitmaps publication: Consistently faster and smaller
compressed bitmaps with Roaring. Will be used as reference.
 –  Comparison between bitmap compressions: An Experimental Study of
Bitmap Compression vs. Inverted List Compression. Will be used for
additional context since it provides a brief explanation of all bitmap
compressions.
 –  Taylor’s previous work on .bitmap table of contents: mail-list fork
Taylor found that we could get a significant speed-up in some cases,
but that the speed-up was
basically obliterated whenever we had to do any follow-on traversal if
the bitmap coverage wasn't completely up-to-date.

Previous work that introduced EWAH bitmaps could be used for possible
additional context, exposure, related conversations, code paths and
reference.
 –  Initial patch series, designed by Shawn Pearce for JGit,
implemented by Vicent Marti
 –  Initial patch improvements by Jeff v1, v2, v3, v4.
 –  Merged initial patch: compressed bitmaps, use bitmaps when packing objects

Related Github blog posts: Counting objects, Scaling monorepo maintenance

# Biographical Information

 –  Me and Git
I have been using Git since 2015, in my work and to collaborate on uni
and fun projects. Throughout these years I’ve read the entire pro Git
book while following most parts through my terminal, reproducing the
behavior, in order to make me a better software engineer. The workshop
I conducted (pt. 1, pt. 2) regarding the workflow with Git & Github
was heavily influenced by pro Git book which is apparent in the
workshop’s pdf. I’ve also read material like hacking Git, github.blog
posts, Git Rev News and similar technical documents.

 –  Me, C and Shell
C is my native programming language. I have written 80.000+ LOC and
I’ve implemented various systems. Most notably: B+ Tree, LSH for KNN &
clustering, simple search engine, graph editor, MPI & OpenMP parallel
system, threaded server, multiprocess systems with workers, archiver,
many algorithms & data structures (kd-trees, red-black, AVL, splay
tree, linked-lists, skip lists, hashtable, etc), many of which I’ve
converted to be thread-safe. In my uni we’re C/C++ oriented.
I’ve been using Arch Linux exclusively for 4+ years to learn and
tinker with my setup. I run dwm, st, self compiled Linux kernel,
various shell scripts, btrfs, full disk encryption, lvm to configure
my computer just how I like it. I’m confident in my C & shell skills.

 – Professional Experience
Jan 2020 - June 2021 (1.5 year) GUnet
Worked on setting up and synchronizing LDAP databases, SCIM identity
management, LSC and maintenance of our GitLab instance.

 –  Teaching
2019: Workshop regarding the workflow of Git & Github. (pt. 1, pt. 2)
2017: Workshop for first-year undergrads on how to switch from Windows
to Linux and why, including its advantages & disadvantages and related
utilities like IDEs, compilers, packages.
2016 - 2020: Lab assistant courses data structures in C (4 years), oop
course in C++ (2 years).

             reply	other threads:[~2022-04-19  0:10 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-19  0:10 Plato Kiorpelidis [this message]
  -- strict thread matches above, loose matches on Subject: below --
2022-04-06 20:46 [GSoC][RFC][PROPOSAL] Reachability bitmap improvements Abhradeep Chakraborty
2022-04-06 20:09 Abhradeep Chakraborty
2022-04-13 20:03 ` Kaartic Sivaraam
2022-04-14  8:20   ` Abhradeep Chakraborty
2022-04-14 20:29     ` Kaartic Sivaraam
2022-04-15  5:23       ` Abhradeep Chakraborty
2022-04-06 20:04 Abhradeep Chakraborty
2022-04-13 19:29 ` Taylor Blau
2022-04-14  7:58   ` Abhradeep Chakraborty
2022-04-14 14:24   ` Philip Oakley
2022-04-15  5:28     ` Abhradeep Chakraborty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAO2gv81YQRTAvJOW4J9ROK5bi0xizhveLoyVbdXniRZ8N+ic0A@mail.gmail.com \
    --to=kioplato@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=kaartic.sivaraam@gmail.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.