git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
@ 2022-04-06 20:46 Abhradeep Chakraborty
  0 siblings, 0 replies; 12+ messages in thread
From: Abhradeep Chakraborty @ 2022-04-06 20:46 UTC (permalink / raw)
  To: git

Hello everyone,

Here is the initial version of my proposal on "Reachability bitmap
improvements". I would really love to have comments on it. If you
find any mistakes please notify me about that.

The proposal template is inspired by Hariom Verma's (This year's
GSoC mentor) previous year's gsoc proposal.

Thanks :)

=============================================

View on Docs: https://docs.google.com/document/d/15tZ5QbMtYN41No5wdMw-2Ilp63vquB7OqAwBCLl_0n0/edit?usp=sharing

==============================================

**Reachability bitmap improvements**

Name  :  Abhradeep Chakraborty
Major : Information Technology
Mobile No. : [ ... ]
Email : chakrabortyabhradeep79@gmail.com
GitHub Profile: (Abhra303)[https://github.com/Abhra303]
LinkedIn : www.linkedin.com/in/abhradeep-chakraborty-b36426201
Time Zone : Indian Standard Time ( IST ) ( UTC +05:30 )

## About Me

I am Abhradeep Chakraborty, currently pursuing B.Tech in Information
Technology (now in my 3rd year) at Jalpaiguri Government Engineering
College, India.

In the last two years, I mainly concentrated on learning and building
web development things. I mainly use Django, React for my projects. But
for the past 6 months, I have been learning about devops, cloud technologies
( e.g. docker, Kubernetes, AWS etc.), computer networking and core
technologies (such as git, linux, gcc etc.). I also have an interest in
some research based technologies like WebRTC, Web3 etc.

My ultimate objective is to be a dedicated and active contributor to all
of these technologies. So, I want to start this contribution journey with
`git` and therefore I want to be a part of this GSoC program.

## Me & Git

I have started exploring the Git codebase since Sept 2021. At first, I
mainly focused on knowing the git internals. During this time I read
documentations (Under `Documentation` directory). `MyFirstContribution.txt`[1],
`MyFirstObjectWalk.txt`[2] and `Hacking Git`[3] helped me to learn about the git
contribution workflow, how `git log` ( in other words `object walk`) works
internally.

Though I had read many documentations, I was still not able to fully
understand the terminologies (like `refs`, `packfiles`, `blobs`, `trees`
etc.). (ProGit)[https://git-scm.com/book/en/v2] helped me tremendously here.
I read the full book and it cleared almost every doubt that came into my mind.

I have provided my git involvement history below. Activities are sorted
by timeline from top to bottom.

==============================
PRs, Commits & Discussion Participations
==============================

---------------------------------
[patch] Amend error messages that violate git coding style convention
Github link : https://github.com/gitgitgadget/git/pull/1062
Status : Not Merged/ PR
Description : There are many error strings which do not follow coding
style guides. This PR was to fix those. There is a ticket dedicated to
this ( github Issue - (#635)[https://github.com/gitgitgadget/git/issues/635]).
The author of this issue said it would be better to manually correct
these things.
Outcome : As this was my first PR, I made many mistakes and I faced
difficulties. Though this was not merged, I learned tremendous new
things. These violating strings were in many files ( nearly 200 files).
At first, I fixed all the files at once manually. I fixed errors
generated by those changes. Ultimately, I learned about overall git
structures, how to fix errors etc.

---------------------------------
[RFC PATCH 0/1] making set-upstream have default arguments
Mailing List : https://lore.kernel.org/git/20211202144354.17416-1-chakrabortyabhradeep79@gmail.com/
Status : Not Merged/ Closed
Description : `git push -u` generally ( if `push.default` not set)
throws an error if no arguments are provided. This patch request was to
teach `git push -u` to push and set a default upstream if no arguments
are provided. There is a github
(issue)[https://github.com/gitgitgadget/git/issues/1061]
also.
Outcome : This PR let `git push -u` to push to a default branch and
set that as the upstream irrespective of `push.default` configuration.
I realized it was badly affecting the reason `push.default` is built
for ( See this (comment)[https://lore.kernel.org/git/20220104132839.1209-1-chakrabortyabhradeep79@gmail.com/]).
So, I intentionally stopped working on it.

----------------------------------
Avoid checking for -h in commands that already use parse-options API ( Issue)
Github Issue Link : https://github.com/gitgitgadget/git/issues/1125
Status : Closed

----------------------------------
[PATCH] Add usage strings ci check and amend remaining usage strings
Github PR Link : https://github.com/gitgitgadget/git/pull/1147
Mailing List Link :
https://lore.kernel.org/git/pull.1147.git.1645030949730.gitgitgadget@gmail.com/
Status : Merged into Master
Description : According to dscho’s
(comment)[https://github.com/gitgitgadget/git/issues/636#issuecomment-1018660439],
There were still some usage strings that violated the style guide.
So, this patch request fixed those strings and added a CI check to
verify all the usage strings are passing the style check. In the first
version, I built a shell script to check the validity of strings.
Outcome : The first commit was merged into master. There were some
contradictions about the different approaches suggested by reviewers
in the discussion about the last commit i.e. whether the check should
be static check or runtime check, whether I should use `parse options`
method or coccinelle etc (see the (mailing
list)[https://lore.kernel.org/git/nycvar.QRO.7.76.6.2202251632320.11118@tvgsbejvaqbjf.bet/]).
Lastly, it was decided that I should try the `coccinelle` method first
and check if it is better ( see this
(comment)[https://lore.kernel.org/git/20220304142154.2350-1-chakrabortyabhradeep79@gmail.com/]).
I will try to submit a PR regarding this before April 19.

==================================
MicroProject
==================================

[PATCH] Partial-Clone : add a partial-clone test case
Github PR Link : https://github.com/gitgitgadget/git/pull/1175
Status : Merged into Master
Github Issue : https://github.com/gitgitgadget/git/issues/827
Description : In a blobless-cloned repo, `git log –follow – <path>`
(`<path>` have an exact OID rename) shouldn’t download blobs of the
file from where the new file is renamed. This patch added a test case
to verify it.
Outcome : The test was added to the master branch. Moreover, another
issue was discovered while discussing the patch request. Junio discovered
that the test helper function `test_subcommand_inexact` was poorly designed.
Later Derrick submitted a (PR)[https://github.com/gitgitgadget/git/pull/1185]
to fix it ( now merged).

==================================
Review
==================================
[PATCH] test-lib-functions : fix test_subcommand_inexact
Github PR Link : https://github.com/gitgitgadget/git/pull/1185
Description : Junio discovered that the `test_subcommand_inexact` function
(test-helper function) is poorly designed. This PR addresses that.
My comment : Though I placed it under the `Review` section; but I didn’t
review the code here. Instead I gave my opinion regarding this in
(this comment)[https://lore.kernel.org/git/20220324181056.53824-1-chakrabortyabhradeep79@gmail.com/].

## Proposed Project

### Abstract

While sending large repositories, git has to decide what needs to be
sent, how to be sent. For large repositories, git creates pack files
where it packs all the related commits, trees, blobs( in the form of
deltas and bases ), tags etc. Now, finding the related objects among
all the objects might be very expensive.

As git only knows about the branch tips, it is very expensive to find
the related objects from all the objects if we try to traverse down
from the branch tips to all their respective related objects. It becomes
more expensive as the number of objects (i.e. commits, trees, blobs
etc.) increases. Ultimately resulting in slow fetching from the git
daemon.

To counter that, reachability bitmaps were introduced. It uses bitmap
data structure (an array of bits) to detect if an object is reachable
from a particular commit or not. The commit messages of this
(patch series)[https://lore.kernel.org/git/1372116193-32762-1-git-send-email-tanoku@gmail.com/]
are itself a documentation of how it is achieved. All the bitmap
indexes for selected commits (in a EWAH compressed form) are stored in
a `.bitmap` file which has the same name of its respective packfile.

The current task is to improve the performance of this bitmap approach.
There are three points that we need to work on -

1. Decompression of bitmaps is slow with EWAH. To know if an object is
   related to a particular commit, we have to decompress the whole thing.
   So, we have to consider other alternative techniques besides EWAH.

2. Loading a `.bitmap` file can be slow for large bitmaps. This is because
   we need to read the file sequentially in order to discover the offset of
   each bitmap within the file. It would be better if we can create a `table
   of contents` for these bitmaps.

3. Regenerating bitmaps can take a long time. One possible approach is to
   add a new mode which only adds new bitmaps for commits introduced between
   successive bitmap generations.

4. If we solve the above points, we would think of other aspects like
   rethinking how we select which commit receives bitmaps etc.

### Previous Work

I didn’t find any previous patches regarding point no. 1. But there were
some discussions about this. Derrick’s
(comment)[https://lore.kernel.org/git/1685a358-f033-64e0-2e12-df3a1c10af19@gmail.com/]
suggested considering the
(Roaring+Run)[https://roaringbitmap.org/about/]
technique. It is much faster than the currently implemented
(EWAH)[https://arxiv.org/abs/0901.3751].

Taylor initiated the `developing a table of contents for bitmap` work
by creating a
(branch and committing some
patches)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
in that branch. From the
(discussion)[https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/]
and the performance test results provided in that discussion,  it seems
that it would improve overall performance. So, this would work as a
reference for me while developing the table of contents for bitmaps
(obviously after running performance tests and discussing the best
approach).

### Potential Problem

The biggest difficulty of this project would be to replace the EWAH
compression/decompression altogether with better techniques like the
Roaring+Run method. I have to be aware of it so that it doesn’t
generate any unintentional behavior or bugs.

### The plan

As this project includes three ( actually four) sub projects, I would
address them one by one -
1. Firstly I will understand the working of EWAH compression better.
   I have gone through their documentation and have seen git’s
   implementation code for it. The commit messages were really helpful
   here. But I think I should research more. After that,  I can compare
   (and discuss) each approach with respect to the EWAH approach on
   some parameters (like amount of computation in each approach,
   simplicity, memory usage, correctness i.e. does it always give the
   correct output etc.). This comparison would lead me to take the
   most ideal approach in this case. There are several techniques which
   may work/perform better than the current EWAH implementation. I
   have researched some of these techniques. One of them is obviously
   the Roaring+Run method. I have gone through their
   (research paper)[https://arxiv.org/pdf/1709.07821v6.pdf]
   on this technique; it seems like a good approach to me. Other than
   that, there is a golang package called
   (sroar)[https://dgraph.io/blog/post/serialized-roaring-bitmaps-golang/]
   (i.e. Serialized Roaring bitmaps), which is an open source package
   ( mainly built for (Dgraph)[https://dgraph.io/])
   and the creator of this package claims it is
   (7x faster using 15x less
memory)[https://github.com/dgraph-io/sroar#benchmarks].
   It doesn’t require any encoding/decoding step. I would look into
   it ( It needs to be implemented in C language though). I have also
   looked into approaches like
   (bloom filter)[https://en.wikipedia.org/wiki/Bloom_filter],
   (Hyperloglog)[https://en.wikipedia.org/wiki/HyperLogLog]. These are
   very fast and use very less memory. But they give probabilistic
   answers and sometimes they may give wrong answers. So, those
   would not be a good fit for us I guess.
2. I will understand the bitmap related files such as
   `pack-bitmap-write.c`[4], `pack-bitmap.c`[5] etc. It would
   help me to decide where and how I can add the code related to
   `table of contents` for bitmap.
   Taylor’s (branch
work)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
   would work as a reference for me. As the Project idea section
   suggests, I would also include performance tests to continuously
   check whether my approach is really improving the performance or
   not.
3. As I understand more about bitmap related files, I will surely
   be able to think whether `adding a new mode` is really necessary
   or not (It is conceptually a good idea though). If necessary,
   then I would first make the flow design and discuss it with the
   community. If all is good, I will start working on it. Creating
   and running performance tests is a must.
4. While developing all of these, my understanding of bitmap and
   its implementation will get better. So, that would help me to
   think more practically about the questions provided in the `Project
   Idea` section. Then I would work on it accordingly.

## Estimated Timeline

My Semester exams will most probably start from May last week. So,
I may be inactive (or less active) that time.
(Smart India Hackathon)[https://www.sih.gov.in]
(semi finals and finals) may also be held during the GSoC event.
I will get you notified about that.

--> Period
- Pre GSoC Period
- Till May 20
--> Tasks
- Research more on the bitmap files, compression techniques etc.
- See if there are any other (better) techniques for it.

--> Period
- Community Bonding
- May 20 - June 12
--> Tasks
- Discussion on the project started
- Decide which approach is better for compression/decompression of
  bitmaps.
- If a decision is not taken, create branches for each approach.
- Create branches for each sub project and make skeletons for them.

--> Period
- Coding Phase 1
- June 13 - July 25
--> Tasks
- Start working on sub-project no. 1 (i.e. bitmap compression
  technique)
- Add Performance test
- If multiple branch are create for this sub-project, compare
  the test results
- Take the best performing approach, delete other branches.
- Update documentation accordingly
- Fix bugs
- Finish sub-project 1

--> Period
- Coding Phase 2
- July 25 - Sept 4
--> Tasks
- Fix sub-project-1 related small bugs
- Start working on sub-project-2 (i.e. adding
  `table of content` for bitmap files)
- Add Performance test
- Add Documentation
- Fix bugs
- If finished, start working on sub-project-3 (i.e. adding a
  `new mode`)
- Try to finish it with documentation and performance tests

--> Period
- Final Week 1
- Sept 5 - Sept 12
--> Tasks
- If both are finished, fix bugs and test overall performance.
  Else extend the period.
- Discuss about the last sub project (i.e. questions )
- If anything needs to be changed/modified or added, extend the
  GSoC Period

--> Period
- Coding Phase 3
- Sept 13 - Nov 13
--> Tasks
- Finish code related to all sub projects
- Add overall documentation
- Add overall performance tests


## Blogging about Git

I love to write blogs and technical articles. Those are my ways to
connect and communicate with the developers community. I have
previously written articles about Django in a portal named
(GeeksForGeeks)[https://auth.geeksforgeeks.org/user/riter79/articles].
Recently I started writing  blogs on
(Medium platform)[https://medium.com/@abhra303].

During the GSoC event, I will frequently write blogs about my
progress. I will also share the problems I face and the solutions
of those problems that I consider.

I got interested in git by reading Olga’s GSoC blogs. That led me
to contribute to this codebase. So, someone like me will hopefully
be motivated by seeing my blogs ;-)

## Availability

I intentionally freed my summer slot for GSoC. So, there wouldn’t
be much disturbance/inactive days while working on my project.
Generally, I can spend nearly 35-40 hours a week on this project.
If the college gets closed for some reason, then the amount of
available time will increase. Smart India Hackathon finale would
be held in July. So, I may be inactive for a few days. But overall,
I will be available most of the time.

## Post GSoC

Once the GSoC finishes, I would look into other issues/tickets and
would send PRs solving those issues. As I said before, I want to
be a long term contributor to git - I would definitely maintain the
flow and continue to work on improving the overall git tool.

## Experience with Open Source

I have a pretty good contribution history in open source. The
organizations/tools where I contributed to are -

 - Mozilla/bedrock - written mainly in Python Django and javaScript
   link - https://github.com/mozilla/bedrock/commits?author=Abhra303

 - Datree - a Kubernetes misconfiguration checker tool written mainly
   in golang
   link - https://github.com/datreeio/datree/commits?author=Abhra303

 - KubeBuilder - A tool to extend the Kubernetes API by creating custom
   CRDs - written mainly in golang. I recently started looking into it.
   link : https://github.com/kubernetes-sigs/kubebuilder/issues/2567

## Motivation

I always wanted to develop/contribute to projects that affect millions
of people. Git is a tool that most ( or maybe every) developers
use for their daily work. Contributing to it means my written code
will be used behind the scenes in every developer’s machine. That’s my
motivation.

## Closing Remarks

Lastly, I want to say that I am a dedicated, research preferred quick
learner. I am always ready to solve real world problems and learn
required skills for them. For example, I recently took part in a
hackathon called “Cloud Native Hackathon” and won 1st place in the
category of “best use of Datree”. I knew nothing about
(Datree)[https://datree.io/] ( and (Kubernetes)[https://kubernetes.io]
partially) before this hackathon and the hackathon was only 2 days
long. But I researched hard ( to know the Kubernetes misconfigurations)
and learned  jsonSchema ( Here is my
(blog link)[https://medium.com/@abhra303/how-i-won-cloud-native-hackathon-2021-4b1f5b2a84f2]
if you want to know my full experience).

I hope you’ll give me a chance to make an impact among the developers
community by considering my application.

Thanks & Regards,

Abhradeep Chakraborty


[1] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstContribution.txt
[2] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstObjectWalk.txt
[3] https://git-scm.com/docs/user-manual#hacking-git
[4] https://github.com/git/git/blob/master/pack-bitmap-write.c
[5] https://github.com/git/git/blob/master/pack-bitmap.c

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [GSoC][RFC][Proposal] Reachability bitmap improvements
@ 2022-04-19  0:10 Plato Kiorpelidis
  0 siblings, 0 replies; 12+ messages in thread
From: Plato Kiorpelidis @ 2022-04-19  0:10 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Kaartic Sivaraam

Hello all,

I completed my proposal and I'm sharing it with you. Even thought it's
close to the deadline I hope it gets reviewed. I'm looking for more
related material like mailing list conversations to include in my
proposal and learn more.
How can I further improve it? Have I missed something?

I know I didn't include any links here. The PDF version I submitted has links.

Thanks in advance, I'm sorry for the late submission,
Plato

~~~~~~~~~~

# Contact info [...]

# About me

Final year undergrad @ University of Athens, majoring Computer Science
and minoring Mathematics. Throughout my studies I was selected to help
conduct uni lab lectures voluntarily for courses: data structures in C
(4 years), oop in C++ (2 years). GPA: 8.4/10.

I was a founding member of my university department’s ACM student
chapter, with the goal to grow and bond our uni community by
organizing lectures and workshops by students, academic researchers
and professionals. Mostly I was active in open source channels where I
directed, along with others, conversations about how to get involved
and contribute to open source projects. We even launched our own dummy
project to help provide aided exposure with PRs, reviews and related
procedures, which was accompanied by a talk I conducted (pt. 1, pt. 2)
regarding the workflow with Git & Github. Covid outbreak hindered the
development of the ACM student chapter.

My computing world volunteerism does not stop there as I helped
construct the website & autograder for the cryptography course in ‘15
for NTUA ECE, a different university from the one I’m enrolled in. I
also gave a talk for first-year undergrads on how to switch from
Windows to Linux and why, including its advantages and disadvantages.

In late 2019 I was recommended by a prof. to work in my uni department
to develop systems for academic and educational purposes. I focused on
setting up and synchronizing LDAP databases, SCIM identity management,
LSC and maintenance of our GitLab instance.

# Benefits to Community

I plan to become a long term Git contributor, help develop and
influence its design and attract new contributors. I plan to give a
talk to my fellow undergrad and graduate classmates about my
experience in GSoC with Git. How they can become involved, the GSoC
program, Git’s codebase structure and how things are done within the
Git organization.

I will invite them to participate in GSoC ‘23 with Git where,
hopefully, I will mentor the selected candidates. My participation
with GSoC this year will lay the foundations for me to learn and
accumulate the necessary experience and understanding so I can give
back to GSoC, Git, my university and fellow classmates.

Throughout my GSoC journey, and after it, I’ll publish digests about
my bitmap work and related knowledge under kioplato.github.io, similar
to the ones in Git Rev News. I’ll also PR related parts to Git Rev
News. This way my work will be transparent and accessible to anyone
who’s interested. It will also serve as reference throughout my GSoC
journey and towards future GSoC candidates interested in participating
with Git organization.

Improving Git will benefit open source projects and their communities
as well, increasing productivity due to reduction in wait times as the
proposed project will, hopefully, yield non-trivial speed-up over
network operations. Git is a widely used technology throughout the
industry and further developing it will benefit everyone who’s trying
to improve our lives through technology.

# Microproject & Git related work

  –  In 2019, I planned to participate in GSoC ‘20. To familiarize
myself with Git and get involved early, I picked up this microproject
from GSoC ‘19 and submitted a patch converting remove_subtree()
function from entry.c to use dir-iterator API instead of opendir,
readdir, closedir API. This function is required to iterate directory
paths after their contents, rendering my patch invalid, as pointed out
by Matheus, since dir-iterator didn’t and still doesn’t offer such
functionality. I didn’t follow up with a dir-iterator API patch to
implement the required functionality nor did I attempt to participate
in GSoC ‘20 due to having limited time.

  –  My dream to participate in GSoC continues and I plan to
participate in GSoC ‘22. I picked up from where I left off and
introduced the first set of changes requested by this thread to
dir-iterator and submitted RFC v1. The first set of changes refer to
dir-iterator iterating directories before and/or after their contents.
My goal is to considerably improve dir-iterator, guided by the needs
of its possible customers like read_directory_recursive(), resulting
in a simpler, more maintainable codebase. Getting selected for GSoC
will put this goal on hold.

  –  While reading through the SoC documentation in git.github.com I
noticed the CI section of the General Microproject Information webpage
wasn’t updated after the switch from Travis-ci to Github actions. The
CI instructions were out of date, pointing to Travis-ci. I builded the
site, updated the section and opened a PR to git.github.com repo,
which was merged.

# Proposal Abstract

Pre Git v2.0.0, network operations like fetches and clones were
hindered by the counting objects phase. In this phase, Git needed to
calculate which objects are required to be sent. However, Git only
knows the tips of all branches, so its only option was to walk down
the graph. To deal with this, Git adopted JGit’s reachability bitmaps
in EWAH format. This method led to a significant speed-up over the
network operations, as it indexed which objects are reachable from
salient commits.

While EWAH bitmaps eliminate a performance bottleneck by indexing
which commits are reachable from a given commit, there are possible
improvements to be made. There is a sense that EWAH bitmap compression
is slow in some cases. Are Roaring bitmaps better in Git’s case? There
is strong evidence they will be. Are Roaring bitmaps faster to access
specific parts of the bitmap due to its random-access feature and how
can we use this? Is the current bitmap format sufficient to
incorporate a lookup table which specifies the commits which have a
bitmap? How can we improve it? This project attempts to answer these
questions.

The proposed project would entail a) Designing & implementing a suite
of performance tests benchmarking EWAH to validate our assumption that
the performance of bitmap decompression is worth pursuing. Compare the
two bitmap compressions and perform any necessary changes to the
.bitmap format to accommodate the new compression scheme. Speeding-up
read, write bitmap operations or reducing used memory will convince us
to switch to Roaring bitmaps. b) Further explore the idea of
introducing a small “table of contents”, indexing which commits have a
bitmap and where to find them in the .bitmap file. In this case we
will need to implement a new .bitmap format to add a variable-width
table of contents.

# Proposal Timeline

The proposed project is challenging and in combination with its
experimental nature requires diligence and careful planning. I need to
be realistic. I will need time to familiarize myself and be confident
in my understanding regarding Git’s implementation related to bitmaps,
packing objects and related material, before proceeding with the
actual changes to .bitmap format and performance testing. These
require time and effort, therefore I’ll reserve sufficient time to get
them right: I’ll dedicate 35-40 hours/week. I will reserve more if the
project requires it and I’m open to extending the duration of the
program in case the proposed goals are not achieved.

## Overview

The plan is to first validate the need for an alternate bitmap
compression algorithm. This will be verified from the performance test
suite. → If there is indeed a need, we will attempt to adopt Roaring
bitmaps instead of EWAH, and compare the two implementations.
Hopefully Roaring bitmaps will speed-up bitmap operations and convince
us to adopt. → Regardless of the necessity of an alternate bitmap
compression algorithm, the idea of introducing a “table of contents”
to the bitmap format will be explored, even post GSoC. If the
performance tests prove that EWAH is good enough, then the “table of
contents” along with a new .bitmap format will become my primary
focus.

1. Spend time understanding bitmap data structure, EWAH, Git’s bitmap
format, codebase and respective implementations. Self code and tinker
with codebase. Don’t be afraid to ask questions and make sure it’s
clear to me what’s expected.
Material: Bitmap EWAH Bitmap format Bitmaps & inverted lists Git: 1 2 3 4 5
2. Design & implement a suite of performance tests. Validate our
assumption that the performance of bitmap decompression is worth
pursuing.
Material: Git performance tests Taylor’s .bitmap format: mail-list fork
3. Implement Roaring bitmap with run containers. Think about the
architecture & design. Compare it against the existing implementation.
Adopt if bitmap operations are faster.
Material: Roaring paper Roaring spec Bitmap format Git: 1 2 3 4 5
4. (alternative) Dedicate some time to investigate and sketch out a
new .bitmap format for the “table of contents” sub-project. Current
bitmap format comes with limitations.
Material: Bitmap format Taylor’s .bitmap format: mail-list fork
Stolee’s chunk-format API
5. (alternative) Implement a new bitmap format to incorporate the
variable-width “table of contents” metadata. Implement and run related
tests to verify speed-up.
Material: Bitmap format Taylor’s .bitmap format: mail-list fork
Stolee’s chunk-format API

## Timeline

- April 19: GSoC contributor application deadline
- April 20 - May 08 (~3 weeks, 19 days, 120h): Understand codebase
Tinker with and understand Bitmap, EWAH, Git’s EWAH, Bitmap format.
Local testing and self code. Gather a good understanding of Git’s
codebase related to bitmaps in general 1 2 3 4 5. Explore related Git
commands like git gc, pack-objects, unpack-objects, repack, bundle and
other related commands. See which code paths trigger and understand
what’s happening. Ask questions to mentors Taylor & Sivaraam and wider
audience. Read documentation.
- May 09 - May 13 (~1 week, 5 days, 40h): Understand performance tests library
Understand the architecture of Git’s performance tests & existing
bitmap performance tests. Investigate which code paths trigger in the
codebase. Consult Taylor’s performance test parts: mail-list to aid
with understanding of how the performance tests work.
- May 20: Accepted GSoC contributor projects announced & Community
bonding period begins
- May 14 - June 03 (3 weeks, 21 days, 120+h): Compression perf tests &
validate assumptions
Design and implement a suite of performance tests. Consult existing
tests, README, conversations, Taylor’s previous work on .bitmap
format: mail-list fork. Validate our assumption that the performance
of bitmap decompression is worth pursuing. Discuss possible new
.bitmap format. Tie up loose ends.
- June 13: Coding officially begins
- June 04  - July 03 (4 weeks, 30 days, 160h): Implement Roaring &
compare to EWAH
Implement Roaring bitmaps with run containers. Understand Roaring
specification and other compressions. Self code and local testing to
understand Roaring format. Read Roaring paper. Minimally implement
Roaring bitmaps and compare against existing compression. Adopt if
bitmap operations are faster. Discuss possible new .bitmap format. Tie
up loose ends.
- July 04 - July 24 (3 weeks, 21 days, 120h): Bitmap perf tests &
further discuss bitmap format
Consult Taylor’s previous work on .bitmap format with commit lookup
table: mail-list fork. Implement more tests to identify & discuss
possible improvements to be made to the current .bitmap format.
Investigate how we can incorporate a variable-width table of contents
to a new .bitmap format. How does the way we read hash-cache extension
limit the current .bitmap format? Can we make it compatible with
Stolee’s chunk-format API? Further understand current .bitmap format.
Understand pack objects, deltas, related operations & code paths. Self
code and tinker.
- July 29: Phase 1 Evaluation deadline: Submit code
- July 25 - July 29 (5 days): Submit phase 1 evaluation
- July 30 - Aug 21 (4 weeks, 160h): Implement new .bitmap format ideas
& run perf tests
Experiment with new .bitmap format ideas and document possible
speed-up. Continue discussions related to .bitmap formats. Consult
previous conversations. Hopefully arrive at a satisfying new .bitmap
format. Think about its architecture & design.
- Aug 22 - Sep 04 (2 weeks, 80h): Unallocated time in case of delays &
bad time management
- Sep 05 - Sep 12 (1 week): Submit final work product & mentor evaluation.

The above timeline serves mostly as a guideline. The overall pace and
progress of the project depends on intermediate steps like
“Compression perf tests & validate assumptions”. There is a plethora
of work to be done and possible improvements to be discussed in this
area. That’s the purpose behind the multiple subprojects of the
proposed project i.e. additional flexibility.

As a possible GSoC candidate, I want to successfully complete GSoC and
get this project right. I want to explore bitmaps, their compressions
and offer as much as I can. For this reason I’ll take advantage and
start preparing from 20 April as the proposal timeline suggests. This
will give me additional flexibility, given the nature of the project
(e.g. in case EWAH compression is good enough), and confidence since
I’ll have more time to work and understand underlying implementations.
Even if I’m not selected, these conversations and my exposure to
bitmaps will help the selected candidate. Additionally, I will try to
participate in the process and help the selected candidate.

# Related Work

There is strong evidence that Roaring bitmaps, usually, perform better
than its alternatives:
 –  Guzun et al compared Concise, WAH, EWAH, Roaring (without run
containers) with a hybrid system combining EWAH and uncompressed
bitsets. In many of their tests, but not all, Roaring was superior to
Concise, WAH and EWAH.
 –  Wang et al compared a wide range of bitmap compression techniques.
They conclude their comparison by a strong recommendation in favor of
Roaring: “Use Roaring for bitmap compression whenever possible. Do not
use other bitmap compression methods such as BBC, WAH, EWAH, PLWAH,
CONCISE, VALWAH, and SBH.”
 –  Lemire et al reviewed and evaluated the CRoaring library. They
compared BitMagic, EWAH (code from this library is part of the Git
version-control tool), uncompressed bitset, WAH & Concise, C++
std::vector and std::unordered_set for hash-set performance. In their
tests Roaring performance was superior. They concluded by saying “No
single approach can be best in all cases, but Roaring offers good
all-around performance”.
 –  CRoaring is an implementation of the Roaring bitmaps in C and C++.
Will be used as reference while adopting Roaring bitmaps in Git.
 –  Bitmap implementation in C. Excellent for getting introduced into
the bitmap data structure. Will be used as reference.
 –  Roaring format specification thoroughly describes the compressed
bitmap format.
Will be consulted throughout the project’s duration.
 –  EWAH implementation in C. Excellent for getting introduced into
how things are currently implemented. Git’s EWAH implementation will
complement that.
 –  Roaring bitmaps publication: Consistently faster and smaller
compressed bitmaps with Roaring. Will be used as reference.
 –  Comparison between bitmap compressions: An Experimental Study of
Bitmap Compression vs. Inverted List Compression. Will be used for
additional context since it provides a brief explanation of all bitmap
compressions.
 –  Taylor’s previous work on .bitmap table of contents: mail-list fork
Taylor found that we could get a significant speed-up in some cases,
but that the speed-up was
basically obliterated whenever we had to do any follow-on traversal if
the bitmap coverage wasn't completely up-to-date.

Previous work that introduced EWAH bitmaps could be used for possible
additional context, exposure, related conversations, code paths and
reference.
 –  Initial patch series, designed by Shawn Pearce for JGit,
implemented by Vicent Marti
 –  Initial patch improvements by Jeff v1, v2, v3, v4.
 –  Merged initial patch: compressed bitmaps, use bitmaps when packing objects

Related Github blog posts: Counting objects, Scaling monorepo maintenance

# Biographical Information

 –  Me and Git
I have been using Git since 2015, in my work and to collaborate on uni
and fun projects. Throughout these years I’ve read the entire pro Git
book while following most parts through my terminal, reproducing the
behavior, in order to make me a better software engineer. The workshop
I conducted (pt. 1, pt. 2) regarding the workflow with Git & Github
was heavily influenced by pro Git book which is apparent in the
workshop’s pdf. I’ve also read material like hacking Git, github.blog
posts, Git Rev News and similar technical documents.

 –  Me, C and Shell
C is my native programming language. I have written 80.000+ LOC and
I’ve implemented various systems. Most notably: B+ Tree, LSH for KNN &
clustering, simple search engine, graph editor, MPI & OpenMP parallel
system, threaded server, multiprocess systems with workers, archiver,
many algorithms & data structures (kd-trees, red-black, AVL, splay
tree, linked-lists, skip lists, hashtable, etc), many of which I’ve
converted to be thread-safe. In my uni we’re C/C++ oriented.
I’ve been using Arch Linux exclusively for 4+ years to learn and
tinker with my setup. I run dwm, st, self compiled Linux kernel,
various shell scripts, btrfs, full disk encryption, lvm to configure
my computer just how I like it. I’m confident in my C & shell skills.

 – Professional Experience
Jan 2020 - June 2021 (1.5 year) GUnet
Worked on setting up and synchronizing LDAP databases, SCIM identity
management, LSC and maintenance of our GitLab instance.

 –  Teaching
2019: Workshop regarding the workflow of Git & Github. (pt. 1, pt. 2)
2017: Workshop for first-year undergrads on how to switch from Windows
to Linux and why, including its advantages & disadvantages and related
utilities like IDEs, compilers, packages.
2016 - 2020: Lab assistant courses data structures in C (4 years), oop
course in C++ (2 years).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-14 14:24   ` Philip Oakley
@ 2022-04-15  5:28     ` Abhradeep Chakraborty
  0 siblings, 0 replies; 12+ messages in thread
From: Abhradeep Chakraborty @ 2022-04-15  5:28 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Abhradeep Chakraborty, Git, Taylor Blau, Kaartic Sivaraam

Philip Oakley <philipoakley@iee.email> wrote:

> It's also worth looking at the Git Rev News articles and light reading links
>      https://git.github.io/rev_news/ and
>      https://git.github.io/rev_news/archive/

Thanks Philip! will surely look into it.

Thanks :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-14 20:29     ` Kaartic Sivaraam
@ 2022-04-15  5:23       ` Abhradeep Chakraborty
  0 siblings, 0 replies; 12+ messages in thread
From: Abhradeep Chakraborty @ 2022-04-15  5:23 UTC (permalink / raw)
  To: Kaartic Sivaraam; +Cc: Abhradeep Chakraborty, Git, Taylor Blau

Kaartic Sivaraam <kaartic.sivaraam@gmail.com> wrote:

> Rather than deciding the plan in haste, I would recommend that you put 
> some more thought into deciding the plan for your proposal. Take some 
> time to reflect on the different parts required for the project, 
> potential things that might interfere with the project etc. This would 
> help you come up with a more refined plan.
>
> This might need some deeper understanding of the project that you would 
> be working on. Do feel to ask us in case you face any issues with this.
>
> Also, you could use Taylor's suggestion as a possible starting point for
> the plan:
>
>  > If you end up working on this, I would suggest budgeting a week or so
>  > to try out a couple of different sub-projects, and then decide where
>  > to spend your time from there.

Got it! thanks! Will do as you suggested.

Thanks :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-14  8:20   ` Abhradeep Chakraborty
@ 2022-04-14 20:29     ` Kaartic Sivaraam
  2022-04-15  5:23       ` Abhradeep Chakraborty
  0 siblings, 1 reply; 12+ messages in thread
From: Kaartic Sivaraam @ 2022-04-14 20:29 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Git, Taylor Blau

On 14-04-2022 13:50, Abhradeep Chakraborty wrote:
> Kaartic Sivaraam <kaartic.sivaraam@gmail.com> wrote:
> 
>> That's good to know. Do let us know if you have any experience with C
>> and shell programming outside of your contributions to Git.
> 
> Thank you for reviewing my proposal. I used C language in my B.tech
> classes. I never used C in practical/real world projects before. So,
> putting altogether, no, I didn't use C and shell in real world projects
> before.
>

That's good to know. Thanks.

>> Taylor has shared feedback on the proposal and timeline which I hope are
>> helpful to you. Specifically, Taylor said:
>>
>>>
>>> So I think it makes sense to try and find a way you can dip your toes
>>> into 2-3 of the sub-projects and get a sense for what feels most doable
>>> and interesting, then focus on that before allocating time for more
>>> projects in the future.
>>   >
>>
>> I agree with this. It's better to be practical than ambituous when
>> planning the proposal.
> 
> Okay, I am willing to work on first sub-project then. Should I remove the
> other subprojects then?
> 

Rather than deciding the plan in haste, I would recommend that you put 
some more thought into deciding the plan for your proposal. Take some 
time to reflect on the different parts required for the project, 
potential things that might interfere with the project etc. This would 
help you come up with a more refined plan.

This might need some deeper understanding of the project that you would 
be working on. Do feel to ask us in case you face any issues with this.

Also, you could use Taylor's suggestion as a possible starting point for
the plan:

 > If you end up working on this, I would suggest budgeting a week or so
 > to try out a couple of different sub-projects, and then decide where
 > to spend your time from there.

Hope this helps,
Sivaraam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-13 19:29 ` Taylor Blau
  2022-04-14  7:58   ` Abhradeep Chakraborty
@ 2022-04-14 14:24   ` Philip Oakley
  2022-04-15  5:28     ` Abhradeep Chakraborty
  1 sibling, 1 reply; 12+ messages in thread
From: Philip Oakley @ 2022-04-14 14:24 UTC (permalink / raw)
  To: Taylor Blau, Abhradeep Chakraborty; +Cc: git, Kaartic Sivaraam

Hi Abhradeep,


On 13/04/2022 20:29, Taylor Blau wrote:
>> Though I had read many documentations, I was still not able to fully
>> understand the terminologies (like `refs`, `packfiles`, `blobs`, `trees`
>> etc.). (ProGit)[https://git-scm.com/book/en/v2] helped me tremendously here.
>> I read the full book and it cleared almost every doubt that came into my mind.
> Terrific! I am really glad that the MyFirst... documents were helpful
> and made it easier for you to contribute. The ProGit book is a great
> resource, too.
>
> If you are looking for more resources, I would encourage you to search
> around for blog posts written by Git contributors, particularly related
> to reachability bitmaps (at least for this GSoC project). Some helpful
> places to start there would be:
>
>     https://github.blog/2015-09-22-counting-objects/
>     https://github.blog/2021-04-29-scaling-monorepo-maintenance/
>
It's also worth looking at the Git Rev News articles and light reading links
     https://git.github.io/rev_news/ and
     https://git.github.io/rev_news/archive/
--
Philip.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-13 20:03 ` Kaartic Sivaraam
@ 2022-04-14  8:20   ` Abhradeep Chakraborty
  2022-04-14 20:29     ` Kaartic Sivaraam
  0 siblings, 1 reply; 12+ messages in thread
From: Abhradeep Chakraborty @ 2022-04-14  8:20 UTC (permalink / raw)
  To: Kaartic Sivaraam; +Cc: Abhradeep Chakraborty, Git, Taylor Blau

Kaartic Sivaraam <kaartic.sivaraam@gmail.com> wrote:

> That's good to know. Do let us know if you have any experience with C
> and shell programming outside of your contributions to Git.

Thank you for reviewing my proposal. I used C language in my B.tech
classes. I never used C in practical/real world projects before. So,
putting altogether, no, I didn't use C and shell in real world projects
before.

> Good to see your various contributions to Git so far. Good work!

Thanks.

> s/indexes/indices/

Oops, correcting it.

> > [ ... snip ... ]
> >> ## Estimated Timeline
> >
> > [ ... snip ... ]
>
> Taylor has shared feedback on the proposal and timeline which I hope are 
> helpful to you. Specifically, Taylor said:
>
> > 
> > So I think it makes sense to try and find a way you can dip your toes
> > into 2-3 of the sub-projects and get a sense for what feels most doable
> > and interesting, then focus on that before allocating time for more
> > projects in the future.
>  >
>
> I agree with this. It's better to be practical than ambituous when 
> planning the proposal.

Okay, I am willing to work on first sub-project then. Should I remove the
other subprojects then?

> Thanks for letting us know of possible things that would interfere with 
> GSoC! Do make sure to keep us posted in case you come to know of 
> anything else.

Yes! will surely inform you in that case.

Thanks :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-13 19:29 ` Taylor Blau
@ 2022-04-14  7:58   ` Abhradeep Chakraborty
  2022-04-14 14:24   ` Philip Oakley
  1 sibling, 0 replies; 12+ messages in thread
From: Abhradeep Chakraborty @ 2022-04-14  7:58 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Abhradeep Chakraborty, Git, Kaartic Sivaraam

Taylor Blau <me@ttaylorr.com> wrote:

> Thanks so much for submitting this proposal! I have been excited to look
> through it, and am sorry for not getting to it sooner. Let me take a
> look now and give some of my thoughts.

Thanks so much Taylor for reviewing my proposal. It's my pleasure!

> Great! It sounds like working on Git would give you some new experience
> in a different domain than some of your past projects, and help you
> learn more about how it works and is developed.

Yeah, definitely! Looking forward to explore more in this domain :)

> Terrific! I am really glad that the MyFirst... documents were helpful
> and made it easier for you to contribute. The ProGit book is a great
> resource, too.

Thanks :)

> If you are looking for more resources, I would encourage you to search
> around for blog posts written by Git contributors, particularly related
> to reachability bitmaps (at least for this GSoC project). Some helpful
> places to start there would be:
>
>     https://github.blog/2015-09-22-counting-objects/
>     https://github.blog/2021-04-29-scaling-monorepo-maintenance/

I have read the "counting objects" article already. It is a very good
article to understand the reachability bitmaps in git. I would surely
go through the other article you shared.

> Note that commits and trees can be stored as deltas against other
> commits and trees (which themselves might be packed as deltas) and so
> on. But we only consider a pair of objects to be delta candidates for
> one another when they share a common type (e.g., we would never delta a
> tree object against a blob or vice-versa).

Oops. That is new to me. I didn't know that commits and trees can also
be delta-ed. Thanks for this info!!

> Exactly; though note that since that series we now have multi-pack
> bitmaps, too, which use a slightly different ordering called the
> "pseudo-pack" order, which (more or less) looks like all of the objects
> in a MIDX first sorted by which pack they came from, and then sorted by
> their order within the pack, removing duplicates in favor of the
> "preferred" pack.
>
> More thorough documentation on this can be found in the
> "multi-pack reverse indexes" section of
> "Documentation/technical/pack-format.txt".

Thanks. Will look into it :)

> But I want to be careful about the word "slow" here. They might be slow,
> or it might be really fast to decompress a given bitmap, depending on
> the disk performance, CPU pressure, how large the bitmaps are, and so
> on. So I think a good first step here is to validate our assumption that
> EWAH decompression is even slow to begin with.
>
> (Of course, the literature on newer formats will readily tell you that
> the old stuff is slower, but I think it's worth reevaluating those
> statements in the context of a practical application instead of in
> theory / benchmarks).

I think, it is "conceptually/theoritically" slow approach. Because as
you said -
> we can't begin to interpret the bitmaps until after decompressing them

It has to decompress the whole thing instead of evaluating the target
bitmap. Hardwares can make it fast but theoritically the time complexity
is slow.

I will validate whether it is practically slow or not though.

> Same comments here about whether or not this is slow to begin with. From
> my experimental patches in this area earlier, I found that we could get
> a significant speed-up in some cases, but that the speed-up was
> basically obliterated whenever we had to do any follow-on traversal if
> the bitmap coverage wasn't completely up-to-date.

True. But atleast it is faster(hopefully) when bitmap coverage is up-to-date.
We may optimize/improve it to make it fast overall.

> Arguably we could do this whether or not we solve the above, but doing
> some combination of the above may make it easier (e.g., if we wanted to
> change the bitmap selection to store dramatically more commits, then we
> may want to investigate how much of a bottleneck the sequential read
> requirement of .bitmap files would become for different numbers of
> selected commits).

True.

> For bitmaps, the number one thing we care about is correctness. I have
> never thought about using Bloom filters before; even though the
> one-sided nature of their errors means that we would never forget to
> send an object that we should have, having an absolute result is vastly
> preferred.
>
> After that, I think we mostly care about how quickly they can be
> decompressed. And after that, I think we care about things like "how
> fast can they be written", and "how large are they".

That's obvious. I had included those probabilistic approaches because
I thought we could have a check whether the obtained result is right
or wrong. In case of wrong answers (2% chance for Hyperloglog), we can
follow another approach that alters only the wrong results. But I
think, it will become complex. More research is needed to check whether
it can be done or not.

For other approaches ( i.e. Roaring+run and sroar), I think these two
fits the criteria you listed in your comment. Though, I need to work
more to fully understand the sroar methodology.

> This all sounds very good and ambitious. Keep in mind, though, that
> these projects have a tendency to take much more time than expected ;-).
> If we get one done, I'll be thrilled! (The goal with suggesting a few
> potential directions to go in is not to say: "let's do all of these
> things", but rather to give you some options in terms of what you find
> most interesting and exciting to work on).
>
> So I think it makes sense to try and find a way you can dip your toes
> into 2-3 of the sub-projects and get a sense for what feels most doable
> and interesting, then focus on that before allocating time for more
> projects in the future.
>
> > ## Estimated Timeline
>
> Like I said, I'm fine if you spend your entire GSoC project working on
> just one of these projects. So I don't want to get hung up on specific
> timelines just yet. If you end up working on this, I would suggest
> budgeting a week or so to try out a couple of different sub-projects,
> and then decide where to spend your time from there.

If that is the case, then I would go for the first sub-project i.e.
"Think of an better alternative of EWAH". It interests me a lot. Though
I will do what you suggested i.e. "try and find a way you can dip your
toes into 2-3 of the sub-projects and get a sense for what feels most
doable and interesting".

Should I remove "Estimated Timeline" ( or/and point 2, 3 from "The Plan"
section) for now then?

> Thanks very much for your interest and the time you spent putting this
> proposal together. I hope that some of my comments were helpful in
> refining it, and that you didn't mind my slow response too much.

The comments and suggestions are indeed helpful! I even didn't mind
anything.

Thanks :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-06 20:09 [GSoC][RFC][PROPOSAL] " Abhradeep Chakraborty
@ 2022-04-13 20:03 ` Kaartic Sivaraam
  2022-04-14  8:20   ` Abhradeep Chakraborty
  0 siblings, 1 reply; 12+ messages in thread
From: Kaartic Sivaraam @ 2022-04-13 20:03 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: Taylor Blau, git

Hi Abhradeep,

Thanks for drafting your initial GSoC proposal. Some thoughts inline:

On 07-04-2022 01:39, Abhradeep Chakraborty wrote:
>
> I am Abhradeep Chakraborty, currently pursuing B.Tech in Information
> Technology (now in my 3rd year) at Jalpaiguri Government Engineering
> College, India.
> 
> In the last two years, I mainly concentrated on learning and building
> web development things. I mainly use Django, React for my projects. But
> for the past 6 months, I have been learning about devops, cloud technologies
> ( e.g. docker, Kubernetes, AWS etc.), computer networking and core
> technologies (such as git, linux, gcc etc.). I also have an interest in
> some research based technologies like WebRTC, Web3 etc.
>

That's good to know. Do let us know if you have any experience with C
and shell programming outside of your contributions to Git.

> My ultimate objective is to be a dedicated and active contributor to all
> of these technologies. So, I want to start this contribution journey with
> `git` and therefore I want to be a part of this GSoC program.
> 
> ## Me & Git
> 
> I have started exploring the Git codebase since Sept 2021. At first, I
> mainly focused on knowing the git internals. During this time I read
> documentations (Under `Documentation` directory). `MyFirstContribution.txt`[1],
> `MyFirstObjectWalk.txt`[2] and `Hacking Git`[3] helped me to learn about the git
> contribution workflow, how `git log` ( in other words `object walk`) works
> internally.
> 
> Though I had read many documentations, I was still not able to fully
> understand the terminologies (like `refs`, `packfiles`, `blobs`, `trees`
> etc.). (ProGit)[https://git-scm.com/book/en/v2] helped me tremendously here.
> I read the full book and it cleared almost every doubt that came into my mind.
> 
> I have provided my git involvement history below. Activities are sorted
> by timeline from top to bottom.
> 
> ==============================
> PRs, Commits & Discussion Participations
> ==============================
 >
 > [ ... snip ... ]

Good to see your various contributions to Git so far. Good work!

> ## Proposed Project
> 
> ### Abstract
> 
> While sending large repositories, git has to decide what needs to be
> sent, how to be sent. For large repositories, git creates pack files
> where it packs all the related commits, trees, blobs( in the form of
> deltas and bases ), tags etc. Now, finding the related objects among
> all the objects might be very expensive.
> 
> As git only knows about the branch tips, it is very expensive to find
> the related objects from all the objects if we try to traverse down
> from the branch tips to all their respective related objects. It becomes
> more expensive as the number of objects (i.e. commits, trees, blobs
> etc.) increases. Ultimately resulting in slow fetching from the git
> daemon.
> 
> To counter that, reachability bitmaps were introduced. It uses bitmap
> data structure (an array of bits) to detect if an object is reachable
> from a particular commit or not. The commit messages of this
> (patch series)[https://lore.kernel.org/git/1372116193-32762-1-git-send-email-tanoku@gmail.com/]
> are itself a documentation of how it is achieved. All the bitmap
> indexes for selected commits (in a EWAH compressed form) are stored in
> a `.bitmap` file which has the same name of its respective packfile.
>

s/indexes/indices/

> [ ... snip ... ]
 >> ## Estimated Timeline
>
 > [ ... snip ... ]

Taylor has shared feedback on the proposal and timeline which I hope are 
helpful to you. Specifically, Taylor said:

> 
> So I think it makes sense to try and find a way you can dip your toes
> into 2-3 of the sub-projects and get a sense for what feels most doable
> and interesting, then focus on that before allocating time for more
> projects in the future.
 >

I agree with this. It's better to be practical than ambituous when 
planning the proposal.

> 
> ## Availability
> 
> I intentionally freed my summer slot for GSoC. So, there wouldn’t
> be much disturbance/inactive days while working on my project.
> Generally, I can spend nearly 35-40 hours a week on this project.
> If the college gets closed for some reason, then the amount of
> available time will increase. Smart India Hackathon finale would
> be held in July. So, I may be inactive for a few days. But overall,
> I will be available most of the time.
>

Thanks for letting us know of possible things that would interfere with 
GSoC! Do make sure to keep us posted in case you come to know of 
anything else.

> ## Post GSoC
> 
> Once the GSoC finishes, I would look into other issues/tickets and
> would send PRs solving those issues. As I said before, I want to
> be a long term contributor to git - I would definitely maintain the
> flow and continue to work on improving the overall git tool.
>

That's nice.

--
Sivaraam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
  2022-04-06 20:04 Abhradeep Chakraborty
@ 2022-04-13 19:29 ` Taylor Blau
  2022-04-14  7:58   ` Abhradeep Chakraborty
  2022-04-14 14:24   ` Philip Oakley
  0 siblings, 2 replies; 12+ messages in thread
From: Taylor Blau @ 2022-04-13 19:29 UTC (permalink / raw)
  To: Abhradeep Chakraborty; +Cc: git, Kaartic Sivaraam

Hi Abhradeep,

On Thu, Apr 07, 2022 at 01:34:40AM +0530, Abhradeep Chakraborty wrote:
> Hello everyone,
>
> Here is the initial version of my proposal on "Reachability bitmap
> improvements". I would really love to have comments on it. If you
> find any mistakes please notify me about that.

Thanks so much for submitting this proposal! I have been excited to look
through it, and am sorry for not getting to it sooner. Let me take a
look now and give some of my thoughts.

> ## About Me
>
> I am Abhradeep Chakraborty, currently pursuing B.Tech in Information
> Technology (now in my 3rd year) at Jalpaiguri Government Engineering
> College, India.
>
> In the last two years, I mainly concentrated on learning and building
> web development things. I mainly use Django, React for my projects. But
> for the past 6 months, I have been learning about devops, cloud technologies
> ( e.g. docker, Kubernetes, AWS etc.), computer networking and core
> technologies (such as git, linux, gcc etc.). I also have an interest in
> some research based technologies like WebRTC, Web3 etc.
>
> My ultimate objective is to be a dedicated and active contributor to all
> of these technologies. So, I want to start this contribution journey with
> `git` and therefore I want to be a part of this GSoC program.

Great! It sounds like working on Git would give you some new experience
in a different domain than some of your past projects, and help you
learn more about how it works and is developed.

> ## Me & Git
>
> I have started exploring the Git codebase since Sept 2021. At first, I
> mainly focused on knowing the git internals. During this time I read
> documentations (Under `Documentation` directory). `MyFirstContribution.txt`[1],
> `MyFirstObjectWalk.txt`[2] and `Hacking Git`[3] helped me to learn about the git
> contribution workflow, how `git log` ( in other words `object walk`) works
> internally.
>
> Though I had read many documentations, I was still not able to fully
> understand the terminologies (like `refs`, `packfiles`, `blobs`, `trees`
> etc.). (ProGit)[https://git-scm.com/book/en/v2] helped me tremendously here.
> I read the full book and it cleared almost every doubt that came into my mind.

Terrific! I am really glad that the MyFirst... documents were helpful
and made it easier for you to contribute. The ProGit book is a great
resource, too.

If you are looking for more resources, I would encourage you to search
around for blog posts written by Git contributors, particularly related
to reachability bitmaps (at least for this GSoC project). Some helpful
places to start there would be:

    https://github.blog/2015-09-22-counting-objects/
    https://github.blog/2021-04-29-scaling-monorepo-maintenance/

> ## Proposed Project
>
> ### Abstract
>
> While sending large repositories, git has to decide what needs to be
> sent, how to be sent. For large repositories, git creates pack files
> where it packs all the related commits, trees, blobs( in the form of
> deltas and bases ), tags etc. Now, finding the related objects among
> all the objects might be very expensive.

Note that commits and trees can be stored as deltas against other
commits and trees (which themselves might be packed as deltas) and so
on. But we only consider a pair of objects to be delta candidates for
one another when they share a common type (e.g., we would never delta a
tree object against a blob or vice-versa).

> As git only knows about the branch tips, it is very expensive to find
> the related objects from all the objects if we try to traverse down
> from the branch tips to all their respective related objects. It becomes
> more expensive as the number of objects (i.e. commits, trees, blobs
> etc.) increases. Ultimately resulting in slow fetching from the git
> daemon.
>
> To counter that, reachability bitmaps were introduced. It uses bitmap
> data structure (an array of bits) to detect if an object is reachable
> from a particular commit or not. The commit messages of this
> (patch series)[https://lore.kernel.org/git/1372116193-32762-1-git-send-email-tanoku@gmail.com/]
> are itself a documentation of how it is achieved. All the bitmap
> indexes for selected commits (in a EWAH compressed form) are stored in
> a `.bitmap` file which has the same name of its respective packfile.

Exactly; though note that since that series we now have multi-pack
bitmaps, too, which use a slightly different ordering called the
"pseudo-pack" order, which (more or less) looks like all of the objects
in a MIDX first sorted by which pack they came from, and then sorted by
their order within the pack, removing duplicates in favor of the
"preferred" pack.

More thorough documentation on this can be found in the
"multi-pack reverse indexes" section of
"Documentation/technical/pack-format.txt".

> The current task is to improve the performance of this bitmap approach.
> There are three points that we need to work on -
>
> 1. Decompression of bitmaps is slow with EWAH. To know if an object is
>    related to a particular commit, we have to decompress the whole thing.
>    So, we have to consider other alternative techniques besides EWAH.

Right, we can't begin to interpret the bitmaps until after decompressing
them (ditto for operations that involve compressed bitmaps, e.g., we
can't OR two EWAH compressed bitmaps together without first
decompressing them).

But I want to be careful about the word "slow" here. They might be slow,
or it might be really fast to decompress a given bitmap, depending on
the disk performance, CPU pressure, how large the bitmaps are, and so
on. So I think a good first step here is to validate our assumption that
EWAH decompression is even slow to begin with.

(Of course, the literature on newer formats will readily tell you that
the old stuff is slower, but I think it's worth reevaluating those
statements in the context of a practical application instead of in
theory / benchmarks).

> 2. Loading a `.bitmap` file can be slow for large bitmaps. This is because
>    we need to read the file sequentially in order to discover the offset of
>    each bitmap within the file. It would be better if we can create a `table
>    of contents` for these bitmaps.

Same comments here about whether or not this is slow to begin with. From
my experimental patches in this area earlier, I found that we could get
a significant speed-up in some cases, but that the speed-up was
basically obliterated whenever we had to do any follow-on traversal if
the bitmap coverage wasn't completely up-to-date.

> 3. Regenerating bitmaps can take a long time. One possible approach is to
>    add a new mode which only adds new bitmaps for commits introduced between
>    successive bitmap generations.

Definitely true. Another way to look at this is that it's now possible
to do some "incremental" work as far as repacking a repository goes, but
that updating its bitmap is still much less "incremental" than it could
be.

> 4. If we solve the above points, we would think of other aspects like
>    rethinking how we select which commit receives bitmaps etc.

Arguably we could do this whether or not we solve the above, but doing
some combination of the above may make it easier (e.g., if we wanted to
change the bitmap selection to store dramatically more commits, then we
may want to investigate how much of a bottleneck the sequential read
requirement of .bitmap files would become for different numbers of
selected commits).

> ### Previous Work
>
> I didn’t find any previous patches regarding point no. 1. But there were
> some discussions about this. Derrick’s
> (comment)[https://lore.kernel.org/git/1685a358-f033-64e0-2e12-df3a1c10af19@gmail.com/]
> suggested considering the
> (Roaring+Run)[https://roaringbitmap.org/about/]
> technique. It is much faster than the currently implemented
> (EWAH)[https://arxiv.org/abs/0901.3751].
>
> Taylor initiated the `developing a table of contents for bitmap` work
> by creating a
> (branch and committing some patches)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
> in that branch. From the
> (discussion)[https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/]
> and the performance test results provided in that discussion,  it seems
> that it would improve overall performance. So, this would work as a
> reference for me while developing the table of contents for bitmaps
> (obviously after running performance tests and discussing the best
> approach).

This is a great summary of some of the existing work in this area!

> ### The plan
>
> As this project includes three ( actually four) sub projects, I would
> address them one by one -
>
> 1. Firstly I will understand the working of EWAH compression better.
>    I have gone through their documentation and have seen git’s
>    implementation code for it. The commit messages were really helpful
>    here. But I think I should research more. After that,  I can compare
>    (and discuss) each approach with respect to the EWAH approach on
>    some parameters (like amount of computation in each approach,
>    simplicity, memory usage, correctness i.e. does it always give the
>    correct output etc.). This comparison would lead me to take the
>    most ideal approach in this case. There are several techniques which
>    may work/perform better than the current EWAH implementation. I
>    have researched some of these techniques. One of them is obviously
>    the Roaring+Run method. I have gone through their
>    (research paper)[https://arxiv.org/pdf/1709.07821v6.pdf]
>    on this technique; it seems like a good approach to me. Other than
>    that, there is a golang package called
>    (sroar)[https://dgraph.io/blog/post/serialized-roaring-bitmaps-golang/]
>    (i.e. Serialized Roaring bitmaps), which is an open source package
>    ( mainly built for (Dgraph)[https://dgraph.io/])
>    and the creator of this package claims it is
>    (7x faster using 15x less memory)[https://github.com/dgraph-io/sroar#benchmarks].
>    It doesn’t require any encoding/decoding step. I would look into
>    it ( It needs to be implemented in C language though). I have also
>    looked into approaches like
>    (bloom filter)[https://en.wikipedia.org/wiki/Bloom_filter],
>    (Hyperloglog)[https://en.wikipedia.org/wiki/HyperLogLog]. These are
>    very fast and use very less memory. But they give probabilistic
>    answers and sometimes they may give wrong answers. So, those
>    would not be a good fit for us I guess.

For bitmaps, the number one thing we care about is correctness. I have
never thought about using Bloom filters before; even though the
one-sided nature of their errors means that we would never forget to
send an object that we should have, having an absolute result is vastly
preferred.

After that, I think we mostly care about how quickly they can be
decompressed. And after that, I think we care about things like "how
fast can they be written", and "how large are they".

> 2. I will understand the bitmap related files such as
>    `pack-bitmap-write.c`[4], `pack-bitmap.c`[5] etc. It would
>    help me to decide where and how I can add the code related to
>    `table of contents` for bitmap.
>    Taylor’s (branch work)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
>    would work as a reference for me. As the Project idea section
>    suggests, I would also include performance tests to continuously
>    check whether my approach is really improving the performance or
>    not.

Developing some performance tests up-front (including expanding on the
existing ones in the t/perf suite) I think is a good first step after
doing some basic research. That will help keep us honest about whether
our changes are moving us in the right direction or the wrong one.

> 3. As I understand more about bitmap related files, I will surely
>    be able to think whether `adding a new mode` is really necessary
>    or not (It is conceptually a good idea though). If necessary,
>    then I would first make the flow design and discuss it with the
>    community. If all is good, I will start working on it. Creating
>    and running performance tests is a must.

> 4. While developing all of these, my understanding of bitmap and
>    its implementation will get better. So, that would help me to
>    think more practically about the questions provided in the `Project
>    Idea` section. Then I would work on it accordingly.

This all sounds very good and ambitious. Keep in mind, though, that
these projects have a tendency to take much more time than expected ;-).
If we get one done, I'll be thrilled! (The goal with suggesting a few
potential directions to go in is not to say: "let's do all of these
things", but rather to give you some options in terms of what you find
most interesting and exciting to work on).

So I think it makes sense to try and find a way you can dip your toes
into 2-3 of the sub-projects and get a sense for what feels most doable
and interesting, then focus on that before allocating time for more
projects in the future.

> ## Estimated Timeline

Like I said, I'm fine if you spend your entire GSoC project working on
just one of these projects. So I don't want to get hung up on specific
timelines just yet. If you end up working on this, I would suggest
budgeting a week or so to try out a couple of different sub-projects,
and then decide where to spend your time from there.

> ## Blogging about Git
>
> I love to write blogs and technical articles. Those are my ways to
> connect and communicate with the developers community. I have
> previously written articles about Django in a portal named
> (GeeksForGeeks)[https://auth.geeksforgeeks.org/user/riter79/articles].
> Recently I started writing  blogs on
> (Medium platform)[https://medium.com/@abhra303].
>
> During the GSoC event, I will frequently write blogs about my
> progress. I will also share the problems I face and the solutions
> of those problems that I consider.

Great!

> I got interested in git by reading Olga’s GSoC blogs. That led me
> to contribute to this codebase. So, someone like me will hopefully
> be motivated by seeing my blogs ;-)
>
> ## Availability
>
> I intentionally freed my summer slot for GSoC. So, there wouldn’t
> be much disturbance/inactive days while working on my project.
> Generally, I can spend nearly 35-40 hours a week on this project.
> If the college gets closed for some reason, then the amount of
> available time will increase. Smart India Hackathon finale would
> be held in July. So, I may be inactive for a few days. But overall,
> I will be available most of the time.

All sounds good to me.

> I hope you’ll give me a chance to make an impact among the developers
> community by considering my application.

Thanks very much for your interest and the time you spent putting this
proposal together. I hope that some of my comments were helpful in
refining it, and that you didn't mind my slow response too much.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
@ 2022-04-06 20:09 Abhradeep Chakraborty
  2022-04-13 20:03 ` Kaartic Sivaraam
  0 siblings, 1 reply; 12+ messages in thread
From: Abhradeep Chakraborty @ 2022-04-06 20:09 UTC (permalink / raw)
  To: git; +Cc: Abhradeep Chakraborty, Taylor Blau, Kaartic Sivaraam

Hello everyone,

Here is the initial version of my proposal on "Reachability bitmap
improvements". I would really love to have comments on it. If you
find any mistakes please notify me about that.

The proposal template is inspired by Hariom Verma's (This year's
GSoC mentor) previous year's gsoc proposal.

Thanks :)

=============================================

View on Docs: https://docs.google.com/document/d/15tZ5QbMtYN41No5wdMw-2Ilp63vquB7OqAwBCLl_0n0/edit?usp=sharing

==============================================

**Reachability bitmap improvements**

Name  :  Abhradeep Chakraborty
Major : Information Technology
Mobile No. : [ ... ]
Email : chakrabortyabhradeep79@gmail.com
GitHub Profile: (Abhra303)[https://github.com/Abhra303]
LinkedIn : www.linkedin.com/in/abhradeep-chakraborty-b36426201
Time Zone : Indian Standard Time ( IST ) ( UTC +05:30 )

## About Me

I am Abhradeep Chakraborty, currently pursuing B.Tech in Information
Technology (now in my 3rd year) at Jalpaiguri Government Engineering
College, India.

In the last two years, I mainly concentrated on learning and building
web development things. I mainly use Django, React for my projects. But
for the past 6 months, I have been learning about devops, cloud technologies
( e.g. docker, Kubernetes, AWS etc.), computer networking and core
technologies (such as git, linux, gcc etc.). I also have an interest in
some research based technologies like WebRTC, Web3 etc.

My ultimate objective is to be a dedicated and active contributor to all
of these technologies. So, I want to start this contribution journey with
`git` and therefore I want to be a part of this GSoC program.

## Me & Git

I have started exploring the Git codebase since Sept 2021. At first, I
mainly focused on knowing the git internals. During this time I read
documentations (Under `Documentation` directory). `MyFirstContribution.txt`[1],
`MyFirstObjectWalk.txt`[2] and `Hacking Git`[3] helped me to learn about the git
contribution workflow, how `git log` ( in other words `object walk`) works
internally.

Though I had read many documentations, I was still not able to fully
understand the terminologies (like `refs`, `packfiles`, `blobs`, `trees`
etc.). (ProGit)[https://git-scm.com/book/en/v2] helped me tremendously here.
I read the full book and it cleared almost every doubt that came into my mind.

I have provided my git involvement history below. Activities are sorted
by timeline from top to bottom.

==============================
PRs, Commits & Discussion Participations
==============================

---------------------------------
[patch] Amend error messages that violate git coding style convention
Github link : https://github.com/gitgitgadget/git/pull/1062
Status : Not Merged/ PR
Description : There are many error strings which do not follow coding
style guides. This PR was to fix those. There is a ticket dedicated to
this ( github Issue - (#635)[https://github.com/gitgitgadget/git/issues/635]).
The author of this issue said it would be better to manually correct
these things.
Outcome : As this was my first PR, I made many mistakes and I faced
difficulties. Though this was not merged, I learned tremendous new
things. These violating strings were in many files ( nearly 200 files).
At first, I fixed all the files at once manually. I fixed errors
generated by those changes. Ultimately, I learned about overall git
structures, how to fix errors etc.

---------------------------------
[RFC PATCH 0/1] making set-upstream have default arguments
Mailing List : https://lore.kernel.org/git/20211202144354.17416-1-chakrabortyabhradeep79@gmail.com/
Status : Not Merged/ Closed
Description : `git push -u` generally ( if `push.default` not set)
throws an error if no arguments are provided. This patch request was to
teach `git push -u` to push and set a default upstream if no arguments
are provided. There is a github (issue)[https://github.com/gitgitgadget/git/issues/1061]
also.
Outcome : This PR let `git push -u` to push to a default branch and
set that as the upstream irrespective of `push.default` configuration.
I realized it was badly affecting the reason `push.default` is built
for ( See this (comment)[https://lore.kernel.org/git/20220104132839.1209-1-chakrabortyabhradeep79@gmail.com/]).
So, I intentionally stopped working on it.

----------------------------------
Avoid checking for -h in commands that already use parse-options API ( Issue)
Github Issue Link : https://github.com/gitgitgadget/git/issues/1125
Status : Closed

----------------------------------
[PATCH] Add usage strings ci check and amend remaining usage strings
Github PR Link : https://github.com/gitgitgadget/git/pull/1147
Mailing List Link : https://lore.kernel.org/git/pull.1147.git.1645030949730.gitgitgadget@gmail.com/
Status : Merged into Master
Description : According to dscho’s (comment)[https://github.com/gitgitgadget/git/issues/636#issuecomment-1018660439],
There were still some usage strings that violated the style guide.
So, this patch request fixed those strings and added a CI check to
verify all the usage strings are passing the style check. In the first
version, I built a shell script to check the validity of strings.
Outcome : The first commit was merged into master. There were some
contradictions about the different approaches suggested by reviewers
in the discussion about the last commit i.e. whether the check should
be static check or runtime check, whether I should use `parse options`
method or coccinelle etc (see the (mailing list)[https://lore.kernel.org/git/nycvar.QRO.7.76.6.2202251632320.11118@tvgsbejvaqbjf.bet/]).
Lastly, it was decided that I should try the `coccinelle` method first
and check if it is better ( see this (comment)[https://lore.kernel.org/git/20220304142154.2350-1-chakrabortyabhradeep79@gmail.com/]).
I will try to submit a PR regarding this before April 19.

==================================
MicroProject
==================================

[PATCH] Partial-Clone : add a partial-clone test case
Github PR Link : https://github.com/gitgitgadget/git/pull/1175
Status : Merged into Master
Github Issue : https://github.com/gitgitgadget/git/issues/827
Description : In a blobless-cloned repo, `git log –follow – <path>`
(`<path>` have an exact OID rename) shouldn’t download blobs of the
file from where the new file is renamed. This patch added a test case
to verify it.
Outcome : The test was added to the master branch. Moreover, another
issue was discovered while discussing the patch request. Junio discovered
that the test helper function `test_subcommand_inexact` was poorly designed.
Later Derrick submitted a (PR)[https://github.com/gitgitgadget/git/pull/1185]
to fix it ( now merged).

==================================
Review
==================================
[PATCH] test-lib-functions : fix test_subcommand_inexact
Github PR Link : https://github.com/gitgitgadget/git/pull/1185
Description : Junio discovered that the `test_subcommand_inexact` function
(test-helper function) is poorly designed. This PR addresses that.
My comment : Though I placed it under the `Review` section; but I didn’t
review the code here. Instead I gave my opinion regarding this in
(this comment)[https://lore.kernel.org/git/20220324181056.53824-1-chakrabortyabhradeep79@gmail.com/].

## Proposed Project

### Abstract

While sending large repositories, git has to decide what needs to be
sent, how to be sent. For large repositories, git creates pack files
where it packs all the related commits, trees, blobs( in the form of
deltas and bases ), tags etc. Now, finding the related objects among
all the objects might be very expensive.

As git only knows about the branch tips, it is very expensive to find
the related objects from all the objects if we try to traverse down
from the branch tips to all their respective related objects. It becomes
more expensive as the number of objects (i.e. commits, trees, blobs
etc.) increases. Ultimately resulting in slow fetching from the git
daemon.

To counter that, reachability bitmaps were introduced. It uses bitmap
data structure (an array of bits) to detect if an object is reachable
from a particular commit or not. The commit messages of this
(patch series)[https://lore.kernel.org/git/1372116193-32762-1-git-send-email-tanoku@gmail.com/]
are itself a documentation of how it is achieved. All the bitmap
indexes for selected commits (in a EWAH compressed form) are stored in
a `.bitmap` file which has the same name of its respective packfile.

The current task is to improve the performance of this bitmap approach.
There are three points that we need to work on -

1. Decompression of bitmaps is slow with EWAH. To know if an object is
   related to a particular commit, we have to decompress the whole thing.
   So, we have to consider other alternative techniques besides EWAH.

2. Loading a `.bitmap` file can be slow for large bitmaps. This is because
   we need to read the file sequentially in order to discover the offset of
   each bitmap within the file. It would be better if we can create a `table
   of contents` for these bitmaps.

3. Regenerating bitmaps can take a long time. One possible approach is to
   add a new mode which only adds new bitmaps for commits introduced between
   successive bitmap generations.

4. If we solve the above points, we would think of other aspects like
   rethinking how we select which commit receives bitmaps etc.

### Previous Work

I didn’t find any previous patches regarding point no. 1. But there were
some discussions about this. Derrick’s
(comment)[https://lore.kernel.org/git/1685a358-f033-64e0-2e12-df3a1c10af19@gmail.com/]
suggested considering the
(Roaring+Run)[https://roaringbitmap.org/about/]
technique. It is much faster than the currently implemented
(EWAH)[https://arxiv.org/abs/0901.3751].

Taylor initiated the `developing a table of contents for bitmap` work
by creating a
(branch and committing some patches)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
in that branch. From the
(discussion)[https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/]
and the performance test results provided in that discussion,  it seems
that it would improve overall performance. So, this would work as a
reference for me while developing the table of contents for bitmaps
(obviously after running performance tests and discussing the best
approach).

### Potential Problem

The biggest difficulty of this project would be to replace the EWAH
compression/decompression altogether with better techniques like the
Roaring+Run method. I have to be aware of it so that it doesn’t
generate any unintentional behavior or bugs.

### The plan

As this project includes three ( actually four) sub projects, I would
address them one by one -
1. Firstly I will understand the working of EWAH compression better.
   I have gone through their documentation and have seen git’s
   implementation code for it. The commit messages were really helpful
   here. But I think I should research more. After that,  I can compare
   (and discuss) each approach with respect to the EWAH approach on
   some parameters (like amount of computation in each approach,
   simplicity, memory usage, correctness i.e. does it always give the
   correct output etc.). This comparison would lead me to take the
   most ideal approach in this case. There are several techniques which
   may work/perform better than the current EWAH implementation. I
   have researched some of these techniques. One of them is obviously
   the Roaring+Run method. I have gone through their
   (research paper)[https://arxiv.org/pdf/1709.07821v6.pdf]
   on this technique; it seems like a good approach to me. Other than
   that, there is a golang package called
   (sroar)[https://dgraph.io/blog/post/serialized-roaring-bitmaps-golang/]
   (i.e. Serialized Roaring bitmaps), which is an open source package
   ( mainly built for (Dgraph)[https://dgraph.io/])
   and the creator of this package claims it is
   (7x faster using 15x less memory)[https://github.com/dgraph-io/sroar#benchmarks].
   It doesn’t require any encoding/decoding step. I would look into
   it ( It needs to be implemented in C language though). I have also
   looked into approaches like
   (bloom filter)[https://en.wikipedia.org/wiki/Bloom_filter],
   (Hyperloglog)[https://en.wikipedia.org/wiki/HyperLogLog]. These are
   very fast and use very less memory. But they give probabilistic
   answers and sometimes they may give wrong answers. So, those
   would not be a good fit for us I guess.
2. I will understand the bitmap related files such as
   `pack-bitmap-write.c`[4], `pack-bitmap.c`[5] etc. It would
   help me to decide where and how I can add the code related to
   `table of contents` for bitmap.
   Taylor’s (branch work)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
   would work as a reference for me. As the Project idea section
   suggests, I would also include performance tests to continuously
   check whether my approach is really improving the performance or
   not.
3. As I understand more about bitmap related files, I will surely
   be able to think whether `adding a new mode` is really necessary
   or not (It is conceptually a good idea though). If necessary,
   then I would first make the flow design and discuss it with the
   community. If all is good, I will start working on it. Creating
   and running performance tests is a must.
4. While developing all of these, my understanding of bitmap and
   its implementation will get better. So, that would help me to
   think more practically about the questions provided in the `Project
   Idea` section. Then I would work on it accordingly.

## Estimated Timeline

My Semester exams will most probably start from May last week. So,
I may be inactive (or less active) that time.
(Smart India Hackathon)[https://www.sih.gov.in]
(semi finals and finals) may also be held during the GSoC event.
I will get you notified about that.

--> Period
- Pre GSoC Period
- Till May 20
--> Tasks
- Research more on the bitmap files, compression techniques etc.
- See if there are any other (better) techniques for it.

--> Period
- Community Bonding
- May 20 - June 12
--> Tasks
- Discussion on the project started
- Decide which approach is better for compression/decompression of
  bitmaps.
- If a decision is not taken, create branches for each approach.
- Create branches for each sub project and make skeletons for them.

--> Period
- Coding Phase 1
- June 13 - July 25
--> Tasks
- Start working on sub-project no. 1 (i.e. bitmap compression
  technique)
- Add Performance test
- If multiple branch are create for this sub-project, compare
  the test results
- Take the best performing approach, delete other branches.
- Update documentation accordingly
- Fix bugs
- Finish sub-project 1

--> Period
- Coding Phase 2
- July 25 - Sept 4
--> Tasks
- Fix sub-project-1 related small bugs
- Start working on sub-project-2 (i.e. adding
  `table of content` for bitmap files)
- Add Performance test
- Add Documentation
- Fix bugs
- If finished, start working on sub-project-3 (i.e. adding a
  `new mode`)
- Try to finish it with documentation and performance tests

--> Period
- Final Week 1
- Sept 5 - Sept 12
--> Tasks
- If both are finished, fix bugs and test overall performance.
  Else extend the period.
- Discuss about the last sub project (i.e. questions )
- If anything needs to be changed/modified or added, extend the
  GSoC Period

--> Period
- Coding Phase 3
- Sept 13 - Nov 13
--> Tasks
- Finish code related to all sub projects
- Add overall documentation
- Add overall performance tests


## Blogging about Git

I love to write blogs and technical articles. Those are my ways to
connect and communicate with the developers community. I have
previously written articles about Django in a portal named
(GeeksForGeeks)[https://auth.geeksforgeeks.org/user/riter79/articles].
Recently I started writing  blogs on
(Medium platform)[https://medium.com/@abhra303].

During the GSoC event, I will frequently write blogs about my
progress. I will also share the problems I face and the solutions
of those problems that I consider.

I got interested in git by reading Olga’s GSoC blogs. That led me
to contribute to this codebase. So, someone like me will hopefully
be motivated by seeing my blogs ;-)

## Availability

I intentionally freed my summer slot for GSoC. So, there wouldn’t
be much disturbance/inactive days while working on my project.
Generally, I can spend nearly 35-40 hours a week on this project.
If the college gets closed for some reason, then the amount of
available time will increase. Smart India Hackathon finale would
be held in July. So, I may be inactive for a few days. But overall,
I will be available most of the time.

## Post GSoC

Once the GSoC finishes, I would look into other issues/tickets and
would send PRs solving those issues. As I said before, I want to
be a long term contributor to git - I would definitely maintain the
flow and continue to work on improving the overall git tool.

## Experience with Open Source

I have a pretty good contribution history in open source. The
organizations/tools where I contributed to are - 

 - Mozilla/bedrock - written mainly in Python Django and javaScript
   link - https://github.com/mozilla/bedrock/commits?author=Abhra303

 - Datree - a Kubernetes misconfiguration checker tool written mainly
   in golang
   link - https://github.com/datreeio/datree/commits?author=Abhra303

 - KubeBuilder - A tool to extend the Kubernetes API by creating custom
   CRDs - written mainly in golang. I recently started looking into it.
   link : https://github.com/kubernetes-sigs/kubebuilder/issues/2567

## Motivation

I always wanted to develop/contribute to projects that affect millions
of people. Git is a tool that most ( or maybe every) developers
use for their daily work. Contributing to it means my written code
will be used behind the scenes in every developer’s machine. That’s my
motivation.

## Closing Remarks

Lastly, I want to say that I am a dedicated, research preferred quick
learner. I am always ready to solve real world problems and learn
required skills for them. For example, I recently took part in a
hackathon called “Cloud Native Hackathon” and won 1st place in the
category of “best use of Datree”. I knew nothing about
(Datree)[https://datree.io/] ( and (Kubernetes)[https://kubernetes.io]
partially) before this hackathon and the hackathon was only 2 days
long. But I researched hard ( to know the Kubernetes misconfigurations)
and learned  jsonSchema ( Here is my
(blog link)[https://medium.com/@abhra303/how-i-won-cloud-native-hackathon-2021-4b1f5b2a84f2]
if you want to know my full experience).

I hope you’ll give me a chance to make an impact among the developers
community by considering my application.

Thanks & Regards,

Abhradeep Chakraborty



[1] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstContribution.txt
[2] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstObjectWalk.txt
[3] https://git-scm.com/docs/user-manual#hacking-git
[4] https://github.com/git/git/blob/master/pack-bitmap-write.c
[5] https://github.com/git/git/blob/master/pack-bitmap.c

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [GSoC][RFC][PROPOSAL] Reachability bitmap improvements
@ 2022-04-06 20:04 Abhradeep Chakraborty
  2022-04-13 19:29 ` Taylor Blau
  0 siblings, 1 reply; 12+ messages in thread
From: Abhradeep Chakraborty @ 2022-04-06 20:04 UTC (permalink / raw)
  To: git; +Cc: Abhradeep Chakraborty, Taylor Blau, Kaartic Sivaraam

Hello everyone,

Here is the initial version of my proposal on "Reachability bitmap
improvements". I would really love to have comments on it. If you
find any mistakes please notify me about that.

The proposal template is inspired by Hariom Verma's (This year's
GSoC mentor) previous year's gsoc proposal.

Thanks :)

=============================================

View on Docs: https://docs.google.com/document/d/15tZ5QbMtYN41No5wdMw-2Ilp63vquB7OqAwBCLl_0n0/edit?usp=sharing

==============================================

**Reachability bitmap improvements**

Name  :  Abhradeep Chakraborty
Major : Information Technology
Mobile No. : [ ... ]
Email : chakrabortyabhradeep79@gmail.com
GitHub Profile: (Abhra303)[https://github.com/Abhra303]
LinkedIn : www.linkedin.com/in/abhradeep-chakraborty-b36426201
Time Zone : Indian Standard Time ( IST ) ( UTC +05:30 )

## About Me

I am Abhradeep Chakraborty, currently pursuing B.Tech in Information
Technology (now in my 3rd year) at Jalpaiguri Government Engineering
College, India.

In the last two years, I mainly concentrated on learning and building
web development things. I mainly use Django, React for my projects. But
for the past 6 months, I have been learning about devops, cloud technologies
( e.g. docker, Kubernetes, AWS etc.), computer networking and core
technologies (such as git, linux, gcc etc.). I also have an interest in
some research based technologies like WebRTC, Web3 etc.

My ultimate objective is to be a dedicated and active contributor to all
of these technologies. So, I want to start this contribution journey with
`git` and therefore I want to be a part of this GSoC program.

## Me & Git

I have started exploring the Git codebase since Sept 2021. At first, I
mainly focused on knowing the git internals. During this time I read
documentations (Under `Documentation` directory). `MyFirstContribution.txt`[1],
`MyFirstObjectWalk.txt`[2] and `Hacking Git`[3] helped me to learn about the git
contribution workflow, how `git log` ( in other words `object walk`) works
internally.

Though I had read many documentations, I was still not able to fully
understand the terminologies (like `refs`, `packfiles`, `blobs`, `trees`
etc.). (ProGit)[https://git-scm.com/book/en/v2] helped me tremendously here.
I read the full book and it cleared almost every doubt that came into my mind.

I have provided my git involvement history below. Activities are sorted
by timeline from top to bottom.

==============================
PRs, Commits & Discussion Participations
==============================

---------------------------------
[patch] Amend error messages that violate git coding style convention
Github link : https://github.com/gitgitgadget/git/pull/1062
Status : Not Merged/ PR
Description : There are many error strings which do not follow coding
style guides. This PR was to fix those. There is a ticket dedicated to
this ( github Issue - (#635)[https://github.com/gitgitgadget/git/issues/635]).
The author of this issue said it would be better to manually correct
these things.
Outcome : As this was my first PR, I made many mistakes and I faced
difficulties. Though this was not merged, I learned tremendous new
things. These violating strings were in many files ( nearly 200 files).
At first, I fixed all the files at once manually. I fixed errors
generated by those changes. Ultimately, I learned about overall git
structures, how to fix errors etc.

---------------------------------
[RFC PATCH 0/1] making set-upstream have default arguments
Mailing List : https://lore.kernel.org/git/20211202144354.17416-1-chakrabortyabhradeep79@gmail.com/
Status : Not Merged/ Closed
Description : `git push -u` generally ( if `push.default` not set)
throws an error if no arguments are provided. This patch request was to
teach `git push -u` to push and set a default upstream if no arguments
are provided. There is a github (issue)[https://github.com/gitgitgadget/git/issues/1061]
also.
Outcome : This PR let `git push -u` to push to a default branch and
set that as the upstream irrespective of `push.default` configuration.
I realized it was badly affecting the reason `push.default` is built
for ( See this (comment)[https://lore.kernel.org/git/20220104132839.1209-1-chakrabortyabhradeep79@gmail.com/]).
So, I intentionally stopped working on it.

----------------------------------
Avoid checking for -h in commands that already use parse-options API ( Issue)
Github Issue Link : https://github.com/gitgitgadget/git/issues/1125
Status : Closed

----------------------------------
[PATCH] Add usage strings ci check and amend remaining usage strings
Github PR Link : https://github.com/gitgitgadget/git/pull/1147
Mailing List Link : https://lore.kernel.org/git/pull.1147.git.1645030949730.gitgitgadget@gmail.com/
Status : Merged into Master
Description : According to dscho’s (comment)[https://github.com/gitgitgadget/git/issues/636#issuecomment-1018660439],
There were still some usage strings that violated the style guide.
So, this patch request fixed those strings and added a CI check to
verify all the usage strings are passing the style check. In the first
version, I built a shell script to check the validity of strings.
Outcome : The first commit was merged into master. There were some
contradictions about the different approaches suggested by reviewers
in the discussion about the last commit i.e. whether the check should
be static check or runtime check, whether I should use `parse options`
method or coccinelle etc (see the (mailing list)[https://lore.kernel.org/git/nycvar.QRO.7.76.6.2202251632320.11118@tvgsbejvaqbjf.bet/]).
Lastly, it was decided that I should try the `coccinelle` method first
and check if it is better ( see this (comment)[https://lore.kernel.org/git/20220304142154.2350-1-chakrabortyabhradeep79@gmail.com/]).
I will try to submit a PR regarding this before April 19.

==================================
MicroProject
==================================

[PATCH] Partial-Clone : add a partial-clone test case
Github PR Link : https://github.com/gitgitgadget/git/pull/1175
Status : Merged into Master
Github Issue : https://github.com/gitgitgadget/git/issues/827
Description : In a blobless-cloned repo, `git log –follow – <path>`
(`<path>` have an exact OID rename) shouldn’t download blobs of the
file from where the new file is renamed. This patch added a test case
to verify it.
Outcome : The test was added to the master branch. Moreover, another
issue was discovered while discussing the patch request. Junio discovered
that the test helper function `test_subcommand_inexact` was poorly designed.
Later Derrick submitted a (PR)[https://github.com/gitgitgadget/git/pull/1185]
to fix it ( now merged).

==================================
Review
==================================
[PATCH] test-lib-functions : fix test_subcommand_inexact
Github PR Link : https://github.com/gitgitgadget/git/pull/1185
Description : Junio discovered that the `test_subcommand_inexact` function
(test-helper function) is poorly designed. This PR addresses that.
My comment : Though I placed it under the `Review` section; but I didn’t
review the code here. Instead I gave my opinion regarding this in
(this comment)[https://lore.kernel.org/git/20220324181056.53824-1-chakrabortyabhradeep79@gmail.com/].

## Proposed Project

### Abstract

While sending large repositories, git has to decide what needs to be
sent, how to be sent. For large repositories, git creates pack files
where it packs all the related commits, trees, blobs( in the form of
deltas and bases ), tags etc. Now, finding the related objects among
all the objects might be very expensive.

As git only knows about the branch tips, it is very expensive to find
the related objects from all the objects if we try to traverse down
from the branch tips to all their respective related objects. It becomes
more expensive as the number of objects (i.e. commits, trees, blobs
etc.) increases. Ultimately resulting in slow fetching from the git
daemon.

To counter that, reachability bitmaps were introduced. It uses bitmap
data structure (an array of bits) to detect if an object is reachable
from a particular commit or not. The commit messages of this
(patch series)[https://lore.kernel.org/git/1372116193-32762-1-git-send-email-tanoku@gmail.com/]
are itself a documentation of how it is achieved. All the bitmap
indexes for selected commits (in a EWAH compressed form) are stored in
a `.bitmap` file which has the same name of its respective packfile.

The current task is to improve the performance of this bitmap approach.
There are three points that we need to work on -

1. Decompression of bitmaps is slow with EWAH. To know if an object is
   related to a particular commit, we have to decompress the whole thing.
   So, we have to consider other alternative techniques besides EWAH.

2. Loading a `.bitmap` file can be slow for large bitmaps. This is because
   we need to read the file sequentially in order to discover the offset of
   each bitmap within the file. It would be better if we can create a `table
   of contents` for these bitmaps.

3. Regenerating bitmaps can take a long time. One possible approach is to
   add a new mode which only adds new bitmaps for commits introduced between
   successive bitmap generations.

4. If we solve the above points, we would think of other aspects like
   rethinking how we select which commit receives bitmaps etc.

### Previous Work

I didn’t find any previous patches regarding point no. 1. But there were
some discussions about this. Derrick’s
(comment)[https://lore.kernel.org/git/1685a358-f033-64e0-2e12-df3a1c10af19@gmail.com/]
suggested considering the
(Roaring+Run)[https://roaringbitmap.org/about/]
technique. It is much faster than the currently implemented
(EWAH)[https://arxiv.org/abs/0901.3751].

Taylor initiated the `developing a table of contents for bitmap` work
by creating a
(branch and committing some patches)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
in that branch. From the
(discussion)[https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/]
and the performance test results provided in that discussion,  it seems
that it would improve overall performance. So, this would work as a
reference for me while developing the table of contents for bitmaps
(obviously after running performance tests and discussing the best
approach).

### Potential Problem

The biggest difficulty of this project would be to replace the EWAH
compression/decompression altogether with better techniques like the
Roaring+Run method. I have to be aware of it so that it doesn’t
generate any unintentional behavior or bugs.

### The plan

As this project includes three ( actually four) sub projects, I would
address them one by one -
1. Firstly I will understand the working of EWAH compression better.
   I have gone through their documentation and have seen git’s
   implementation code for it. The commit messages were really helpful
   here. But I think I should research more. After that,  I can compare
   (and discuss) each approach with respect to the EWAH approach on
   some parameters (like amount of computation in each approach,
   simplicity, memory usage, correctness i.e. does it always give the
   correct output etc.). This comparison would lead me to take the
   most ideal approach in this case. There are several techniques which
   may work/perform better than the current EWAH implementation. I
   have researched some of these techniques. One of them is obviously
   the Roaring+Run method. I have gone through their
   (research paper)[https://arxiv.org/pdf/1709.07821v6.pdf]
   on this technique; it seems like a good approach to me. Other than
   that, there is a golang package called
   (sroar)[https://dgraph.io/blog/post/serialized-roaring-bitmaps-golang/]
   (i.e. Serialized Roaring bitmaps), which is an open source package
   ( mainly built for (Dgraph)[https://dgraph.io/])
   and the creator of this package claims it is
   (7x faster using 15x less memory)[https://github.com/dgraph-io/sroar#benchmarks].
   It doesn’t require any encoding/decoding step. I would look into
   it ( It needs to be implemented in C language though). I have also
   looked into approaches like
   (bloom filter)[https://en.wikipedia.org/wiki/Bloom_filter],
   (Hyperloglog)[https://en.wikipedia.org/wiki/HyperLogLog]. These are
   very fast and use very less memory. But they give probabilistic
   answers and sometimes they may give wrong answers. So, those
   would not be a good fit for us I guess.
2. I will understand the bitmap related files such as
   `pack-bitmap-write.c`[4], `pack-bitmap.c`[5] etc. It would
   help me to decide where and how I can add the code related to
   `table of contents` for bitmap.
   Taylor’s (branch work)[https://github.com/git/git/compare/master...ttaylorr:tb/bitmap-commit-table]
   would work as a reference for me. As the Project idea section
   suggests, I would also include performance tests to continuously
   check whether my approach is really improving the performance or
   not.
3. As I understand more about bitmap related files, I will surely
   be able to think whether `adding a new mode` is really necessary
   or not (It is conceptually a good idea though). If necessary,
   then I would first make the flow design and discuss it with the
   community. If all is good, I will start working on it. Creating
   and running performance tests is a must.
4. While developing all of these, my understanding of bitmap and
   its implementation will get better. So, that would help me to
   think more practically about the questions provided in the `Project
   Idea` section. Then I would work on it accordingly.

## Estimated Timeline

My Semester exams will most probably start from May last week. So,
I may be inactive (or less active) that time.
(Smart India Hackathon)[https://www.sih.gov.in]
(semi finals and finals) may also be held during the GSoC event.
I will get you notified about that.

--> Period
- Pre GSoC Period
- Till May 20
--> Tasks
- Research more on the bitmap files, compression techniques etc.
- See if there are any other (better) techniques for it.

--> Period
- Community Bonding
- May 20 - June 12
--> Tasks
- Discussion on the project started
- Decide which approach is better for compression/decompression of
  bitmaps.
- If a decision is not taken, create branches for each approach.
- Create branches for each sub project and make skeletons for them.

--> Period
- Coding Phase 1
- June 13 - July 25
--> Tasks
- Start working on sub-project no. 1 (i.e. bitmap compression
  technique)
- Add Performance test
- If multiple branch are create for this sub-project, compare
  the test results
- Take the best performing approach, delete other branches.
- Update documentation accordingly
- Fix bugs
- Finish sub-project 1

--> Period
- Coding Phase 2
- July 25 - Sept 4
--> Tasks
- Fix sub-project-1 related small bugs
- Start working on sub-project-2 (i.e. adding
  `table of content` for bitmap files)
- Add Performance test
- Add Documentation
- Fix bugs
- If finished, start working on sub-project-3 (i.e. adding a
  `new mode`)
- Try to finish it with documentation and performance tests

--> Period
- Final Week 1
- Sept 5 - Sept 12
--> Tasks
- If both are finished, fix bugs and test overall performance.
  Else extend the period.
- Discuss about the last sub project (i.e. questions )
- If anything needs to be changed/modified or added, extend the
  GSoC Period

--> Period
- Coding Phase 3
- Sept 13 - Nov 13
--> Tasks
- Finish code related to all sub projects
- Add overall documentation
- Add overall performance tests


## Blogging about Git

I love to write blogs and technical articles. Those are my ways to
connect and communicate with the developers community. I have
previously written articles about Django in a portal named
(GeeksForGeeks)[https://auth.geeksforgeeks.org/user/riter79/articles].
Recently I started writing  blogs on
(Medium platform)[https://medium.com/@abhra303].

During the GSoC event, I will frequently write blogs about my
progress. I will also share the problems I face and the solutions
of those problems that I consider.

I got interested in git by reading Olga’s GSoC blogs. That led me
to contribute to this codebase. So, someone like me will hopefully
be motivated by seeing my blogs ;-)

## Availability

I intentionally freed my summer slot for GSoC. So, there wouldn’t
be much disturbance/inactive days while working on my project.
Generally, I can spend nearly 35-40 hours a week on this project.
If the college gets closed for some reason, then the amount of
available time will increase. Smart India Hackathon finale would
be held in July. So, I may be inactive for a few days. But overall,
I will be available most of the time.

## Post GSoC

Once the GSoC finishes, I would look into other issues/tickets and
would send PRs solving those issues. As I said before, I want to
be a long term contributor to git - I would definitely maintain the
flow and continue to work on improving the overall git tool.

## Experience with Open Source

I have a pretty good contribution history in open source. The
organizations/tools where I contributed to are - 

 - Mozilla/bedrock - written mainly in Python Django and javaScript
   link - https://github.com/mozilla/bedrock/commits?author=Abhra303

 - Datree - a Kubernetes misconfiguration checker tool written mainly
   in golang
   link - https://github.com/datreeio/datree/commits?author=Abhra303

 - KubeBuilder - A tool to extend the Kubernetes API by creating custom
   CRDs - written mainly in golang. I recently started looking into it.
   link : https://github.com/kubernetes-sigs/kubebuilder/issues/2567

## Motivation

I always wanted to develop/contribute to projects that affect millions
of people. Git is a tool that most ( or maybe every) developers
use for their daily work. Contributing to it means my written code
will be used behind the scenes in every developer’s machine. That’s my
motivation.

## Closing Remarks

Lastly, I want to say that I am a dedicated, research preferred quick
learner. I am always ready to solve real world problems and learn
required skills for them. For example, I recently took part in a
hackathon called “Cloud Native Hackathon” and won 1st place in the
category of “best use of Datree”. I knew nothing about
(Datree)[https://datree.io/] ( and (Kubernetes)[https://kubernetes.io]
partially) before this hackathon and the hackathon was only 2 days
long. But I researched hard ( to know the Kubernetes misconfigurations)
and learned  jsonSchema ( Here is my
(blog link)[https://medium.com/@abhra303/how-i-won-cloud-native-hackathon-2021-4b1f5b2a84f2]
if you want to know my full experience).

I hope you’ll give me a chance to make an impact among the developers
community by considering my application.

Thanks & Regards,

Abhradeep Chakraborty



[1] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstContribution.txt
[2] https://github.com/gitgitgadget/git/blob/master/Documentation/MyFirstObjectWalk.txt
[3] https://git-scm.com/docs/user-manual#hacking-git
[4] https://github.com/git/git/blob/master/pack-bitmap-write.c
[5] https://github.com/git/git/blob/master/pack-bitmap.c

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-19  0:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06 20:46 [GSoC][RFC][PROPOSAL] Reachability bitmap improvements Abhradeep Chakraborty
  -- strict thread matches above, loose matches on Subject: below --
2022-04-19  0:10 [GSoC][RFC][Proposal] " Plato Kiorpelidis
2022-04-06 20:09 [GSoC][RFC][PROPOSAL] " Abhradeep Chakraborty
2022-04-13 20:03 ` Kaartic Sivaraam
2022-04-14  8:20   ` Abhradeep Chakraborty
2022-04-14 20:29     ` Kaartic Sivaraam
2022-04-15  5:23       ` Abhradeep Chakraborty
2022-04-06 20:04 Abhradeep Chakraborty
2022-04-13 19:29 ` Taylor Blau
2022-04-14  7:58   ` Abhradeep Chakraborty
2022-04-14 14:24   ` Philip Oakley
2022-04-15  5:28     ` Abhradeep Chakraborty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).