git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GSoC] [PATCH 0/1] draft proposal for reachability bitmap
@ 2022-04-09 18:43 Shubham Mishra
  2022-04-09 18:43 ` [PATCH 1/1] [GSoC] Initial draft proposal Shubham Mishra
  2022-04-09 18:43 ` [PATCH 1/1] " Shubham Mishra
  0 siblings, 2 replies; 4+ messages in thread
From: Shubham Mishra @ 2022-04-09 18:43 UTC (permalink / raw)
  To: git; +Cc: me, kaartic.sivaraam, Shubham Mishra

Hi Community,
I am sharing draft for my proposal on reachability bitmap. I am looking
to get your reviews on proposal on every aspect that includes planning,
grammar, or anything I might missed to mention.

Thanks, looking forward to get some comments and improve the initial
version.

google doc link - https://docs.google.com/document/d/1Zei9g3mYa0SAz6Z94cIx_1tgFeID6xIL5Qb3Un_itVQ

Thanks,
Shubham


Shubham Mishra (1):
  Initial draft proposal

 proposal.txt | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100644 proposal.txt


base-commit: bf23fe5c37d62f37267d31d4afa1a1444f70cdac
-- 
2.25.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] [GSoC] Initial draft proposal
  2022-04-09 18:43 [GSoC] [PATCH 0/1] draft proposal for reachability bitmap Shubham Mishra
@ 2022-04-09 18:43 ` Shubham Mishra
  2022-04-09 18:43 ` [PATCH 1/1] " Shubham Mishra
  1 sibling, 0 replies; 4+ messages in thread
From: Shubham Mishra @ 2022-04-09 18:43 UTC (permalink / raw)
  To: git; +Cc: me, kaartic.sivaraam, Shubham Mishra

Initial draft proposal for reachability bitmap

Signed-off-by: Shubham Mishra <shivam828787@gmail.com>
---
 proposal.txt | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100644 proposal.txt

diff --git a/proposal.txt b/proposal.txt
new file mode 100644
index 0000000000..0888ba8b38
--- /dev/null
+++ b/proposal.txt
@@ -0,0 +1,132 @@
+# GIT
+
+# Reachability bitmap performance improvement
+
+Name - Shubham Mishra
+
+Email - [shivam828787@gmail.com](mailto:shivam828787@gmail.com)
+
+IRC Nick - shubham828
+
+Location - Ghaziabad, Uttar Pradesh , India , UTC+5:30
+
+Proposal Title - Reachability bitmap performance improvement.
+
+* * *
+
+About Me
+
+I am Shubham, I am currently working as a Software Engineer at Microsoft India. I am a 2021 graduate from Delhi University. I am passionate about core engineering and backend technologies. I love to see beyond all abstractions and how things really work under the hood. So, I can work from their roots and make things better. I feel engineering is all about the tradeoffs that we make and I am trying to learn them to become a better Engineer.
+
+I am passionate about open source technologies and have quite a good amount of contribution to them, I participated in GSoC 2020 with [KDE](https://summerofcode.withgoogle.com/archive/2020/projects/6473982317953024), did Internship with Linux Foundation - [HDV](https://wiki.lfnetworking.org/display/LN/HDV), Season of KDE - [report](https://community.kde.org/SoK/2020/StatusReport/Shubham), and I am doing voluntary contributions to VSperf, BoostC++ and some other open-source projects.
+
+Motivation for Proposal
+
+I have been using Git for the last 3 years now, and I always find myself curious about how it manages a lot of files seamlessly to make developer collaboration so smooth.
+
+I was listening to one of Derrick Stolee's podcasts, where I felt git contributors really do cool stuff, So I also wanted to be one. By becoming a regular contributor of git, I can give my contribution back to it as well as I can show off it to my friends :) . I love to study advanced data structures and Algorithms that's the reason I chose a bitmap related project. I will get a chance to learn different compression algorithms and analyze their performance.
+
+Project Abstract
+
+During repository clones, the Git server needs to find out all the objects which clients do not have and need to be sent to the client.
+
+To make the process faster, Git uses bitmaps to quickly find all the related objects from an object. Bitmap approach is a performance optimization over the legacy "Counting Objects" - the process in which the git server used to iterate through the graph from branch tips to the beginning of history to list down all objects that need to be sent.
+
+bitmap made reachability faster but uncompressed bitmaps can cost a lot of extra storage. Git uses a C ported version of [EWAHBoolArray](https://github.com/lemire/EWAHBoolArray) to compress bitmaps which get stored in the ".bitmap" file with the same prefix "sha" as ".pack" and ".idx".
+
+The aim of the project is to design a performance test suite as well as do the necessary changes to improve bitmap performance by trying out a new compression scheme that can make read operations along with other common operations like intersect, union and negate faster.
+
+Me & Git:
+
+## Microproject:
+
+I worked on the microproject "Avoid pipes in git related commands in test scripts", the patches for it has been merged to master now
+
+*   [https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com](https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com/)
+*   [https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.co/](https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com/)
+    
+
+I run a pattern matching grep to find all git commands on LHS of pipes and fix all of them from file t001-t050.
+
+As an outcome of this process, I got to learn the code review process at git work, which is quite cool and different from other organization's I contributed to before.
+
+I learned about building source code, running tests, using email to send patches, communicating with reviewers and sending the next patch version process.
+
+## Current understanding:
+
+*   I have gone through git internals, and I well understood about the pack files as well as the difference between git objects (tree, blob, commit).
+*   I have gone through some documentations - "MyFirstObjectWalk", etc. it was a good hands-on to get some glimpse of general object related tasks.
+*   I understand how bitmap works in general, I have got some idea how EWAH compression works and also I have gone through the research paper on roaring run.
+*   I played with commands of pack-object - "git pack-objects dir --progress < obj_lists.txt" and read the code of related files "pack-bitmap.c" and parts of "pack-object.c"
+*   I checked the general documentation of [Croaring](https://github.com/RoaringBitmap/CRoaring) as one of the potential alternatives to EWAH.
+    
+
+Execution plan:
+
+I am interested in keeping my primary focus on "building a performance suite and improving bitmaps performance by finding a better compression scheme" project and if I finish this early or even after the GSoC timeline, I will be happy to contribute to other tasks too.
+
+From the idea page, I got some sense that decompressing a bitmap for reading bits or doing operations like intersection, negation and union makes them slow and can be improved.
+
+Roaring + Run was the suggested alternative to explore. It divides the data into chunks of 2^16, which allows you to check for the presence of any one value faster. As a result, Roaring can compute many operations much faster than run-length-encoded formats like WAH, EWAH, Concise. After getting a high level understanding of algorithms, I explored a bit of the [Croaring](https://github.com/RoaringBitmap/CRoaring) library which is a C implementation of roaring bitmap. It provides a lot of useful functions to do all general operations (find cardinality, and, or, copy, equals). Which I think we will be using in the "pack-bitmap.c" and "pack-bitmap-write.c" files as a replacement of ewah/bitmap functions. I do not have enough knowledge yet to figure out how compatible croaring is with the current .bitmap format. We might need to make changes in the current .bitmap format accordingly.
+
+Steps I will be following to accomplish the task-
+
+1\. Get a better understanding of bitmap related functionalities/ codebase, EWAH, roaring+run techniques.
+
+2\. try to build out an initial draft version implementing only minimal required core changes, I will try to get a review on it from a wider audience (including mentors).
+
+3\. make changes according to the comment and repeat the review process.
+
+4\. build a performance suite for the changes I have done in the above steps.
+
+5\. If performance improves, I will be writing the rest of the required code changes to use the Croaring including perf tests for them.
+
+6\. Until this time, I will also get a good understanding of the bitmap related projects, so if we will be able to make good progress on roaring+run. I can start picking other subprojects too like 'table of contents' for the .bitmap file where past work - [https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/](https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/) can act as a good reference to me or/and 'append-only bitmap generation' subproject.
+
+I feel for any mentoring program, Communication is the key to success. I will be utilizing the community bonding periods to figure out a process for being in regular touch with mentors, which is essential to make sure I am going in the right direction.
+
+Timeline
+
+I am available to dedicate around 30-35 hours every week to the project.
+
+Community bonding periods -
+
+1.  Exploring code base mostly related to bitmaps
+2.  research on other bitmap compression techniques
+3.  reading technical documents
+4.  interacting with mentors to understand them and project in more detail.
+    
+
+12 June - 25 July:
+
+1.  Writing the first version with the new compression technique
+2.  Get the initial version reviewed by reviewers and make changes accordingly.
+3.  write performance test for these changes
+    
+
+25 July - 4 sept
+
+1.  if tests result well, extending the above functionality to completely move to a new technique.
+2.  writing more performance tests.
+3.  start picking up other tasks if time left
+    
+
+Sept 5 - Sept 12
+
+1.  I will make sure to get all changes merged before this week including tests
+2.  if not, make a decision with mentior on extending the project
+    
+
+Other proposals
+
+No, This is the only proposal I am making for GSoC 2022.
+
+Blog
+
+I want to make blog writing a habit so I planned to publish biweekly blogs at [https://shubham828.github.io/](https://shubham828.github.io/) during this GSoC and after that. This is something I started during my last GSoC too but unfortunately couldn't continue post-gsoc. This GSoC gives me another opportunity to become a regular blogger.
+
+After GSoC
+
+I would love to be a regular contributor of Git. After GSoC, I can pick any left out subprojects of bitmap reachability and I would also be happy to extend my knowledge beyond bitmaps to learn and contribute to other parts of git too.
+
+EndFragment
\ No newline at end of file
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 1/1] Initial draft proposal
  2022-04-09 18:43 [GSoC] [PATCH 0/1] draft proposal for reachability bitmap Shubham Mishra
  2022-04-09 18:43 ` [PATCH 1/1] [GSoC] Initial draft proposal Shubham Mishra
@ 2022-04-09 18:43 ` Shubham Mishra
  2022-04-13 20:44   ` Taylor Blau
  1 sibling, 1 reply; 4+ messages in thread
From: Shubham Mishra @ 2022-04-09 18:43 UTC (permalink / raw)
  To: git; +Cc: me, kaartic.sivaraam, Shubham Mishra

Initial draft proposal for reachability bitmap

Signed-off-by: Shubham Mishra <shivam828787@gmail.com>
---
 proposal.txt | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100644 proposal.txt

diff --git a/proposal.txt b/proposal.txt
new file mode 100644
index 0000000000..0888ba8b38
--- /dev/null
+++ b/proposal.txt
@@ -0,0 +1,132 @@
+# GIT
+
+# Reachability bitmap performance improvement
+
+Name - Shubham Mishra
+
+Email - [shivam828787@gmail.com](mailto:shivam828787@gmail.com)
+
+IRC Nick - shubham828
+
+Location - Ghaziabad, Uttar Pradesh , India , UTC+5:30
+
+Proposal Title - Reachability bitmap performance improvement.
+
+* * *
+
+About Me
+
+I am Shubham, I am currently working as a Software Engineer at Microsoft India. I am a 2021 graduate from Delhi University. I am passionate about core engineering and backend technologies. I love to see beyond all abstractions and how things really work under the hood. So, I can work from their roots and make things better. I feel engineering is all about the tradeoffs that we make and I am trying to learn them to become a better Engineer.
+
+I am passionate about open source technologies and have quite a good amount of contribution to them, I participated in GSoC 2020 with [KDE](https://summerofcode.withgoogle.com/archive/2020/projects/6473982317953024), did Internship with Linux Foundation - [HDV](https://wiki.lfnetworking.org/display/LN/HDV), Season of KDE - [report](https://community.kde.org/SoK/2020/StatusReport/Shubham), and I am doing voluntary contributions to VSperf, BoostC++ and some other open-source projects.
+
+Motivation for Proposal
+
+I have been using Git for the last 3 years now, and I always find myself curious about how it manages a lot of files seamlessly to make developer collaboration so smooth.
+
+I was listening to one of Derrick Stolee's podcasts, where I felt git contributors really do cool stuff, So I also wanted to be one. By becoming a regular contributor of git, I can give my contribution back to it as well as I can show off it to my friends :) . I love to study advanced data structures and Algorithms that's the reason I chose a bitmap related project. I will get a chance to learn different compression algorithms and analyze their performance.
+
+Project Abstract
+
+During repository clones, the Git server needs to find out all the objects which clients do not have and need to be sent to the client.
+
+To make the process faster, Git uses bitmaps to quickly find all the related objects from an object. Bitmap approach is a performance optimization over the legacy "Counting Objects" - the process in which the git server used to iterate through the graph from branch tips to the beginning of history to list down all objects that need to be sent.
+
+bitmap made reachability faster but uncompressed bitmaps can cost a lot of extra storage. Git uses a C ported version of [EWAHBoolArray](https://github.com/lemire/EWAHBoolArray) to compress bitmaps which get stored in the ".bitmap" file with the same prefix "sha" as ".pack" and ".idx".
+
+The aim of the project is to design a performance test suite as well as do the necessary changes to improve bitmap performance by trying out a new compression scheme that can make read operations along with other common operations like intersect, union and negate faster.
+
+Me & Git:
+
+## Microproject:
+
+I worked on the microproject "Avoid pipes in git related commands in test scripts", the patches for it has been merged to master now
+
+*   [https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com](https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com/)
+*   [https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.co/](https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com/)
+    
+
+I run a pattern matching grep to find all git commands on LHS of pipes and fix all of them from file t001-t050.
+
+As an outcome of this process, I got to learn the code review process at git work, which is quite cool and different from other organization's I contributed to before.
+
+I learned about building source code, running tests, using email to send patches, communicating with reviewers and sending the next patch version process.
+
+## Current understanding:
+
+*   I have gone through git internals, and I well understood about the pack files as well as the difference between git objects (tree, blob, commit).
+*   I have gone through some documentations - "MyFirstObjectWalk", etc. it was a good hands-on to get some glimpse of general object related tasks.
+*   I understand how bitmap works in general, I have got some idea how EWAH compression works and also I have gone through the research paper on roaring run.
+*   I played with commands of pack-object - "git pack-objects dir --progress < obj_lists.txt" and read the code of related files "pack-bitmap.c" and parts of "pack-object.c"
+*   I checked the general documentation of [Croaring](https://github.com/RoaringBitmap/CRoaring) as one of the potential alternatives to EWAH.
+    
+
+Execution plan:
+
+I am interested in keeping my primary focus on "building a performance suite and improving bitmaps performance by finding a better compression scheme" project and if I finish this early or even after the GSoC timeline, I will be happy to contribute to other tasks too.
+
+From the idea page, I got some sense that decompressing a bitmap for reading bits or doing operations like intersection, negation and union makes them slow and can be improved.
+
+Roaring + Run was the suggested alternative to explore. It divides the data into chunks of 2^16, which allows you to check for the presence of any one value faster. As a result, Roaring can compute many operations much faster than run-length-encoded formats like WAH, EWAH, Concise. After getting a high level understanding of algorithms, I explored a bit of the [Croaring](https://github.com/RoaringBitmap/CRoaring) library which is a C implementation of roaring bitmap. It provides a lot of useful functions to do all general operations (find cardinality, and, or, copy, equals). Which I think we will be using in the "pack-bitmap.c" and "pack-bitmap-write.c" files as a replacement of ewah/bitmap functions. I do not have enough knowledge yet to figure out how compatible croaring is with the current .bitmap format. We might need to make changes in the current .bitmap format accordingly.
+
+Steps I will be following to accomplish the task-
+
+1\. Get a better understanding of bitmap related functionalities/ codebase, EWAH, roaring+run techniques.
+
+2\. try to build out an initial draft version implementing only minimal required core changes, I will try to get a review on it from a wider audience (including mentors).
+
+3\. make changes according to the comment and repeat the review process.
+
+4\. build a performance suite for the changes I have done in the above steps.
+
+5\. If performance improves, I will be writing the rest of the required code changes to use the Croaring including perf tests for them.
+
+6\. Until this time, I will also get a good understanding of the bitmap related projects, so if we will be able to make good progress on roaring+run. I can start picking other subprojects too like 'table of contents' for the .bitmap file where past work - [https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/](https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/) can act as a good reference to me or/and 'append-only bitmap generation' subproject.
+
+I feel for any mentoring program, Communication is the key to success. I will be utilizing the community bonding periods to figure out a process for being in regular touch with mentors, which is essential to make sure I am going in the right direction.
+
+Timeline
+
+I am available to dedicate around 30-35 hours every week to the project.
+
+Community bonding periods -
+
+1.  Exploring code base mostly related to bitmaps
+2.  research on other bitmap compression techniques
+3.  reading technical documents
+4.  interacting with mentors to understand them and project in more detail.
+    
+
+12 June - 25 July:
+
+1.  Writing the first version with the new compression technique
+2.  Get the initial version reviewed by reviewers and make changes accordingly.
+3.  write performance test for these changes
+    
+
+25 July - 4 sept
+
+1.  if tests result well, extending the above functionality to completely move to a new technique.
+2.  writing more performance tests.
+3.  start picking up other tasks if time left
+    
+
+Sept 5 - Sept 12
+
+1.  I will make sure to get all changes merged before this week including tests
+2.  if not, make a decision with mentior on extending the project
+    
+
+Other proposals
+
+No, This is the only proposal I am making for GSoC 2022.
+
+Blog
+
+I want to make blog writing a habit so I planned to publish biweekly blogs at [https://shubham828.github.io/](https://shubham828.github.io/) during this GSoC and after that. This is something I started during my last GSoC too but unfortunately couldn't continue post-gsoc. This GSoC gives me another opportunity to become a regular blogger.
+
+After GSoC
+
+I would love to be a regular contributor of Git. After GSoC, I can pick any left out subprojects of bitmap reachability and I would also be happy to extend my knowledge beyond bitmaps to learn and contribute to other parts of git too.
+
+EndFragment
\ No newline at end of file
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] Initial draft proposal
  2022-04-09 18:43 ` [PATCH 1/1] " Shubham Mishra
@ 2022-04-13 20:44   ` Taylor Blau
  0 siblings, 0 replies; 4+ messages in thread
From: Taylor Blau @ 2022-04-13 20:44 UTC (permalink / raw)
  To: Shubham Mishra; +Cc: git, kaartic.sivaraam

Hi Shubham,

On Sun, Apr 10, 2022 at 12:13:50AM +0530, Shubham Mishra wrote:
> Initial draft proposal for reachability bitmap

Thanks so much for your interest in this project and the time you've
spent putting together this proposal. Sorry that it has taken me a few
days to get back to you. Let's take a look...

(I should mention that this shouldn't be reviewed to be picked up by
Junio, so it would be fine to just send the contents of "proposal.txt"
as an email to the list directly instead of a patch. But it's fine to
review it either way, just stating this in case other readers aren't
sure).

> About Me
>
> I am Shubham, I am currently working as a Software Engineer at
> Microsoft India. I am a 2021 graduate from Delhi University. I am
> passionate about core engineering and backend technologies. I love to
> see beyond all abstractions and how things really work under the hood.
> So, I can work from their roots and make things better. I feel
> engineering is all about the tradeoffs that we make and I am trying to
> learn them to become a better Engineer.
>
> I am passionate about open source technologies and have quite a good
> amount of contribution to them, I participated in GSoC 2020 with
> [KDE](https://summerofcode.withgoogle.com/archive/2020/projects/6473982317953024),
> did Internship with Linux Foundation -
> [HDV](https://wiki.lfnetworking.org/display/LN/HDV), Season of KDE -
> [report](https://community.kde.org/SoK/2020/StatusReport/Shubham), and
> I am doing voluntary contributions to VSperf, BoostC++ and some other
> open-source projects.

Great, that all sounds like terrific experience! Typically GSoC only
allows accepted contributors to participate once, but an exception has
been made for participants in the 2020 and 2021 cycles, so you would be
eligible here under those rules.

   https://developers.google.com/open-source/gsoc/faq#what_are_the_eligibility_requirements_for_participation

> Project Abstract
>
> During repository clones, the Git server needs to find out all the
> objects which clients do not have and need to be sent to the client.
>
> To make the process faster, Git uses bitmaps to quickly find all the
> related objects from an object. Bitmap approach is a performance
> optimization over the legacy "Counting Objects" - the process in which
> the git server used to iterate through the graph from branch tips to
> the beginning of history to list down all objects that need to be
> sent.
>
> bitmap made reachability faster but uncompressed bitmaps can cost a
> lot of extra storage. Git uses a C ported version of
> [EWAHBoolArray](https://github.com/lemire/EWAHBoolArray) to compress
> bitmaps which get stored in the ".bitmap" file with the same prefix
> "sha" as ".pack" and ".idx".

Right; there are more than 8.5M objects in my (very outdated copy of
linux.git), so it would be impractical to store an 8.5M-bit array for
each selected commit.

> The aim of the project is to design a performance test suite as well
> as do the necessary changes to improve bitmap performance by trying
> out a new compression scheme that can make read operations along with
> other common operations like intersect, union and negate faster.

Well stated. Like I mentioned in another similar proposal, I think
validating our assumption that the performance of bitmap decompression
is worth pursuing is a good first step. We may find that using a
different compression scheme:

  - makes it faster to read bitmaps (by taking less time to decompress
    them)
  - makes it faster to write them (by taking less time to compress the
    bit-arrays)
  - use less memory for one, the other, or both of the above

, alternatively, we may find that switching EWAH for something else
doesn't make a meaningful dent in any of those areas, in which case we
can find something else to work on, too ;-).

> Me & Git:
>
> ## Microproject:
>
> I worked on the microproject "Avoid pipes in git related commands in
> test scripts", the patches for it has been merged to master now
>
>   * https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com/
>   * https://public-inbox.org/git/20220224054720.23996-3-shivam828787@gmail.com/
>
> I run a pattern matching grep to find all git commands on LHS of pipes
> and fix all of them from file t001-t050.
>
> As an outcome of this process, I got to learn the code review process
> at git work, which is quite cool and different from other
> organization's I contributed to before.
>
> I learned about building source code, running tests, using email to
> send patches, communicating with reviewers and sending the next patch
> version process.

Terrific; I was glad to see that you were able to work on modernizing
those test scripts by removing git invocations on the left-hand side of
a pipe. Hopefully that gave you some good hands-on experience with our
development process, which it sounds like it did.

> ## Current understanding:
>
> - I have gone through git internals, and I well understood about the
>   pack files as well as the difference between git objects (tree, blob,
>   commit).
>
> - I have gone through some documentations - "MyFirstObjectWalk", etc.
>   it was a good hands-on to get some glimpse of general object related
>   tasks.
>
> - I understand how bitmap works in general, I have got some idea how
>   EWAH compression works and also I have gone through the research paper
>   on roaring run.
>
> - I played with commands of pack-object - "git pack-objects dir
>   --progress < obj_lists.txt" and read the code of related files
>   "pack-bitmap.c" and parts of "pack-object.c"

Great; if you are looking for more areas to dig in, I'd recommend
reading:

    - pack-bitmap-writer.c
    - midx.c (the bitmap-related parts)
    - and spending extra time with pack-objects.c (as you mention you
      already have, but this is the best place to look for how
      pack-objects uses bitmaps to accelerate its work).

There are other parts of the code that use bitmaps, too, for some more
straightforward tasks. See for e.g., the parts of builtin/rev-list.c
that match a search for "bitmap".

> Execution plan:
>
> I am interested in keeping my primary focus on "building a performance
> suite and improving bitmaps performance by finding a better
> compression scheme" project and if I finish this early or even after
> the GSoC timeline, I will be happy to contribute to other tasks too.

I think that's a good strategy. Like I mentioned in the other proposal,
too, the goal behind having a few different sub-projects to pick from is
that it gives you some additional flexibility to choose what is most
interesting to you. It definitely isn't meant to suggest that we should
do all of the sub-projects within, since I think any one of them done
right could easily fill up the entire GSoC contribution period.

> [...] I do not have enough knowledge yet to figure out how
> compatible croaring is with the current .bitmap format. We might need
> to make changes in the current .bitmap format accordingly.

Good summary in the parts I elided with the [...] in the quoted section
above.

I think that we will likely want to introduce a new bitmap format. We
*could* use the existing format, and indicate to older readers that we
are using a newer format by incrementing the version identifier and
otherwise leaving the format unchanged.

But I would like to get away from the existing .bitmap format, since the
way we read the hash-cache extension makes it difficult to add things
like a variable-width table of contents. So it would be nice to make the
.bitmap format compatible with the new-ish chunk-format API that Stolee
worked on a year or two ago.

Of course, this will be a not-insignificant project in and of itself. It
also carries some notable downside that JGit won't be able to read the
newer format of bitmaps. We could consider expanding the GSoC project to
help JGit add support for that, but I don't think it's strictly necessary
for you to take that on, since JGit should gracefully ignore the unknown
version number.

> Steps I will be following to accomplish the task-
>
> 1\. Get a better understanding of bitmap related functionalities/
> codebase, EWAH, roaring+run techniques.
>
> 2\. try to build out an initial draft version implementing only
> minimal required core changes, I will try to get a review on it from a
> wider audience (including mentors).
>
> 3\. make changes according to the comment and repeat the review
> process.
>
> 4\. build a performance suite for the changes I have done in the above
> steps.

It may be worth making this occur after the first step in your proposal,
since it removes bias from when you design the performance tests and
keeps us honest about whether or not our changes are worth making.

(Please don't worry about the possibility of replacing EWAH compression
not helping bitmap performance; if it doesn't, there is no shortage of
other things that we can work on in this area ;-)).

> 6\. Until this time, I will also get a good understanding of the
> bitmap related projects, so if we will be able to make good progress
> on roaring+run. I can start picking other subprojects too like 'table
> of contents' for the .bitmap file where past work -
> [https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/](https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/)
> can act as a good reference to me or/and 'append-only bitmap
> generation' subproject.

I think it makes sense to pick one or the other, since both will be
fairly involved. Like I said earlier, I think we'll have to investigate
a new .bitmap format for the "table of contents" project, so we may want
to dedicate some time to sketching out what that would look like.

> Timeline
>
> I am available to dedicate around 30-35 hours every week to the project.

That seems reasonable to me.

> Community bonding periods -
>
> 1.  Exploring code base mostly related to bitmaps
> 2.  research on other bitmap compression techniques
> 3.  reading technical documents
> 4.  interacting with mentors to understand them and project in more detail.
>
> 12 June - 25 July:
>
> 1.  Writing the first version with the new compression technique
> 2.  Get the initial version reviewed by reviewers and make changes accordingly.
> 3.  write performance test for these changes
>
> 25 July - 4 sept
>
> 1.  if tests result well, extending the above functionality to completely move to a new technique.
> 2.  writing more performance tests.
> 3.  start picking up other tasks if time left
>
> Sept 5 - Sept 12
>
> 1.  I will make sure to get all changes merged before this week including tests
> 2.  if not, make a decision with mentior on extending the project

This looks reasonable-ish to me, though I would suggest a couple of
changes:

  - Dedicate some time to investigating whether or not a new .bitmap
    format is required up front, adjusting the rest of the schedule
    accordingly if it is.

  - Move the writing the performance tests up much earlier, since I
    think it makes sense to start there, rather than to wait until the
    end to start evaluating our changes (and it removes some degree of
    bias, like I mentioned earlier).

I'm not too attached to the particular dates, TBH, and much more
interested in the general pace and priority of things.

> Blog
>
> I want to make blog writing a habit so I planned to publish biweekly
> blogs at [https://shubham828.github.io/](https://shubham828.github.io/)
> during this GSoC and after that. This is something I started during my
> last GSoC too but unfortunately couldn't continue post-gsoc. This GSoC
> gives me another opportunity to become a regular blogger.

Wonderful, I'll look forward to reading your blog posts in this area if
we end up working on it.

> After GSoC
>
> I would love to be a regular contributor of Git. After GSoC, I can
> pick any left out subprojects of bitmap reachability and I would also
> be happy to extend my knowledge beyond bitmaps to learn and contribute
> to other parts of git too.

That's great!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-04-13 20:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-09 18:43 [GSoC] [PATCH 0/1] draft proposal for reachability bitmap Shubham Mishra
2022-04-09 18:43 ` [PATCH 1/1] [GSoC] Initial draft proposal Shubham Mishra
2022-04-09 18:43 ` [PATCH 1/1] " Shubham Mishra
2022-04-13 20:44   ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).