git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Neeraj Singh <nksingh85@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com>,
	"Git List" <git@vger.kernel.org>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Patrick Steinhardt" <ps@pks.im>,
	"Neeraj K. Singh" <neerajsi@microsoft.com>
Subject: Re: [PATCH 1/7] bulk-checkin: rename 'state' variable and separate 'plugged' boolean
Date: Wed, 16 Mar 2022 00:33:15 -0700	[thread overview]
Message-ID: <CANQDOdcak1nV1Pr9cmyk9dgEjHOH8Au92pUMskJipUodzskzqQ@mail.gmail.com> (raw)
In-Reply-To: <xmqqbky6bgw0.fsf@gitster.g>

On Tue, Mar 15, 2022 at 10:33 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.
> >
> > * Rename 'state' variable to 'bulk_checkin_state', since we will later
> >   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
> >   find in the debugger, since the name is more unique.
> >
> > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
> >   static variable. Doing this avoids resetting the variable in
> >   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
> >   seem to unintentionally disable the plugging functionality the first
> >   time a new packfile must be created due to packfile size limits. While
> >   disabling the plugging state only results in suboptimal behavior for
> >   the current code, it would be fatal for the bulk-fsync functionality
> >   later in this patch series.
>
> Sorry, but I am confused.  The bulk-checkin infrastructure is there
> so that we can send many little objects into a single packfile
> instead of creating many little loose object files.  Everything we
> throw at object-file.c::index_stream() will be concatenated into the
> single packfile while we are "plugged" until we get "unplugged".
>

I noticed that you invented bulk-checkin back in 2011, but I don't think your
description matches what the code actually does.  index_bulk_checkin
is only called from index_stream, which is only called from index_fd. index_fd
goes down the index_bulk_checkin path for large files (512MB by default). It
looks like the effect of the 'plug/unplug' code is to allow multiple
large blobs to
go into a single packfile rather than each getting one getting its own separate
packfile.

> My understanding of what you are doing in this series is to still
> create many little loose object files, but avoid the overhead of
> having to fsync them individually.  And I am not sure how well the
> original idea behind the bulk-checkin infrastructure to avoid
> overhead of having to create many loose objects by creating a single
> packfile (and presumably having to fsync at the end, but that is
> just a single .pack file) with your goal of still creating many
> loose object files but synching them more efficiently.
>
> Is it just the new feature is piggybacking on the existing bulk
> checkin infrastructure, even though these two have nothing in
> common?
>

I think my new usage is congruent with the existing API, which seems
to be about combining multiple add operations into a large transaction,
where we can do some cleanup operations once we're finished. In the
preexisting code, the transaction is about adding a bunch of large objects
to a single pack file (while leaving small objects loose), and then completing
the packfile when the adds are finished.

---
On a side note, I've also been thinking about how we could use a packfile
approach as an alternative means to achieve faster addition of many small
objects. It's essentially what you stated above, where we'd send our
little objects
into a pack file. But to avoid frequent repacking overhead, we might
want to reuse
the 'latest' packfile across multiple Git invocations by appending
objects to it, with
an fsync on the file at the end.

We'd need sufficient padding between objects created by different Git
invocations to
ensure that previously synced data doesn't get disturbed by later
operations.  We'd
need to rewrite the pack indexes each time, but that's at least
derived metadata, so it
doesn't need to be fsynced. To make the pack indexes more
incrementally-updatable,
we might want to have the fanout table be checksummed, with
checksummed pointers to
leaf blocks. If we detect corruption during an index lookup, we could
recreate the index
from the packfile.

Essentially the above proposal is to move away from storing loose
objects in the filesystem
and instead to index the data within Git itself.

Thanks,
Neeraj

  reply	other threads:[~2022-03-16  7:33 UTC|newest]

Thread overview: 175+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 21:30 [PATCH 0/7] core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects Neeraj K. Singh via GitGitGadget
2022-03-15 21:30 ` [PATCH 1/7] bulk-checkin: rename 'state' variable and separate 'plugged' boolean Neeraj Singh via GitGitGadget
2022-03-16  5:33   ` Junio C Hamano
2022-03-16  7:33     ` Neeraj Singh [this message]
2022-03-16 16:14       ` Junio C Hamano
2022-03-16 17:59         ` Neeraj Singh
2022-03-16 18:10           ` Junio C Hamano
2022-03-16 19:50             ` Neeraj Singh
2022-03-15 21:30 ` [PATCH 2/7] core.fsyncmethod: batched disk flushes for loose-objects Neeraj Singh via GitGitGadget
2022-03-16  7:31   ` Patrick Steinhardt
2022-03-16 18:21     ` Neeraj Singh
2022-03-17  5:48       ` Patrick Steinhardt
2022-03-16 11:50   ` Bagas Sanjaya
2022-03-16 19:59     ` Neeraj Singh
2022-03-15 21:30 ` [PATCH 3/7] update-index: use the bulk-checkin infrastructure Neeraj Singh via GitGitGadget
2022-03-15 21:30 ` [PATCH 4/7] unpack-objects: " Neeraj Singh via GitGitGadget
2022-03-15 21:30 ` [PATCH 5/7] core.fsync: use batch mode and sync loose objects by default on Windows Neeraj Singh via GitGitGadget
2022-03-15 21:30 ` [PATCH 6/7] core.fsyncmethod: tests for batch mode Neeraj Singh via GitGitGadget
2022-03-15 21:30 ` [PATCH 7/7] core.fsyncmethod: performance tests for add and stash Neeraj Singh via GitGitGadget
2022-03-20  7:15 ` [PATCH v2 0/7] core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects Neeraj K. Singh via GitGitGadget
2022-03-20  7:15   ` [PATCH v2 1/7] bulk-checkin: rename 'state' variable and separate 'plugged' boolean Neeraj Singh via GitGitGadget
2022-03-20  7:15   ` [PATCH v2 2/7] core.fsyncmethod: batched disk flushes for loose-objects Neeraj Singh via GitGitGadget
2022-03-21 14:41     ` Ævar Arnfjörð Bjarmason
2022-03-21 18:28       ` Neeraj Singh
2022-03-21 15:47     ` Ævar Arnfjörð Bjarmason
2022-03-21 20:14       ` Neeraj Singh
2022-03-21 20:18         ` Ævar Arnfjörð Bjarmason
2022-03-22  0:13           ` Neeraj Singh
2022-03-22  8:52             ` Ævar Arnfjörð Bjarmason
2022-03-22 20:05               ` Neeraj Singh
2022-03-23  3:47                 ` [RFC PATCH 0/7] bottom-up ns/batched-fsync & "plugging" in object-file.c Ævar Arnfjörð Bjarmason
2022-03-23  3:47                   ` [RFC PATCH 1/7] write-or-die.c: remove unused fsync_component() function Ævar Arnfjörð Bjarmason
2022-03-23  5:27                     ` Neeraj Singh
2022-03-23  3:47                   ` [RFC PATCH 2/7] unpack-objects: add skeleton HASH_N_OBJECTS{,_{FIRST,LAST}} flags Ævar Arnfjörð Bjarmason
2022-03-23  3:47                   ` [RFC PATCH 3/7] object-file: pass down unpack-objects.c flags for "bulk" checkin Ævar Arnfjörð Bjarmason
2022-03-23  3:47                   ` [RFC PATCH 4/7] update-index: use a utility function for stdin consumption Ævar Arnfjörð Bjarmason
2022-03-23  3:47                   ` [RFC PATCH 5/7] update-index: pass down an "oflags" argument Ævar Arnfjörð Bjarmason
2022-03-23  3:47                   ` [RFC PATCH 6/7] update-index: rename "buf" to "line" Ævar Arnfjörð Bjarmason
2022-03-23  3:47                   ` [RFC PATCH 7/7] update-index: make use of HASH_N_OBJECTS{,_{FIRST,LAST}} flags Ævar Arnfjörð Bjarmason
2022-03-23  5:51                     ` Neeraj Singh
2022-03-23  9:48                       ` Ævar Arnfjörð Bjarmason
2022-03-23 20:19                         ` Neeraj Singh
2022-03-23 14:18                   ` [RFC PATCH v2 0/7] bottom-up ns/batched-fsync & "plugging" in object-file.c Ævar Arnfjörð Bjarmason
2022-03-23 14:18                     ` [RFC PATCH v2 1/7] unpack-objects: add skeleton HASH_N_OBJECTS{,_{FIRST,LAST}} flags Ævar Arnfjörð Bjarmason
2022-03-23 20:23                       ` Neeraj Singh
2022-03-23 14:18                     ` [RFC PATCH v2 2/7] object-file: pass down unpack-objects.c flags for "bulk" checkin Ævar Arnfjörð Bjarmason
2022-03-23 20:25                       ` Neeraj Singh
2022-03-23 14:18                     ` [RFC PATCH v2 3/7] update-index: pass down skeleton "oflags" argument Ævar Arnfjörð Bjarmason
2022-03-23 14:18                     ` [RFC PATCH v2 4/7] update-index: have the index fsync() flush the loose objects Ævar Arnfjörð Bjarmason
2022-03-23 20:30                       ` Neeraj Singh
2022-03-23 14:18                     ` [RFC PATCH v2 5/7] add: use WLI_NEED_LOOSE_FSYNC for new "only the index" bulk fsync() Ævar Arnfjörð Bjarmason
2022-03-23 14:18                     ` [RFC PATCH v2 6/7] fsync docs: update for new syncing semantics Ævar Arnfjörð Bjarmason
2022-03-23 14:18                     ` [RFC PATCH v2 7/7] fsync docs: add new fsyncMethod.batch.quarantine, elaborate on old Ævar Arnfjörð Bjarmason
2022-03-23 21:08                       ` Neeraj Singh
2022-03-21 17:30     ` [PATCH v2 2/7] core.fsyncmethod: batched disk flushes for loose-objects Junio C Hamano
2022-03-21 20:23       ` Neeraj Singh
2022-03-23 13:26     ` Ævar Arnfjörð Bjarmason
2022-03-24  2:04       ` Neeraj Singh
2022-03-20  7:15   ` [PATCH v2 3/7] update-index: use the bulk-checkin infrastructure Neeraj Singh via GitGitGadget
2022-03-21 15:01     ` Ævar Arnfjörð Bjarmason
2022-03-21 22:09       ` Neeraj Singh
2022-03-21 23:16         ` Ævar Arnfjörð Bjarmason
2022-03-21 17:50     ` Junio C Hamano
2022-03-21 22:18       ` Neeraj Singh
2022-03-20  7:15   ` [PATCH v2 4/7] unpack-objects: " Neeraj Singh via GitGitGadget
2022-03-21 17:55     ` Junio C Hamano
2022-03-21 23:02       ` Neeraj Singh
2022-03-22 20:54         ` Neeraj Singh
2022-03-20  7:15   ` [PATCH v2 5/7] core.fsync: use batch mode and sync loose objects by default on Windows Neeraj Singh via GitGitGadget
2022-03-20  7:15   ` [PATCH v2 6/7] core.fsyncmethod: tests for batch mode Neeraj Singh via GitGitGadget
2022-03-21 18:34     ` Junio C Hamano
2022-03-22  5:54       ` Neeraj Singh
2022-03-20  7:16   ` [PATCH v2 7/7] core.fsyncmethod: performance tests for add and stash Neeraj Singh via GitGitGadget
2022-03-21 17:03   ` [PATCH v2 0/7] core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects Junio C Hamano
2022-03-21 18:14     ` Neeraj Singh
2022-03-21 20:49       ` Junio C Hamano
2022-03-24  4:58   ` [PATCH v3 00/11] " Neeraj K. Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 01/11] bulk-checkin: rebrand plug/unplug APIs as 'odb transactions' Neeraj Singh via GitGitGadget
2022-03-24 16:10       ` Ævar Arnfjörð Bjarmason
2022-03-24 17:52         ` Neeraj Singh
2022-03-24  4:58     ` [PATCH v3 02/11] bulk-checkin: rename 'state' variable and separate 'plugged' boolean Neeraj Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 03/11] object-file: pass filename to fsync_or_die Neeraj Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 04/11] core.fsyncmethod: batched disk flushes for loose-objects Neeraj Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 05/11] update-index: use the bulk-checkin infrastructure Neeraj Singh via GitGitGadget
2022-03-24 18:18       ` Junio C Hamano
2022-03-24 20:25         ` Neeraj Singh
2022-03-24 21:34           ` Junio C Hamano
2022-03-24 22:21             ` Neeraj Singh
2022-03-24  4:58     ` [PATCH v3 06/11] unpack-objects: " Neeraj Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 07/11] core.fsync: use batch mode and sync loose objects by default on Windows Neeraj Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 08/11] test-lib-functions: add parsing helpers for ls-files and ls-tree Neeraj Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 09/11] core.fsyncmethod: tests for batch mode Neeraj Singh via GitGitGadget
2022-03-24 16:29       ` Ævar Arnfjörð Bjarmason
2022-03-24 18:23         ` Neeraj Singh
2022-03-26 15:35           ` Ævar Arnfjörð Bjarmason
2022-03-24  4:58     ` [PATCH v3 10/11] core.fsyncmethod: performance tests for add and stash Neeraj Singh via GitGitGadget
2022-03-24  4:58     ` [PATCH v3 11/11] core.fsyncmethod: correctly camel-case warning message Neeraj Singh via GitGitGadget
2022-03-24 17:44     ` [PATCH v3 00/11] core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects Junio C Hamano
2022-03-24 19:21       ` Neeraj Singh
2022-03-29  0:42     ` [PATCH v4 00/13] " Neeraj K. Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 01/13] bulk-checkin: rename 'state' variable and separate 'plugged' boolean Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 02/13] bulk-checkin: rebrand plug/unplug APIs as 'odb transactions' Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 03/13] object-file: pass filename to fsync_or_die Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 04/13] core.fsyncmethod: batched disk flushes for loose-objects Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 05/13] cache-tree: use ODB transaction around writing a tree Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 06/13] update-index: use the bulk-checkin infrastructure Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 07/13] unpack-objects: " Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 08/13] core.fsync: use batch mode and sync loose objects by default on Windows Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 09/13] test-lib-functions: add parsing helpers for ls-files and ls-tree Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 10/13] core.fsyncmethod: tests for batch mode Neeraj Singh via GitGitGadget
2022-03-29  0:42       ` [PATCH v4 11/13] t/perf: add iteration setup mechanism to perf-lib Neeraj Singh via GitGitGadget
2022-03-29 17:14         ` Neeraj Singh
2022-03-29 18:50           ` Junio C Hamano
2022-03-29  0:42       ` [PATCH v4 12/13] core.fsyncmethod: performance tests for add and stash Neeraj Singh via GitGitGadget
2022-03-29 17:38         ` Neeraj Singh
2022-03-29  0:42       ` [PATCH v4 13/13] core.fsyncmethod: correctly camel-case warning message Neeraj Singh via GitGitGadget
2022-03-29 10:47       ` [PATCH v4 00/13] core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects Ævar Arnfjörð Bjarmason
2022-03-29 17:09         ` Neeraj Singh
2022-03-29 11:45       ` Ævar Arnfjörð Bjarmason
2022-03-29 16:51         ` Neeraj Singh
2022-03-30  5:05       ` [PATCH v5 00/14] " Neeraj K. Singh via GitGitGadget
2022-03-30  5:05         ` [PATCH v5 01/14] bulk-checkin: rename 'state' variable and separate 'plugged' boolean Neeraj Singh via GitGitGadget
2022-03-30 17:11           ` Junio C Hamano
2022-03-30 18:34             ` Neeraj Singh
2022-03-30 20:24               ` Junio C Hamano
2022-03-31  4:17                 ` Neeraj Singh
2022-03-31 17:50                   ` Junio C Hamano
2022-03-31 19:08                     ` Neeraj Singh
2022-03-30  5:05         ` [PATCH v5 02/14] bulk-checkin: rebrand plug/unplug APIs as 'odb transactions' Neeraj Singh via GitGitGadget
2022-03-30 17:17           ` Junio C Hamano
2022-03-31  5:51             ` Neeraj Singh
2022-03-30  5:05         ` [PATCH v5 03/14] object-file: pass filename to fsync_or_die Neeraj Singh via GitGitGadget
2022-03-30 17:18           ` Junio C Hamano
2022-03-30 17:54             ` Neeraj Singh
2022-03-30  5:05         ` [PATCH v5 04/14] core.fsyncmethod: batched disk flushes for loose-objects Neeraj Singh via GitGitGadget
2022-03-30 17:37           ` Junio C Hamano
2022-03-31  6:28             ` Neeraj Singh
2022-03-31 18:05               ` Junio C Hamano
2022-03-31 19:18                 ` Neeraj Singh
2022-04-01 15:56                   ` Junio C Hamano
2022-03-30  5:05         ` [PATCH v5 05/14] cache-tree: use ODB transaction around writing a tree Neeraj Singh via GitGitGadget
2022-03-30 17:46           ` Junio C Hamano
2022-03-30 19:04             ` Neeraj Singh
2022-03-30  5:05         ` [PATCH v5 06/14] builtin/add: add ODB transaction around add_files_to_cache Neeraj Singh via GitGitGadget
2022-03-30 17:47           ` Junio C Hamano
2022-03-30  5:05         ` [PATCH v5 07/14] update-index: use the bulk-checkin infrastructure Neeraj Singh via GitGitGadget
2022-03-30 17:52           ` Junio C Hamano
2022-03-30 19:09             ` Neeraj Singh
2022-03-30  5:05         ` [PATCH v5 08/14] unpack-objects: " Neeraj Singh via GitGitGadget
2022-03-30  5:05         ` [PATCH v5 09/14] core.fsync: use batch mode and sync loose objects by default on Windows Neeraj Singh via GitGitGadget
2022-03-30  5:05         ` [PATCH v5 10/14] test-lib-functions: add parsing helpers for ls-files and ls-tree Neeraj Singh via GitGitGadget
2022-03-30  5:05         ` [PATCH v5 11/14] core.fsyncmethod: tests for batch mode Neeraj Singh via GitGitGadget
2022-03-30 18:13           ` Junio C Hamano
2022-03-31  3:55             ` Neeraj Singh
2022-03-30  5:05         ` [PATCH v5 12/14] t/perf: add iteration setup mechanism to perf-lib Neeraj Singh via GitGitGadget
2022-03-30  5:05         ` [PATCH v5 13/14] core.fsyncmethod: performance tests for batch mode Neeraj Singh via GitGitGadget
2022-03-31  4:09           ` Neeraj Singh
2022-03-30  5:05         ` [PATCH v5 14/14] core.fsyncmethod: correctly camel-case warning message Neeraj Singh via GitGitGadget
2022-04-05  5:20         ` [PATCH v6 00/12] core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects nksingh85
2022-04-06 20:32           ` Junio C Hamano
2022-05-19 21:47             ` Junio C Hamano
2022-05-19 21:54               ` Neeraj Singh
2022-05-24 12:31                 ` Johannes Schindelin
2022-04-05  5:20         ` [PATCH v6 01/12] bulk-checkin: rename 'state' variable and separate 'plugged' boolean nksingh85
2022-04-05  5:20         ` [PATCH v6 02/12] bulk-checkin: rebrand plug/unplug APIs as 'odb transactions' nksingh85
2022-04-05  5:20         ` [PATCH v6 03/12] core.fsyncmethod: batched disk flushes for loose-objects nksingh85
2022-04-05  5:20         ` [PATCH v6 04/12] cache-tree: use ODB transaction around writing a tree nksingh85
2022-04-05  5:20         ` [PATCH v6 05/12] builtin/add: add ODB transaction around add_files_to_cache nksingh85
2022-04-05  5:20         ` [PATCH v6 06/12] update-index: use the bulk-checkin infrastructure nksingh85
2022-04-05  5:20         ` [PATCH v6 07/12] unpack-objects: " nksingh85
2022-04-05  5:20         ` [PATCH v6 08/12] core.fsync: use batch mode and sync loose objects by default on Windows nksingh85
2022-04-05  5:20         ` [PATCH v6 09/12] test-lib-functions: add parsing helpers for ls-files and ls-tree nksingh85
2022-04-05  5:20         ` [PATCH v6 10/12] core.fsyncmethod: tests for batch mode nksingh85
2022-04-05  5:20         ` [PATCH v6 11/12] t/perf: add iteration setup mechanism to perf-lib nksingh85
2022-04-05  5:20         ` [PATCH v6 12/12] core.fsyncmethod: performance tests for batch mode nksingh85

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANQDOdcak1nV1Pr9cmyk9dgEjHOH8Au92pUMskJipUodzskzqQ@mail.gmail.com \
    --to=nksingh85@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=neerajsi@microsoft.com \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).