All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, vdye@github.com, avarab@gmail.com,
	newren@gmail.com, Jacob Keller <jacob.keller@gmail.com>,
	Derrick Stolee <derrickstolee@github.com>
Subject: [PATCH v5 0/4] Optionally skip hashing index on write
Date: Fri, 06 Jan 2023 16:31:52 +0000	[thread overview]
Message-ID: <pull.1439.v5.git.1673022717.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1439.v4.git.1671204678.gitgitgadget@gmail.com>

Writing the index is a critical action that takes place in multiple Git
commands. The recent performance improvements available with the sparse
index show how often the I/O costs around the index can affect different Git
commands, although reading the index takes place more often than a write.

The index is written through the hashfile API, both as a buffered write and
as a computation of its trailing hash. This trailing hash is an expectation
of the file format: several optional index extensions are provided using
indicators immediately preceding the trailing hash. 'git fsck' will complain
if the trailing hash does not match the contents of the file up to that
point.

However, computing the hash is expensive. This is even more expensive now
that we've moved to sha1dc instead of hardware-accelerated SHA1
computations.

This series provides a new config option, index.skipHash, that allows users
to disable computing the hash as Git writes the index. In this case, the
trailing hash is stored as the null hash (all bytes are zero).

The implementation is rather simple, but the patches organize different
aspects in a way that we could split things out:

 * Patch 1 adds a 'skip_hash' option to 'struct hashfile' that creates this
   as a general option to the hashwrite API.
 * The motivation in Patch 1 avoids the talk about the chunk-format API and
   instead focuses on the index as the first example that could make use of
   this.
 * Patch 2 creates the index.skipHash config option and updates 'git fsck'
   to ignore a null hash on the index. This is demonstrated by a very simple
   test.
 * Patch 3 creates a test helper to get the trailing hash of a file, and
   adds a concrete check on the trailing hash when index.skipHash=true.
 * Patch 4 (which could be deferred until later) adds index.skipHash=true as
   an implication of feature.manyFiles and as a setting in Scalar.

The end result is that index writes speed up significantly, enough that 'git
update-index --force-write' improves by 1.75x.

Similar patches appeared in my earlier refs RFC [1].

[1]
https://lore.kernel.org/git/pull.1408.git.1667846164.gitgitgadget@gmail.com/


Updates since v4
================

 * Patch 2 now uses repo_config_get_bool() instead of
   repo_config_get_maybe_bool().
 * A test is added for submodule-level config.


Updates since v3
================

 * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over
   from Kevin's patch in microsoft/git, but our use of it in patch 2 is
   different and thus is better with a signed int.
 * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This
   allows the setting to be applied to more cases where istate->repo is not
   set.
 * Patch 2 also uses 'start' in more places, since it already computed the
   beginning of the hash.
 * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch
   2's fallback to the_repository.


Updates since v2
================

 * Patch 2 now has additional details about why to use the config option in
   its message. The discussion about why the index is special is included in
   the commit message.


Updates since v1
================

 * In Patch 1, use hashcler() instead of memset().
 * In Patches 2 and 4, be explicit about which Git versions will exhibit
   different behavior when reading an index with a null trailing hash.
 * In Patch 2, used a more compact way of loading from config.
 * In Patch 2 (and following) dropped the sub-shell in t1600-index.sh.
 * In Patch 3, dropped back-ticks when using a multi-line pipe.
 * In Patch 4, use "! cmp" instead of "! test_cmp"


Updates since RFC
=================

 * The config name is now index.skipHash to make it more clear that the
   default is to not skip the hash, and choosing this option requires a true
   value.
 * Use oidread() instead of an ad-hoc loop.
 * Patches 3 and 4 are new. 3 adds some extra test verification that wasn't
   in the RFC.
 * I think that patch 4 is helpful to see now, but I could understand if we
   wanted to defer that patch until after index.skipHash has had more time
   to bake.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: allow skipping the hash function
  read-cache: add index.skipHash config option
  test-lib-functions: add helper for trailing hash
  features: feature.manyFiles implies fast index writes

 Documentation/config/feature.txt |  5 +++++
 Documentation/config/index.txt   | 11 +++++++++++
 csum-file.c                      | 14 +++++++++++---
 csum-file.h                      |  7 +++++++
 read-cache.c                     | 14 +++++++++++++-
 repo-settings.c                  |  2 ++
 repository.h                     |  1 +
 scalar.c                         |  1 +
 t/t1600-index.sh                 | 30 ++++++++++++++++++++++++++++++
 t/test-lib-functions.sh          |  8 ++++++++
 10 files changed, 89 insertions(+), 4 deletions(-)


base-commit: 805265fcf7a737664a8321aaf4a0587b78435184
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1439%2Fderrickstolee%2Fhashfile-skip-hash-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1439/derrickstolee/hashfile-skip-hash-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/1439

Range-diff vs v4:

 1:  c99470d4676 = 1:  c99470d4676 hashfile: allow skipping the hash function
 2:  aae047cbc9f ! 2:  a3c79308171 read-cache: add index.skipHash config option
     @@ Commit message
          ability to opt-in to this feature, even with the potential confusion
          with older 'git fsck' versions.
      
     +    Test this new config option, both at a command-line level and within a
     +    submodule. The confirmation is currently limited to confirm that 'git
     +    fsck' does not complain about the index. Future updates will make this
     +    test more robust.
     +
          It is critical that this test is placed before the test_index_version
          tests, since those tests obliterate the .git/config file and hence lose
          the setting from GIT_TEST_DEFAULT_HASH, if set.
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     -+	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
     ++	repo_config_get_bool(r, "index.skiphash", &f->skip_hash);
      +
       	for (i = removed = extended = 0; i < entries; i++) {
       		if (cache[i]->ce_flags & CE_REMOVE)
     @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin
      +test_expect_success 'index.skipHash config option' '
      +	rm -f .git/index &&
      +	git -c index.skipHash=true add a &&
     -+	git fsck
     ++	git fsck &&
     ++
     ++	test_commit start &&
     ++	git -c protocol.file.allow=always submodule add ./ sub &&
     ++	git config index.skipHash false &&
     ++	git -C sub config index.skipHash true &&
     ++	>sub/file &&
     ++	git -C sub add a &&
     ++	git -C sub fsck
      +'
      +
       test_index_version () {
 3:  27fbe52e748 ! 3:  dab9b15de47 test-lib-functions: add helper for trailing hash
     @@ Commit message
      
          Add a new test_trailing_hash helper and use it in t1600 to verify that
          index.skipHash=true really does skip the hash computation, since
     -    'git fsck' does not actually verify the hash.
     +    'git fsck' does not actually verify the hash. This confirms that when
     +    the config is disabled explicitly in a super project but enabled in a
     +    submodule, then the use of repo_config_get_bool() loads config from the
     +    correct repository in the case of 'git add'. There are other cases where
     +    istate->repo is NULL and thus this config is loaded instead from
     +    the_repository, but that's due to many different code paths initializing
     +    index_state structs in their own way.
      
          Keep the 'git fsck' call to ensure that any potential future change to
          check the index hash does not cause an error in this case.
     @@ t/t1600-index.sh: test_expect_success 'out of bounds index.version issues warnin
      +	test_trailing_hash .git/index >hash &&
      +	echo $(test_oid zero) >expect &&
      +	test_cmp expect hash &&
     - 	git fsck
     + 	git fsck &&
     + 
     + 	test_commit start &&
     +@@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' '
     + 	git -C sub config index.skipHash true &&
     + 	>sub/file &&
     + 	git -C sub add a &&
     ++	test_trailing_hash .git/modules/sub/index >hash &&
     ++	test_cmp expect hash &&
     + 	git -C sub fsck
       '
       
      
 4:  075921514f2 ! 4:  1beedcb5ba1 features: feature.manyFiles implies fast index writes
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       
       	f = hashfd(tempfile->fd, tempfile->filename.buf);
       
     --	repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash);
     +-	repo_config_get_bool(r, "index.skiphash", &f->skip_hash);
      +	prepare_repo_settings(r);
      +	f->skip_hash = r->settings.index_skip_hash;
       
     @@ scalar.c: static int set_recommended_config(int reconfigure)
      
       ## t/t1600-index.sh ##
      @@ t/t1600-index.sh: test_expect_success 'index.skipHash config option' '
     - 	test_trailing_hash .git/index >hash &&
     - 	echo $(test_oid zero) >expect &&
       	test_cmp expect hash &&
     --	git fsck
     -+	git fsck &&
     -+
     + 	git fsck &&
     + 
      +	rm -f .git/index &&
      +	git -c feature.manyFiles=true add a &&
      +	test_trailing_hash .git/index >hash &&
     -+	test_cmp expect hash &&
     ++	cmp expect hash &&
      +
      +	rm -f .git/index &&
      +	git -c feature.manyFiles=true \
     -+		-c index.skipHash=false add a &&
     ++	    -c index.skipHash=false add a &&
      +	test_trailing_hash .git/index >hash &&
     -+	! cmp expect hash
     - '
     - 
     - test_index_version () {
     ++	! cmp expect hash &&
     ++
     + 	test_commit start &&
     + 	git -c protocol.file.allow=always submodule add ./ sub &&
     + 	git config index.skipHash false &&

-- 
gitgitgadget

  parent reply	other threads:[~2023-01-06 16:32 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-07 17:25 [PATCH 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
2022-12-07 17:25 ` [PATCH 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-07 22:13   ` Ævar Arnfjörð Bjarmason
2022-12-08  7:32     ` Jeff King
2022-12-07 17:25 ` [PATCH 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-07 18:59   ` Eric Sunshine
2022-12-12 13:59     ` Derrick Stolee
2022-12-12 18:55       ` Eric Sunshine
2022-12-07 22:25   ` Ævar Arnfjörð Bjarmason
2022-12-07 23:06   ` Ævar Arnfjörð Bjarmason
2022-12-08  0:05     ` Junio C Hamano
2022-12-12 14:05     ` Derrick Stolee
2022-12-12 18:01       ` Ævar Arnfjörð Bjarmason
2022-12-07 17:25 ` [PATCH 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-07 22:27   ` Ævar Arnfjörð Bjarmason
2022-12-12 14:10     ` Derrick Stolee
2022-12-07 17:25 ` [PATCH 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-07 22:30   ` Ævar Arnfjörð Bjarmason
2022-12-12 14:18     ` Derrick Stolee
2022-12-12 18:27       ` Ævar Arnfjörð Bjarmason
2022-12-07 23:27 ` [PATCH 0/4] Optionally skip hashing index on write Junio C Hamano
2022-12-07 23:42   ` Ævar Arnfjörð Bjarmason
2022-12-08 16:38   ` Derrick Stolee
2022-12-12 22:22     ` Jacob Keller
2022-12-12 16:31 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-12 16:31   ` [PATCH v2 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-12 18:14     ` SZEDER Gábor
2022-12-13  0:55       ` Junio C Hamano
2022-12-17 17:37         ` SZEDER Gábor
2022-12-12 16:31   ` [PATCH v2 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-15 15:06   ` [PATCH v3 0/4] Optionally skip hashing index on write Derrick Stolee via GitGitGadget
2022-12-15 15:06     ` [PATCH v3 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-15 15:06     ` [PATCH v3 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-15 16:12       ` Ævar Arnfjörð Bjarmason
2022-12-15 15:06     ` [PATCH v3 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-15 15:07     ` [PATCH v3 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-15 15:56     ` [PATCH v3 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
2022-12-16 13:41       ` Derrick Stolee
2022-12-16 15:31     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2022-12-16 15:31       ` [PATCH v4 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2022-12-16 15:43       ` [PATCH v4 0/4] Optionally skip hashing index on write Ævar Arnfjörð Bjarmason
2023-01-06 15:33         ` Derrick Stolee
2023-01-06 22:45           ` Junio C Hamano
2023-01-06 23:40             ` Derrick Stolee
2023-01-09 17:15               ` Ævar Arnfjörð Bjarmason
2023-01-09 18:00                 ` Derrick Stolee
2023-01-09 19:22                   ` Ævar Arnfjörð Bjarmason
2023-01-06 16:31       ` Derrick Stolee via GitGitGadget [this message]
2023-01-06 16:31         ` [PATCH v5 1/4] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 2/4] read-cache: add index.skipHash config option Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 3/4] test-lib-functions: add helper for trailing hash Derrick Stolee via GitGitGadget
2023-01-06 16:31         ` [PATCH v5 4/4] features: feature.manyFiles implies fast index writes Derrick Stolee via GitGitGadget
2023-01-15  9:31         ` [PATCH v5 0/4] Optionally skip hashing index on write Junio C Hamano
2023-01-17 14:49           ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.1439.v5.git.1673022717.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=avarab@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jacob.keller@gmail.com \
    --cc=newren@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.