Re: [PATCH v4 4/4] cache-tree: Write updated cache-tree after commit

From: Junio C Hamano <gitster@pobox.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: David Turner <dturner@twopensource.com>,
	Git Mailing List <git@vger.kernel.org>,
	David Turner <dturner@twitter.com>
Subject: Re: [PATCH v4 4/4] cache-tree: Write updated cache-tree after commit
Date: Tue, 08 Jul 2014 10:05:24 -0700	[thread overview]
Message-ID: <xmqqwqbnaii3.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <CACsJy8C20oFdATHKTLK=9U3_kHu1QsuS4i74RPgQn0aTwVCC8w@mail.gmail.com> (Duy Nguyen's message of "Tue, 8 Jul 2014 17:32:20 +0700")

Duy Nguyen <pclouds@gmail.com> writes:

> I wonder if we need to update_main_cache_tree() so many times in this
> function. If I read the code correctly, all roads must lead to
> update_main_cache_tree(0) in prepare_to_commit().

I think prepare-to-commit is too late; it does not want to know if
the index it was given to create a tree out of is the one that the
user will keep using after this invocation of "git commit" or just a
temporary one used for partial commit.  The cache-tree rebuilt there
is what is recorded with commit_tree_extended() in cmd_commit(), but
if you are doing a partial commit, these generic code paths are
given a temporary index file on the filesystem to work on, and
cache-tree in that will not help user's later operation.

For three main uses of "git commit", prepare_index() does these:

 - "git commit -a" and "git commit -i $paths" update the index with
   the new contents from the working tree, fully builds cache-tree
   in-core to write out the tree objects, and writes the index file
   to the filesystem.  Because this index is the one used after this
   invocation of "git commit" returns, we have a fully populated
   cache-tree after this happens.  This code path is perfect and
   there is no need to change.

 - "git commit" commits the contents of the index as-is, so
   technically there is no reason for it to update the index on the
   filesystem at all, but it refreshes the cached stat data to help
   the "status" part of the command, and if it finds that stat data
   was stale, updates the index on the filesystem because it is
   wasteful not to do so.  As we would be spending I/O cycles to
   update the index file in that case anyway, we also rebuild the
   cache-tree and include that in the updated index.

   When the cached stat data was already up-to-date, however, we do
   not update the index on the filesystem, so the series under
   discussion will change the trade-off by doing one more I/O to
   write out a new index to the filesystem only to update cache-tree.

 - "git commit $paths" updates the "real" index with given $paths
   and writes it out to the filesystem first.  This is the index the
   user will use after "git commit" finishes; traditionally our
   trade-off was "populate cache-tree only when we do not have to
   spend any cycle only to do so (i.e. when we are writing trees
   anyway, or when we read from a tree), and let it degrade as paths
   are added, removed and modified" and we avoided populating
   cache-tree in this codepath.  The series under discussion will
   change the trade-off here, too.

   After it updates this "real" index, it builds another temporary
   index that represents the tree state to be committed (starting
   from HEAD and updates with the given $paths), but that will be
   discarded and we do not want to spend any extra cycle to do
   anything only to make its later use more efficient (like writing
   updated cache-tree to it).

> If we find out that
> we have incomplete cache-tree before that call, we could write the
> index one more time after that call,

and that will make an extra I/O only to write out cache-tree to the
temporary index that we will discard immediately after for a partial
commit.