All of lore.kernel.org
 help / color / mirror / Atom feed
From: Quentin Casasnovas <quentin.casasnovas@oracle.com>
To: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>,
	Junio C Hamano <gitster@pobox.com>,
	Duy Nguyen <pclouds@gmail.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Refreshing index timestamps without reading content
Date: Tue, 10 Jan 2017 15:17:31 +0100	[thread overview]
Message-ID: <20170110141731.GH7000@chrystal.oracle.com> (raw)
In-Reply-To: <20170109155537.GG7000@chrystal.oracle.com>

[-- Attachment #1: Type: text/plain, Size: 5131 bytes --]

On Mon, Jan 09, 2017 at 04:55:37PM +0100, Quentin Casasnovas wrote:
> On Mon, Jan 09, 2017 at 07:01:36AM -0800, Junio C Hamano wrote:
> > Duy Nguyen <pclouds@gmail.com> writes:
> >
> > > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> > > <quentin.casasnovas@oracle.com> wrote:
> > >> Is there any way to tell git, after the git ls-tree command above, to
> > >> refresh its stat cache information and trust us that the file content has
> > >> not changed, as to avoid any useless file read (though it will obviously
> > >> will have to stat all of them, but that's not something we can really
> > >> avoid)
> > >
> > > I don't think there's any way to do that, unfortunately.
> >
> > Lose "unfortunately".
> >
> > >> If not, I am willing to implement a --assume-content-unchanged to the git
> > >> update-index if you guys don't see something fundamentally wrong with this
> > >> approach.
> > >
> > > If you do that, I think you should go with either of the following options
> > >
> > > - Extend git-update-index --index-info to take stat info as well (or
> > > maybe make a new option instead). Then you can feed stat info directly
> > > to git without a use-case-specific "assume-content-unchanged".
> > >
> > > - Add "git update-index --touch" that does what "touch" does. In this
> > > case, it blindly updates stat info to latest. But like touch, we can
> > > also specify  mtime from command line if we need to. It's a bit less
> > > generic than the above option, but easier to use.
> >
> > Even if we assume that it is a good idea to let people muck with the
> > index like this, either of the above would be a usable addition,
> > because the cached stat information does not consist solely of
> > mtime.
> >
> > "git update-index --index-info" was invented for the case where a
> > user or a script _knows_ the object ID of the blob that _would_
> > result if a contents of a file on the filesystem were run through
> > hash-object.  So from the interface's point of view, it may make
> > sense to teach it to take an extra/optional argument that is the
> > path to the file and take the stat info out of the named file when
> > the extra/optional argument was given.
> >
> > But that assumes that it is a good idea to do this in the first
> > place.  It was deliberate design decision that setting the cached
> > stat info for the entry was protected behind actual content
> > comparison, and removing that protection will open the index to
> > abuse.
> >
> 
> Hi Junio,
> 
> Thanks for your feedback, appreciated :)
> 
> I do understand how it would be possible for someone to shoot themselves in
> the feet with such option, but it solves real life use cases and improved
> build times very signficantly here.
> 
> Another use case we have is setting up very lightweight linux work trees,
> by reflinking from a base work-tree.  This allows for a completely
> different work-tree taking up almost no size at first, whereas using a
> shared clone or the recent worktree subcommand would "waste" ~500MB*:
> 
>  # linux-2.6 is a shared clone of a bare clone residing locally
>  ~ $ cp --reflink -a linux-2.6 linux-2.6-reflinked
> 
>  # At this point, the mtime inside linux-2.6-reflinked are matching the
>  # mtime of the source linux-2.6 (since we used the '-a' option of 'cp)
>  ~ $ diff -u <(stat linux-2.6/README) <(stat linux-2.6-reflinked/README)
>  --- /proc/self/fd/11  2017-01-09 16:34:04.523438942 +0100
>  +++ /proc/self/fd/12  2017-01-09 16:34:04.523438942 +0100
>  @@ -1,8 +1,8 @@
>  -  File: 'linux-2.6/README'
>  +  File: 'linux-2.6-reflinked/README'
>     Size: 18372		Blocks: 40         IO Block: 4096   regular file
>  -Device: fd00h/64768d	Inode: 268467090   Links: 1
>  +Device: fd00h/64768d	Inode: 805970606   Links: 1
>   Access: (0644/-rw-r--r--)  Uid: ( 1000/ quentin)   Gid: ( 1000/ quentin)
>   Access: 2017-01-09 12:04:15.317758718 +0100
>   Modify: 2017-01-09 12:04:12.566758772 +0100
>  -Change: 2017-01-09 12:04:12.566758772 +0100
>  +Change: 2017-01-09 16:29:48.305444003 +0100
>    Birth:
> 
>   # Now let's check how long it takes to refresh the index from the source
>   # and destination..
>   ~/linux-2.6 $ time git update-index --refresh
>   git update-index --refresh  0.04s user 0.08s system 204% cpu 0.058 total
>                                                                ~~~~~~~~~~~
>   ~/linux-2.6-reflinked $ time git update-index --refresh
>   git update-index --refresh  2.40s user 1.43s system 38% cpu 10.003 total
>                                                               ~~~~~~~~~~~~
> 

After discussing this with my friend Vegard, he found the core.checkStat
config which, if set to 'minimal', ignores the inode number which is enough
for the above use case to work just fine - so please excuse my ignorance!

For the initial problem I had when changing the mtime of all the files in
the tree, I should be able to change the mtime of the object files instead,
hence I don't really need the patch I sent earlier.

Sorry for the wasted time! :)

Q

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

      reply	other threads:[~2017-01-10 14:13 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-05 11:23 Refreshing index timestamps without reading content Quentin Casasnovas
2017-01-09 12:02 ` Duy Nguyen
2017-01-09 12:17   ` Quentin Casasnovas
2017-01-09 12:22     ` Quentin Casasnovas
2017-01-09 15:01   ` Junio C Hamano
2017-01-09 15:55     ` Quentin Casasnovas
2017-01-10 14:17       ` Quentin Casasnovas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170110141731.GH7000@chrystal.oracle.com \
    --to=quentin.casasnovas@oracle.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=vegard.nossum@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.