All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] update-server-info: avoid needless overwrites
Date: Sun, 12 May 2019 00:38:59 +0000	[thread overview]
Message-ID: <20190512003859.GA22361@dcvr> (raw)
In-Reply-To: <87v9ygwoj0.fsf@evledraar.gmail.com>

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Sat, May 11 2019, Eric Wong wrote:
> > +static int files_differ(FILE *fp, const char *path)
> > +{
> > +	struct stat st;
> > +	git_hash_ctx c;
> > +	struct object_id oid_old, oid_new;
> > +	struct strbuf tmp = STRBUF_INIT;
> > +	long new_len = ftell(fp);
> > +
> > +	if (new_len < 0 || stat(path, &st) < 0)
> > +		return 1;
> > +	if (!S_ISREG(st.st_mode))
> > +		return 1;
> > +	if ((off_t)new_len != st.st_size)
> > +		return 1;
> > +
> > +	rewind(fp);
> > +	if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len)
> > +		return 1;
> > +	the_hash_algo->init_fn(&c);
> > +	the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > +	the_hash_algo->final_fn(oid_new.hash, &c);
> > +	strbuf_release(&tmp);
> > +
> > +	if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0)
> > +		return 1;
> > +	the_hash_algo->init_fn(&c);
> > +	the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > +	the_hash_algo->final_fn(oid_old.hash, &c);
> > +	strbuf_release(&tmp);
> > +
> > +	return hashcmp(oid_old.hash, oid_new.hash);
> > +}
> 
> This way of doing it just seems so weirdly convoluted. Read them one at
> a time, compute the SHA-1, just to see if they're different. Why not
> something closer to a plain memcmp():
> 
> 	static int files_differ(FILE *fp, const char *path)
> 	{
> 		struct strbuf old = STRBUF_INIT, new = STRBUF_INIT;
> 		long new_len = ftell(fp);
> 		int diff = 1;
> 
> 		rewind(fp);
> 		if (strbuf_fread(&new, (size_t)new_len, fp) != (size_t)new_len)
> 			goto release_return;
> 		if (strbuf_read_file(&old, path, 0) < 0)
> 			goto release_return;
> 
> 		diff = strbuf_cmp(&old, &new);
> 
> 	release_return:
> 		strbuf_release(&old);
> 		strbuf_release(&new);
> 
> 		return diff;
> 	}
> 
> I.e. optimze for code simplicity with something close to a dumb "cmp
> --silent". I'm going to make the bold claim that nobody using "dumb
> http" is going to be impacted by the performance of reading their
> for-each-ref or for-each-pack dump, hence opting for not even bothing to
> stat() to get the size before reading.

I've been trying to improve dumb HTTP for more cases; actually.
(since it's much cheaper than smart HTTP in server memory/CPU)

> Because really, if we were *trying* to micro-optimize this for time or
> memory use there's much better ways, e.g. reading the old file and
> memcmp() as we go and stream the "generate" callback, but I just don't
> see the point of trying in this case.

I was actually going towards that route; but wasn't sure if this
idea would be accepted at all (and I've been trying to stay away
from using non-scripting languages).

I don't slurping all of info/refs into memory at all; so maybe
a streaming memcmp of the existing file is worth doing...

> >  /*
> >   * Create the file "path" by writing to a temporary file and renaming
> >   * it into place. The contents of the file come from "generate", which
> >   * should return non-zero if it encounters an error.
> >   */
> > -static int update_info_file(char *path, int (*generate)(FILE *))
> > +static int update_info_file(char *path, int (*generate)(FILE *), int force)
> >  {
> >  	char *tmp = mkpathdup("%s_XXXXXX", path);
> 
> Unrelated to this, patch, but I hadn't thought about this nasty race
> condition. We recommend users run this from the "post-update" (or
> "post-receive") hook, and don't juggle the lock along with the ref
> update, thus due to the vagaries of scheduling you can end up with two
> concurrent ref updates where the "old" one wins.
> 
> But I guess that brings me back to something close to "nobody with that
> sort of update rate is using 'dumb http'" :)
> 
> For this to work properly we'd need to take some sort of global "ref
> update/pack update" lock, and I guess at that point this "cmp" case
> would be a helper similar to commit_lock_file_to(),
> i.e. a commit_lock_file_to_if_different().

Worth a separate patch, at some point, I think.  I'm not too
familiar with the existing locking in git, actually...

Along those lines, I think repack/gc should automatically
update objects/info/packs if the file already exists.

> >  	int ret = -1;
> >  	int fd = -1;
> >  	FILE *fp = NULL, *to_close;
> > +	int do_update;
> >
> >  	safe_create_leading_directories(path);
> >  	fd = git_mkstemp_mode(tmp, 0666);
> >  	if (fd < 0)
> >  		goto out;
> > -	to_close = fp = fdopen(fd, "w");
> > +	to_close = fp = fdopen(fd, "w+");
> >  	if (!fp)
> >  		goto out;
> >  	fd = -1;
> >  	ret = generate(fp);
> >  	if (ret)
> >  		goto out;
> > +
> > +	do_update = force || files_differ(fp, path);
> > [...]
> >
> > -static int update_info_refs(void)
> > +static int update_info_refs(int force)
> 
> So, I was going to say "shouldn't we update the docs?" which for --force
> say "Update the info files from scratch.".
> 
> But reading through it that "from scratch" wording is from c743e6e3c0
> ("Add a link from update-server-info documentation to repository
> layout.", 2005-09-04).

Yes, that wording is awkward and I can update it.  But maybe making
it undocumented is sufficient and would save us the trouble
of describing it :)

"--force" might be seen as a performance optimization for cases
where you're certain the result will differ, but I'm not
sure if that's worth mentioning in the manpage.

> There it was a refrence to a bug since fixed in 60d0526aaa ("Unoptimize
> info/refs creation.", 2005-09-14), and then removed from the docs in
> c5fe5b6de9 ("Remove obsolete bug warning in man git-update-server-info",
> 2009-04-25).
> 
> Then in b3223761c8 ("update_info_refs(): drop unused force parameter",
> 2019-04-05) Jeff removed the unused-for-a-decade "force" param.
> 
> So having gone through the history I think we're better off just
> dropping the --force argument entirely, or at least changing the
> docs.

I can update the docs, or make it undocumented.  Compatibility
from the command-line needs to remain in case there are scripts
using it.

  reply	other threads:[~2019-05-12  0:39 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-11  1:34 [PATCH] update-server-info: avoid needless overwrites Eric Wong
2019-05-11  7:35 ` Eric Sunshine
2019-05-11 20:47   ` [PATCH v2] " Eric Wong
2019-05-11 21:17 ` [PATCH] " Eric Wong
2019-05-11 23:37 ` Ævar Arnfjörð Bjarmason
2019-05-12  0:38   ` Eric Wong [this message]
2019-05-12  4:08   ` Jeff King
2019-05-12  7:16     ` Ævar Arnfjörð Bjarmason
2019-05-14  9:47       ` Jeff King
2019-05-14 10:33         ` Ævar Arnfjörð Bjarmason
2019-05-14 11:24           ` Jeff King
2019-05-14 11:57             ` Ævar Arnfjörð Bjarmason
2019-05-14 11:50         ` Eric Wong
2019-05-14 12:13           ` dumb HTTP things I want to do Eric Wong
2019-05-14 12:27             ` Jeff King
2019-05-14 12:19           ` [PATCH] update-server-info: avoid needless overwrites Ævar Arnfjörð Bjarmason
2019-05-14 12:29             ` Jeff King
2019-05-15  0:45             ` [PATCH 2/1] server-info: conditionally update on fetch Eric Wong
2019-05-15 20:38               ` [WIP] repack leaving stale entries in objects/info/packs Eric Wong
2019-05-15 21:48                 ` Jeff King
2019-05-23  8:59                   ` [PATCH] server-info: do not list unlinked packs Eric Wong
2019-05-23 10:24                     ` Jeff King
2019-05-23 17:27                       ` [PATCH v2] " Eric Wong
2019-05-24  6:05                         ` Jeff King
2019-05-24  7:34                         ` Ævar Arnfjörð Bjarmason
2019-05-13 23:17 ` [PATCH v3] update-server-info: avoid needless overwrites Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190512003859.GA22361@dcvr \
    --to=e@80x24.org \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.