git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Taylor Blau <ttaylorr@github.com>
Cc: Sun Chao <16657101987@163.com>, Taylor Blau <me@ttaylorr.com>,
	Sun Chao via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration
Date: Wed, 14 Jul 2021 20:19:15 +0200	[thread overview]
Message-ID: <877dhs20x3.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <YO8XrOChAtxhpuxS@nand.local>


On Wed, Jul 14 2021, Taylor Blau wrote:

> On Thu, Jul 15, 2021 at 12:46:47AM +0800, Sun Chao wrote:
>> > Stepping back, I'm not sure I understand why freshening a pack is so
>> > slow for you. freshen_file() just calls utime(2), and any sync back to
>> > the disk shouldn't need to update the pack itself, just a couple of
>> > fields in its inode. Maybe you could help explain further.
>> >
>> > [ ... ]
>>
>> The reason why we want to avoid freshen the mtime of ".pack" file is to
>> improve the reading speed of Git Servers.
>>
>> We have some large repositories in our Git Severs (some are bigger than 10GB),
>> and we created '.keep' files for large ".pack" files, we want the big files
>> unchanged to speed up git upload-pack, because in our mind the file system
>> cache will reduce the disk IO if a file does not changed.
>>
>> However we find the mtime of ".pack" files changes over time which makes the
>> file system always reload the big files, that takes a lot of IO time and result
>> in lower speed of git upload-pack and even further the disk IOPS is exhausted.
>
> That's surprising behavior to me. Are you saying that calling utime(2)
> causes the *page* cache to be invalidated and that most reads are
> cache-misses lowering overall IOPS?
>
> If so, then I am quite surprised ;). The only state that should be
> dirtied by calling utime(2) is the inode itself, so the blocks referred
> to by the inode corresponding to a pack should be left in-tact.
>
> If you're on Linux, you can try observing the behavior of evicting
> inodes, blocks, or both from the disk cache by changing "2" in the
> following:
>
>     hyperfine 'git pack-objects --all --stdout --delta-base-offset >/dev/null'
>       --prepare='sync; echo 2 | sudo tee /proc/sys/vm/drop_caches'
>
> where "1" drops the page cache, "2" drops the inodes, and "3" evicts
> both.
>
> I wonder if you could share the results of running the above varying
> the value of "1", "2", and "3", as well as swapping the `--prepare` for
> `--warmup=3` to warm your caches (and give us an idea of what your
> expected performance is probably like).

I think you may be right narrowly, but wrong in this context :)

I.e. my understanding of this problem is that they have some incremental
backup job, e.g. rsync without --checksum (not that doing that would
help, chicken & egg issue)..

So by changing the mtime you cause the file to be re-synced.

Yes Linux (or hopefully any modern OS) isn't so dumb as to evict your FS
cache because of such a metadata change, but that's besides the point.

If you have a backup job like that your FS cache will get evicted or be
subject to churn anyway, because you'll shortly be having to deal with
the "rsync" job that's noticed the changed mtime competing for caching
resources with "real" traffic.

Sun: Does that summarize the problem you're having?

<large digression ahead>

Sun, also: Note that in general doing backups of live git repositories
with rsync is a bad idea, and will lead to corruption.

The most common cause of such corruption is that a tool like "rsync"
will iterate recursively through say "objects" followed by "refs".

So by the time it gets to the latter (or is doing a deep iteration
within those dirs) git's state has changed in such a way as to yield an
rsync backup in a state that the repository was never in.

(As an aside, I've often wondered what it is about git exactly makes
people who'd never think of doing the same thing with the FS part of an
RDMBS's data store think that implementing such an ad-hoc backup
solution for git would be a good idea, but I digress. Perhaps we need
more scarier looking BerkeleyDB-looking names in the .git directory :)

Even if you do FS snapshots of live git repositories you're likely to
get corruption, search this mailing list for references to fsync(),
e.g. [1].

In short, git's historically (and still) been sloppy about rsync, and
relied on non-standard behavior such as "if I do N updates for N=1..100,
and fsync just "100", then I can assume 1..99 are fsynced (spoiler: you
can't assume that).

Our use of fsync is still broken in that sense today, git is not a safe
place to store your data in the POSIXLY pedantic sense (and no, I don't
just mean that core.fsyncObjectFiles is `false` by default, it only
covers a small part of this, e.g. we don't fsync dir entries even with
that).

On a real live filesystem this is usually not an issue, because if
you're not dealing with yanked power cords (and even then, journals
might save you), then even if you fsync a file but don't fsync the dir
entry it's in, the FS is usually forgiving about such cases.

I.e. if someone does a concurrent request for the could-be-outdated dir
entry they'll service the up-to-date one, even without that having been
fsync'd, because the VFS layer isn't going to the synced disk, it's
checking it's current state and servicing your request from that.

But at least some FS snapshot implementations have a habit of exposing
the most pedantic interpretation possible of FS semantics, and one that
you wouldn't ever get on a live FS. I.e. you might be hooking into the
equivalent of the order in which things are written to disk, and end up
with a state that would never have been exposed to a running program
(there would be a 1=1 correspondence if we fsync'd properly, which we
don't).

The best way to get backups of git repositories you know are correct are
is to use git's own transport mechanisms, i.e. fetch/pull the data, or
create bundles from it. This would be the case even if we fixed all our
fsync issues, because doing so wouldn't help you in the case of a
bit-flip, but an "index-pack" on the other end will spot such issues.

1. https://lore.kernel.org/git/20200917112830.26606-2-avarab@gmail.com/

  reply	other threads:[~2021-07-14 18:41 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-10 19:01 [PATCH] packfile: enhance the mtime of packfile by idx file Sun Chao via GitGitGadget
2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason
2021-07-12 16:17   ` Sun Chao
2021-07-14  1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget
2021-07-14  1:39   ` Ævar Arnfjörð Bjarmason
2021-07-14  2:52     ` Taylor Blau
2021-07-14 16:46       ` Sun Chao
2021-07-14 17:04         ` Taylor Blau
2021-07-14 18:19           ` Ævar Arnfjörð Bjarmason [this message]
2021-07-14 19:11             ` Martin Fick
2021-07-14 19:41               ` Ævar Arnfjörð Bjarmason
2021-07-14 20:20                 ` Martin Fick
2021-07-20  6:32                   ` Ævar Arnfjörð Bjarmason
2021-07-15  8:23                 ` Son Luong Ngoc
2021-07-20  6:29                   ` Ævar Arnfjörð Bjarmason
2021-07-14 19:30             ` Taylor Blau
2021-07-14 19:32               ` Ævar Arnfjörð Bjarmason
2021-07-14 19:52                 ` Taylor Blau
2021-07-14 21:40               ` Junio C Hamano
2021-07-15 16:30           ` Sun Chao
2021-07-15 16:42             ` Taylor Blau
2021-07-15 16:48               ` Sun Chao
2021-07-14 16:11     ` Sun Chao
2021-07-19 19:53   ` [PATCH v3] " Sun Chao via GitGitGadget
2021-07-19 20:51     ` Taylor Blau
2021-07-20  0:07       ` Junio C Hamano
2021-07-20 15:07         ` Sun Chao
2021-07-20  6:19       ` Ævar Arnfjörð Bjarmason
2021-07-20 15:34         ` Sun Chao
2021-07-20 15:00       ` Sun Chao
2021-07-20 16:53         ` Taylor Blau
2021-08-15 17:08     ` [PATCH v4 0/2] " Sun Chao via GitGitGadget
2021-08-15 17:08       ` [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` Sun Chao via GitGitGadget
2021-08-15 17:08       ` [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file Sun Chao via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877dhs20x3.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=16657101987@163.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=ttaylorr@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).