git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Working with git binary stream
@ 2021-08-09 16:12 anatoly techtonik
  2021-08-09 21:07 ` Jeff King
  2021-08-10 13:57 ` Elijah Newren
  0 siblings, 2 replies; 3+ messages in thread
From: anatoly techtonik @ 2021-08-09 16:12 UTC (permalink / raw)
  To: Git Mailing List

Hi.

In https://lore.kernel.org/git/CAPkN8xK7JnhatkdurEb16bC0wb+=Khd=xJ51YQUXmf2H23YCGw@mail.gmail.com/T/#u
it became clear that it is impossible to make fast-export followed
by fast-import to get identical commit hashes for the resulting
repository (try https://github.com/simons-public/protonfixes).
It is also impossible to detect which commits would be altered
as a result of this operation. Because fast-export/import does
some implicit commit normalization, fixing that probably requires
too much effort.

As an alternative it appeared that that theres is also a
"git binary stream" log that is produced by

git cat-file --batch --batch-all-objects

Is there a way to reconstruct the repository given that stream?
Is there documentation on how to read it?
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Working with git binary stream
  2021-08-09 16:12 Working with git binary stream anatoly techtonik
@ 2021-08-09 21:07 ` Jeff King
  2021-08-10 13:57 ` Elijah Newren
  1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2021-08-09 21:07 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Git Mailing List

On Mon, Aug 09, 2021 at 07:12:13PM +0300, anatoly techtonik wrote:

> As an alternative it appeared that that theres is also a
> "git binary stream" log that is produced by
> 
> git cat-file --batch --batch-all-objects
> 
> Is there a way to reconstruct the repository given that stream?

Yes, though it is probably not the easiest way to do so. Just dumping
all of the object contents back into another repository will indeed give
you the same hashes, etc. But if you change one object, then all its
hash will change, and all of the other objects pointing to it will need
to change, etc. And that dump is in apparently-random order with respect
to the actual graph structure and relationship between objects.

You'd probably do better to build a tool around rev-list, and only use
cat-file to fetch the verbatim object contents. At some point your tool
would start to look a lot like fast-export/fast-import, and it may be
less work to teach them whatever features you need to avoid any
normalization (e.g., retaining signatures, encodings, etc).

> Is there documentation on how to read it?

The output format is described in the "BATCH FORMAT" section of "git
help cat-file". Basically you get each object id, type, and size in
bytes, followed by the object contents. You can use the size from the
header to know how many bytes to read.

There's no tool to accept the whole stream. You'd have to parse each
entry and feed it to "git hash-object" with the appropriate type.

Having a mode to hash-object to read in a bunch of objects in "cat-file
--batch" format wouldn't be unreasonable, but nobody has found a need
for it so far. It would also be quite slow (it writes out individual
loose objects, whereas something like fast-import writes out a packfile,
including at least a basic attempt at deltas).

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Working with git binary stream
  2021-08-09 16:12 Working with git binary stream anatoly techtonik
  2021-08-09 21:07 ` Jeff King
@ 2021-08-10 13:57 ` Elijah Newren
  1 sibling, 0 replies; 3+ messages in thread
From: Elijah Newren @ 2021-08-10 13:57 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Git Mailing List

On Mon, Aug 9, 2021 at 9:16 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> Hi.
>
> In https://lore.kernel.org/git/CAPkN8xK7JnhatkdurEb16bC0wb+=Khd=xJ51YQUXmf2H23YCGw@mail.gmail.com/T/#u
> it became clear that it is impossible to make fast-export followed
> by fast-import to get identical commit hashes for the resulting
> repository (try https://github.com/simons-public/protonfixes).
> It is also impossible to detect which commits would be altered
> as a result of this operation. Because fast-export/import does
> some implicit commit normalization, fixing that probably requires
> too much effort.
>
> As an alternative it appeared that that theres is also a
> "git binary stream" log that is produced by
>
> git cat-file --batch --batch-all-objects
>
> Is there a way to reconstruct the repository given that stream?
> Is there documentation on how to read it?

Peff already responded about hash-object.  And pointed you, again, to
the manual for cat-file.

Can I suggest an alternative, even if it changes the problem statement
slightly?  For some reason you didn't like my
--reference-excluded-parents suggestion, but there's another way to do
this as well with fast-export and fast-import as they exist today: use
fast-export's --show-original-ids flag.  With that flag, you'll know
the original hashes.  And if your filtering process does not modify a
commit nor any of its ancestors, it can simply omit that commit (i.e.
not pass it along to fast-import) and replace any references to the
commit with a reference to the original hash.  So, for example if the
`git fast-export --show-original-ids ...` output looked as follows (a
simple repository with just three commits for demonstration purposes):

"""
reset refs/heads/main
commit refs/heads/main
mark :1
original-oid 81b642ea15a614e84cdd52514a963735426ab06c
author Developer Name <developer@foo.corp> 1628603376 -0400
committer Developer Name <developer@foo.corp> 1628603376 -0400
data 35
First commit, which was gpg signed
M 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 fileA

commit refs/heads/main
mark :2
original-oid 0024a18e9bfef3fd1091305cef4dd5a789164809
author Developer Name <developer@foo.corp> 1628603396 -0400
committer Developer Name <developer@foo.corp> 1628603396 -0400
data 14
Second commit
from :1
M 100644 f2e41136eac73c39554dede1fd7e67b12502d577 fileA
M 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 fileB

commit refs/heads/main
mark :3
original-oid 96efb1173ad5c037f03f3639976f2465b1c58186
author Developer Name <developer@foo.corp> 1628603422 -0400
committer Developer Name <developer@foo.corp> 1628603422 -0400
data 13
Third commit
from :2
M 100644 f15bf479158b73b9bb79e158ce93d75190bc9597 fileA
M 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 fileC
"""

Then we'd parse the first commit, decide we didn't want to filter it,
note that we hadn't filtered it or any of its parents, and then decide
to replace any references to ":1" (the stream's name for the
replacement for that commit) with
"81b642ea15a614e84cdd52514a963735426ab06c" (the original hash).

Then we'd parse the second commit.  Perhaps on this one we decide we
want to remove fileB.  So we output it after removing the fileB line,
and after replacing ":1" with the appropriate hash.

Then we'd parse the third commit.  We decide we don't want to change
this one, but we did change the second commit (the one with "mark
:2"), so we still have to output it.  There are no direct references
to :1, so we don't need to update those either.

In the end, we'd pass this stream to fast-import:

"""
reset refs/heads/main
commit refs/heads/main
mark :2
original-oid 0024a18e9bfef3fd1091305cef4dd5a789164809
author Developer Name <developer@foo.corp> 1628603396 -0400
committer Developer Name <developer@foo.corp> 1628603396 -0400
data 14
Second commit
from 81b642ea15a614e84cdd52514a963735426ab06c
M 100644 f2e41136eac73c39554dede1fd7e67b12502d577 fileA
M 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 fileB

commit refs/heads/main
mark :3
original-oid 96efb1173ad5c037f03f3639976f2465b1c58186
author Developer Name <developer@foo.corp> 1628603422 -0400
committer Developer Name <developer@foo.corp> 1628603422 -0400
data 13
Third commit
from :2
M 100644 f15bf479158b73b9bb79e158ce93d75190bc9597 fileA
M 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 fileC
"""

and it'd recover the original commit as you wanted.

This does presume that you're importing into the original repository
(or a clone --mirror of it), because it expects certain hashes to
already exist.  And when importing into such a repo, you want to use
--force with fast-import.  But it should do what you're asking for,
without needing to do any extra work in fast-export or fast-import.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-10 13:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-09 16:12 Working with git binary stream anatoly techtonik
2021-08-09 21:07 ` Jeff King
2021-08-10 13:57 ` Elijah Newren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).