All of lore.kernel.org
 help / color / mirror / Atom feed
* RCS Keywords in Git done right
@ 2014-11-26 16:44 Derek Moore
  2014-11-26 18:10 ` Stefan Beller
  0 siblings, 1 reply; 7+ messages in thread
From: Derek Moore @ 2014-11-26 16:44 UTC (permalink / raw)
  To: git

Junio, et al.,

I've completed my first pass at RCS Keywords in Git. I believe I've
come up with a solution that is accurate, performant and complete (but
I have not tested it on big repos yet, I'm doing that today...).

https://github.com/derekm/git-keywords

This work basically takes advantage of all the state-machine
transitions in git to surgically perform "git update-index $(git
archive $(git log -1 --format=%H @ -- $path) -- $path | tar vx)"
overwrites in the work tree. (It also exposes some state transitions
that are entirely absent from git, creating a few edge cases, but they
are relatively unimportant edge cases if your deployed git repos will
be managed by an automated system [humans doing development workflows
can trigger the edge cases when cancelling certain operations, all
edge cases just leave you with un-substituted files, which will become
substituted again after checkouts, commits, merges, rewrites, etc.].)

Only $Author$, $Date$ and $Revision$ can be emulated presently. $Id$
and other tags requiring filename paths or basenames are possible, but
would require changes internal to git allowing "pretty format" codes
inside a file to triangulate filenames from blob hash and commit hash
pairs.

I believe this work fundamentally proves that the theory of RCS
keywords is sound in the context of Git, and that full support in
git-core is entirely achievable in short order. In fact, other areas
in git would become improved for several reasons if git devs ingested
some of the results of this work.

There is a lot of gainsaying and kneejerk reaction to the idea of
keywords under the assumption of distributed development because of
the fallacy of thinking in terms of shared/universal linear history
instead of in terms of relative spacetime events.

Keyword substitution can be done accurately relative to the history of
the possessor of that history. Last edit timestamps and last authors
and revision IDs are important to many workflows inside and outside
development.

Of the keywords emulated, the only thing I couldn't achieve
(obviously) were monotonically increasing revision numbers, instead I
went with the file's most recent commit short hash (which is more
proper for git anyway).

To test it out...

1) clone the repo:

git clone https://github.com/derekm/git-keywords

2) cd into the repo and setup the hooks:

ln -sf ../../post-checkout-filter.pl .git/hooks/post-checkout
ln -sf ../../pre-commit-check.pl .git/hooks/pre-commit
ln -sf ../../post-commit-filter.pl .git/hooks/post-commit
ln -sf ../../post-merge-filter.pl .git/hooks/post-merge
ln -sf ../../post-rewrite-filter.pl .git/hooks/post-rewrite

3) edit .git/config and setup the filters:

[filter "keywords"]
        smudge = ./keyword-smudge.pl %f
        clean = ./keyword-clean.pl

4) inspect the lack of substitutions:

head -4 *

5) initialize the repo with first substitutions:

for i in $(git ls-tree --name-only @); do
 git update-index \
  $(git archive \
   $(git log -1 --format=%H @ -- $i) -- $i | tar vx)
done

6) inspect the presence of substitutions:

head -4 *

7) ??? (start hacking, try to break it, etc.)

8) PROFIT!

PS: I may consider rewriting the hooks in Bash, but I need to audit
what commands are available under msys-git.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RCS Keywords in Git done right
  2014-11-26 16:44 RCS Keywords in Git done right Derek Moore
@ 2014-11-26 18:10 ` Stefan Beller
  2014-11-26 19:22   ` Derek Moore
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Beller @ 2014-11-26 18:10 UTC (permalink / raw)
  To: Derek Moore; +Cc: git

On Wed, Nov 26, 2014 at 8:44 AM, Derek Moore <derek.p.moore@gmail.com> wrote:
> Junio, et al.,
>
> I've completed my first pass at RCS Keywords in Git. I believe I've
> come up with a solution that is accurate, performant and complete (but
> I have not tested it on big repos yet, I'm doing that today...).
>
> https://github.com/derekm/git-keywords
>
> This work basically takes advantage of all the state-machine
> transitions in git to surgically perform "git update-index $(git
> archive $(git log -1 --format=%H @ -- $path) -- $path | tar vx)"
> overwrites in the work tree. (It also exposes some state transitions
> that are entirely absent from git, creating a few edge cases, but they
> are relatively unimportant edge cases if your deployed git repos will
> be managed by an automated system [humans doing development workflows
> can trigger the edge cases when cancelling certain operations, all
> edge cases just leave you with un-substituted files, which will become
> substituted again after checkouts, commits, merges, rewrites, etc.].)

Now knowing the edge cases won't work, I did not get an idea about the
standard case of what should work with this. Would you mind to write
a more detailed example or a more advertising paragraph of what this can do?
Not getting the big picture may be related to me having not worked with RCS yet.

Thanks,
Stefan

> Only $Author$, $Date$ and $Revision$ can be emulated presently. $Id$
> and other tags requiring filename paths or basenames are possible, but
> would require changes internal to git allowing "pretty format" codes
> inside a file to triangulate filenames from blob hash and commit hash
> pairs.
>
> I believe this work fundamentally proves that the theory of RCS
> keywords is sound in the context of Git, and that full support in
> git-core is entirely achievable in short order. In fact, other areas
> in git would become improved for several reasons if git devs ingested
> some of the results of this work.
>
> There is a lot of gainsaying and kneejerk reaction to the idea of
> keywords under the assumption of distributed development because of
> the fallacy of thinking in terms of shared/universal linear history
> instead of in terms of relative spacetime events.
>
> Keyword substitution can be done accurately relative to the history of
> the possessor of that history. Last edit timestamps and last authors
> and revision IDs are important to many workflows inside and outside
> development.
>
> Of the keywords emulated, the only thing I couldn't achieve
> (obviously) were monotonically increasing revision numbers, instead I
> went with the file's most recent commit short hash (which is more
> proper for git anyway).
>
> To test it out...
>
> 1) clone the repo:
>
> git clone https://github.com/derekm/git-keywords
>
> 2) cd into the repo and setup the hooks:
>
> ln -sf ../../post-checkout-filter.pl .git/hooks/post-checkout
> ln -sf ../../pre-commit-check.pl .git/hooks/pre-commit
> ln -sf ../../post-commit-filter.pl .git/hooks/post-commit
> ln -sf ../../post-merge-filter.pl .git/hooks/post-merge
> ln -sf ../../post-rewrite-filter.pl .git/hooks/post-rewrite
>
> 3) edit .git/config and setup the filters:
>
> [filter "keywords"]
>         smudge = ./keyword-smudge.pl %f
>         clean = ./keyword-clean.pl
>
> 4) inspect the lack of substitutions:
>
> head -4 *
>
> 5) initialize the repo with first substitutions:
>
> for i in $(git ls-tree --name-only @); do
>  git update-index \
>   $(git archive \
>    $(git log -1 --format=%H @ -- $i) -- $i | tar vx)
> done
>
> 6) inspect the presence of substitutions:
>
> head -4 *
>
> 7) ??? (start hacking, try to break it, etc.)
>
> 8) PROFIT!
>
> PS: I may consider rewriting the hooks in Bash, but I need to audit
> what commands are available under msys-git.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RCS Keywords in Git done right
  2014-11-26 18:10 ` Stefan Beller
@ 2014-11-26 19:22   ` Derek Moore
  2014-11-26 21:15     ` Stefan Beller
  0 siblings, 1 reply; 7+ messages in thread
From: Derek Moore @ 2014-11-26 19:22 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

> Now knowing the edge cases won't work, I did not get an idea about the
> standard case of what should work with this. Would you mind to write
> a more detailed example or a more advertising paragraph of what this can do?
> Not getting the big picture may be related to me having not worked with RCS yet.

Stefan,

RCS Keywords, while originating from RCS, are commonly used in CVS and
SVN. A lot of LaTeX workflows in the scientific community, for
example, use these keyword substitutions, trapping scientists in
legacy SCMSes. In my environment, we do builds and deployments from
within pristine working copies or "checkouts of trunk", we also have
some deployments that are symlinks into live checkouts of trunk, and
we have production support workflows where support personnel inspect
files remotely and subsequent escalation procedures rely on the
contents of the $Author$ substitutions, etc. As a result of this,
projects that have migrated to git are demanding the restoration of
their RCS keyword substitutions.

In CVS/SVN, keywords are expanded on checkout, placing text related to
the most recent history of a give file into that file. RCS has one
keyword that takes action on check-in (or commit), being the $Log$
keyword, which is a running commit log of the file in the file.
Keyword expansions are not stored in the repo, but are substituted on
their out of the repo into the working copy, and substitutions are
reversed on their way from the working copy into the repo.

Git's export-subst feature in git-archive is very similar to RCS
Keywords. What I'm providing here is a mechanism to enable
export-subst functionality throughout normal git workflows and not
just during builds that employ git-archive, as if export-subst worked
alongside git's ident feature.

Perhaps described the known issues I've found will also help towards
understanding...


Known Issues
------------

Edge Case #1 (aka, modified smudge filter)

1. create new branch
2. edit smudge filter
3. commit
4. switch back to previous branch
5. smudge filter is temporarily disappeared at the moment the smudge
filter wants to run

This edge case is a side-effect of the order in which git performs
deletions in the worktree and extractions from the index and
executions of the filters. This edge case is related to a similar to
one seen in older git versions where the smudge is disappeared during
a "git checkout-index -a -f", but the sequence of operations has been
fixed in more recent gits, so the smudge does not disappear during a
checkout-index.


Edge Case #2

1. create branch B from branch A
2. make changes in branch A, commit
3. checkout branch B
4. git merge A
5. while editing commit file, file being modified lacks keywords (expected)
6. delete commit message, cancelling commit
7. file remains w/o substituted keywords
8. cancel merge w/ git reset --merge ORIG_HEAD & restored original
file is w/o substituted keywords

Reason: no available state transition on reset --merge


Edge Case #3

1. create branch B from branch A, checkout B
2. modify file, commit
3. checkout A
4. make conflicting edit to same file
5. git rebase B, rebase will conflict
6. git rebase --abort
7. file will be w/o substituted keywords


Known Unissues
--------------

Not-an-Edge-Case #1

1. create branch B from branch A, checkout B
2. modify file, commit
3. checkout A
4. git merge --squash B
5. file as modified from B is w/o substituted keywords
AS EXPECTED - that version of file does not yet contain history in A,
file will gain substitutions following commit

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RCS Keywords in Git done right
  2014-11-26 19:22   ` Derek Moore
@ 2014-11-26 21:15     ` Stefan Beller
  2014-12-02 16:31       ` Derek Moore
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Beller @ 2014-11-26 21:15 UTC (permalink / raw)
  To: Derek Moore; +Cc: git

On Wed, Nov 26, 2014 at 11:22 AM, Derek Moore <derek.p.moore@gmail.com> wrote:
>> Now knowing the edge cases won't work, I did not get an idea about the
>> standard case of what should work with this. Would you mind to write
>> a more detailed example or a more advertising paragraph of what this can do?
>> Not getting the big picture may be related to me having not worked with RCS yet.
>
> Stefan,
>
> RCS Keywords, while originating from RCS, are commonly used in CVS and
> SVN. A lot of LaTeX workflows in the scientific community, for
> example, use these keyword substitutions, trapping scientists in
> legacy SCMSes. In my environment, we do builds and deployments from
> within pristine working copies or "checkouts of trunk", we also have
> some deployments that are symlinks into live checkouts of trunk, and
> we have production support workflows where support personnel inspect
> files remotely and subsequent escalation procedures rely on the
> contents of the $Author$ substitutions, etc. As a result of this,
> projects that have migrated to git are demanding the restoration of
> their RCS keyword substitutions.
>
> In CVS/SVN, keywords are expanded on checkout, placing text related to
> the most recent history of a give file into that file. RCS has one
> keyword that takes action on check-in (or commit), being the $Log$
> keyword, which is a running commit log of the file in the file.
> Keyword expansions are not stored in the repo, but are substituted on
> their out of the repo into the working copy, and substitutions are
> reversed on their way from the working copy into the repo.

Thanks for the explanation!


>
> Git's export-subst feature in git-archive is very similar to RCS
> Keywords. What I'm providing here is a mechanism to enable
> export-subst functionality throughout normal git workflows and not
> just during builds that employ git-archive, as if export-subst worked
> alongside git's ident feature.
>
> Perhaps described the known issues I've found will also help towards
> understanding...
>
>
> Known Issues
> ------------
>
> Edge Case #1 (aka, modified smudge filter)
>
> 1. create new branch
> 2. edit smudge filter
> 3. commit
> 4. switch back to previous branch
> 5. smudge filter is temporarily disappeared at the moment the smudge
> filter wants to run
>
> This edge case is a side-effect of the order in which git performs
> deletions in the worktree and extractions from the index and
> executions of the filters. This edge case is related to a similar to
> one seen in older git versions where the smudge is disappeared during
> a "git checkout-index -a -f", but the sequence of operations has been
> fixed in more recent gits, so the smudge does not disappear during a
> checkout-index.
>
>
> Edge Case #2
>
> 1. create branch B from branch A
> 2. make changes in branch A, commit
> 3. checkout branch B
> 4. git merge A
> 5. while editing commit file, file being modified lacks keywords (expected)
> 6. delete commit message, cancelling commit
> 7. file remains w/o substituted keywords
> 8. cancel merge w/ git reset --merge ORIG_HEAD & restored original
> file is w/o substituted keywords
>
> Reason: no available state transition on reset --merge
>
>
> Edge Case #3
>
> 1. create branch B from branch A, checkout B
> 2. modify file, commit
> 3. checkout A
> 4. make conflicting edit to same file
> 5. git rebase B, rebase will conflict
> 6. git rebase --abort
> 7. file will be w/o substituted keywords
>
>
> Known Unissues
> --------------
>
> Not-an-Edge-Case #1
>
> 1. create branch B from branch A, checkout B
> 2. modify file, commit
> 3. checkout A
> 4. git merge --squash B
> 5. file as modified from B is w/o substituted keywords
> AS EXPECTED - that version of file does not yet contain history in A,
> file will gain substitutions following commit

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RCS Keywords in Git done right
  2014-11-26 21:15     ` Stefan Beller
@ 2014-12-02 16:31       ` Derek Moore
  2014-12-02 17:03         ` Derek Moore
  0 siblings, 1 reply; 7+ messages in thread
From: Derek Moore @ 2014-12-02 16:31 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

I've finished testing this work in larger repositories.

While the approach is performant and works nicely in small repos, but
in larger repos one of the requirements for the "correctness" of
substitutions slows things down (1 or 2 minutes to perform checkouts
between branches with 10,000+ files).

The operation that is slowing things down is discovering the relative
complement of commits between the common files of two branches (i.e.,
which files are common between two branches but differ in their latest
commit).

My current approach is:
1) find files common between @ & @{-1}, "ls-tree --full-tree
--name-only -r" both branches, take the intersection
2) find current branch's commits for common files, for each file in
intersection "log -1 --format=%H $current_branch -- $file"
3) find common files where latest commits differ, for each file in
intersection keep the file if current branche's latest commit does not
equal prior branch's latest commit
4) overwrite all kept files with the results of git-archive

It is steps 2 & 3 that consume the most time in a large repo with
large intersections of common files between branches.

I've tried to conceive of other ways to arriving at the same
"filename"/"latest current branch commit hash" pairs where filenames
are common between branches and latest current branch commit hash
differs from latest prior branch commit hash. I've thought maybe I
could traverse commits starting from merge-base instead of traversing
files, but that doesn't seem like it would be a huge improvement.

I'm sure internal to git in C there would be a better/faster way (and
it would probably look like writing Btrieve queries). Can anyone think
of a good solution for the intersection of files and complement of
commits using only the git CLI tools?

Thanks,

Derek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RCS Keywords in Git done right
  2014-12-02 16:31       ` Derek Moore
@ 2014-12-02 17:03         ` Derek Moore
  2014-12-02 17:36           ` Derek Moore
  0 siblings, 1 reply; 7+ messages in thread
From: Derek Moore @ 2014-12-02 17:03 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

> My current approach is:
> 1) find files common between @ & @{-1}, "ls-tree --full-tree
> --name-only -r" both branches, take the intersection
> 2) find current branch's commits for common files, for each file in
> intersection "log -1 --format=%H $current_branch -- $file"
> 3) find common files where latest commits differ, for each file in
> intersection keep the file if current branche's latest commit does not
> equal prior branch's latest commit
> 4) overwrite all kept files with the results of git-archive

PS: In large repos, I can dump the entire contents of the repo out of
git-archive faster than I can look up the commits of common files
between two branches for a more limited and surgical dump from
git-archive (say, 30 seconds to dump everything out of git-archive vs.
1 minute 30 seconds to find the intersection of files and look up the
latest commits).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RCS Keywords in Git done right
  2014-12-02 17:03         ` Derek Moore
@ 2014-12-02 17:36           ` Derek Moore
  0 siblings, 0 replies; 7+ messages in thread
From: Derek Moore @ 2014-12-02 17:36 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

PPS: Sounds like I need Peff's git-blame-tree from here:
https://github.com/peff/git/compare/jk/faster-blame-tree

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-12-02 17:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-26 16:44 RCS Keywords in Git done right Derek Moore
2014-11-26 18:10 ` Stefan Beller
2014-11-26 19:22   ` Derek Moore
2014-11-26 21:15     ` Stefan Beller
2014-12-02 16:31       ` Derek Moore
2014-12-02 17:03         ` Derek Moore
2014-12-02 17:36           ` Derek Moore

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.