All of lore.kernel.org
 help / color / mirror / Atom feed
* Effectively tracing project contributions with git
@ 2009-09-12 12:30 Joseph Wakeling
  2009-09-12 18:59 ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Wakeling @ 2009-09-12 12:30 UTC (permalink / raw)
  To: git

Hello,

I've recently begun contributing to a FOSS project that has a problem --
although it has extensive git logs (some being CVS/SVN imports) dating
back over many years, there has not been maintenance of contribution
records on a file-by-file basis.

I'm trying to rectify this and track down who contributed what.
Unfortunately while I'm used to basic operations with git, I don't know
it well enough to be confident in how to go about tracing contributions
in this way.

'git annotate' of course is a nice starting point but of limited use
because every time someone tweaks a line (and there have been many such
tweaks in the history of the project) the responsibility of the original
contributor is replaced by that of the tweaker.

An alternative is to use gitk to trace the history of individual files
(or paths, as gitk has it).  The problem here is that files have been
renamed, content has been moved about between different files and so on.

Finally, there's the option to use gitk to trace contributors (someone
has prepared a .mailman file with a complete list of contributors by
name and email) and manually or otherwise tally their significant
contributions.  Again, I'm not sure to what extent this is made
difficult by copy/pasting and tweaking of file content.

I'm just hoping that the git community can offer some good advice on
this, to what extent the process of tracing contributions can be
automated, and so on.  I'm not expecting anyone to provide a solution
for me, but suggestions and pointers in the possible right directions
would be much appreciated.

Thanks & best wishes,

    -- Joe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Effectively tracing project contributions with git
  2009-09-12 12:30 Effectively tracing project contributions with git Joseph Wakeling
@ 2009-09-12 18:59 ` Jeff King
  2009-09-12 19:03   ` Sverre Rabbelier
  2009-09-13  0:03   ` Joseph Wakeling
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff King @ 2009-09-12 18:59 UTC (permalink / raw)
  To: Joseph Wakeling; +Cc: Sverre Rabbelier, git

On Sat, Sep 12, 2009 at 02:30:17PM +0200, Joseph Wakeling wrote:

> I've recently begun contributing to a FOSS project that has a problem --
> although it has extensive git logs (some being CVS/SVN imports) dating
> back over many years, there has not been maintenance of contribution
> records on a file-by-file basis.
> 
> I'm trying to rectify this and track down who contributed what.
> Unfortunately while I'm used to basic operations with git, I don't know
> it well enough to be confident in how to go about tracing contributions
> in this way.

We can probably help you with the git side of things, but defining "who
contributed what" is kind of a hairy problem. You will need to define
exactly how you want to count contributions.

For example:

> 'git annotate' of course is a nice starting point but of limited use
> because every time someone tweaks a line (and there have been many such
> tweaks in the history of the project) the responsibility of the original
> contributor is replaced by that of the tweaker.

But often the tweaking of the line _does_ make it their own. One of the
metrics often discussed in git is "of the surviving lines in the code,
how many were authored by each person". Which really is the output of
"git blame" (or annotate, which is more or less the same thing). So
people who contribute code that needs a lot of changes or cleanup don't
get as much credit for that code, because their lines got tweaked later.

It's an OK metric if you assume that lines are a good atom of
contribution. That is, if I replace your line, then I remove everything
of value that you added and I should get credit. That is arguably not
the case with something like a style cleanup. Changing:

  if(i = 0; i < n; i++)

to

  if (i = 0; i < n; i++)

to fix whitespace should probably leave authorship with the original
line. But I don't know if you can determine programatically how
significant a change was. In the case of whitespace, "git blame" has an
option to ignore whitespace changes, which probably covers a large
portion of such "trivial change" cases.

> An alternative is to use gitk to trace the history of individual files
> (or paths, as gitk has it).  The problem here is that files have been
> renamed, content has been moved about between different files and so on.

You can use rename detection via --follow and simply count the lines
changed (and by whom) in each commit. Which differs from "git blame"
strategy by counting every change as of value, even if it is a line that
doesn't survive.

But no, that won't handle the movement of some chunk of content from
one file to another. Only "git blame" really looks at code movement on a
smaller-than-file level.

> I'm just hoping that the git community can offer some good advice on
> this, to what extent the process of tracing contributions can be
> automated, and so on.  I'm not expecting anyone to provide a solution
> for me, but suggestions and pointers in the possible right directions
> would be much appreciated.

I think it is less a git problem and more of a "how do you want to
define contribution" problem. The above is just my thinking about it for
a few minutes. Sverre Rabelier (cc'd) did a "git stats" GSoC project
last year, but I don't think I ever looked closely at the results or
what metrics he came up with. But that is probably a good direction to
look in.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Effectively tracing project contributions with git
  2009-09-12 18:59 ` Jeff King
@ 2009-09-12 19:03   ` Sverre Rabbelier
  2009-09-13  0:10     ` Joseph Wakeling
  2009-09-13  0:03   ` Joseph Wakeling
  1 sibling, 1 reply; 8+ messages in thread
From: Sverre Rabbelier @ 2009-09-12 19:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Joseph Wakeling, git

Heya,

On Sat, Sep 12, 2009 at 20:59, Jeff King <peff@peff.net> wrote:
> I think it is less a git problem and more of a "how do you want to
> define contribution" problem. The above is just my thinking about it for
> a few minutes. Sverre Rabelier (cc'd) did a "git stats" GSoC project
> last year, but I don't think I ever looked closely at the results or
> what metrics he came up with. But that is probably a good direction to
> look in.

Git stats can aggregate diffs, so it can show you "this author made
changes to this many lines to this file in total", but it doesn't work
across renames. It also has an option to aggregate that to a total per
project number, but I'm not sure how useful that is to your case, as
you seem to be interested in a per-file/line basis? I agree with Jeff
that you'll need to define more precisely what it is you want to know
:).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Effectively tracing project contributions with git
  2009-09-12 18:59 ` Jeff King
  2009-09-12 19:03   ` Sverre Rabbelier
@ 2009-09-13  0:03   ` Joseph Wakeling
  1 sibling, 0 replies; 8+ messages in thread
From: Joseph Wakeling @ 2009-09-13  0:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Sverre Rabbelier, git

Jeff King wrote:
> We can probably help you with the git side of things, but defining "who
> contributed what" is kind of a hairy problem. You will need to define
> exactly how you want to count contributions.

Yes, that's pretty much what I'm looking for.  My thoughts on
contribution run along much the same lines as yours -- there's a need to
distinguish between meaningful additions and mere tweaks.

My general rule is that stuff like whitespace changes, changing the name
of variables, typo corrections etc. is not a meaningful contribution
although if someone had really done a lot of it I might see things
differently.  Substantial additions -- extending the code, comments or
documentation -- are what I'm after.  Ultimately this has to be decided
by me actually looking at things rather than metrics.

What I'm doing right now is to run a git shortlog on a file to get a
rough idea of the contributors and who are likely to be the main
authors, then using gitk to browse the commits for that file.  It's
time-consuming but works -- once I've identified at least one major
commit from someone I can ignore everything else by them and concentrate
on the remaining contributors.

What would help is some way to speed up the process of getting someone's
commits: 'give me all the diffs for file X by author Y'.  I'm not too
good at shell scripting so grep-y things don't spring easily to mind.

An alternative useful tool would be 'give me all the commits to this
file that change more than N lines'.

With those two -- particularly the first -- I think I'd be able to get a
fair way.  It won't work for the files where there has been a lot of
moving of content or renames, but that's mostly in the docs -- the code,
which is the really important thing, doesn't seem so bad (so far).

Thanks very much for the advice and careful thoughts,

Best wishes,

    -- Joe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Effectively tracing project contributions with git
  2009-09-12 19:03   ` Sverre Rabbelier
@ 2009-09-13  0:10     ` Joseph Wakeling
  2009-09-13  2:28       ` Theodore Tso
  0 siblings, 1 reply; 8+ messages in thread
From: Joseph Wakeling @ 2009-09-13  0:10 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Jeff King, git

Sverre Rabbelier wrote:
> Git stats can aggregate diffs, so it can show you "this author made
> changes to this many lines to this file in total", but it doesn't work
> across renames. It also has an option to aggregate that to a total per
> project number, but I'm not sure how useful that is to your case, as
> you seem to be interested in a per-file/line basis? I agree with Jeff
> that you'll need to define more precisely what it is you want to know
> :).

That would certainly be a very useful function -- it wouldn't solve my
problem for me but would make it easier to identify core authors.  After
all, 'number of commits' doesn't necessarily correspond to meaningful
contribution -- many of them could be editorial -- but number of lines
(or the ratio of lines to commits) could be a much better indicator.

I don't see any solution that doesn't see me browsing diffs -- there's
no metric that will solve the problem -- but if your stats work could
help me get an output of the form 'here are all the diffs on file X by
contributor Y in order of size, largest first' then I think it would
help a LOT.

Is there a website where I can read more about your stats/metrics work?
 Beyond the applications to the present problem I have some other
reasons to be very interested in what can be done with git history stats.

Thanks & best wishes,

    -- Joe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Effectively tracing project contributions with git
  2009-09-13  0:10     ` Joseph Wakeling
@ 2009-09-13  2:28       ` Theodore Tso
  2009-09-13  9:24         ` Jeff King
  2009-09-13 14:30         ` Joseph Wakeling
  0 siblings, 2 replies; 8+ messages in thread
From: Theodore Tso @ 2009-09-13  2:28 UTC (permalink / raw)
  To: Joseph Wakeling; +Cc: Sverre Rabbelier, Jeff King, git

On Sun, Sep 13, 2009 at 02:10:49AM +0200, Joseph Wakeling wrote:
> 
> I don't see any solution that doesn't see me browsing diffs -- there's
> no metric that will solve the problem -- but if your stats work could
> help me get an output of the form 'here are all the diffs on file X by
> contributor Y in order of size, largest first' then I think it would
> help a LOT.

This will display all of the diffs on file (pathname) XXX by contributor YYY:

	git log -p --author=YYY XXX 

You might also find the diffstats useful:

	git log --stat --author=YYY XXX

Or if you want *only* the diffstats for the file in question, you might try:

	git log --stat --pretty=format: --author=YYY XXX | grep XXX

So the bottom line is git will allow you to extract quite a lot of
information.  You might need to do some perl- or shell- or python-
scripting to analyze or format the information, but the harder
question is determining exactly what question you want to ask.

Eliminating whitespace changes isn't hard (add the -b flag).  If you
want to eliminate variable renaming, that's harder since that requires
actually parsing the patch.  There are programs that will do that
(normally used by University professors to catch students cheating at
Programming 101 courses :-), but you'd need to do some shell (or perl
or python) scripting to splice them into the git invocations to
extract out the information.

Is there a particular reason why this is important to you?  Is it for
curiosity reasons; are you trying to build a case that you've
contacted all of the significant contributors for the purposes of
changing the license used on a file?  If it's the latter, what I'd
probably do is just simply collect everyone who has ever changed a
file (git log --format="%aN <%aE>" pathname/to/a/file | sort -u) and
try to get as many people as possible to agree to the license change.
For the ones who have _not_ agreed, or which you can not contact, you
can go back and just analyze their changes (git log --author=YYY) to
decide whether or not they are significant, and whether you need to
try extract hard to contact them, or in the worst case, find someone
to rewrite the parts of the file which they had modified in the past.

Or maybe you have some other reason for gathering said information.
Depending on what the high-level thing it is that you are trying to
do, there may be an easier or more elegant way to get the information
you are requesting.

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Effectively tracing project contributions with git
  2009-09-13  2:28       ` Theodore Tso
@ 2009-09-13  9:24         ` Jeff King
  2009-09-13 14:30         ` Joseph Wakeling
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff King @ 2009-09-13  9:24 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Joseph Wakeling, Sverre Rabbelier, git

On Sat, Sep 12, 2009 at 10:28:43PM -0400, Theodore Tso wrote:

> > I don't see any solution that doesn't see me browsing diffs -- there's
> > no metric that will solve the problem -- but if your stats work could
> > help me get an output of the form 'here are all the diffs on file X by
> > contributor Y in order of size, largest first' then I think it would
> > help a LOT.
> 
> This will display all of the diffs on file (pathname) XXX by contributor YYY:
> 
> 	git log -p --author=YYY XXX 
> 
> You might also find the diffstats useful:
> 
> 	git log --stat --author=YYY XXX
> 
> Or if you want *only* the diffstats for the file in question, you might try:
> 
> 	git log --stat --pretty=format: --author=YYY XXX | grep XXX

There is also the "--numstat" format which is a bit easier for parsing.
I think the "all diffs on file $X by contributor $Y, ordered by size"
would look like:

  git log -z --pretty=tformat:%H --numstat --author=$Y $X |
  perl -0ne '
    my ($commit) = /^([0-9a-f]{40})$/m;
    my ($lines_added) = /^(\d+)\s/m;
    print "$lines_added $commit\n";
  ' |
  sort -rn |
  cut -d ' ' -f2 |
  xargs -n 1 git show

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Effectively tracing project contributions with git
  2009-09-13  2:28       ` Theodore Tso
  2009-09-13  9:24         ` Jeff King
@ 2009-09-13 14:30         ` Joseph Wakeling
  1 sibling, 0 replies; 8+ messages in thread
From: Joseph Wakeling @ 2009-09-13 14:30 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Sverre Rabbelier, Jeff King, git

Theodore Tso wrote:
> This will display all of the diffs on file (pathname) XXX by contributor YYY:
> 
> 	git log -p --author=YYY XXX 
> 
> You might also find the diffstats useful:
> 
> 	git log --stat --author=YYY XXX
> 
> Or if you want *only* the diffstats for the file in question, you might try:
> 
> 	git log --stat --pretty=format: --author=YYY XXX | grep XXX

That's absolutely brilliant -- using these commands makes my task much
easier.

As for reasons -- there are several.  For the FOSS project, here is a
potential relicensing issue (guarding against future problems rather
than addressing present ones) and indeed I'm going about it the way you
suggest -- getting consent from as many contributors as possible.
Despite that, it seems useful to have up-to-date file-by-file credits
and copyright notices.

Personally, there's also a degree of curiosity and wanting to learn some
of the more complex and advanced possibilities of git -- and also
basically wanting to see to what extent this task is possible.  How
fine-grained a degree of credit/blame can I extract for a given piece of
code?  And how far back in history? etc.

Finally, there's an aspect which has nothing to do with code but could
still be very interesting for some people in the git community.  I've
long been fascinated by DVCS as a collaborative tool and over the last
year have been part of the Liquid Publications project:
http://project.liquidpub.org/

... that is trying to develop new models for scientific collaboration
and publishing/sharing of results and ideas.  One of my interests is to
see whether DVCS can be harnessed to enable better and more open
collaboration and micro-credit for scientific contributions.

We've already set up a project on Launchpad to try to turn one of our
project reports into a review paper via open collaboration:
https://code.launchpad.net/~webdrake/liquidpub/peer-review

... and I'll shortly be setting up a GitHub branch for another,
from-scratch article directly on DVCS and their potential applications
(the use of different VCS is deliberate: one thing I'll be doing is
testing different VCS and their different features).  This is something
I'd very much like to have git (and bzr, and hg) community members
involved in.

I was going to write to the git community about this at a later date
once I'd got more stuff prepared, but since the present discussion could
generate useful material for that it seems only fair to be open.

The FOSS project stuff has nothing to do with that, but I certainly see
it as a good experience to feed the LiquidPub research.

Thanks for the useful advice and best wishes,

    -- Joe

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-09-13 14:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-12 12:30 Effectively tracing project contributions with git Joseph Wakeling
2009-09-12 18:59 ` Jeff King
2009-09-12 19:03   ` Sverre Rabbelier
2009-09-13  0:10     ` Joseph Wakeling
2009-09-13  2:28       ` Theodore Tso
2009-09-13  9:24         ` Jeff King
2009-09-13 14:30         ` Joseph Wakeling
2009-09-13  0:03   ` Joseph Wakeling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.