All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Use cases for 'git statistics'
@ 2008-05-08 15:51 Sverre Rabbelier
  2008-05-12  9:38 ` Sverre Rabbelier
  0 siblings, 1 reply; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-08 15:51 UTC (permalink / raw)
  To: Git Mailing List; +Cc: David Symonds, Shawn O. Pearce, Junio C Hamano

Heya,

I've been busy to write up some use cases for 'git statistics' (a new
command that I will be implementing this summer during Google Summer
of Code). For more details on my proposal please see
http://alturin.googlepages.com/gsoc2008 (a pdf of the use cases is
hosted there as well for those who prefer pdf). I would like to ask
for comments on the current use cases; is anything missing, or should
a particular use case be removed/merged? Please let me know.

Thank you for your time.



== Terminology ==

There are four types of users: Maintainers, Developers, Bug-fixers,
and regular Users. The first three are all Contributors.

Name: Maintainer (Contributor)
Description: The Maintainer reviews commits and branches from other
Contributors and decided which ones to integrate into a 'master'
branch.

Name: Developer (Contributor)
Description: The Developer contributes enhancements to the project,
e.g. they add new content or improve existing content.

Name: Bug-fixer (Contributor)
Description: The Bug-fixer locates 'bugs' (as something unwanted that
needs to be corrected) in the content and 'fixes' them.

Name: User
Description: The User uses the content, be it in their daily work or
every now and then for a specific purpose.



== Use cases ==


A model where other Contributors review commits is assumed in all use
cases. When referenced are made to a Contributor addressing another
Contributor to adjust their behavior as the result of data mined, it
should be kept in mind that the Contributor should foremost be the one
to do this. Using this information to, say, spend more time checking
one's own commits for bugs when working on a specific part of the
content on one's own accord is is often more effective then doing so
only after being asked. </disclaimer>? :P


Name: Finding a Contributor that is active in a specific bit of content.
Description:
	Whenever a Contributor needs to know about other Contributors that
are active in a specific part of the content they query git for this
information. This could be used to figure out whom to send a copy of a
commit (someone who has recently worked on the content a commit
modifies is likely to be interested in such a commit). This
information may be easily gathered with, say, git blame. Aggregating
it's output (in the background if need be to maintain speedy response
times), it is trivial to determine whether a Contributor has more
commits/lines of change than a predefined amount. The main difference
with git blame is that it's output is aggregated over the history of
the content, for a specific Contributor, whereas git blame only shows
the latest changes.


Name: Finding which commits touches the parts of the content that a
commit touches.
Description:
	There are several reasons that one might want to know which commit
touches the parts of the content that a commit touches. This may be
implemented similar to how git blame works only instead of 'stopping'
after having found the author of a line, the search continues up to a
certain date in the past.


Name: Integrating the found 'bug introducing' commit with the git
commit message system.
Description:
	When a Bug-fixer sends out a commit to fix a bug it might be useful
for them to find out where exactly the bug was introduced. Using the
'which commit touched the content this commit touches' technique
optional candidates may be retrieved. After picking which of the found
commits caused the bug, this information may then automatically added
to the commit's description. This does not only allow the Bug-fixer to
make clear the origin of their commit, but also make it possible to
later unambiguously determine a bug/fix pair. Note that this is
automated, no user input is required to determine which commit caused
the bug, only the picking of 'cause' commits requires input from the
user.


Name: Finding the Author that introduce a lot of/almost no bugs to the content.
Description:
	Contributors might be interested to know which of the Developers
introduce a lot of bugs, or the contrary, which introduce almost no
bugs to the content. This information is highly relevant to the
Maintainer as they may now focus the time they spend on reviewing
commits on those that stem from Developers that turn out to often
introduce bugs. On the other hand, Developers that usually do not
introduce bugs need less reviewing time. While such information is
usually known to the experienced Maintainer (as they know their main
contributors well), it can be helpful to new maintainers, or as a
pointer that the opinion of the Maintainer about a specific Developer
needs to be adjusted. Bug-fixers on the other hand can use this
information to address the Developer that introduces most of the bugs
they fix, perhaps with advice on how to prevent future bugs from being
introduced.


Name: Finding the Contributor that accepted a lot of/almost no bugs
into the content.
Description:
	Similar to the finding Authors that write the bugs, there are other
Contributors that 'accept' the commit. Either passively, by not
commenting when the commit is sent out for review, or actively, by
'acknowledging' (acked-by), 'signing off' (signed-off-by) or 'testing'
(tested-by) a commit. When actively doing so, this can later be
traced, this information can then be used in the same ways as for
Authors.


Name: Finding parts of the content in which a lot of bugs are
introduced and fixed
Description:
	When a Developer decides to change part of the content, it would be
interesting for them to know that many before them introduced bugs
when working on that part of the content. Knowing this the Developer
might ask for all such buggy commits to try and learn from the
mistakes made by others and prevent making the same mistake. A
Maintainer might use this information to spend extra time reviewing a
commit from a 'bug prone' part of the content.

	
Name: Finding parts of the content a particular Contributor introduces
a lot of/almost no
bugs to.
Description:
	When trying to decide whether to ask a specific Contributor to work
on part of the content it might be useful to not only know how active
they work on that part of the content, but also if they introduced a
lot of bugs to that part, or perhaps fixed many. Similar to the more
general case, this can be split out between modifying content and
'accepting' modifications. This information may be used to decide to
ask a Contributor to spend more time on a specific part of the content
before sending in a commit for review.

	
Name: Finding how many bugs were introduced/fixed in a period of time
Description:
	As bugs are recognized by their fixes, it is always possible to match
a bug to it's fix. Both commits have a time stamp and with those the
time between bug and fix can be calculated. Aggregating this data over
all known bug(fixes) the amount of unfixed bugs may be found over a
specified period of time. For example, finding the amount of fixed
bugs between two releases, or how many bugs were not fixed within one
release cycle. This number might then be calculated over several time
frames (say, each release), after which it is possible to track
'content quality' throughout releases. If this information is then
graphed one can find extremes in this figure (for example, a release
cycle in which a lot of bugs were fixed, or one that introduced many).
Knowing this the Contributors may then determine the cause of such and
learn from that.

	
Name: Finding how much work a contributor has done over a period of time
Description:
	When working in a team in which everybody is expected to do
approximately the same amount of work it is interesting to see how
much work each Contributor actually does. This allows the team to
discuss any extremes and attempt to handle these as to distribute the
work more evenly.
	When work is being done by a large group of people it is interesting
to know the most active Contributors since these usually are the ones
with most knowledge on the content. The other way around, it is
possible to determine if a specific Contributor is 'active enough' for
a specific task (such as mentoring).

	
Name: Finding whether a Contributor is mostly a Developer or a Bug-fixer
Description:
	To all Contributors it is interesting to know if they spend most of
their time fixing bugs, or contributing enhancements to the content.
This information could also be queried over a specific time frame, for
example 'weekends vs. workdays' or 'holidays vs. non-holidays'.



-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-08 15:51 [RFC] Use cases for 'git statistics' Sverre Rabbelier
@ 2008-05-12  9:38 ` Sverre Rabbelier
  2008-05-12 10:16   ` Jakub Narebski
  0 siblings, 1 reply; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-12  9:38 UTC (permalink / raw)
  To: Git Mailing List; +Cc: David Symonds, Shawn O. Pearce, Junio C Hamano

On Thu, May 8, 2008 at 5:51 PM, Sverre Rabbelier <srabbelier@gmail.com> wrote:
> Heya,
>
> I've been busy to write up some use cases for 'git statistics' (a new
> command that I will be implementing this summer during Google Summer
> of Code). For more details on my proposal please see
> http://alturin.googlepages.com/gsoc2008 (a pdf of the use cases is
> hosted there as well for those who prefer pdf). I would like to ask
> for comments on the current use cases; is anything missing, or should
> a particular use case be removed/merged? Please let me know.
>
> Thank you for your time.
>
>
>
> == Terminology ==
>
> There are four types of users: Maintainers, Developers, Bug-fixers,
> and regular Users. The first three are all Contributors.
>
> Name: Maintainer (Contributor)
> Description: The Maintainer reviews commits and branches from other
> Contributors and decided which ones to integrate into a 'master'
> branch.
>
> Name: Developer (Contributor)
> Description: The Developer contributes enhancements to the project,
> e.g. they add new content or improve existing content.
>
> Name: Bug-fixer (Contributor)
> Description: The Bug-fixer locates 'bugs' (as something unwanted that
> needs to be corrected) in the content and 'fixes' them.
>
> Name: User
> Description: The User uses the content, be it in their daily work or
> every now and then for a specific purpose.
>
>
>
> == Use cases ==
>
>
> A model where other Contributors review commits is assumed in all use
> cases. When referenced are made to a Contributor addressing another
> Contributor to adjust their behavior as the result of data mined, it
> should be kept in mind that the Contributor should foremost be the one
> to do this. Using this information to, say, spend more time checking
> one's own commits for bugs when working on a specific part of the
> content on one's own accord is is often more effective then doing so
> only after being asked. </disclaimer>? :P
>
>
> Name: Finding a Contributor that is active in a specific bit of content.
> Description:
>        Whenever a Contributor needs to know about other Contributors that
> are active in a specific part of the content they query git for this
> information. This could be used to figure out whom to send a copy of a
> commit (someone who has recently worked on the content a commit
> modifies is likely to be interested in such a commit). This
> information may be easily gathered with, say, git blame. Aggregating
> it's output (in the background if need be to maintain speedy response
> times), it is trivial to determine whether a Contributor has more
> commits/lines of change than a predefined amount. The main difference
> with git blame is that it's output is aggregated over the history of
> the content, for a specific Contributor, whereas git blame only shows
> the latest changes.
>
>
> Name: Finding which commits touches the parts of the content that a
> commit touches.
> Description:
>        There are several reasons that one might want to know which commit
> touches the parts of the content that a commit touches. This may be
> implemented similar to how git blame works only instead of 'stopping'
> after having found the author of a line, the search continues up to a
> certain date in the past.
>
>
> Name: Integrating the found 'bug introducing' commit with the git
> commit message system.
> Description:
>        When a Bug-fixer sends out a commit to fix a bug it might be useful
> for them to find out where exactly the bug was introduced. Using the
> 'which commit touched the content this commit touches' technique
> optional candidates may be retrieved. After picking which of the found
> commits caused the bug, this information may then automatically added
> to the commit's description. This does not only allow the Bug-fixer to
> make clear the origin of their commit, but also make it possible to
> later unambiguously determine a bug/fix pair. Note that this is
> automated, no user input is required to determine which commit caused
> the bug, only the picking of 'cause' commits requires input from the
> user.
>
>
> Name: Finding the Author that introduce a lot of/almost no bugs to the content.
> Description:
>        Contributors might be interested to know which of the Developers
> introduce a lot of bugs, or the contrary, which introduce almost no
> bugs to the content. This information is highly relevant to the
> Maintainer as they may now focus the time they spend on reviewing
> commits on those that stem from Developers that turn out to often
> introduce bugs. On the other hand, Developers that usually do not
> introduce bugs need less reviewing time. While such information is
> usually known to the experienced Maintainer (as they know their main
> contributors well), it can be helpful to new maintainers, or as a
> pointer that the opinion of the Maintainer about a specific Developer
> needs to be adjusted. Bug-fixers on the other hand can use this
> information to address the Developer that introduces most of the bugs
> they fix, perhaps with advice on how to prevent future bugs from being
> introduced.
>
>
> Name: Finding the Contributor that accepted a lot of/almost no bugs
> into the content.
> Description:
>        Similar to the finding Authors that write the bugs, there are other
> Contributors that 'accept' the commit. Either passively, by not
> commenting when the commit is sent out for review, or actively, by
> 'acknowledging' (acked-by), 'signing off' (signed-off-by) or 'testing'
> (tested-by) a commit. When actively doing so, this can later be
> traced, this information can then be used in the same ways as for
> Authors.
>
>
> Name: Finding parts of the content in which a lot of bugs are
> introduced and fixed
> Description:
>        When a Developer decides to change part of the content, it would be
> interesting for them to know that many before them introduced bugs
> when working on that part of the content. Knowing this the Developer
> might ask for all such buggy commits to try and learn from the
> mistakes made by others and prevent making the same mistake. A
> Maintainer might use this information to spend extra time reviewing a
> commit from a 'bug prone' part of the content.
>
>
> Name: Finding parts of the content a particular Contributor introduces
> a lot of/almost no
> bugs to.
> Description:
>        When trying to decide whether to ask a specific Contributor to work
> on part of the content it might be useful to not only know how active
> they work on that part of the content, but also if they introduced a
> lot of bugs to that part, or perhaps fixed many. Similar to the more
> general case, this can be split out between modifying content and
> 'accepting' modifications. This information may be used to decide to
> ask a Contributor to spend more time on a specific part of the content
> before sending in a commit for review.
>
>
> Name: Finding how many bugs were introduced/fixed in a period of time
> Description:
>        As bugs are recognized by their fixes, it is always possible to match
> a bug to it's fix. Both commits have a time stamp and with those the
> time between bug and fix can be calculated. Aggregating this data over
> all known bug(fixes) the amount of unfixed bugs may be found over a
> specified period of time. For example, finding the amount of fixed
> bugs between two releases, or how many bugs were not fixed within one
> release cycle. This number might then be calculated over several time
> frames (say, each release), after which it is possible to track
> 'content quality' throughout releases. If this information is then
> graphed one can find extremes in this figure (for example, a release
> cycle in which a lot of bugs were fixed, or one that introduced many).
> Knowing this the Contributors may then determine the cause of such and
> learn from that.
>
>
> Name: Finding how much work a contributor has done over a period of time
> Description:
>        When working in a team in which everybody is expected to do
> approximately the same amount of work it is interesting to see how
> much work each Contributor actually does. This allows the team to
> discuss any extremes and attempt to handle these as to distribute the
> work more evenly.
>        When work is being done by a large group of people it is interesting
> to know the most active Contributors since these usually are the ones
> with most knowledge on the content. The other way around, it is
> possible to determine if a specific Contributor is 'active enough' for
> a specific task (such as mentoring).
>
>
> Name: Finding whether a Contributor is mostly a Developer or a Bug-fixer
> Description:
>        To all Contributors it is interesting to know if they spend most of
> their time fixing bugs, or contributing enhancements to the content.
> This information could also be queried over a specific time frame, for
> example 'weekends vs. workdays' or 'holidays vs. non-holidays'.

Heya,

I haven't had replies to this e-mail so far, did it get lost in the list noise?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-12  9:38 ` Sverre Rabbelier
@ 2008-05-12 10:16   ` Jakub Narebski
  2008-05-12 10:19     ` Sverre Rabbelier
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Narebski @ 2008-05-12 10:16 UTC (permalink / raw)
  To: sverre; +Cc: Git Mailing List, David Symonds, Shawn O. Pearce, Junio C Hamano

"Sverre Rabbelier" <srabbelier@gmail.com> writes:

> On Thu, May 8, 2008 at 5:51 PM, Sverre Rabbelier <srabbelier@gmail.com> wrote:
> > Heya,
> >
> > I've been busy to write up some use cases for 'git statistics' (a new
> > command that I will be implementing this summer during Google Summer
> > of Code). For more details on my proposal please see
> > http://alturin.googlepages.com/gsoc2008 (a pdf of the use cases is
> > hosted there as well for those who prefer pdf). I would like to ask
> > for comments on the current use cases; is anything missing, or should
> > a particular use case be removed/merged? Please let me know.
> 
> Heya,
> 
> I haven't had replies to this e-mail so far, did it get lost in the
> list noise?

One comment: did you take a look at 'owners.sh' script posted some
time ago by (IIRC) spearce to check who "owns" egit/jgit and relevant
git code?  This is one interesting, and useful, statistics.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-12 10:16   ` Jakub Narebski
@ 2008-05-12 10:19     ` Sverre Rabbelier
  2008-05-12 11:19       ` Jakub Narebski
  0 siblings, 1 reply; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-12 10:19 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Git Mailing List, David Symonds, Shawn O. Pearce, Junio C Hamano

On Mon, May 12, 2008 at 12:16 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> One comment: did you take a look at 'owners.sh' script posted some
> time ago by (IIRC) spearce to check who "owns" egit/jgit and relevant
> git code?  This is one interesting, and useful, statistics.

Ah, yes, I did see it, and something similar to that I intend to
include. I reckon his script would fall under the "Finding a
Contributor that is active in a specific bit of content" use case.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-12 10:19     ` Sverre Rabbelier
@ 2008-05-12 11:19       ` Jakub Narebski
  2008-05-12 11:49         ` Sverre Rabbelier
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Narebski @ 2008-05-12 11:19 UTC (permalink / raw)
  To: sverre; +Cc: Git Mailing List, David Symonds, Shawn O. Pearce, Junio C Hamano

On Mon, 12 May 2008 12:19, Sverre Rabbelier wrote:
> On Mon, May 12, 2008 at 12:16 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>>
>> One comment: did you take a look at 'owners.sh' script posted some
>> time ago by (IIRC) spearce to check who "owns" egit/jgit and relevant
>> git code?  This is one interesting, and useful, statistics.

First I have to admit that I haven't read your email carefully.
One note: why not provide HTML version in addition to PDF?

> Ah, yes, I did see it, and something similar to that I intend to
> include. I reckon his script would fall under the "Finding a
> Contributor that is active in a specific bit of content" use case.

I don't agree.

This is "Finding the owner of the code" (i.e. something like
non-existent 'git blame --summary') with the goal of "Find who
needs to be contact about changing (or adding) license / relicensing".

This is similar, but not exactly the same as "Find maintainer of given
subsystem", or "Who is responsible for this part of code".


A few use cases I thought about (perhaps repeating what you have wrote,
see note above):

* Maintainer: how close should I examine provided patch?
* Contributor: who is maintainer of the code / whom should I contact
  and send copy of a patch?
* Bug-fixer: who is responsible about this part of code? Who might have
  introduced the bug?
* Contributor: what happened with my code?
* Searching where to contribute: what are oldest part of code dealing
  with error messages (find ancient code)?

HTH
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-12 11:19       ` Jakub Narebski
@ 2008-05-12 11:49         ` Sverre Rabbelier
  2008-05-12 12:40           ` Jakub Narebski
  0 siblings, 1 reply; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-12 11:49 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Git Mailing List, David Symonds, Shawn O. Pearce, Junio C Hamano

On Mon, May 12, 2008 at 1:19 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>  One note: why not provide HTML version in addition to PDF?

HTML version is now available too: http://alturin.googlepages.com/Use_cases.html

>  This is "Finding the owner of the code" (i.e. something like
>  non-existent 'git blame --summary') with the goal of "Find who
>  needs to be contact about changing (or adding) license / relicensing".

Ah, it is a simple aggregation of 'git blame' then, you are right,
that is not the same as what I had in mind for the mentioned use case.

>  This is similar, but not exactly the same as "Find maintainer of given
>  subsystem", or "Who is responsible for this part of code".

Agreed, since those should look back in history too.

>  * Maintainer: how close should I examine provided patch?

I'm not sure I understand what you mean with this, perhaps related to
"Name: Finding parts of the content in which a lot of bugs are
introduced and fixed" (e.g., patches to bug prone areas should be
examined more closely).

>  * Contributor: who is maintainer of the code / whom should I contact
>   and send copy of a patch?

I think this -is- the "Finding a Contributor that is active in a
specific bit of content" use case this time.

>  * Bug-fixer: who is responsible about this part of code? Who might have
>   introduced the bug?

How would you define 'responsible'? "Having a lot of signed-off-by
lines in that part of the content" would seem like a candidate, but
the "activity" use case seems applicable again here.

>  * Contributor: what happened with my code?

Do you mean a "track my code" like feature? Showing the movement of a
particular piece of code through the code? (Displaying information
like "moved from foo.c to bar.c in commit 0123456789abcd"?)

>  * Searching where to contribute: what are oldest part of code dealing
>   with error messages (find ancient code)?

In other words, find the lines with the oldest modification time stamp
from 'git blame'?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-12 11:49         ` Sverre Rabbelier
@ 2008-05-12 12:40           ` Jakub Narebski
  2008-05-12 13:01             ` Sverre Rabbelier
       [not found]             ` <bd6139dc0805120604m349b1fbbr39c6dcb8d893e771@mail.gmail.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Jakub Narebski @ 2008-05-12 12:40 UTC (permalink / raw)
  To: sverre; +Cc: Git Mailing List, David Symonds, Shawn O. Pearce, Junio C Hamano

Sverre Rabbelier wrote:
> On Mon, May 12, 2008 at 1:19 PM, Jakub Narebski <jnareb@gmail.com> wrote: 

> >  * Maintainer: how close should I examine provided patch?
> 
> I'm not sure I understand what you mean with this, perhaps related to
> "Name: Finding parts of the content in which a lot of bugs are
> introduced and fixed" (e.g., patches to bug prone areas should be
> examined more closely).

This is, IMHO, the most complex example (at least to do properly).
It begins with: does given author have code touching given subsystem
(i.e. is it for him/her new contribution wrt. subsystem)? How many
commits he/she has affecting given subsystem? How often he/she rewrites
code? How many bugs were introduced?

Details I think need to be provided by maintainer...

> >  * Contributor: what happened with my code?
> 
> Do you mean a "track my code" like feature? Showing the movement of a
> particular piece of code through the code? (Displaying information
> like "moved from foo.c to bar.c in commit 0123456789abcd"?)

I was thinking there about "git blame --reverse".
 
> >  * Searching where to contribute: what are oldest part of code dealing
> >   with error messages (find ancient code)?
> 
> In other words, find the lines with the oldest modification time stamp
> from 'git blame'?

Or find the lines with oldest modification stamp with "die" or "warn",
or find which messages are oldest, even if wrapper have changed.


P.S. I wonder how hard to be to plug-in such SCM statistic system
into something like project management, see
  "Joel On Software: Evidence based scheduling" (of programming tasks)
  http://www.joelonsoftware.com/items/2007/10/26.html

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-12 12:40           ` Jakub Narebski
@ 2008-05-12 13:01             ` Sverre Rabbelier
       [not found]             ` <bd6139dc0805120604m349b1fbbr39c6dcb8d893e771@mail.gmail.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-12 13:01 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Git Mailing List, David Symonds, Shawn O. Pearce, Junio C Hamano

On Mon, May 12, 2008 at 2:40 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>  This is, IMHO, the most complex example (at least to do properly).
>  It begins with: does given author have code touching given subsystem
>  (i.e. is it for him/her new contribution wrt. subsystem)? How many
>  commits he/she has affecting given subsystem? How often he/she rewrites
>  code? How many bugs were introduced?

Ah, there is a lot more to this example than I thought. Perhaps this
data could all be shown and then, using some "importance" metric per
item a "grade" can be calculated?

>  Details I think need to be provided by maintainer...
>
>
>  > >  * Contributor: what happened with my code?
>  >
>  > Do you mean a "track my code" like feature? Showing the movement of a
>  > particular piece of code through the code? (Displaying information
>  > like "moved from foo.c to bar.c in commit 0123456789abcd"?)
>
>  I was thinking there about "git blame --reverse".
>
>
>  > >  * Searching where to contribute: what are oldest part of code dealing
>  > >   with error messages (find ancient code)?
>  >
>  > In other words, find the lines with the oldest modification time stamp
>  > from 'git blame'?
>
>  Or find the lines with oldest modification stamp with "die" or "warn",
>  or find which messages are oldest, even if wrapper have changed.
>
>
>  P.S. I wonder how hard to be to plug-in such SCM statistic system
>  into something like project management, see
>   "Joel On Software: Evidence based scheduling" (of programming tasks)
>   http://www.joelonsoftware.com/items/2007/10/26.html
>
>  --
>  Jakub Narebski
>  Poland
>



-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
       [not found]             ` <bd6139dc0805120604m349b1fbbr39c6dcb8d893e771@mail.gmail.com>
@ 2008-05-13 13:07               ` Jakub Narebski
  2008-05-13 13:37                 ` Sverre Rabbelier
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Narebski @ 2008-05-13 13:07 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: git

On Mon, 12 May 2008, Sverre Rabbelier wrote:
> [Sorry, I hit 'send' instead of 'save']

And now you apparently forgot to add git mailing list to receipients...

> On Mon, May 12, 2008 at 2:40 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>>  This is, IMHO, the most complex example (at least to do properly).
>>  It begins with: does given author have code touching given subsystem
>>  (i.e. is it for him/her new contribution wrt. subsystem)? How many
>>  commits he/she has affecting given subsystem? How often he/she rewrites
>>  code? How many bugs were introduced?
> 
> Ah, there is a lot more to this example than I thought. Perhaps this
> data could all be shown and then, using some "importance" metric per
> item a "grade" can be calculated?

Weighting different statistics, bayesian hypotesis/filtering, expert
system, machine learning... I guess that would be quite a work to do
it well.  Probably would require to calculate and adjust scoring of code
(difficulity) and authors (skill), and matching them...

This is certainly in the "wishlist" scope.

>>  Details I think need to be provided by maintainer...
> 
> Do you mean Junio, or the user of the program?

I mean that all I can provide is speculation.  I'm not, and never was
a maintainer of OSS project, and I don't know what criteria one use
(perhaps unvoiced criteria) to decide whether given patch needs to be
examined more closely, or the cursory browsing should be enough.

>>>>  * Contributor: what happened with my code?
>>>
>>> Do you mean a "track my code" like feature? Showing the movement of a
>>> particular piece of code through the code? (Displaying information
>>> like "moved from foo.c to bar.c in commit 0123456789abcd"?)
>>
>>  I was thinking there about "git blame --reverse".
> 
> Do you mean, filter it's output for a specific user?

I mean, given the code at given version, what happened to this code?
Filtering "git blame --reverse" by user might be one way of solving it.

>>>>  * Searching where to contribute: what are oldest part of code dealing
>>>>   with error messages (find ancient code)?
>>>>
>> Or find the lines with oldest modification stamp with "die" or "warn",
>> or find which messages are oldest, even if wrapper have changed.
> 
> In that case, perhaps a regexp would be more suitable, to allow the
> user to search for any specific line, not just "die" or "warn"?

What I had in mind here, but didn't explain clear enough, was an
extension to pickaxe search.  You want to find when current error
message was created, even if the way of handling it (fprintf vs. die)
changed, or if code was indented, or was moved.

Or find all error messages, in the order they were created, for example
in git case to find ancient error messages and replace it by something
more user-friendly (or less selective about choosing friends ;-).

>>  P.S. I wonder how hard to be to plug-in such SCM statistic system
>>  into something like project management, see
>>   "Joel On Software: Evidence based scheduling" (of programming tasks)
>>   http://www.joelonsoftware.com/items/2007/10/26.html
> 
> Interesting article, I think integrating statistics
> (http://www.statsvn.org/ for example) can be a very powerful tool for
> project management.

You meant http://git.koha.org/gitstat/, didn't you? ;-P

Siriously, what I had in mind was to integrate author dates and commit
dates into project management system scheduling.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-13 13:07               ` Jakub Narebski
@ 2008-05-13 13:37                 ` Sverre Rabbelier
  2008-05-14 20:34                   ` Jakub Narebski
  2008-05-17  0:02                   ` Junio C Hamano
  0 siblings, 2 replies; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-13 13:37 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Junio C Hamano

On Tue, May 13, 2008 at 3:07 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>  And now you apparently forgot to add git mailing list to receipients...

I guess mailinglists are not my thing huh?

>  > On Mon, May 12, 2008 at 2:40 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>  > Ah, there is a lot more to this example than I thought. Perhaps this
>  > data could all be shown and then, using some "importance" metric per
>  > item a "grade" can be calculated?
>
>  Weighting different statistics, bayesian hypotesis/filtering, expert
>  system, machine learning... I guess that would be quite a work to do
>  it well.  Probably would require to calculate and adjust scoring of code
>  (difficulity) and authors (skill), and matching them...
>
>  This is certainly in the "wishlist" scope.

Yeah, I think it would go in the 'c' of 'MoSCoW', but it could be very
useful when done right.

>  >>  Details I think need to be provided by maintainer...
>  >
>  > Do you mean Junio, or the user of the program?
>
>  I mean that all I can provide is speculation.  I'm not, and never was
>  a maintainer of OSS project, and I don't know what criteria one use
>  (perhaps unvoiced criteria) to decide whether given patch needs to be
>  examined more closely, or the cursory browsing should be enough.

I reckon more input from actual maintainers would be needed then.
Junio: aside from the original list with suggestions you provided,
could you shine your light as git maintainer on this?

>  I mean, given the code at given version, what happened to this code?
>  Filtering "git blame --reverse" by user might be one way of solving it.

It sounds like it would not be too hard to implement' maybe another
'C' in 'MoSCoW' (or perhaps that first 'C' should be a 'W'...)

>  What I had in mind here, but didn't explain clear enough, was an
>  extension to pickaxe search.  You want to find when current error
>  message was created, even if the way of handling it (fprintf vs. die)
>  changed, or if code was indented, or was moved.

I'm not familiar with pickaxe, what you suggest sounds like grepping
the content also throughout history?

>  Or find all error messages, in the order they were created, for example
>  in git case to find ancient error messages and replace it by something
>  more user-friendly (or less selective about choosing friends ;-).

I understand what you want, a search for specific content, from old to
new, stopping when you have a match?

>  > Interesting article, I think integrating statistics
>  > (http://www.statsvn.org/ for example) can be a very powerful tool for
>  > project management.
>
>  You meant http://git.koha.org/gitstat/, didn't you? ;-P

I used the former, never tried the latter :).

>  Seriously, what I had in mind was to integrate author dates and commit
>  dates into project management system scheduling.

I'm not sure what gain that would bring though, as it can only provide
end dates, not 'starting work now' timestamps...

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-13 13:37                 ` Sverre Rabbelier
@ 2008-05-14 20:34                   ` Jakub Narebski
  2008-05-15 12:21                     ` Andreas Ericsson
  2008-05-17  0:02                   ` Junio C Hamano
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Narebski @ 2008-05-14 20:34 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: git, Junio C Hamano

On Tue, 13 May 2008, Sverre Rabbelier wrote:
> On Tue, May 13, 2008 at 3:07 PM, Jakub Narebski <jnareb@gmail.com> wrote:

[on helping maintainer decide how closely patch should be examined]

>>  Weighting different statistics, bayesian hypotesis/filtering, expert
>>  system, machine learning... I guess that would be quite a work to do
>>  it well.  Probably would require to calculate and adjust scoring of code
>>  (difficulity) and authors (skill), and matching them...
>>
>>  This is certainly in the "wishlist" scope.
> 
> Yeah, I think it would go in the 'c' of 'MoSCoW', but it could be very
> useful when done right.

Errr... what do you mean by 'MoSCoW'?

[here I think you cut a bit too much]
>>
>>  What I had in mind here, but didn't explain clear enough, was an
>>  extension to pickaxe search.  You want to find when current error
>>  message was created, even if the way of handling it (fprintf vs. die)
>>  changed, or if code was indented, or was moved.
> 
> I'm not familiar with pickaxe, what you suggest sounds like grepping
> the content also throughout history?

Documentation/glossary.txt (linked from git(7), in "Git User's Manual")

   pickaxe::
        The term <<def_pickaxe,pickaxe>> refers to an option to the diffcore
        routines that help select changes that add or delete a given text
        string. With the `--pickaxe-all` option, it can be used to view the full
        <<def_changeset,changeset>> that introduced or removed, say, a
        particular line of text. See linkgit:git-diff[1].

git-diff(1):

       -S<string>
              Look for differences that contain the change in <string>.

       --pickaxe-all
              When -S finds a change, show all the changes in that changeset, not
              just the files that contain the change in <string>.

       --pickaxe-regex
              Make the <string> not a plain string but an extended POSIX regex to
              match.

>>  Or find all error messages, in the order they were created, for example
>>  in git case to find ancient error messages and replace it by something
>>  more user-friendly (or less selective about choosing friends ;-).
> 
> I understand what you want, a search for specific content, from old to
> new, stopping when you have a match?

But let me elaborate a bit. What I wanted in my example is for each
die("<message>") and error("<message>") to have commit and date where
<message> was introduced (even if it was in fprintf(stderr, ...) then).

>>  Seriously, what I had in mind was to integrate author dates and commit
>>  dates into project management system scheduling.
> 
> I'm not sure what gain that would bring though, as it can only provide
> end dates, not 'starting work now' timestamps...

Well, if you use patch management system such like StGit, it could
trace when patch was created, when was refreshed, when was temporarily
abandoned (push, pop, float, new), ans when was finalized (commit or
clean).

But that is also in the realm of vague ideas, not concrete applications.
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-14 20:34                   ` Jakub Narebski
@ 2008-05-15 12:21                     ` Andreas Ericsson
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Ericsson @ 2008-05-15 12:21 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Sverre Rabbelier, git, Junio C Hamano

Jakub Narebski wrote:
> On Tue, 13 May 2008, Sverre Rabbelier wrote:
>> On Tue, May 13, 2008 at 3:07 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> 
> [on helping maintainer decide how closely patch should be examined]
> 
>>>  Weighting different statistics, bayesian hypotesis/filtering, expert
>>>  system, machine learning... I guess that would be quite a work to do
>>>  it well.  Probably would require to calculate and adjust scoring of code
>>>  (difficulity) and authors (skill), and matching them...
>>>
>>>  This is certainly in the "wishlist" scope.
>> Yeah, I think it would go in the 'c' of 'MoSCoW', but it could be very
>> useful when done right.
> 
> Errr... what do you mean by 'MoSCoW'?
> 

Must have
Should have
Could have
Won't have

It's a priority scheme used in agile development techniques, where
developers, customers and users work close together. The customer
decides "must have this, or we scrap this project", "should have this,
or users will be unhappy", "could have this, many would appreciate it"
and "won't have this, it's too expensive to develop" after the devs
have estimated the time required to develop the individual components.

Agile development is usually used to go under-feature instead of
over-budget. Since opensource projects are more driven by whatever
passing-by developers happen to find interesting (or annoying) at the
moment (nearly as predictable as Brownian motion), agile development
techniques are very rarely used successfully to develop oss in
anything but extremely tight communities.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-13 13:37                 ` Sverre Rabbelier
  2008-05-14 20:34                   ` Jakub Narebski
@ 2008-05-17  0:02                   ` Junio C Hamano
  2008-05-18  1:01                     ` Sverre Rabbelier
  2008-05-21 17:30                     ` Junio C Hamano
  1 sibling, 2 replies; 16+ messages in thread
From: Junio C Hamano @ 2008-05-17  0:02 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Jakub Narebski, git

"Sverre Rabbelier" <srabbelier@gmail.com> writes:

>>  >>  Details I think need to be provided by maintainer...
>>  >
>>  > Do you mean Junio, or the user of the program?
>>
>>  I mean that all I can provide is speculation.  I'm not, and never was
>>  a maintainer of OSS project, and I don't know what criteria one use
>>  (perhaps unvoiced criteria) to decide whether given patch needs to be
>>  examined more closely, or the cursory browsing should be enough.
>
> I reckon more input from actual maintainers would be needed then.
> Junio: aside from the original list with suggestions you provided,
> could you shine your light as git maintainer on this?

A cursory browsing is enough only when you trust the contributor well.
For example, I read patches from Nico to code around the pack generation
only once or at most twice before I apply them, and the same thing can be
said about git-svn patches from or acked-by Eric.  These come mostly from
the fact that (1) I know they know the area a lot better than myself do,
and more importantly that (2) I know they care deeply about the subsystem
they are modifying, and they have good taste.

Project maintainers and old timers become familiar with habits, strengths
and weaknesses of known contributors over time, and that is the source of
such trust.

A clever enough automated way may be able to identify links between the
contributors and the areas they are familiar with, and using such a
mechanism people might be able to decide that a patch falls into category
(1) above.  I am not sure if any automated way could ever decide if a
patch falls into category (2) above, though.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-17  0:02                   ` Junio C Hamano
@ 2008-05-18  1:01                     ` Sverre Rabbelier
  2008-05-21 17:30                     ` Junio C Hamano
  1 sibling, 0 replies; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-18  1:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narebski, git

[And once more with 'reply to all' instead. Wouldn't it be nice if
gmail had an 'auto-reply-to-all' feature...]

On Sat, May 17, 2008 at 2:02 AM, Junio C Hamano <gitster@pobox.com> wrote:
> A cursory browsing is enough only when you trust the contributor well.
> For example, I read patches from Nico to code around the pack generation
> only once or at most twice before I apply them, and the same thing can be
> said about git-svn patches from or acked-by Eric.  These come mostly from
> the fact that (1) I know they know the area a lot better than myself do,
> and more importantly that (2) I know they care deeply about the subsystem
> they are modifying, and they have good taste.

This makes sense, patches only get a 'cursory browsing' when they come
from a trusted author, which is defined mostly by how active and how
'good' they are in the area they modify.

> Project maintainers and old timers become familiar with habits, strengths
> and weaknesses of known contributors over time, and that is the source of
> such trust.

This could only partially be done by an algorithm, while git excels in
the 'over time' part, the definition of 'habits, strengths and
weaknesses' is harder to make.

> A clever enough automated way may be able to identify links between the
> contributors and the areas they are familiar with, and using such a
> mechanism people might be able to decide that a patch falls into category
> (1) above.  I am not sure if any automated way could ever decide if a
> patch falls into category (2) above, though.

Yes, your solution in determining patches from (1) is in the same
direction of what I have been thinking on myself. I don't think it is
possible to determine (2) without having access to the review system
(in git's case, the mailing list). When the review system would become
part of the analysis it could provide information on what improvements
had to be made to a commit before it was accepted. If 'style
improvements' would be marked in such a system then people with 'good
taste' are people whose commits do not often need 'style
improvements'. Alas, implementing something like that would be beyond
the scope of 1 GSoC. Ah well, 't is a nice dream about to implement at
a later time perhaps. (Although such would be more suited in a team
collaboration suite than in a [D]VCS).


--
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-17  0:02                   ` Junio C Hamano
  2008-05-18  1:01                     ` Sverre Rabbelier
@ 2008-05-21 17:30                     ` Junio C Hamano
  2008-05-21 20:52                       ` Sverre Rabbelier
  1 sibling, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2008-05-21 17:30 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Jakub Narebski, git

Junio C Hamano <gitster@pobox.com> writes:

> "Sverre Rabbelier" <srabbelier@gmail.com> writes:
>
>>>  >>  Details I think need to be provided by maintainer...
>>>  >
>>>  > Do you mean Junio, or the user of the program?
>>>
>>>  I mean that all I can provide is speculation.  I'm not, and never was
>>>  a maintainer of OSS project, and I don't know what criteria one use
>>>  (perhaps unvoiced criteria) to decide whether given patch needs to be
>>>  examined more closely, or the cursory browsing should be enough.
>>
>> I reckon more input from actual maintainers would be needed then.
>> Junio: aside from the original list with suggestions you provided,
>> could you shine your light as git maintainer on this?
> ...
> Project maintainers and old timers become familiar with habits, strengths
> and weaknesses of known contributors over time, and that is the source of
> such trust.

I just realized another thing about "the source of trust".  The
"statistics" would count _only_ what gets accepted, but maintainers and
list participants have much richer set of datapoints to judge the
strengths and weaknesses of contributors --- rejects.

An early round of contribution from somebody needs deeper review if the
contributor has a history of taking many rounds of refinements to get a
rather trivial change into an acceptable shape.  IOW, over time people can
learn who are meticulous and who are careless from rejection counts, which
is not recorded in the committed history.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Use cases for 'git statistics'
  2008-05-21 17:30                     ` Junio C Hamano
@ 2008-05-21 20:52                       ` Sverre Rabbelier
  0 siblings, 0 replies; 16+ messages in thread
From: Sverre Rabbelier @ 2008-05-21 20:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narebski, git

On Wed, May 21, 2008 at 7:30 PM, Junio C Hamano <gitster@pobox.com> wrote:

> I just realized another thing about "the source of trust".  The
> "statistics" would count _only_ what gets accepted, but maintainers and
> list participants have much richer set of datapoints to judge the
> strengths and weaknesses of contributors --- rejects.

I think I know where this is coming from and what you say makes sense.

> An early round of contribution from somebody needs deeper review if the
> contributor has a history of taking many rounds of refinements to get a
> rather trivial change into an acceptable shape.  IOW, over time people can
> learn who are meticulous and who are careless from rejection counts, which
> is not recorded in the committed history.

Yup, in order to gather that kind of data a more elaborate tool (one
that is integrated with a review tool like Rietveld or such) together
with the VCS would be required. I'm confident that this project will
result in useful statistics. Perhaps they will not be enough to
determine which patches to let in and which ones to reject without
human interference (I'm actually quite sure that won't happen), but I
do think other useful statistics may be gathered and used
nevertheless.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-05-21 20:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-08 15:51 [RFC] Use cases for 'git statistics' Sverre Rabbelier
2008-05-12  9:38 ` Sverre Rabbelier
2008-05-12 10:16   ` Jakub Narebski
2008-05-12 10:19     ` Sverre Rabbelier
2008-05-12 11:19       ` Jakub Narebski
2008-05-12 11:49         ` Sverre Rabbelier
2008-05-12 12:40           ` Jakub Narebski
2008-05-12 13:01             ` Sverre Rabbelier
     [not found]             ` <bd6139dc0805120604m349b1fbbr39c6dcb8d893e771@mail.gmail.com>
2008-05-13 13:07               ` Jakub Narebski
2008-05-13 13:37                 ` Sverre Rabbelier
2008-05-14 20:34                   ` Jakub Narebski
2008-05-15 12:21                     ` Andreas Ericsson
2008-05-17  0:02                   ` Junio C Hamano
2008-05-18  1:01                     ` Sverre Rabbelier
2008-05-21 17:30                     ` Junio C Hamano
2008-05-21 20:52                       ` Sverre Rabbelier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.