All of lore.kernel.org
 help / color / mirror / Atom feed
* GSoC draft proposal: Line-level history browser
@ 2010-03-20  9:18 Bo Yang
  2010-03-20 11:30 ` Johannes Schindelin
  2010-03-20 20:35 ` Alex Riesen
  0 siblings, 2 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-20  9:18 UTC (permalink / raw)
  To: git

Hi,

I am very interested in the project 'Line-level history browser',
after some days consideration, now I made up a draft of my proposal, I
think it is helpful to send it to the list before submitting it. Could
you please give me some advise?

-----------------------------------------------
Draft proposal: Line-level History Browser

=====Purpose of this project=====
"git blame" can tell us who is responsible for a line of code, but it
can't help if we want to get the detail of how the lines of code have
evolved as what it is now.
This project will add a new utility for git called 'git line-log'. It
can trace the history of any line range of certain file at any
revision. For simplity, users can run the command like: ' git line-log
builtin/diff.c 6..8 ', he will get the change history of code between
line 6 and line 8 of the diff.c file. And for each history entry, it
will provide the commits, the diff block which contains changes of
users' interested lines.
This utility will trace all the modification history of interested
lines and stop until it finds the root of the lines, which is a point
where all the new code is added from scratch. Also, the users can
specify how deeply he wants this utility to trace. And this tool will
treat code move just like modification too, so it will follow the code
move inside one file.
Note that, the history may not always be a single thread of commits.
If there are more than one commit which produce the specified line
range, the thread of history will split. And this utility will stop
and provide all commits with its code changes to the user, let the
user to select which one to trace next.

=====Work and technical issues=====
==Command options==
This new tool should be used for exploring the history of changes for
certain line range of code in one file.

git line-log [options] <file> <line range>

Options:
1. Since it will output commit description, it will contain the option
used to control whether we should show the whole commit message or
just a short title.
2. Option whether we should display only the 'user interested lines'
diff block [default] or display the whole diff with the interested
area colorfully displayed.
3. The max depth we trace into the commit history.
4. The revision of the <file>. This is very useful when the current
interested line range is produced by more than one commit. The user
can use this option to specify the file revision and trace down from
that revision and the line range.

<line range>
Its format should be <start pos>..<end pos> or just a <line number>.

==Design and implementation==
Git store all the blobs instead of code delta, so we should traverse
the commit history and directly access the tree/blob objects to
compute the code delta and search for the diff which contains the
interested lines. Since git use libxdiff to format its diff file, we
should iterate through all xdiff's diff blocks and find what the code
looks like before the commit. Here, we will find a new line range
which is the origin code before this commit. And then start another
search from the current commit and the new line range. Recursively, we
can find all the modification history. We will stop when we find that
the current interested line range is added from scratch and is not
moved from other place of the file. We may also stop the traverse when
we reach the max search depth. Also, if the thread of change history
split into two or more commits, we stop and provide the users all the
related commits and corresponding line range.

For implementation related stuff, this tool heavily depends on
libxdiff. Because we will search our interested lines through xdiff's
output to find the right diff trunk to display and trace down. So, how
we search the xdiff's diff blocks is very important. After reading
some libxdiff document and code, I find that libxdiff output all the
diff blocks as string into a memory file. If we parse the diff block
string to find the changed lines, it is very inefficient. So, I
suggest changing xdiff's xdl_diff function to let it store some meta
data for each diff trunk. I think this will be very helpful for the
performance of this tool.

Generally,
1. xdiff/xdiffi.c will get changed to make xdl_diff store some desired
meta data and pass it to caller.
2. builtin/line-log.c will be added to complete most of the new
features, the most important function here may be cmd_linelog.
3. git.c will be changed to add this new utility to the front end.
4. Documents will be updated to introduce this new tool.

=====About me=====
I am Bo Yang, a Chinese graduate student majoring in Computer Science
of NanKai University. I have touched some open source software since 5
years ago and began to contribute code to open source community from
three years ago. I have contributed to Mozilla/Mingw/Netsurf.
Technically, I am experienced in C/Bash Shell. I have attended last
year's GSoC with Netsurf project. In that project, I have completed
most of a DOM library in C.
I begin to use git for source code revision from about two years ago.
I use Git for track my Mozilla trunk source code. Because updating
Mozilla code by CVS in my school is very slow. So, I write one script
to automatically updating the trunk with CVS at mid-night, when the
network flow is fast, on the server, and then use Git to maintain the
code. Then I use Git in my PC to clone/update the source code from my
local server and that is very fast. I use Git to track my changes to
the code and some bug fixes. It is an excellent tool for
branch/history, I think.
Git is my lovely daily tool for revision control. I have much
experience with it and have read "Git Internals" and also get some
basic knowledge about Git's code base. And I think the line-level
history explorer is really suitable for me and I can make a good start
with this project in Git community.
-----------------------------------------------

Any feedback from you will be appreciated very much, thanks a lot!

Regards!
Bo
-- 
My blog: http://blog.morebits.org

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20  9:18 GSoC draft proposal: Line-level history browser Bo Yang
@ 2010-03-20 11:30 ` Johannes Schindelin
  2010-03-20 13:10   ` Bo Yang
  2010-03-20 20:35 ` Alex Riesen
  1 sibling, 1 reply; 54+ messages in thread
From: Johannes Schindelin @ 2010-03-20 11:30 UTC (permalink / raw)
  To: Bo Yang; +Cc: git

Hi,

On Sat, 20 Mar 2010, Bo Yang wrote:

> I am very interested in the project 'Line-level history browser', after 
> some days consideration, now I made up a draft of my proposal, I think 
> it is helpful to send it to the list before submitting it. Could you 
> please give me some advise?

I like it very much already! You obviously put in a substantial amount of 
time to learn intricate details about the way Git operates, and what is 
already available.

And you also provided a patch (unrelated to line-level history browser), 
so you proved that you actually cloned Git, and that you can actually 
patch it and use Git itself to send a patch to this list.

Very good.

Just a few constructive criticisms (inlined):

> This project will add a new utility for git called 'git line-log'. It 
> can trace the history of any line range of certain file at any revision.

I think that that might be good for starters, but one could imagine that 
an integration into "git log" might be even better, so that gitk can use 
this without any further changes.

> For simplity, users can run the command like: ' git line-log 
> builtin/diff.c 6..8 ', he will get the change history of code between 
> line 6 and line 8 of the diff.c file.

It would be good if the code looked harder after failing with the simple 
strategy, such as looking for code removed in other files, fuzzy matching 
(optional), and looking for code duplication (i.e. literal copying, or 
slightly modified copying).

The fuzzy matching might be necessary to catch things like a Java class 
moving from one file into another (and changing its name): the first line 
changes, but not completely.

> After reading some libxdiff document and code, I find that libxdiff 
> output all the diff blocks as string into a memory file.

Almost.

Just have a look at the word-level diff (--color-words):

http://repo.or.cz/w/git/dscho.git/blob/bc1ed6aafd9ee4937559535c66c8bddf1864bec6:/diff.c#l382

You will see that there is a function fn_out_diff_words_aux(), which is 
passed to xdi_diff_outf(). That latter function calls xdiff such that the 
former function receives a complete line at a time. And this is what I 
would suggest doing in the line-level log, too.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 11:30 ` Johannes Schindelin
@ 2010-03-20 13:10   ` Bo Yang
  2010-03-20 13:30     ` Junio C Hamano
  2010-03-20 13:36     ` Johannes Schindelin
  0 siblings, 2 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-20 13:10 UTC (permalink / raw)
  To: git

Hi Johannes,

    Thank you very much for your advice!

>
> I like it very much already! You obviously put in a substantial amount of
> time to learn intricate details about the way Git operates, and what is
> already available.
>
> And you also provided a patch (unrelated to line-level history browser),
> so you proved that you actually cloned Git, and that you can actually
> patch it and use Git itself to send a patch to this list.

I am very happy you like it.

>
> I think that that might be good for starters, but one could imagine that
> an integration into "git log" might be even better, so that gitk can use
> this without any further changes.

So, I think add some new options to 'git log' is preferred.

>
> It would be good if the code looked harder after failing with the simple
> strategy, such as looking for code removed in other files, fuzzy matching
> (optional), and looking for code duplication (i.e. literal copying, or
> slightly modified copying).
>
> The fuzzy matching might be necessary to catch things like a Java class
> moving from one file into another (and changing its name): the first line
> changes, but not completely.

That's really a good idea.
So, when the program reach the end of the history thread of some
changes of line range, it should not stop immediately. It then should
make a harder code search and try to find whether the new add lines of
code is moved to there or just copied from other place to there. And
these kind of search should use fuzzy matching instead of exact string
matching.

But notice that, detect code movement in one commit is much efficient
than detecting code copy. So, I think we should add an option to
control whether we detect such kind of code copy. By default, we
detect code move but not code copy. How do you think about this?

> Just have a look at the word-level diff (--color-words):
>
> http://repo.or.cz/w/git/dscho.git/blob/bc1ed6aafd9ee4937559535c66c8bddf1864bec6:/diff.c#l382
>
> You will see that there is a function fn_out_diff_words_aux(), which is
> passed to xdi_diff_outf(). That latter function calls xdiff such that the
> former function receives a complete line at a time. And this is what I
> would suggest doing in the line-level log, too.

I have look over the function fn_out_diff_words_aux, this function
parse each line of a memory diff. We can use it to detect the diff
hunk head and find the line change. If you think the performance is
acceptable, I think using this callback mechanism is all right.

Regards!
Bo
-- 
My blog: http://blog.morebits.org

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 13:10   ` Bo Yang
@ 2010-03-20 13:30     ` Junio C Hamano
  2010-03-21  6:03       ` Bo Yang
  2010-03-20 13:36     ` Johannes Schindelin
  1 sibling, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2010-03-20 13:30 UTC (permalink / raw)
  To: Bo Yang; +Cc: git

Bo Yang <struggleyb.nku@gmail.com> writes:

> But notice that, detect code movement in one commit is much efficient
> than detecting code copy. So, I think we should add an option to
> control whether we detect such kind of code copy.

If you are hooking into "git log", it already has "-M / -C / -C -C" as a
notion to express "different levels of digging" to find code movement and
copies, and so does "git blame".  You probably will save a lot of time if
you studied the current blame implementation thouroughly before designing
or coding.

Two things that you need to think about carefully is why "blame" stops at
the commits it shows, and if you could "peel" these lines in its output to
peek what are behind the lines, what you would see.  This is not a rocket
science topic, but it is not entirely trivial.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 13:10   ` Bo Yang
  2010-03-20 13:30     ` Junio C Hamano
@ 2010-03-20 13:36     ` Johannes Schindelin
  2010-03-21  6:05       ` Bo Yang
  1 sibling, 1 reply; 54+ messages in thread
From: Johannes Schindelin @ 2010-03-20 13:36 UTC (permalink / raw)
  To: Bo Yang; +Cc: git

Hi,

[please do not cull the Cc: list]

On Sat, 20 Mar 2010, Bo Yang wrote:

> I (Johannes) wrote:
>
> > I think that that might be good for starters, but one could imagine 
> > that an integration into "git log" might be even better, so that gitk 
> > can use this without any further changes.
> 
> So, I think add some new options to 'git log' is preferred.

Yes, I think that this should be the target for the user interface. 
However, the logic should be different enough to merit a completely new 
file for the code (think "git add --interactive").

> > It would be good if the code looked harder after failing with the 
> > simple strategy, such as looking for code removed in other files, 
> > fuzzy matching (optional), and looking for code duplication (i.e. 
> > literal copying, or slightly modified copying).
> >
> > The fuzzy matching might be necessary to catch things like a Java 
> > class moving from one file into another (and changing its name): the 
> > first line changes, but not completely.
> 
> That's really a good idea.
> So, when the program reach the end of the history thread of some
> changes of line range, it should not stop immediately. It then should
> make a harder code search and try to find whether the new add lines of
> code is moved to there or just copied from other place to there. And
> these kind of search should use fuzzy matching instead of exact string
> matching.
> 
> But notice that, detect code movement in one commit is much efficient
> than detecting code copy. So, I think we should add an option to
> control whether we detect such kind of code copy. By default, we
> detect code move but not code copy. How do you think about this?

Yes, it is much more difficult, and it is more expensive. So: there are 
several steps in the project (you could also call them "milestones"), and 
fuzzy matching end lines would come later than simple code movement. And 
still later than code movement between files.

> > Just have a look at the word-level diff (--color-words):
> >
> > http://repo.or.cz/w/git/dscho.git/blob/bc1ed6aafd9ee4937559535c66c8bddf1864bec6:/diff.c#l382
> >
> > You will see that there is a function fn_out_diff_words_aux(), which 
> > is passed to xdi_diff_outf(). That latter function calls xdiff such 
> > that the former function receives a complete line at a time. And this 
> > is what I would suggest doing in the line-level log, too.
> 
> I have look over the function fn_out_diff_words_aux, this function parse 
> each line of a memory diff. We can use it to detect the diff hunk head 
> and find the line change. If you think the performance is acceptable, I 
> think using this callback mechanism is all right.

Yes, I think that the performance is alright there, it works well enough 
for --color-words.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20  9:18 GSoC draft proposal: Line-level history browser Bo Yang
  2010-03-20 11:30 ` Johannes Schindelin
@ 2010-03-20 20:35 ` Alex Riesen
  2010-03-20 20:57   ` Junio C Hamano
  2010-03-20 21:58   ` A Large Angry SCM
  1 sibling, 2 replies; 54+ messages in thread
From: Alex Riesen @ 2010-03-20 20:35 UTC (permalink / raw)
  To: Bo Yang; +Cc: git

On Sat, Mar 20, 2010 at 10:18, Bo Yang <struggleyb.nku@gmail.com> wrote:
> <line range>
> Its format should be <start pos>..<end pos> or just a <line number>.

You might want to reconsider the line range syntax. Exactly the same syntax
is already used to specify a commit range, so reusing it may lead to confusion.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 20:35 ` Alex Riesen
@ 2010-03-20 20:57   ` Junio C Hamano
  2010-03-21  6:10     ` Bo Yang
  2010-03-20 21:58   ` A Large Angry SCM
  1 sibling, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2010-03-20 20:57 UTC (permalink / raw)
  To: Bo Yang; +Cc: Alex Riesen, git

Alex Riesen <raa.lkml@gmail.com> writes:

> On Sat, Mar 20, 2010 at 10:18, Bo Yang <struggleyb.nku@gmail.com> wrote:
>> <line range>
>> Its format should be <start pos>..<end pos> or just a <line number>.
>
> You might want to reconsider the line range syntax. Exactly the same syntax
> is already used to specify a commit range, so reusing it may lead to confusion.

I would actually recommend you take a look at -L option from blame.  What
I use most often and find very handy myself is this pattern:

	blame -L '/^void some_function()/,/^}/' -- path

as I do not have to count the line numbers.

There also was a discussion on allowing more than one -L to blame, which I
think is applicable to this feature.  Check the list archive for the past
few months.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 20:35 ` Alex Riesen
  2010-03-20 20:57   ` Junio C Hamano
@ 2010-03-20 21:58   ` A Large Angry SCM
  2010-03-21  6:16     ` Bo Yang
  1 sibling, 1 reply; 54+ messages in thread
From: A Large Angry SCM @ 2010-03-20 21:58 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Bo Yang, git

Alex Riesen wrote:
> On Sat, Mar 20, 2010 at 10:18, Bo Yang <struggleyb.nku@gmail.com> wrote:
>> <line range>
>> Its format should be <start pos>..<end pos> or just a <line number>.
> 
> You might want to reconsider the line range syntax. Exactly the same syntax
> is already used to specify a commit range, so reusing it may lead to confusion.

I, actually, think the proposed line range syntax works because it uses 
the same _range_ notation. The issue is how to differentiate the _line_ 
range(s) from the _commit_ range(s); and, yes, I would like multiple 
ranges of each type as well as multiple files.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 13:30     ` Junio C Hamano
@ 2010-03-21  6:03       ` Bo Yang
  0 siblings, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-21  6:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi Junio,

   Thank you very much for your advice.
>
> If you are hooking into "git log", it already has "-M / -C / -C -C" as a
> notion to express "different levels of digging" to find code movement and
> copies, and so does "git blame".  You probably will save a lot of time if
> you studied the current blame implementation thouroughly before designing
> or coding.

Yes, both blame and log has such '-M/-C/-C -C/' options.  But the
meaning are not very same:
For 'git log': -M is used to detect file rename, -C is used to trace
code copy. Both options accept no argument.
For 'git blame': -M is used to trace code move, -C is used to trace
code copy. And both options accept a <num> which specify the lower
bound of the 'same code characters'.
And, I think the line-level history tool act more like 'git blame'.
So, the '-C' option for 'git log' is exactly what we need but '-M' is
not. So, I think, maybe we should add another '-m' option to 'git log'
for line-level code movement detect.

I have make a rough look over blame.c, it is really very helpful and I
find I can borrow some code from 'git blame' to make the line-level
history browser.

Thanks a lot!

>
> Two things that you need to think about carefully is why "blame" stops at
> the commits it shows, and if you could "peel" these lines in its output to
> peek what are behind the lines, what you would see.  This is not a rocket
> science topic, but it is not entirely trivial.

I think blame's purpose is to find who is responsible for which line
of code. So, it stop after it find the origin of the code. And
line-level history browser will continue back into more history on
what blame got, it will find what the line should be before this
commit, and go backward the history based on the origin line to get a
more old status and go on again. Simply, it is something like 'git
blame' recursively. :)

Thanks again for your advice, I get too much from your feedback, thanks!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 13:36     ` Johannes Schindelin
@ 2010-03-21  6:05       ` Bo Yang
  0 siblings, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-21  6:05 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

>>
>> So, I think add some new options to 'git log' is preferred.
>
> Yes, I think that this should be the target for the user interface.
> However, the logic should be different enough to merit a completely new
> file for the code (think "git add --interactive").

So, a new file builtin/line-level.c will be added.

>
> Yes, it is much more difficult, and it is more expensive. So: there are
> several steps in the project (you could also call them "milestones"), and
> fuzzy matching end lines would come later than simple code movement. And
> still later than code movement between files.

Ok, I will add some milestones on my next version proposal, thanks.

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 20:57   ` Junio C Hamano
@ 2010-03-21  6:10     ` Bo Yang
  0 siblings, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-21  6:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Alex Riesen, git

>
> I would actually recommend you take a look at -L option from blame.  What
> I use most often and find very handy myself is this pattern:
>
>        blame -L '/^void some_function()/,/^}/' -- path
>
> as I do not have to count the line numbers.

I have look at that options and I find it is very convenient and
line-level browser will adopt that line syntax, too.

> There also was a discussion on allowing more than one -L to blame, which I
> think is applicable to this feature.  Check the list archive for the past
> few months.

I think it is rationale for 'git blame' to allow more than one -L to
let the users see more than one block of code. But for a tool which
used to explore history, I think the user almost focus on one thread
of history. If the history split on some point, we should ask user for
choose one to go on. So, I think the line-level browser need not to
support such a thing. :)

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-20 21:58   ` A Large Angry SCM
@ 2010-03-21  6:16     ` Bo Yang
  2010-03-21 13:19       ` A Large Angry SCM
  0 siblings, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-21  6:16 UTC (permalink / raw)
  To: gitzilla; +Cc: Alex Riesen, git

On Sun, Mar 21, 2010 at 5:58 AM, A Large Angry SCM <gitzilla@gmail.com> wrote:
> Alex Riesen wrote:
>>
>> On Sat, Mar 20, 2010 at 10:18, Bo Yang <struggleyb.nku@gmail.com> wrote:
>>>
>>> <line range>
>>> Its format should be <start pos>..<end pos> or just a <line number>.
>>
>> You might want to reconsider the line range syntax. Exactly the same
>> syntax
>> is already used to specify a commit range, so reusing it may lead to
>> confusion.
>
> I, actually, think the proposed line range syntax works because it uses the
> same _range_ notation. The issue is how to differentiate the _line_ range(s)
> from the _commit_ range(s); and, yes, I would like multiple ranges of each
> type as well as multiple files.

As what I said in previous post, I think we should adopt 'git blame'
way. Use a '-L <start pos>,<end pos>' to specify the line range. It
support both line number and posix regex.
For multiple ranges stuff, I don't think it is very useful to support
it for a history browser. Anyway, our users can only focus on one line
of thread history. I am very willing to listen what is your use case
for a multiple ranges?

Thanks for your precious advice!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-21  6:16     ` Bo Yang
@ 2010-03-21 13:19       ` A Large Angry SCM
  2010-03-22  3:48         ` Bo Yang
  2010-03-22  3:52         ` Bo Yang
  0 siblings, 2 replies; 54+ messages in thread
From: A Large Angry SCM @ 2010-03-21 13:19 UTC (permalink / raw)
  To: Bo Yang; +Cc: Alex Riesen, git

Bo Yang wrote:
[...]
> For multiple ranges stuff, I don't think it is very useful to support
> it for a history browser. Anyway, our users can only focus on one line
> of thread history. I am very willing to listen what is your use case
> for a multiple ranges?

More than one line range can be related and of interest to a 
forensics/archeology task.

In a simple multi range case, you'd have 2 line ranges in the same file 
that you want to see the history and graph of. Such as 2 related macro 
definitions in a header file.

In a complex multi range case, you'd have many line ranges spread over 
multiple blobs and some of the blobs have disjoint commit graphs.

The complex multi range case may be too much for a GSOC project, and the 
simple multi range case may be also. However, the command syntax should 
be general enough to handle them without being too ugly so that the 
implementation could be improved and expanded later.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-21 13:19       ` A Large Angry SCM
@ 2010-03-22  3:48         ` Bo Yang
  2010-03-22  4:24           ` Junio C Hamano
  2010-03-22  3:52         ` Bo Yang
  1 sibling, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-22  3:48 UTC (permalink / raw)
  To: gitzilla; +Cc: Alex Riesen, git

On Sun, Mar 21, 2010 at 9:19 PM, A Large Angry SCM <gitzilla@gmail.com> wrote:
> Bo Yang wrote:
> [...]
>>
>> For multiple ranges stuff, I don't think it is very useful to support
>> it for a history browser. Anyway, our users can only focus on one line
>> of thread history. I am very willing to listen what is your use case
>> for a multiple ranges?
>
> More than one line range can be related and of interest to a
> forensics/archeology task.
>
> In a simple multi range case, you'd have 2 line ranges in the same file that
> you want to see the history and graph of. Such as 2 related macro
> definitions in a header file.
>
> In a complex multi range case, you'd have many line ranges spread over
> multiple blobs and some of the blobs have disjoint commit graphs.
>
> The complex multi range case may be too much for a GSOC project, and the
> simple multi range case may be also. However, the command syntax should be
> general enough to handle them without being too ugly so that the
> implementation could be improved and expanded later.

Yeah, how do you think use the following syntax:

<file1>@<rev1>:<start pos>,<end pos> <file2>@<rev2>:<start pos>,<end pos>

Thanks!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-21 13:19       ` A Large Angry SCM
  2010-03-22  3:48         ` Bo Yang
@ 2010-03-22  3:52         ` Bo Yang
  2010-03-22 15:48           ` Jakub Narebski
                             ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-22  3:52 UTC (permalink / raw)
  To: Johannes.Schindelin, gitster, gitzilla, Alex Riesen; +Cc: git

Hi all,

     Thanks a lot for your precious advice and based on that, I have
prepared a new version of my proposal, generally it provide a detailed
options which I want to add to 'git log' and a new syntax for
supporting multi line ranges in any file at any revision. Also, this
version provide a milestones and timeline for this project. Thanks
again for your advice and I appreciate your feedback very much for
this version.

-----------------------------------------------------------------------
Draft proposal(v2): Line-level History Browser

=====Purpose of this project=====
"git blame" can tell us who is responsible for a line of code, but it
can't help if we want to get the detail of how the lines of code have
evolved as what it is now.
This project will add a new feature for 'git log' to display line
level history. It can trace the history of any line range of certain
file at any revision. For simplity, users can run the command like: '
git log -L builtin/diff.c:6,8 ', he will get the change history of
code between line 6 and line 8 of the diff.c file. And for each
history entry, it will provide the commits, the diff block which
contains changes of users' interested lines.
This utility will trace all the modification history of interested
lines and stop until it finds the root of the lines, which is a point
where all the new code is added from scratch. Also, the users can
specify how deeply he wants this utility to trace. And this tool will
treat code move just like modification too, so it will follow the code
move inside one commit.
Note that, the history may not always be a single thread of commits.
If there are more than one commit which produce the specified line
range, the thread of history will split. And this utility will stop
and provide all commits with its code changes to the user, let the
user to select which one to trace next.

=====Work and technical issues=====
==Command options==
This new feature should be used for exploring the history of changes
for certain line range of code in one file.

git log [-m<num>] [-I] [-d depth] [--fuzzy]  -L file1@rev1:<start
pos>,<end pos>  file2@rev2:<start pos>,<end pos>

Options:
1. -m<num>, option to control whether we should follow code movement.
If one -m is given, we follow code movement inside file, when more
than one '-m' is given, we follow the movement between files in one
commit. The <num> is used to specify the lower bound for the number of
lines of moved code. If it is not given, we set it as 1.
2. -I, option to control whether we should display only the 'user
interested lines' diff block [default] or display the whole diff with
the interested area colorfully displayed.
3. -d, option to control the max depth we trace into the commit history.
4. --fuzzy, option to control whether fuzzy code copy mathing is used.
5. '-L' to control whether we run a simple log or we want a line level log.
6. Files and lines. I propose to use such a syntax to specify the
files at revision and line range, <file>@<revision>:<start pos>,<end
pos>. This looks a little complex, but I think it is neccessary
because we will support multiple file at any version and any line
range finally. The revision can be any revision format of Git and the
<pos> can be a number, or a posix regex, just like what 'git blame'
do.
7. And we will support code copy detect, too. The option which control
whether we trace code copy does exist in current 'git log', which is
the option '-C'. Similiarly, one '-C' is used to trace code copy of
new added code inside one commit. Two '-C' will trace any code copy
inside commit tree.

==Design and implementation==
Git store all the blobs instead of code delta, so we should traverse
the commit history and directly access the tree/blob objects to
compute the code delta and search for the diff which contains the
interested lines. Since git use libxdiff to format its diff file, we
should iterate through all xdiff's diff blocks and find what the code
looks like before the commit. This will be done using the callback
mechanism. Here, we will find a new line range which is the origin
code before this commit. And then start another search from the
current commit and the new line range. Recursively, we can find all
the modification history. We will stop when we find that the current
interested line range is added from scratch and is not moved from
other place of the file. Here, if the user want to trace code copy,
more work will be done to find the possible code copy. We may also
stop the traverse when we reach the max search depth. Also, if the
thread of change history split into two or more commits, we stop and
provide the users all the related commits and corresponding line
range.

Generally,
1. New callback for xdi_diff to parse the diff hunk and store line
level history info.
2. builtin/line-log.c will be added to complete most of the new features.
3. builtin/log.c will be changed to add this new utility to the front end.
4. Documents will be updated to introduce this new tool.

=====Milestones and Timeline=====
In this summer, we will add support of line level history browser for
only one file. The multiple ranges support is currently not in this
project.

The milestones of the project are:
1. Simple modification change history.
2. Code movement inside one file detect.
3. Code movement inside one commit but not a file.
4. Code copy of modified file in one commit.
5. Code copy of any place in one commit tree.
6. Fuzzy matching support.

And the timeline will be:
April 26 - May 23:   Catch up with Git code base and study the
implementation of blame.c and log.c thouroughly.

May 24 - June 21 :   Complete a version which supports code
modifcation trace but without code movement and code copy support.

June 22 - June 29:   Complete a version which supports code movement
inside one file.

June 30 - July 7:    Complete a version which supports code movement
between files inside one commit.

July 8 - July 15:    Complete a version which supports code copy of
modified file in one commit.

July 16 - July 23:   Complete a version which supports code copy of
any file in one commit tree.

July 24 - August 7:  Complete fuzzy matching of code movement and copy detect.

=====About me=====
I am Bo Yang, a Chinese graduate student majoring in Computer Science
of NanKai University. I have touched some open source software since 5
years ago and began to contribute code to open source community from
three years ago. I have contributed to Mozilla/Mingw/Netsurf.
Technically, I am experienced in C/Bash Shell. I have attended last
year's GSoC with Netsurf project. In that project, I have completed
most of a DOM library in C.
I begin to use git for source code revision from about two years ago.
I use Git for track my Mozilla trunk source code. Because updating
Mozilla code by CVS in my school is very slow. So, I write one script
to automatically updating the trunk with CVS at mid-night, when the
network flow is fast, on the server, and then use Git to maintain the
code. Then I use Git in my PC to clone/update the source code from my
local server and that is very fast. I use Git to track my changes to
the code and some bug fixes. It is an excellent tool for
branch/history, I think.
Git is my lovely daily tool for revision control. I have much
experience with it and have read "Git Internals" and also get some
basic knowledge about Git's code base. And I think the line-level
history explorer is really suitable for me and I can make a good start
with this project in Git community.

---------------------------------------------------
Thank you very much!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  3:48         ` Bo Yang
@ 2010-03-22  4:24           ` Junio C Hamano
  2010-03-22  4:34             ` Bo Yang
  0 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2010-03-22  4:24 UTC (permalink / raw)
  To: Bo Yang; +Cc: gitzilla, Alex Riesen, git

Bo Yang <struggleyb.nku@gmail.com> writes:

> Yeah, how do you think use the following syntax:
>
> <file1>@<rev1>:<start pos>,<end pos> <file2>@<rev2>:<start pos>,<end pos>

Horrible.  That is not how we name things.

What's wrong with bog standard:

    $ git log -L 10,20 master -- Documentation/git.txt

which is exactly how "blame" does it?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  4:24           ` Junio C Hamano
@ 2010-03-22  4:34             ` Bo Yang
  2010-03-22  5:32               ` Junio C Hamano
  0 siblings, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-22  4:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: gitzilla, Alex Riesen, git

> Horrible.  That is not how we name things.
>
> What's wrong with bog standard:
>
>    $ git log -L 10,20 master -- Documentation/git.txt
>
> which is exactly how "blame" does it?

The 'blame' way is very good if we only support one line range. But if
we want to support multiple line ranges, I don't think it is suitable
for that case. Anyway, how can I specify multi-ranges which refers to
multiple files at multiple revision and multiple line ranges using
above syntax?

Except that, I still can't convince myself that we need multiple
ranges support. Anyway, how do we display such a result to our users?

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  4:34             ` Bo Yang
@ 2010-03-22  5:32               ` Junio C Hamano
  2010-03-22  7:31                 ` Bo Yang
  2010-03-22 10:39                 ` Alex Riesen
  0 siblings, 2 replies; 54+ messages in thread
From: Junio C Hamano @ 2010-03-22  5:32 UTC (permalink / raw)
  To: Bo Yang; +Cc: gitzilla, Alex Riesen, git

Bo Yang <struggleyb.nku@gmail.com> writes:

> The 'blame' way is very good if we only support one line range. But if
> we want to support multiple line ranges, I don't think it is suitable
> for that case. Anyway, how can I specify multi-ranges which refers to
> multiple files at multiple revision and multiple line ranges using
> above syntax?

I would sort of see you may want to be able to say "explain lines 10 thru
15 of config.h and lines 100-115 of hello.c that appear in v1.2.0", but I
think it is a total nonsense to ask for "ll 10-15 of config.h in v1.2.0
and ll 110-115 of hello.c in v1.0.0".  After all they never existed in the
same revision (otherwise you would have said "ll 7-13 of config.h and ll
110-115 of hello.c that appear in v1.0.0").  So I would reject the
SVN-like "rev@" in the first place.

While I don't seriously buy "multiple files" either, if that is really
needed, I could be pursuaded with  "log -- path1:10-15 path2:1-7", or
"log -L path1:10-15 -Lpath2:1-7 -- path1 path2" or something similarly
ugly like these, but that is not how we generally name things, and it
probably shouldn't be a new option to "log" anymore.

On the other hand, multiple ranges in a single file is something that
may be quite reasonable, e.g.

  $ git log -L10-15 -L200-210 -- Makefile
  $ git log -L'*/^#ifdef WINDOWS/,/^#endif \/\* WINDOWS \/\*/' -- config.h

As I already said, I wouldn't be so worried about multiple-range feature,
but I would be worried about the usefulness of this feature, even for the
case to track a single range of a single file, starting from one given
revision.  When you want to know where the first few lines of Makefile
came from, and if blame says the first line came from 2731d048, that
really means that between the revision you started digging from and the
found revision, there is no commit that touched that particular line, but
equally importantly, that before that found revision, there wasn't a
corresponding line in that file---blame stopped exactly because there is
nobody before that found revision that the line can be blamed on.

So implementing "git log -L1,10 -- Makefile" might be just the matter of
doing something like:

 1. Run "git blame -L1,10 -- Makefile";
 2. Note the commits that appear in the output;
 3. Topologically sort these commits;
 4. Run "git show <the result of that toposort>"

which is not very satisfying.

And "git log -L1 -- Makefile" naturally degenerates into:

 1. Run "git blame -L1,1 -- Makefile";
 2. Note the commits that appear in the output;
 3. Run "git show <that commit>"

which is not just unsatisfying, but is almost boring.

I dunno.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  5:32               ` Junio C Hamano
@ 2010-03-22  7:31                 ` Bo Yang
  2010-03-22  7:41                   ` Junio C Hamano
  2010-03-22 10:39                 ` Alex Riesen
  1 sibling, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-22  7:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: gitzilla, Alex Riesen, git

On Mon, Mar 22, 2010 at 1:32 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Bo Yang <struggleyb.nku@gmail.com> writes:
>
>> The 'blame' way is very good if we only support one line range. But if
>> we want to support multiple line ranges, I don't think it is suitable
>> for that case. Anyway, how can I specify multi-ranges which refers to
>> multiple files at multiple revision and multiple line ranges using
>> above syntax?
>
> I would sort of see you may want to be able to say "explain lines 10 thru
> 15 of config.h and lines 100-115 of hello.c that appear in v1.2.0", but I
> think it is a total nonsense to ask for "ll 10-15 of config.h in v1.2.0
> and ll 110-115 of hello.c in v1.0.0".  After all they never existed in the
> same revision (otherwise you would have said "ll 7-13 of config.h and ll
> 110-115 of hello.c that appear in v1.0.0").  So I would reject the
> SVN-like "rev@" in the first place.
>
> While I don't seriously buy "multiple files" either, if that is really
> needed, I could be pursuaded with  "log -- path1:10-15 path2:1-7", or
> "log -L path1:10-15 -Lpath2:1-7 -- path1 path2" or something similarly
> ugly like these, but that is not how we generally name things, and it
> probably shouldn't be a new option to "log" anymore.
>
> On the other hand, multiple ranges in a single file is something that
> may be quite reasonable, e.g.
>
>  $ git log -L10-15 -L200-210 -- Makefile
>  $ git log -L'*/^#ifdef WINDOWS/,/^#endif \/\* WINDOWS \/\*/' -- config.h

Yeah, maybe one file multiple ranges is most rationale.

> As I already said, I wouldn't be so worried about multiple-range feature,
> but I would be worried about the usefulness of this feature, even for the
> case to track a single range of a single file, starting from one given
> revision.

I am sorry, but I did not catch up you here. You worried about the
usefulness of the multi-range feature or the line level history
browser?

I think tracking a single range of a single file, starting from one
given revision is useful when the line of history split on some point.
This can let users focus on a single line of history using this
feature.

>When you want to know where the first few lines of Makefile
> came from, and if blame says the first line came from 2731d048, that
> really means that between the revision you started digging from and the
> found revision, there is no commit that touched that particular line, but
> equally importantly, that before that found revision, there wasn't a
> corresponding line in that file---blame stopped exactly because there is
> nobody before that found revision that the line can be blamed on.
>
> So implementing "git log -L1,10 -- Makefile" might be just the matter of
> doing something like:
>
>  1. Run "git blame -L1,10 -- Makefile";
>  2. Note the commits that appear in the output;
>  3. Topologically sort these commits;
>  4. Run "git show <the result of that toposort>"
>
> which is not very satisfying.

Yes, this is not satisfying. But as I understand, the line level
history browser will do more than just this. It will not stop on 'step
4', it can follow the change history recursively and deeply, to find
more. I think this is useful when we focus just one a range of code
and want to know how it become into such a now condition.

Anyway, it is not a bad thing too add a new convenient feature to a
daily tool. :)

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  7:31                 ` Bo Yang
@ 2010-03-22  7:41                   ` Junio C Hamano
  2010-03-22  7:52                     ` Bo Yang
  2010-03-22  8:10                     ` Jonathan Nieder
  0 siblings, 2 replies; 54+ messages in thread
From: Junio C Hamano @ 2010-03-22  7:41 UTC (permalink / raw)
  To: Bo Yang; +Cc: gitzilla, Alex Riesen, git

Bo Yang <struggleyb.nku@gmail.com> writes:

>> When you want to know where the first few lines of Makefile
>> came from, and if blame says the first line came from 2731d048, that
>> really means that between the revision you started digging from and the
>> found revision, there is no commit that touched that particular line, but
>> equally importantly, that before that found revision, there wasn't a
>> corresponding line in that file---blame stopped exactly because there is
>> nobody before that found revision that the line can be blamed on.
> ...
> Yes, this is not satisfying. But as I understand, the line level
> history browser will do more than just this. It will not stop on 'step
> 4', it can follow the change history recursively and deeply, to find
> more.

I am actually questioning the existence of "recursively and deeply to find
more"; the reason blame stopped at a particular commit is exactly because
there is no more---otherwise it wouldn't have stopped there but kept
digging deeper.

That is what I meant in the message you are responding to, quoted at the
top of this message.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  7:41                   ` Junio C Hamano
@ 2010-03-22  7:52                     ` Bo Yang
  2010-03-22  8:10                     ` Jonathan Nieder
  1 sibling, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-22  7:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: gitzilla, Alex Riesen, git

On Mon, Mar 22, 2010 at 3:41 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Bo Yang <struggleyb.nku@gmail.com> writes:
>
>>> When you want to know where the first few lines of Makefile
>>> came from, and if blame says the first line came from 2731d048, that
>>> really means that between the revision you started digging from and the
>>> found revision, there is no commit that touched that particular line, but
>>> equally importantly, that before that found revision, there wasn't a
>>> corresponding line in that file---blame stopped exactly because there is
>>> nobody before that found revision that the line can be blamed on.
>> ...
>> Yes, this is not satisfying. But as I understand, the line level
>> history browser will do more than just this. It will not stop on 'step
>> 4', it can follow the change history recursively and deeply, to find
>> more.
>
> I am actually questioning the existence of "recursively and deeply to find
> more"; the reason blame stopped at a particular commit is exactly because
> there is no more---otherwise it wouldn't have stopped there but kept
> digging deeper.

I think an example may explain me well.

commit 1 of the file:
line 1 rev 1
line 2 rev 1

commit 2 of the file:
line 1 rev 2
line 2 rev 2

commit 3 of the file:
line 1 rev 3
line 2 rev 3

If we run, git blame file, it will show two lines are blamed on commit
3. Line level utility will also show rev2 and rev1 to users as the
format like what git log provide. I think git blame focus on who
produce the current code range. And the line level browser will
provide more than that, it also answer, how the lines evolved into
current condition.

I hope I explain everything clearly. :)

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  7:41                   ` Junio C Hamano
  2010-03-22  7:52                     ` Bo Yang
@ 2010-03-22  8:10                     ` Jonathan Nieder
  2010-03-23  6:01                       ` Bo Yang
  1 sibling, 1 reply; 54+ messages in thread
From: Jonathan Nieder @ 2010-03-22  8:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Bo Yang, gitzilla, Alex Riesen, git

Junio C Hamano wrote:

> I am actually questioning the existence of "recursively and deeply to find
> more"; the reason blame stopped at a particular commit is exactly because
> there is no more

Hmm, I can imagine some (mutually inconsistent) heuristics:

 - Suppose in the blamed commit a single isolated line changed.  Then
   it is clear where to look next.

 - If the mystery code is at the beginning of the file (resp.
   beginning of a diff -C0 hunk), maybe it was based on the line at the
   same position within the previous commit.

 - Take the line with the lowest Levenshtein distance from the mystery
   code.

 - Expect certain common patterns of change: substituted words,
   whitespace changes, added arguments for a function, things like that.

That said, I still don’t have a clear picture of a basic strategy.

Interested,
Jonathan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  5:32               ` Junio C Hamano
  2010-03-22  7:31                 ` Bo Yang
@ 2010-03-22 10:39                 ` Alex Riesen
  2010-03-22 15:05                   ` Johannes Schindelin
  1 sibling, 1 reply; 54+ messages in thread
From: Alex Riesen @ 2010-03-22 10:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Bo Yang, gitzilla, git

On Mon, Mar 22, 2010 at 06:32, Junio C Hamano <gitster@pobox.com> wrote:
> While I don't seriously buy "multiple files" either, if that is really

yeah, _really_

> needed, I could be pursuaded with  "log -- path1:10-15 path2:1-7", or
> "log -L path1:10-15 -Lpath2:1-7 -- path1 path2" or something similarly
> ugly like these, but that is not how we generally name things, and it
> probably shouldn't be a new option to "log" anymore.

But then, how about putting the "path" last in the argument,
so that the unambiguosly defined part of the format comes first?
Less need for quoting of ":" (or "@") in pathnames.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22 10:39                 ` Alex Riesen
@ 2010-03-22 15:05                   ` Johannes Schindelin
  0 siblings, 0 replies; 54+ messages in thread
From: Johannes Schindelin @ 2010-03-22 15:05 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, Bo Yang, gitzilla, git

Hi,

On Mon, 22 Mar 2010, Alex Riesen wrote:

> On Mon, Mar 22, 2010 at 06:32, Junio C Hamano <gitster@pobox.com> wrote:
> > While I don't seriously buy "multiple files" either, if that is really
> 
> yeah, _really_

Yes. Besides, it is an easy fall-out of the common "a Java class was split 
into two" case, where you follow line ranges in different files (at least 
at some stage) _anyway_.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  3:52         ` Bo Yang
@ 2010-03-22 15:48           ` Jakub Narebski
  2010-03-22 18:21             ` Johannes Schindelin
  2010-03-22 19:24           ` Johannes Schindelin
       [not found]           ` <201003282120.40536.trast@student.ethz.ch>
  2 siblings, 1 reply; 54+ messages in thread
From: Jakub Narebski @ 2010-03-22 15:48 UTC (permalink / raw)
  To: Bo Yang; +Cc: Johannes.Schindelin, gitster, gitzilla, Alex Riesen, git

Bo Yang <struggleyb.nku@gmail.com> writes:

> This project will add a new feature for 'git log' to display line
> level history. It can trace the history of any line range of certain
> file at any revision. For simplity, users can run the command like: '
> git log -L builtin/diff.c:6,8 ', he will get the change history of
> code between line 6 and line 8 of the diff.c file. 

I think that, at least at first, line-level log should follow the
git-blame, i.e.

  git log -L <begin>,<end>  <revs>  -- <file>

If we want (in the future) to follow history of some lines from one
file, and other lines from other file together, we do not need to use

  -L <file>:<begin>,<end>

syntax.  If parseopt allows, we can use posotion of parameters, i.e.

  <file1> -L <m>,<n>   <file2> -L <k>,<j>

> And for each history entry, it will provide the commits, the diff
> block which contains changes of users' interested lines.

The most important *new* algorithm you need to implement is, after
finding (blame-like) the commit that created given version of given
line, what was previous version of given line and which line that was.

You can probably find some heuristic in existing merge tools, like
emerge from GNU Emacs, or graphical diff tools.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22 15:48           ` Jakub Narebski
@ 2010-03-22 18:21             ` Johannes Schindelin
  2010-03-22 18:38               ` Sverre Rabbelier
  0 siblings, 1 reply; 54+ messages in thread
From: Johannes Schindelin @ 2010-03-22 18:21 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Bo Yang, gitster, gitzilla, Alex Riesen, git

Hi,

On Mon, 22 Mar 2010, Jakub Narebski wrote:

> Bo Yang <struggleyb.nku@gmail.com> writes:
> 
> > This project will add a new feature for 'git log' to display line 
> > level history. It can trace the history of any line range of certain 
> > file at any revision. For simplity, users can run the command like: ' 
> > git log -L builtin/diff.c:6,8 ', he will get the change history of 
> > code between line 6 and line 8 of the diff.c file.
> 
> I think that, at least at first, line-level log should follow the
> git-blame, i.e.
> 
>   git log -L <begin>,<end>  <revs>  -- <file>
> 
> If we want (in the future) to follow history of some lines from one
> file, and other lines from other file together, we do not need to use
> 
>   -L <file>:<begin>,<end>
> 
> syntax.  If parseopt allows, we can use posotion of parameters, i.e.
> 
>   <file1> -L <m>,<n>   <file2> -L <k>,<j>

Oh, is it bikeshedding time already? /me might have missed the start 
signal.

> > And for each history entry, it will provide the commits, the diff 
> > block which contains changes of users' interested lines.
> 
> The most important *new* algorithm you need to implement is, after 
> finding (blame-like) the commit that created given version of given 
> line, what was previous version of given line and which line that was.
> 
> You can probably find some heuristic in existing merge tools, like
> emerge from GNU Emacs, or graphical diff tools.

I do not think that these tools can help, as they never look further than 
identical lines (and they mustn't, either).

More importantly, the first step really is about driving the libxdiff in 
such a way that you can recognize the exact same lines.

(One point to note for the technical details: the algorithm has to expect 
opposite code moves, i.e. it must cope well when the diff shows the code 
in question removed in one hunk and added in another.)

We also should not get ahead of ourselves, but allow the student to get a 
full understanding of the requirements, from which he can then make a 
project plan (with milestones, Christian, no problem).

BTW by "requirements" I do not mean something as technical as the syntax, 
but rather a definition what people should be able to expect to do with 
this at the end of the summer.

As to fuzzy matching of lines that could not be attributed otherwise, I 
think that that will require a lot of playing around with different ideas. 
A simple Levenshtein-Damerau is highly unlikely to be enough.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22 18:21             ` Johannes Schindelin
@ 2010-03-22 18:38               ` Sverre Rabbelier
  2010-03-22 19:26                 ` Johannes Schindelin
  0 siblings, 1 reply; 54+ messages in thread
From: Sverre Rabbelier @ 2010-03-22 18:38 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jakub Narebski, Bo Yang, gitster, gitzilla, Alex Riesen, git

Heya,

On Mon, Mar 22, 2010 at 19:21, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> As to fuzzy matching of lines that could not be attributed otherwise, I
> think that that will require a lot of playing around with different ideas.
> A simple Levenshtein-Damerau is highly unlikely to be enough.

I'd recommend making this either the last milestone, or not a
milestone at all. As I noticed with git-stats such metrics might not
exist at all (or at least be too hard to find/implement), and it's
quite a bummer to not be able to implement your primary milestone ;).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  3:52         ` Bo Yang
  2010-03-22 15:48           ` Jakub Narebski
@ 2010-03-22 19:24           ` Johannes Schindelin
  2010-03-23  6:08             ` Bo Yang
  2010-03-23  6:27             ` Bo Yang
       [not found]           ` <201003282120.40536.trast@student.ethz.ch>
  2 siblings, 2 replies; 54+ messages in thread
From: Johannes Schindelin @ 2010-03-22 19:24 UTC (permalink / raw)
  To: Bo Yang; +Cc: gitster, gitzilla, Alex Riesen, git

Hi,

On Mon, 22 Mar 2010, Bo Yang wrote:

> Draft proposal(v2): Line-level History Browser
> 
> =====Purpose of this project=====
> "git blame" can tell us who is responsible for a line of code, but it
> can't help if we want to get the detail of how the lines of code have
> evolved as what it is now.
> This project will add a new feature for 'git log' to display line
> level history. It can trace the history of any line range of certain
> file at any revision. For simplity, users can run the command like: '
> git log -L builtin/diff.c:6,8 ', he will get the change history of
> code between line 6 and line 8 of the diff.c file. And for each
> history entry, it will provide the commits, the diff block which
> contains changes of users' interested lines.

I would not be too specific here about the exact syntax. I would rather 
have an example where this might be useful.

In git.git, for example, you could point to pretty_print_commit() which 
was split out from commit.c into pretty.c in 93fc05e(Split off the pretty 
print stuff into its own file), and mention that it is hard to verify 
without much hassle that the code split was really only a code split, 
rather than a split with an evil change.

Or you could point to 691f1a2(replace direct calls to unlink(2) with 
unlink_or_warn), where code was refactored, into a new function 
(unfortunately in two commits, so it might be a case not covered by your 
project) and it might be somebody's task to find out the original author 
for that function.

Basically, I would like to have a structure in the proposal like this: 
what? why? how? when?

> This utility will trace all the modification history of interested
> lines and stop until it finds the root of the lines, which is a point
> where all the new code is added from scratch. Also, the users can
> specify how deeply he wants this utility to trace. And this tool will
> treat code move just like modification too, so it will follow the code
> move inside one commit.
>
> Note that, the history may not always be a single thread of commits.
> If there are more than one commit which produce the specified line
> range, the thread of history will split.

Do not forget the case where there are more than one source of a code 
move. Think "refactoring".

> =====Work and technical issues=====
> ==Command options==
> This new feature should be used for exploring the history of changes
> for certain line range of code in one file.
> 
> git log [-m<num>] [-I] [-d depth] [--fuzzy]  -L file1@rev1:<start
> pos>,<end pos>  file2@rev2:<start pos>,<end pos>

I would like this not to be specified too much here. For example, we do 
not know yet, whether the matching will be fuzzy, or whether we find 
something cleverer than that.

So, I suggest to list not the command line options, but what you intend to 
support. I.e.:

> Options:
>
> 1. -m<num>, option to control whether we should follow code movement.  
>    If one -m is given, we follow code movement inside file, when more 
>    than one '-m' is given, we follow the movement between files in one 
>    commit. The <num> is used to specify the lower bound for the number 
>    of lines of moved code. If it is not given, we set it as 1.

Here you do not need to say that it is -m<num>, but that you want to 
support following code movements both inside and between files, but only 
optionally, for performance reasons (or some such).

In any case, this would probably just reuse the -M option.

> 2. -I, option to control whether we should display only the 'user 
>    interested lines' diff block [default] or display the whole diff with 
>    the interested area colorfully displayed.

It would be more in line with the diff options to use -U, but you do not 
have to state that. Just talk about a configurable amount of context.

> 3. -d, option to control the max depth we trace into the commit history.

Again, there are better options for "git log" already, but you do not need 
to be too explicit on the syntax side. Just say that you want to be able 
to use as many of "git log"s options as make sense in the context of 
line-level history.

> 4. --fuzzy, option to control whether fuzzy code copy mathing is used.

See above.

> 5. '-L' to control whether we run a simple log or we want a line level 
>    log.

See above.

> 6. Files and lines. I propose to use such a syntax to specify the files 
>    at revision and line range, <file>@<revision>:<start pos>,<end pos>. 
>    This looks a little complex, but I think it is neccessary because we 
>    will support multiple file at any version and any line range finally. 
>    The revision can be any revision format of Git and the <pos> can be a 
>    number, or a posix regex, just like what 'git blame' do.

See above.

> 7. And we will support code copy detect, too. The option which control 
>    whether we trace code copy does exist in current 'git log', which is 
>    the option '-C'. Similiarly, one '-C' is used to trace code copy of 
>    new added code inside one commit. Two '-C' will trace any code copy 
>    inside commit tree.

Again, do not be too specific about details that have to be fleshed out 
while working on the project. For example, we do not know yet whether it 
would make more sense to look for code movements automatically when we 
detected a deletion, and maybe fall back automatically to detecting code 
copies when we found an inter-file move.

> ==Design and implementation==
> Git store all the blobs instead of code delta, so we should traverse
> the commit history and directly access the tree/blob objects to
> compute the code delta and search for the diff which contains the
> interested lines.

s/ed/ing/

> Since git use libxdiff to format its diff file, we should iterate 
> through all xdiff's diff blocks and find what the code looks like before 
> the commit. This will be done using the callback mechanism. Here, we 
> will find a new line range which is the origin code before this commit. 
> And then start another search from the current commit and the new line 
> range.
>
> Recursively, we can find all the modification history. We will stop when 
> we find that the current interested line range is added from scratch and 
> is not moved from other place of the file. Here, if the user want to 
> trace code copy, more work will be done to find the possible code copy. 
> We may also stop the traverse when we reach the max search depth.
>
> Also, if the thread of change history split into two or more commits, we 
> stop and provide the users all the related commits and corresponding 
> line range.

Good.

> Generally,
> 1. New callback for xdi_diff to parse the diff hunk and store line
> level history info.
> 2. builtin/line-log.c will be added to complete most of the new features.
> 3. builtin/log.c will be changed to add this new utility to the front end.
> 4. Documents will be updated to introduce this new tool.

Good.

> =====Milestones and Timeline=====
> In this summer, we will add support of line level history browser for
> only one file. The multiple ranges support is currently not in this
> project.
> 
> The milestones of the project are:
> 1. Simple modification change history.

IMHO this should be split into

	1a) have an initial version which does nothing else than parse
	    git-log options and a single additional -L, requiring exactly
	    one file to be specified

	1b) implement the xdiff callback and identify the commits touching
	    the line range (this is not completely trivial due to merges)

> 2. Code movement inside one file detect.

Again, this has to be split a little bit. Code can split, and it can also 
unite. So, a single line range can easily become multiple ones.

> 3. Code movement inside one commit but not a file.

s/but not a file/between files/

> 4. Code copy of modified file in one commit.

You mean code copy from somewhere in the same file?

> 5. Code copy of any place in one commit tree.
> 6. Fuzzy matching support.

For fuzzy matching support, I would add some ideas, such as trying to 
match alpha-numeric characters, or matching longest words or some such. 
Also mention the possibility that this might be infeasible. In any case, 
give an example what case this is trying to help with.

> And the timeline will be:
> April 26 - May 23:   Catch up with Git code base and study the
> implementation of blame.c and log.c thouroughly.

Hmm. Maybe it would be better to be more precise. Like: 1st week: follow 
the bird's eye view on Git's source code. 2nd week, analyze the rev-list 
machinery (probably first looking at the code of merge-base, for easier 
understanding), 3rd week, have a look at builtin/log.c, 4th week, 
understand blame.c

> May 24 - June 21 :   Complete a version which supports code
> modifcation trace but without code movement and code copy support.
> 
> June 22 - June 29:   Complete a version which supports code movement
> inside one file.
> 
> June 30 - July 7:    Complete a version which supports code movement
> between files inside one commit.
> 
> July 8 - July 15:    Complete a version which supports code copy of
> modified file in one commit.
> 
> July 16 - July 23:   Complete a version which supports code copy of
> any file in one commit tree.
> 
> July 24 - August 7:  Complete fuzzy matching of code movement and copy detect.

This should probably adjusted a bit to my suggestions above.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22 18:38               ` Sverre Rabbelier
@ 2010-03-22 19:26                 ` Johannes Schindelin
  2010-03-22 20:21                   ` Sverre Rabbelier
  0 siblings, 1 reply; 54+ messages in thread
From: Johannes Schindelin @ 2010-03-22 19:26 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: Jakub Narebski, Bo Yang, gitster, gitzilla, Alex Riesen, git

Hi,

On Mon, 22 Mar 2010, Sverre Rabbelier wrote:

> On Mon, Mar 22, 2010 at 19:21, Johannes Schindelin 
> <Johannes.Schindelin@gmx.de> wrote:
> > As to fuzzy matching of lines that could not be attributed otherwise, 
> > I think that that will require a lot of playing around with different 
> > ideas. A simple Levenshtein-Damerau is highly unlikely to be enough.
> 
> I'd recommend making this either the last milestone, or not a milestone 
> at all. As I noticed with git-stats such metrics might not exist at all 
> (or at least be too hard to find/implement), and it's quite a bummer to 
> not be able to implement your primary milestone ;).

Indeed. TBH, I wanted to ask you to assist in that part of the project. 
You probably can give a good overview over what does not work, and why.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22 19:26                 ` Johannes Schindelin
@ 2010-03-22 20:21                   ` Sverre Rabbelier
  0 siblings, 0 replies; 54+ messages in thread
From: Sverre Rabbelier @ 2010-03-22 20:21 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jakub Narebski, Bo Yang, gitster, gitzilla, Alex Riesen, git

Heya,

On Mon, Mar 22, 2010 at 20:26, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Indeed. TBH, I wanted to ask you to assist in that part of the project.
> You probably can give a good overview over what does not work, and why.

Back then I think we even talked about teaching git log to find code
moves? I have some silly code online on repo.or.cz even. maybe.
Anyway, my main problem there was finding a heuristic that would give
a sensible answer both in small _and_ large moves. It might be worth
investigating two or more metrics instead, one that works for (very)
small chunks of code, and thus require an almost exact match, then
perhaps a somewhat linear function (the longer the block moved, the
more 'fuzz' you allow), and maybe after some size, say practical
full-file moves, use an algorithm similar to what rename detection
does. </brandump>

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22  8:10                     ` Jonathan Nieder
@ 2010-03-23  6:01                       ` Bo Yang
  2010-03-23 10:08                         ` Jakub Narebski
  2010-03-23 18:57                         ` Jonathan Nieder
  0 siblings, 2 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-23  6:01 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Junio C Hamano, gitzilla, Alex Riesen, git

Hi,
> Hmm, I can imagine some (mutually inconsistent) heuristics:
>
>  - Suppose in the blamed commit a single isolated line changed.  Then
>   it is clear where to look next.
>
>  - If the mystery code is at the beginning of the file (resp.
>   beginning of a diff -C0 hunk), maybe it was based on the line at the
>   same position within the previous commit.
>
>  - Take the line with the lowest Levenshtein distance from the mystery
>   code.
>
>  - Expect certain common patterns of change: substituted words,
>   whitespace changes, added arguments for a function, things like that.
>
> That said, I still don’t have a clear picture of a basic strategy.

I can't understand fully about your above strategy. I think we can
category the code change into two cases:
1. The diff looks like:

@@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
**argv, const char *prefix)
                add_signoff = xmemdupz(committer, endpos - committer + 1);
        }

-       for (i = 0; i < extra_hdr_nr; i++) {
-               strbuf_addstr(&buf, extra_hdr[i]);
+       for (i = 0; i < extra_hdr.nr; i++) {
+               strbuf_addstr(&buf, extra_hdr.items[i].string);
                strbuf_addch(&buf, '\n');
        }


ie: there is both deletion and addition in a change. And this means we
modify some lines of the code. So, what we do will be tracing the two
'minus' lines and then find another diff. Start trace from that diff
recursively.
Yes, the new added code may also be moved or copied from other place.
But, I think here, we should focus on the lines before this changeset.

2. The diff looks like:

@@ -879,9 +885,12 @@ int cmd_grep(int argc, const char **argv, const
char *prefix)
        opt.regflags = REG_NEWLINE;
        opt.max_depth = -1;

+       strcpy(opt.color_context, "");
        strcpy(opt.color_filename, "");
+       strcpy(opt.color_function, "");
        strcpy(opt.color_lineno, "");
        strcpy(opt.color_match, GIT_COLOR_BOLD_RED);

This means, the code here is added from scratch. Here, I think we have
three options.
1. Find if the new code is moved here from other place.
2. Find if the new code is copied from other place.
3. We find the end of the history, so stop here.

The problems remain how do we find the copied/moved code. The new
added code may be copied/moved from multiple place with little
changes.

I hope I understand the requirement of the line-level browser, could
you please point it out if I have made some mistake?

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22 19:24           ` Johannes Schindelin
@ 2010-03-23  6:08             ` Bo Yang
  2010-03-23  6:27             ` Bo Yang
  1 sibling, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-23  6:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: gitster, gitzilla, Alex Riesen, git

Hi,

>> Note that, the history may not always be a single thread of commits.
>> If there are more than one commit which produce the specified line
>> range, the thread of history will split.
>
> Do not forget the case where there are more than one source of a code
> move. Think "refactoring".

Yeah, I really ignore such a condition. Thanks a lot!
And any new added code can be moved/copied from multiple source. This
will really be a new problem for the fuzzy matching.

>> =====Work and technical issues=====
>> ==Command options==
>> This new feature should be used for exploring the history of changes
>> for certain line range of code in one file.
>>
>> git log [-m<num>] [-I] [-d depth] [--fuzzy]  -L file1@rev1:<start
>> pos>,<end pos>  file2@rev2:<start pos>,<end pos>
>
> I would like this not to be specified too much here. For example, we do
> not know yet, whether the matching will be fuzzy, or whether we find
> something cleverer than that.

Ok, I will focus on express what I will support instead of command line options.

>
>> =====Milestones and Timeline=====
>> In this summer, we will add support of line level history browser for
>> only one file. The multiple ranges support is currently not in this
>> project.
>>
>> The milestones of the project are:
>> 1. Simple modification change history.
>
> IMHO this should be split into
>
>        1a) have an initial version which does nothing else than parse
>            git-log options and a single additional -L, requiring exactly
>            one file to be specified
>
>        1b) implement the xdiff callback and identify the commits touching
>            the line range (this is not completely trivial due to merges)
>

I will make a more specified milestones and timeline, thanks!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-22 19:24           ` Johannes Schindelin
  2010-03-23  6:08             ` Bo Yang
@ 2010-03-23  6:27             ` Bo Yang
  1 sibling, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-23  6:27 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: gitster, gitzilla, Alex Riesen, git

Hi,
>> 4. Code copy of modified file in one commit.
>
> You mean code copy from somewhere in the same file?

I am sorry not. I mean, lines copied from other files that were
modified in the same commit. Just what 'blame' means with one '-C'
options.

>
>> 5. Code copy of any place in one commit tree.
>> 6. Fuzzy matching support.
>
> For fuzzy matching support, I would add some ideas, such as trying to
> match alpha-numeric characters, or matching longest words or some such.
> Also mention the possibility that this might be infeasible. In any case,
> give an example what case this is trying to help with.
>

I think fuzzy matching is used to track multiple lines of
copy/movement, even with little change of the source.
For example, one C function is moved from file1 to file2 and get
renamed. In this case, most of the origin code of function body will
remain unchanged except the function name. So, simply compare the new
added lines with original code line by line and permit some percent of
mismatch will help to find this kind of movement.


Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23  6:01                       ` Bo Yang
@ 2010-03-23 10:08                         ` Jakub Narebski
  2010-03-23 10:38                           ` Bo Yang
  2010-03-23 18:57                         ` Jonathan Nieder
  1 sibling, 1 reply; 54+ messages in thread
From: Jakub Narebski @ 2010-03-23 10:08 UTC (permalink / raw)
  To: Bo Yang; +Cc: Jonathan Nieder, Junio C Hamano, gitzilla, Alex Riesen, git

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 3719 bytes --]

Bo Yang <struggleyb.nku@gmail.com> writes:
> Jonathan Nieder <jrnieder@gmail.com> writes:

> > Hmm, I can imagine some (mutually inconsistent) heuristics:
> >
> >  - Suppose in the blamed commit a single isolated line changed.  Then
> >   it is clear where to look next.
> >
> >  - If the mystery code is at the beginning of the file (resp.
> >   beginning of a diff -C0 hunk), maybe it was based on the line at the
> >   same position within the previous commit.
> >
> >  - Take the line with the lowest Levenshtein distance from the mystery
> >   code.
> >
> >  - Expect certain common patterns of change: substituted words,
> >   whitespace changes, added arguments for a function, things like that.
> >
> > That said, I still donÂ’t have a clear picture of a basic strategy.
> 
> I can't understand fully about your above strategy. I think we can
> category the code change into two cases:
>
> 1. The diff looks like this:
> 
> @@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
> **argv, const char *prefix)
>                 add_signoff = xmemdupz(committer, endpos - committer + 1);
>         }
> 
> -       for (i = 0; i < extra_hdr_nr; i++) {
> -               strbuf_addstr(&buf, extra_hdr[i]);
> +       for (i = 0; i < extra_hdr.nr; i++) {
> +               strbuf_addstr(&buf, extra_hdr.items[i].string);
>                 strbuf_addch(&buf, '\n');
>         }

Errr... how the first line in preimage differs from first line in
postimage?  The look as if they are the same:

  -       for (i = 0; i < extra_hdr_nr; i++) {
  +       for (i = 0; i < extra_hdr.nr; i++) {

> 
> i.e. there is both deletion and addition in a change. And this means we
> modify some lines of the code. So, what we do will be tracing the two
> 'minus' lines and then find another diff. Start trace from that diff
> recursively.
>
> Yes, the new added code may also be moved or copied from other place.
> But, I think here, we should focus on the lines before this changeset.

The problem is when you are asking about tracking a subset of lines
that appear in postimage of a patch.  For example if we ask for
history of

                  strbuf_addstr(&buf, extra_hdr.items[i].string);

line, should we track history of

          for (i = 0; i < extra_hdr.nr; i++) {

line which appears in relevant diff chunk?  If not, how we should
detect which line in preimage (if any) corresponds to given line in
postimage?

> 2. The diff looks like:
> 
> @@ -879,9 +885,12 @@ int cmd_grep(int argc, const char **argv, const
> char *prefix)
>         opt.regflags = REG_NEWLINE;
>         opt.max_depth = -1;
> 
> +       strcpy(opt.color_context, "");
>         strcpy(opt.color_filename, "");
> +       strcpy(opt.color_function, "");
>         strcpy(opt.color_lineno, "");
>         strcpy(opt.color_match, GIT_COLOR_BOLD_RED);
> 
> This means, the code here is added from scratch. Here, I think we have
> three options.
> 1. Find if the new code is moved here from other place.
> 2. Find if the new code is copied from other place.
> 3. We find the end of the history, so stop here.
> 
> The problems remain how do we find the copied/moved code. The new
> added code may be copied/moved from multiple place with little
> changes.

I guess that you could take a look at how git-blame does handle
this... but I think you would get something like generalization of
ordinary patch, where preimage of chunk can come from different place
/ different file.


P.S. I like it that you provide real-life examples.  They really help
     with understanding what are you talking about.
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 10:08                         ` Jakub Narebski
@ 2010-03-23 10:38                           ` Bo Yang
  2010-03-23 11:22                             ` Jakub Narebski
  2010-03-23 12:02                             ` Peter Kjellerstedt
  0 siblings, 2 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-23 10:38 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jonathan Nieder, Junio C Hamano, gitzilla, Alex Riesen, git

Hi,

>
> Errr... how the first line in preimage differs from first line in
> postimage?  The look as if they are the same:
>
>  -       for (i = 0; i < extra_hdr_nr; i++) {
>  +       for (i = 0; i < extra_hdr.nr; i++) {
>

Maybe some space... :)
>
> The problem is when you are asking about tracking a subset of lines
> that appear in postimage of a patch.  For example if we ask for
> history of
>
>                  strbuf_addstr(&buf, extra_hdr.items[i].string);
>
> line, should we track history of
>
>          for (i = 0; i < extra_hdr.nr; i++) {
>
> line which appears in relevant diff chunk?  If not, how we should
> detect which line in preimage (if any) corresponds to given line in
> postimage?

If I understand correctly, that is as following.

@@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
**argv, const char *prefix)
               add_signoff = xmemdupz(committer, endpos - committer + 1);
       }

-       for (i = 0; i < extra_hdr_nr; i++) {
-               strbuf_addstr(&buf, extra_hdr[i]);
+       for (i = 0; i < extra_hdr.nr; i++) {
+               strbuf_addstr(&buf, extra_hdr.items[i].string);
               strbuf_addch(&buf, '\n');
       }

Here, the user only ask for tracking the strbuf_addstr line. And we
find the above diff hunk. I think we can then find what the line would
be in the preimage using @@ -1008,29 +1000,29 @@.  The strbuf_addstr
is located at
1000(the postimage start line number)
+3(the context number)
+1(the number of lines '+' before this line) in the postimage,
and we can calculate its line number in the preimage by the same way
1008
+3
+1(the number of lines with '-' before this line).

How do you think about this method?

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 10:38                           ` Bo Yang
@ 2010-03-23 11:22                             ` Jakub Narebski
  2010-03-23 12:23                               ` Bo Yang
  2010-03-23 12:02                             ` Peter Kjellerstedt
  1 sibling, 1 reply; 54+ messages in thread
From: Jakub Narebski @ 2010-03-23 11:22 UTC (permalink / raw)
  To: Bo Yang; +Cc: Jonathan Nieder, Junio C Hamano, gitzilla, Alex Riesen, git

On Tue, 23 Mar 2010, Bo Yang wrote:

Please do not forget to include attribution line, like the one I have
added below:
 
> Jakub Narebski wrote:

> > The problem is when you are asking about tracking a subset of lines
> > that appear in postimage of a patch.  For example if we ask for
> > history of
> >
> >                  strbuf_addstr(&buf, extra_hdr.items[i].string);
> >
> > line, should we track history of
> >
> >          for (i = 0; i < extra_hdr.nr; i++) {
> >
> > line which appears in relevant diff chunk?  If not, how we should
> > detect which line in preimage (if any) corresponds to given line in
> > postimage?
> 
> If I understand correctly, that is as following.
> 
> @@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
> **argv, const char *prefix)
>                add_signoff = xmemdupz(committer, endpos - committer + 1);
>        }
> 
> -       for (i = 0; i < extra_hdr_nr; i++) {
> -              strbuf_addstr(&buf, extra_hdr[i]);
> +       for (i = 0; i < extra_hdr.nr; i++) {
> +              strbuf_addstr(&buf, extra_hdr.items[i].string);
>                strbuf_addch(&buf, '\n');
>        }
> 
> Here, the user only ask for tracking the strbuf_addstr line. And we
> find the above diff hunk. I think we can then find what the line would
> be in the preimage using @@ -1008,29 +1000,29 @@.  The strbuf_addstr
> is located at
> 1000(the postimage start line number)
> +3(the context number)
> +1(the number of lines '+' before this line) in the postimage,
> and we can calculate its line number in the preimage by the same way
> 1008
> +3
> +1(the number of lines with '-' before this line).
> 
> How do you think about this method?

This would work with the simplest case, but not in more complicated
cases, like for example preimage and postimage with different size.

Take for example the following chunk (fragment):

diff --git a/run-command.c b/run-command.c
index 2feb493..3206d61 100644
--- a/run-command.c
+++ b/run-command.c
@@ -67,19 +67,21 @@ static int child_notifier = -1;
 
 static void notify_parent(void)
 {
-	write(child_notifier, "", 1);
+	ssize_t unused;
+	unused = write(child_notifier, "", 1);
 }
 
 static NORETURN void die_child(const char *err, va_list params)

If you follow ssize_t line, it is created.  If you follow line with
write, which is 2nd line in postimage, its previous version is 1st
line in preimage.


Another example would be reordering of lines, or reordering with
some change.

-- 
Jakub Narebski
Poland

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* RE: GSoC draft proposal: Line-level history browser
  2010-03-23 10:38                           ` Bo Yang
  2010-03-23 11:22                             ` Jakub Narebski
@ 2010-03-23 12:02                             ` Peter Kjellerstedt
  1 sibling, 0 replies; 54+ messages in thread
From: Peter Kjellerstedt @ 2010-03-23 12:02 UTC (permalink / raw)
  To: Bo Yang, Jakub Narebski
  Cc: Jonathan Nieder, Junio C Hamano, gitzilla, Alex Riesen, git

> -----Original Message-----
> From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On
> Behalf Of Bo Yang
> Sent: den 23 mars 2010 11:39
> To: Jakub Narebski
> Cc: Jonathan Nieder; Junio C Hamano; gitzilla@gmail.com; Alex Riesen;
> git@vger.kernel.org
> Subject: Re: GSoC draft proposal: Line-level history browser
> 
> Hi,
> 
> >
> > Errr... how the first line in preimage differs from first line in
> > postimage?  The look as if they are the same:
> >
> >  -       for (i = 0; i < extra_hdr_nr; i++) {
> >  +       for (i = 0; i < extra_hdr.nr; i++) {
> >
> 
> Maybe some space... :)

Look more closely. Hint: a _ is not the same as a . ;)

//Peter

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 11:22                             ` Jakub Narebski
@ 2010-03-23 12:23                               ` Bo Yang
  2010-03-23 13:49                                 ` Jakub Narebski
  0 siblings, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-23 12:23 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jonathan Nieder, Junio C Hamano, gitzilla, Alex Riesen, git

Hi,

On Tue, Mar 23, 2010 at 7:22 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>
> This would work with the simplest case, but not in more complicated
> cases, like for example preimage and postimage with different size.
>
> Take for example the following chunk (fragment):
>
> diff --git a/run-command.c b/run-command.c
> index 2feb493..3206d61 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -67,19 +67,21 @@ static int child_notifier = -1;
>
>  static void notify_parent(void)
>  {
> -       write(child_notifier, "", 1);
> +       ssize_t unused;
> +       unused = write(child_notifier, "", 1);
>  }
>
>  static NORETURN void die_child(const char *err, va_list params)
>
> If you follow ssize_t line, it is created.  If you follow line with
> write, which is 2nd line in postimage, its previous version is 1st
> line in preimage.
>
>
> Another example would be reordering of lines, or reordering with
> some change.

Ah, yes, you are right.

And now, I really get the difference between the understanding about
line level browser of us. :) When users want to browsing the history
of some line or line range, you want to display only the related lines
to them, but I want to display the minim diff hunk to them. :)
And I think displaying the minimum diff hunk is sensible and feasible.
Could you please tell me how do you think about this?

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 12:23                               ` Bo Yang
@ 2010-03-23 13:49                                 ` Jakub Narebski
  2010-03-23 15:23                                   ` Bo Yang
  0 siblings, 1 reply; 54+ messages in thread
From: Jakub Narebski @ 2010-03-23 13:49 UTC (permalink / raw)
  To: Bo Yang; +Cc: Jonathan Nieder, Junio C Hamano, gitzilla, Alex Riesen, git

On Tue, Mar 23, 2010, Bo Yang wrote:
> On Tue, Mar 23, 2010 at 7:22 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> >
> > This would work with the simplest case, but not in more complicated
> > cases, like for example preimage and postimage with different size.
> >
> > Take for example the following chunk (fragment):
> >
> > diff --git a/run-command.c b/run-command.c
> > index 2feb493..3206d61 100644
> > --- a/run-command.c
> > +++ b/run-command.c
> > @@ -67,19 +67,21 @@ static int child_notifier = -1;
> >
> >  static void notify_parent(void)
> >  {
> > -       write(child_notifier, "", 1);
> > +       ssize_t unused;
> > +       unused = write(child_notifier, "", 1);
> >  }
> >
> >  static NORETURN void die_child(const char *err, va_list params)
> >
> > If you follow ssize_t line, it is created.  If you follow line with
> > write, which is 2nd line in postimage, its previous version is 1st
> > line in preimage.
> >
> >
> > Another example would be reordering of lines, or reordering with
> > some change.
> 
> Ah, yes, you are right.
> 
> And now, I really get the difference between the understanding about
> line level browser of us. :) When users want to browsing the history
> of some line or line range, you want to display only the related lines
> to them, but I want to display the minim diff hunk to them. :)
> And I think displaying the minimum diff hunk is sensible and feasible.
> Could you please tell me what do you think about this?

The problem is not what (part of) diff you would display.  The problem
is with following the history (with history simplification).  *After*
displaying diff / chunk / chunk fragment, do we further follow history
of the whole preimage?  Or do we follow history of line pre-change
starting from blamed commit?

If we *don't* follow the history, how line-level browser is different
from (wrapped) git-blame?


Try to come with the result of line-level history for some line in
git sources "by hand": this would help in discussion about what 
line-level history browser should do, and perhaps even be first test
of it (see e.g. tests for git-blame).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 13:49                                 ` Jakub Narebski
@ 2010-03-23 15:23                                   ` Bo Yang
  2010-03-23 19:57                                     ` Jonathan Nieder
  0 siblings, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-23 15:23 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jonathan Nieder, Junio C Hamano, gitzilla, Alex Riesen, git

Hi,

On Tue, Mar 23, 2010 at 9:49 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> Try to come with the result of line-level history for some line in
> git sources "by hand": this would help in discussion about what
> line-level history browser should do, and perhaps even be first test
> of it (see e.g. tests for git-blame).

Thanks for your advice of coming with a real example, Jakub! And I can
give a not too trivial one, :)

If you look at the pretty.c line 1032 line, you will find a line like:

format_commit_message(commit, user_format, sb, context);

Take for example, we will trace the history of this line.
We will find that the first time this line appears:

@@ -900,18 +900,18 @@ char *reencode_commit_message(const struct
commit *commit, const char **encoding
...skipped...
        if (fmt == CMIT_FMT_USERFORMAT) {
-               format_commit_message(commit, user_format, sb, dmode);
+               format_commit_message(commit, user_format, sb, context);
                return;
        }
And we should trace the preimage, something like:
        if (fmt == CMIT_FMT_USERFORMAT) {
               format_commit_message(commit, user_format, sb, dmode);

We will find these below:
@@ -770,7 +775,7 @@ void pretty_print_commit(enum cmit_fmt fmt, const struct com
        const char *encoding;

        if (fmt == CMIT_FMT_USERFORMAT) {
-               format_commit_message(commit, user_format, sb);
+               format_commit_message(commit, user_format, sb, dmode);
                return;
        }

Again:
+
+       if (fmt == CMIT_FMT_USERFORMAT) {
+               format_commit_message(commit, user_format, sb);
+               return;
+       }
+

Here, we find that the line is added from scratch and line level
history browser will do a code movement and copy matching try to find
whether this line if moved from other files.

And it is. In commit 93fc05eb9(Split off the pretty print stuff into
its own file), some code is moved from commit.c to pretty.c and this
line if from commit.c .

Ok, now, we will trace into commit.c for this line.
Again:
        char *reencoded;
        const char *encoding;
-       char *buf;

-       if (fmt == CMIT_FMT_USERFORMAT)
-               return format_commit_message(commit, user_format,
buf_p, space_p);
+       if (fmt == CMIT_FMT_USERFORMAT) {
+               format_commit_message(commit, user_format, sb);
+               return;
+       }

        encoding = (git_log_output_encoding
                    ? git_log_output_encoding

Now, we will trace the commit which produce the above preimage of the
diff hunk. And because there are four lines of the preimage in our
tracing window. We should follow any commit which intersect with these
four lines. Fortunately, there is only one commit.

@@ -1165,7 +1166,7 @@ unsigned long pretty_print_commit(enum cmit_fmt fmt,
        char *buf;

        if (fmt == CMIT_FMT_USERFORMAT)
-               return format_commit_message(commit, msg, buf_p, space_p);
+               return format_commit_message(commit, user_format,
buf_p, space_p);

        encoding = (git_log_output_encoding
                    ? git_log_output_encoding


Again, we find:

        if (fmt == CMIT_FMT_USERFORMAT)
-               return format_commit_message(commit, msg, buf, space);
+               return format_commit_message(commit, msg, buf_p, space_p);

        encoding = (git_log_output_encoding

Again:
        char *encoding;

+       if (fmt == CMIT_FMT_USERFORMAT)
+               return format_commit_message(commit, msg, buf, space);
+
        encoding = (git_log_output_encoding
                    ? git_log_output_encoding

And here, finally, we reach a place where the code is added from
scratch and not copied/moved from other place.

Line level history browser will just display all the related diff to
users and trace the code modification/move/copy.

It traces the preimage of the minimum related diff hunk carefully, if
there is any case that there are more than one commit intersect with
the preimage, we will stop and ask the users to select which way to go
on tracing.

I hope this can help us to discuss the problem, thanks!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23  6:01                       ` Bo Yang
  2010-03-23 10:08                         ` Jakub Narebski
@ 2010-03-23 18:57                         ` Jonathan Nieder
  2010-03-24  2:39                           ` Bo Yang
  1 sibling, 1 reply; 54+ messages in thread
From: Jonathan Nieder @ 2010-03-23 18:57 UTC (permalink / raw)
  To: Bo Yang; +Cc: Junio C Hamano, gitzilla, Alex Riesen, git

Hi,

[reordering quoted text for convenience]

Bo Yang wrote:

> I can't understand fully about your above strategy. I think we can
> category the code change into two cases:

Thanks!  What you said is much more coherent than the vague things I
wrote.

> 2. The diff looks like:
[...]
> This means, the code here is added from scratch. Here, I think we have
> three options.
> 1. Find if the new code is moved here from other place.
> 2. Find if the new code is copied from other place.
> 3. We find the end of the history, so stop here.

If the code is copied verbatim from elsewhere, this is something ‘git
blame’ is already very good at.  See [1].

Fuzzy matching is a big pain.  ‘git blame’ knows how to ignore
whitespace.  Dscho suggested counting common words.  Maybe there are
some other ways.  I think there is a real danger of getting lost in this
problem and wasting a lot of time, so although it is very interesting, I
would consider any progress in this area a bonus rather than a goal.

> 1. The diff looks like:
> 
> @@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
> **argv, const char *prefix)
>                 add_signoff = xmemdupz(committer, endpos - committer + 1);
>         }
> 
> -       for (i = 0; i < extra_hdr_nr; i++) {
> -               strbuf_addstr(&buf, extra_hdr[i]);
> +       for (i = 0; i < extra_hdr.nr; i++) {
> +               strbuf_addstr(&buf, extra_hdr.items[i].string);
>                 strbuf_addch(&buf, '\n');
>         }
> 
> 
> ie: there is both deletion and addition in a change. And this means we
> modify some lines of the code. So, what we do will be tracing the two
> 'minus' lines and then find another diff. Start trace from that diff
> recursively.

If you can make a heuristic along these lines this work well, I think it
would be great.  I imagine it might work very well for commits that made
nice, small changes (like many of those in git.git).  Jakub pointed out
some of the difficulties, and I like to hope your idea of “when in doubt,
include more lines” may work well in many cases in git.git still.

Good luck, and thank you for taking my crazy ideas seriously. :)

Regards,
Jonathan

[1] See v1.4.4-rc1~2 (Merge branch 'jc/pickaxe', 2006-11-07) and the
commits preceding it.  About that series, Junio wrote:

	Actually the plan is to make it do _true_ pickaxe,
	although it will most likely end up either in dustbin or
	replace blame.

It replaced blame.

I am not actually sure, but I assume “true pickaxe” refers to the
goals described in <http://gitster.livejournal.com/35628.html>
and the linked-to message.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 15:23                                   ` Bo Yang
@ 2010-03-23 19:57                                     ` Jonathan Nieder
  2010-03-23 21:51                                       ` A Large Angry SCM
  2010-03-24  2:30                                       ` Bo Yang
  0 siblings, 2 replies; 54+ messages in thread
From: Jonathan Nieder @ 2010-03-23 19:57 UTC (permalink / raw)
  To: Bo Yang; +Cc: Jakub Narebski, Junio C Hamano, gitzilla, Alex Riesen, git

Bo Yang wrote:

> It traces the preimage of the minimum related diff hunk carefully, if
> there is any case that there are more than one commit intersect with
> the preimage, we will stop and ask the users to select which way to go
> on tracing.

That might be necessary, but I will admit that I suspect it to be
harder to make useful.  One of the very nice things about ‘git log’ is
that it is easy to browse through history in a nonlinear way in a
pager (by using a pager’s search functionality).  The “backend” ‘git
rev-list’ is easy to write scripts with, also because of its simple
input and output.

If your program requires input from the user, how will it paginate its
output?  Most pagers expect the standard input to be available for
input from the user.

One approach (I will not say it is a good one) to the problem of
ambiguous origins for a line is to blame _both_ parents.  That is,
start following both lines of history in your revision walking.
Perhaps higher-level tools like ‘git log --graph’ and gitk could
visually represent the branched history you are showing.

Another approach is to just choose one parent automatically: for
example, prefer the first parent, or assign some score representing
the relatedness of each parent and choose the most related one.

Jonathan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 19:57                                     ` Jonathan Nieder
@ 2010-03-23 21:51                                       ` A Large Angry SCM
  2010-03-24  2:30                                       ` Bo Yang
  1 sibling, 0 replies; 54+ messages in thread
From: A Large Angry SCM @ 2010-03-23 21:51 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Bo Yang, Jakub Narebski, Junio C Hamano, Alex Riesen, git

Jonathan Nieder wrote:
> Bo Yang wrote:
> 
>> It traces the preimage of the minimum related diff hunk carefully, if
>> there is any case that there are more than one commit intersect with
>> the preimage, we will stop and ask the users to select which way to go
>> on tracing.
> 
> That might be necessary, but I will admit that I suspect it to be
> harder to make useful.  One of the very nice things about ‘git log’ is
> that it is easy to browse through history in a nonlinear way in a
> pager (by using a pager’s search functionality).  The “backend” ‘git
> rev-list’ is easy to write scripts with, also because of its simple
> input and output.
> 
> If your program requires input from the user, how will it paginate its
> output?  Most pagers expect the standard input to be available for
> input from the user.
> 
> One approach (I will not say it is a good one) to the problem of
> ambiguous origins for a line is to blame _both_ parents.  That is,
> start following both lines of history in your revision walking.
> Perhaps higher-level tools like ‘git log --graph’ and gitk could
> visually represent the branched history you are showing.
> 
> Another approach is to just choose one parent automatically: for
> example, prefer the first parent, or assign some score representing
> the relatedness of each parent and choose the most related one.

What I would like to see (and may be too much for a GSOC project) is the 
  result to be a simplified commit graph with additional annotations of 
the line range mappings that could be fed into something like a modified 
gitk to view the _history_ of the lines of interest.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 19:57                                     ` Jonathan Nieder
  2010-03-23 21:51                                       ` A Large Angry SCM
@ 2010-03-24  2:30                                       ` Bo Yang
  1 sibling, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-24  2:30 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jakub Narebski, Junio C Hamano, gitzilla, Alex Riesen, git

Hi,

On Wed, Mar 24, 2010 at 3:57 AM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Bo Yang wrote:
>
>> It traces the preimage of the minimum related diff hunk carefully, if
>> there is any case that there are more than one commit intersect with
>> the preimage, we will stop and ask the users to select which way to go
>> on tracing.
>
> That might be necessary, but I will admit that I suspect it to be
> harder to make useful.  One of the very nice things about ‘git log’ is
> that it is easy to browse through history in a nonlinear way in a
> pager (by using a pager’s search functionality).  The “backend” ‘git
> rev-list’ is easy to write scripts with, also because of its simple
> input and output.
>
> If your program requires input from the user, how will it paginate its
> output?  Most pagers expect the standard input to be available for
> input from the user.
>
> One approach (I will not say it is a good one) to the problem of
> ambiguous origins for a line is to blame _both_ parents.  That is,
> start following both lines of history in your revision walking.
> Perhaps higher-level tools like ‘git log --graph’ and gitk could
> visually represent the branched history you are showing.
>
> Another approach is to just choose one parent automatically: for
> example, prefer the first parent, or assign some score representing
> the relatedness of each parent and choose the most related one.

Both the approach is very precious for me. I think maybe I will
propose the first one in my real proposal to Git, thanks a lot! You
really help my too much! Thanks!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-23 18:57                         ` Jonathan Nieder
@ 2010-03-24  2:39                           ` Bo Yang
  2010-03-24  4:02                             ` Jonathan Nieder
  0 siblings, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-24  2:39 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Junio C Hamano, gitzilla, Alex Riesen, git

HI,

On Wed, Mar 24, 2010 at 2:57 AM, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> If you can make a heuristic along these lines this work well, I think it
> would be great.  I imagine it might work very well for commits that made
> nice, small changes (like many of those in git.git).  Jakub pointed out
> some of the difficulties, and I like to hope your idea of “when in doubt,
> include more lines” may work well in many cases in git.git still.
>
> Good luck, and thank you for taking my crazy ideas seriously. :)
>
> Regards,
> Jonathan
>
> [1] See v1.4.4-rc1~2 (Merge branch 'jc/pickaxe', 2006-11-07) and the
> commits preceding it.  About that series, Junio wrote:
>
>        Actually the plan is to make it do _true_ pickaxe,
>        although it will most likely end up either in dustbin or
>        replace blame.
>
> It replaced blame.
>
> I am not actually sure, but I assume “true pickaxe” refers to the
> goals described in <http://gitster.livejournal.com/35628.html>
> and the linked-to message.

I have looked over the article and the message from Linus, it really
help me very much. The message and article pointed out most of the
things a line level tool should do, and I am happy to find that it is
similar with my proposal. :) Thanks again for your precious advice and
I think I can come up a better proposal, now. Thanks!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-24  2:39                           ` Bo Yang
@ 2010-03-24  4:02                             ` Jonathan Nieder
  0 siblings, 0 replies; 54+ messages in thread
From: Jonathan Nieder @ 2010-03-24  4:02 UTC (permalink / raw)
  To: Bo Yang; +Cc: Junio C Hamano, gitzilla, Alex Riesen, git, Linus Torvalds

Bo Yang wrote:
> On Wed, Mar 24, 2010 at 2:57 AM, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> I am not actually sure, but I assume “true pickaxe” refers to the
>> goals described in <http://gitster.livejournal.com/35628.html>
>> and the linked-to message.
>
> I have looked over the article and the message from Linus, it really
> help me very much.

Okay, so now I looked over that thread again.  I found this [1]:

  <http://minnie.tuhs.org/Programs/Ctcompare/index.html>

It’s for fuzzy matching of a certain kind.  The latest version is under
the GPLv3, unfortunately for us.  I would still like to reiterate my
warning to not get sidetracked on this, but maybe it would be pleasant
reading.

Enjoy,
Jonathan

[1] Thanks, Linus.
http://thread.gmane.org/gmane.comp.version-control.git/27/focus=225

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
       [not found]           ` <201003282120.40536.trast@student.ethz.ch>
@ 2010-03-29  4:14             ` Bo Yang
  2010-03-29 18:42               ` Thomas Rast
  0 siblings, 1 reply; 54+ messages in thread
From: Bo Yang @ 2010-03-29  4:14 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Johannes Schindelin, git

Hi Thomas,
On Mon, Mar 29, 2010 at 3:20 AM, Thomas Rast <trast@student.ethz.ch> wrote:
> Hi Bo
>
> I have one specific question about the draft project description:
>
> You wrote:
>> And the timeline will be:
>> April 26 - May 23:   Catch up with Git code base and study the
>> implementation of blame.c and log.c thouroughly.
>>
>> May 24 - June 21 :   Complete a version which supports code
>> modifcation trace but without code movement and code copy support.
>>
>> June 22 - June 29:   Complete a version which supports code movement
>> inside one file.
>>
>> June 30 - July 7:    Complete a version which supports code movement
>> between files inside one commit.
>>
>> July 8 - July 15:    Complete a version which supports code copy of
>> modified file in one commit.
>>
>> July 16 - July 23:   Complete a version which supports code copy of
>> any file in one commit tree.
>>
>> July 24 - August 7:  Complete fuzzy matching of code movement and copy detect.
>
> Where are you taking those numbers from?
>
> (I'm fine if the answer is "I'm making them up from whole cloth" but I
> want to know anyway :-P)

You mean the dates? They are made up according on 'GSoC's timeline'
and my estimation about the workload of each milestone.

And this is the draft proposal, after a long thread of discussion, the
timeline and milestone change much.  The fuzzy matching milestone will
become a bonus milestone instead of a primary GSoC milestone. And I
think it may help that I provide a newest version of it, I paste it in
the end of the email.

And I will appreciate any feedback from you. Especially about the
implementation section :)


Regards!
Bo

-------------------------------------------------------------------------
Draft proposal(v3): Line-level History Browser

=====Purpose of this project=====
"git blame" can tell us who is responsible for a line of code, but it
can't help if we want to get the detail of how the lines of code have
evolved as what it is now. For example, in Git, commit 93fc05e(Split
off the pretty print stuff into its own file) split out
pretty_print_commit() from commit.c into pretty.c, and it is hard to
verify without much hassle that the code split was really only a code
split, rather than a split with an evil change.

This project will add a new feature for 'git log' to display line
level history. It can trace the history of any line range of certain
file at any revision. And for each history entry, it will provide the
commits, the diff block which contains changes of users' interested
lines.

This utility will trace all the modification history of interested
lines and stop until it finds the root of the lines, which is a point
where all the new code is added from scratch. Also, the users can
specify how deeply he wants this utility to trace. And this tool will
also follow the code movement and copy inside one commit, too.

Note that, the history may not always be a single thread of commits.
If there are more than one commits which produce the specified line
range, or there are more than one source of code move/copy, the thread
of history will split. And this utility may stop and provide all
commits with its code changes to the user, let the user to select
which one to trace next. Or, it may also use 'git log --graph' way to
display the splitted history, we will provide options to control this.

=====Work and technical issues=====
==Scenario==
For how we use the line level browser and how the utility should act
to us, here is an scenario:
http://article.gmane.org/gmane.comp.version-control.git/143024/match=line+level+history+browser
It contains code movement between files but not code copy and fuzzy matching.

==Features==
This new feature should be used for exploring the history of changes
for certain line range of code in one file. Following features will be
supported:
1. Follow history of code modification of any single line range
starting from any revision. The above scenario provide a good example
for what this function used for and how it acts with users.

2. Follow code movement inside one file. And follow code movement
between files optionally for performance reason. With code movement
detect, we can find code refactoring easily just like what the above
scenario do.

3. Provide a configurable context to users, display only the 'user
interested lines' diff block or display the whole diff with the
interested area colorfully displayed.

4. Detect code copy optionally. This may help us to understand why
some code is here and help on code refactoring. For example, we can
always make some 'usually copied code' a function.

5. Simply fuzzy matching for code move/copy. Provide an option to
control whether we start a fuzzy matching for performance reason. This
can help us to find whether some code is really literally moved to
here or with some evil changes. And this may also help in some
situation like if we move some Java class to another file with only
its class name changed. Anyway, fuzzy matching can help much on code
detection. And there can be many fuzzy detect strategies, but we will
only try to support the simplest one in this summer for time reason.
Maybe a strategy like: 90% of the lines between two ranges of code are
identical or 90% of words are identical. This will be discussed again
before coding I think.

6. Provide a configurable way for how to display the history. A 'git
log --graph' way or stop to ask users when we meet history splitting.

7. Reuse 'git log' existing options as many as possible.

==Design and implementation==
Git store all the blobs instead of code delta, so we should traverse
the commit history and directly access the tree/blob objects to
compute the code delta and search for the diff which contains the
interesting lines. Since git use libxdiff to format its diff file, we
should iterate through all xdiff's diff blocks and find what the code
looks like before the commit. This will be done using the callback
mechanism. Here, we will find a new line range which is the origin
code before this commit. And then start another search from the
current commit and the new line range. Recursively, we can find all
the modification history. We will stop when we find that the current
interested line range is added from scratch and is not moved from
other place of the file. Here, if the user want to trace code copy,
more work will be done to find the possible code copy. We may also
stop the traverse when we reach the max search depth. Also, if the
thread of change history split into two or more commits, we stop and
provide the users all the related commits and corresponding line
range.

Generally,
1. New callback for xdi_diff to parse the diff hunk and store line
level history info.
2. builtin/line-log.c will be added to complete most of the new features.
3. builtin/log.c will be changed to add this new utility to the front end.
4. Documents will be updated to introduce this new tool.

=====Milestones and Timeline=====
In this summer, we will add support of line level history browser for
only one file. The multiple ranges support is currently not in this
project.

The milestones of the project are:
1. Simple modification change history.
1a) Have an initial version which does nothing else than parse git-log
options and a single additional -L, requiring exactly one file to be
specified

1b) Implement the xdiff callback and identify the commits touching the
line range

1c) Implement a workable line level log browser

2. Code movement inside one file.
2a) Support the whole section of code literally move.

2b) Support code movement with splitting.

3c) Support code movement with code uniting.

3. Code movement inside one commit between files.

4. Code lines copied from other files that were modified in the same commit.
4a) Support the whole section of code literally copy.

4b) Support code copy split and unite.

5. Code copy of any place in one commit tree.

6. Fuzzy matching support. Note that there is not a exact strategy for
fuzzy matching and I would like this milestone a bonus one instead of
a primary milestone for GSoC. We will make a good support for this if
time allows.


And the timeline will be:
April 26 - May 23:
1st week, follow the bird's eye view on Git's source code.
2nd week, have a look at the code of merge-base, analyze the rev-listmachinery
3rd week, have a look at builtin/log.c,
4th week, understand blame.c

May 24 - June 13 :   Complete a version which supports code
modifcation trace but without code movement and code copy support. For
detail:
1st week, milestone 1a, 1b
2-3 week, milestone 1c

June 14 - July 11:   Complete a version which supports code movement.
1st week, milestone 2a
2nd week, milestone 2b
3rd week, milestone 2c
4th week, milestone 3

July 12 - August 1:   Complete a version which supports code copy.
1st week, milestone 4a
2nd week, milestone 4b
3rd week, milestone 5

August 2 - August 14:  Complete fuzzy matching of code movement and copy detect.


And there is one milestone for each week nearly, so every week, I will
post a stutas update to the list to let the community know the project
progress. And, patches will be sent for feature completion but not
milestone.

=====About me=====
I am Bo Yang, a Chinese graduate student majoring in Computer Science
of NanKai University. I have touched some open source software since 5
years ago and began to contribute code to open source community from
three years ago. I have contributed to Mozilla/Mingw/Netsurf.
Technically, I am experienced in C/Bash Shell. I have attended last
year's GSoC with Netsurf project. In that project, I have completed
most of a DOM library in C.
I begin to use git for source code revision from about two years ago.
I use Git for track my Mozilla trunk source code. Because updating
Mozilla code by CVS in my school is very slow. So, I write one script
to automatically updating the trunk with CVS at mid-night, when the
network flow is fast, on the server, and then use Git to maintain the
code. Then I use Git in my PC to clone/update the source code from my
local server and that is very fast. I use Git to track my changes to
the code and some bug fixes. It is an excellent tool for
branch/history, I think.
Git is my lovely daily tool for revision control. I have much
experience with it and have read "Git Internals" and also get some
basic knowledge about Git's code base. And I think the line-level
history explorer is really suitable for me and I can make a good start
with this project in Git community.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-29  4:14             ` Bo Yang
@ 2010-03-29 18:42               ` Thomas Rast
  2010-03-30  2:52                 ` Bo Yang
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Rast @ 2010-03-29 18:42 UTC (permalink / raw)
  To: Bo Yang; +Cc: Johannes Schindelin, git, Jens Lehmann

Bo Yang wrote:
> Draft proposal(v3): Line-level History Browser
> 
> =====Purpose of this project=====
> "git blame" can tell us who is responsible for a line of code, but it
> can't help if we want to get the detail of how the lines of code have
> evolved as what it is now. For example, in Git, commit 93fc05e(Split
> off the pretty print stuff into its own file) split out
> pretty_print_commit() from commit.c into pretty.c, and it is hard to
> verify without much hassle that the code split was really only a code
> split, rather than a split with an evil change.

Is this really the right use-case?  AFAICT the answer to the implied
question is given by simply running 'git blame -M 93fc05e:pretty.c'.

(Coming up with a better example should be easy; the way I currently
think of the feature means that it will mostly replace git-blame for
me...)

> Note that, the history may not always be a single thread of commits.
> If there are more than one commits which produce the specified line
> range, or there are more than one source of code move/copy, the thread
> of history will split. And this utility may stop and provide all
> commits with its code changes to the user, let the user to select
> which one to trace next. Or, it may also use 'git log --graph' way to
> display the splitted history, we will provide options to control this.

I would, by far, prefer the latter.  So far 'git log' has always been
noninteractive, and there's no really good way to make it interactive
because it also goes through the pager.  (In the case of blame this is
solved in 'git gui blame', which might also be a reasonable approach.)

OTOH, if you can really fake a history walk, then just about any
log-oriented tool should be able to work with it.  You'd get graphical
output for free with gitk and git log --graph.  I haven't really
thought through the ramifications, though.

> =====Work and technical issues=====
> ==Scenario==
> For how we use the line level browser and how the utility should act
> to us, here is an scenario:
> http://article.gmane.org/gmane.comp.version-control.git/143024/match=line+level+history+browser
> It contains code movement between files but not code copy and fuzzy matching.

I would prefer if you could inline a short example, perhaps starting
at your second diff snippet.  Examples are good ;-)

Even if not, please drop the /match= parameter since it is very
distracting.

> 5. Simply fuzzy matching for code move/copy. Provide an option to
> control whether we start a fuzzy matching for performance reason. This
> can help us to find whether some code is really literally moved to
> here or with some evil changes. And this may also help in some
> situation like if we move some Java class to another file with only
> its class name changed. Anyway, fuzzy matching can help much on code
> detection. And there can be many fuzzy detect strategies, but we will
> only try to support the simplest one in this summer for time reason.
> Maybe a strategy like: 90% of the lines between two ranges of code are
> identical or 90% of words are identical. This will be discussed again
> before coding I think.
> 
> 6. Provide a configurable way for how to display the history. A 'git
> log --graph' way or stop to ask users when we meet history splitting.

See above.

> 7. Reuse 'git log' existing options as many as possible.

One thing that IMO is missing from this list, is a plumbing mode that
just feeds the raw data to a (presumed) frontend.  It could be as
simple as supporting

  git log -L ... --pretty=raw --raw

or similar, if this provides sufficient information.  Compare 'git
blame --porcelain'.

> ==Design and implementation==
> Git store all the blobs instead of code delta, so we should traverse
> the commit history and directly access the tree/blob objects to
> compute the code delta and search for the diff which contains the
> interesting lines. Since git use libxdiff to format its diff file, we
> should iterate through all xdiff's diff blocks and find what the code
> looks like before the commit. This will be done using the callback
> mechanism. Here, we will find a new line range which is the origin
> code before this commit. And then start another search from the
> current commit and the new line range. Recursively, we can find all
> the modification history. We will stop when we find that the current
> interested line range is added from scratch and is not moved from
> other place of the file. Here, if the user want to trace code copy,
> more work will be done to find the possible code copy. We may also
> stop the traverse when we reach the max search depth. Also, if the
> thread of change history split into two or more commits, we stop and
> provide the users all the related commits and corresponding line
> range.
> 
> Generally,
> 1. New callback for xdi_diff to parse the diff hunk and store line
> level history info.
> 2. builtin/line-log.c will be added to complete most of the new features.
> 3. builtin/log.c will be changed to add this new utility to the front end.
> 4. Documents will be updated to introduce this new tool.

This section is too handwavy for my taste.  I think in most cases you
say "we can" when you really mean "git-blame already does it, so we
can just use a similar algorithm".  Which is fine, but I'd rather see
it spelled out so as to see what is not already covered by blame's code.

> =====Milestones and Timeline=====
> In this summer, we will add support of line level history browser for
> only one file. The multiple ranges support is currently not in this
> project.

I agree with what Dscho pointed out earlier in the thread: multiple
ranges will be an easy exercise once you can follow a "blame split"
where half the lines blame to some file and half the lines blame to
another.

Other than that I think the milestones look sensible.  As a theory
guy, I'm not a huge believer in timelines, so lets hope someone else
comments on it.

> And there is one milestone for each week nearly, so every week, I will
> post a stutas update to the list to let the community know the project
> progress. And, patches will be sent for feature completion but not
> milestone.

Push the code somewhere public as you go, even between feature
completions.  Post RFCs once you have workable features so people can
comment.  Generally try to be visible.

Bonus points if you can think of something visible to do during the
period where you look at code,

> April 26 - May 23:
> 1st week, follow the bird's eye view on Git's source code.
> 2nd week, have a look at the code of merge-base, analyze the rev-listmachinery
> 3rd week, have a look at builtin/log.c,
> 4th week, understand blame.c

whether it be documenting your learnings in some way, improving docs
as you go, or documenting the APIs you find.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-29 18:42               ` Thomas Rast
@ 2010-03-30  2:52                 ` Bo Yang
  2010-03-30  9:07                   ` Michael J Gruber
  2010-03-30  9:10                   ` Jakub Narebski
  0 siblings, 2 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-30  2:52 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Johannes Schindelin, git, Jens Lehmann

Hi Thomas,

On Tue, Mar 30, 2010 at 2:42 AM, Thomas Rast <trast@student.ethz.ch> wrote:
>
> Is this really the right use-case?  AFAICT the answer to the implied
> question is given by simply running 'git blame -M 93fc05e:pretty.c'.
>
> (Coming up with a better example should be easy; the way I currently
> think of the feature means that it will mostly replace git-blame for
> me...)

I will cite the same example below in the scenario. :)

> I would, by far, prefer the latter.  So far 'git log' has always been
> noninteractive, and there's no really good way to make it interactive
> because it also goes through the pager.  (In the case of blame this is
> solved in 'git gui blame', which might also be a reasonable approach.)
>
> OTOH, if you can really fake a history walk, then just about any
> log-oriented tool should be able to work with it.  You'd get graphical
> output for free with gitk and git log --graph.  I haven't really
> thought through the ramifications, though.

Ok, so let us try to abandon the interactive way totally.

>> =====Work and technical issues=====
>> ==Scenario==
>> For how we use the line level browser and how the utility should act
>> to us, here is an scenario:
>> http://article.gmane.org/gmane.comp.version-control.git/143024/match=line+level+history+browser
>> It contains code movement between files but not code copy and fuzzy matching.
>
> I would prefer if you could inline a short example, perhaps starting
> at your second diff snippet.  Examples are good ;-)
>
> Even if not, please drop the /match= parameter since it is very
> distracting.

I put the example at the end of the proposal as a reference.

>
>> 7. Reuse 'git log' existing options as many as possible.
>
> One thing that IMO is missing from this list, is a plumbing mode that
> just feeds the raw data to a (presumed) frontend.  It could be as
> simple as supporting
>
>  git log -L ... --pretty=raw --raw
>
> or similar, if this provides sufficient information.  Compare 'git
> blame --porcelain'.

Very good feedback, I will add this, thanks a lot!

>
> This section is too handwavy for my taste.  I think in most cases you
> say "we can" when you really mean "git-blame already does it, so we
> can just use a similar algorithm".  Which is fine, but I'd rather see
> it spelled out so as to see what is not already covered by blame's code.

Changed in next version to make this clear. But only add some words to
state that 'blame does similar' :)

>
> Push the code somewhere public as you go, even between feature
> completions.  Post RFCs once you have workable features so people can
> comment.  Generally try to be visible.
>
> Bonus points if you can think of something visible to do during the
> period where you look at code,

Yeah, really is a good point. And I have tried to play around on
github.com and try to set up a http://github.com/byang/my_git for this
purpose. :)

>> April 26 - May 23:
>> 1st week, follow the bird's eye view on Git's source code.
>> 2nd week, have a look at the code of merge-base, analyze the rev-listmachinery
>> 3rd week, have a look at builtin/log.c,
>> 4th week, understand blame.c
>
> whether it be documenting your learnings in some way, improving docs
> as you go, or documenting the APIs you find.

Thanks a lot for this good advice, I will do so.

With these feedback, I think I can make up a complete version of the
proposal and submit it to Google. Thanks!

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-30  2:52                 ` Bo Yang
@ 2010-03-30  9:07                   ` Michael J Gruber
  2010-03-30  9:38                     ` Michael J Gruber
  2010-03-30 11:10                     ` Bo Yang
  2010-03-30  9:10                   ` Jakub Narebski
  1 sibling, 2 replies; 54+ messages in thread
From: Michael J Gruber @ 2010-03-30  9:07 UTC (permalink / raw)
  To: Bo Yang; +Cc: Thomas Rast, Johannes Schindelin, git, Jens Lehmann

Bo Yang venit, vidit, dixit 30.03.2010 04:52:
> Hi Thomas,
> 
> On Tue, Mar 30, 2010 at 2:42 AM, Thomas Rast <trast@student.ethz.ch> wrote:
>>
>> Is this really the right use-case?  AFAICT the answer to the implied
>> question is given by simply running 'git blame -M 93fc05e:pretty.c'.
>>
>> (Coming up with a better example should be easy; the way I currently
>> think of the feature means that it will mostly replace git-blame for
>> me...)
> 
> I will cite the same example below in the scenario. :)
> 
>> I would, by far, prefer the latter.  So far 'git log' has always been
>> noninteractive, and there's no really good way to make it interactive
>> because it also goes through the pager.  (In the case of blame this is
>> solved in 'git gui blame', which might also be a reasonable approach.)
>>
>> OTOH, if you can really fake a history walk, then just about any
>> log-oriented tool should be able to work with it.  You'd get graphical
>> output for free with gitk and git log --graph.  I haven't really
>> thought through the ramifications, though.
> 
> Ok, so let us try to abandon the interactive way totally.
> 
>>> =====Work and technical issues=====
>>> ==Scenario==
>>> For how we use the line level browser and how the utility should act
>>> to us, here is an scenario:
>>> http://article.gmane.org/gmane.comp.version-control.git/143024/match=line+level+history+browser
>>> It contains code movement between files but not code copy and fuzzy matching.
>>
>> I would prefer if you could inline a short example, perhaps starting
>> at your second diff snippet.  Examples are good ;-)
>>
>> Even if not, please drop the /match= parameter since it is very
>> distracting.
> 
> I put the example at the end of the proposal as a reference.
> 
>>
>>> 7. Reuse 'git log' existing options as many as possible.
>>
>> One thing that IMO is missing from this list, is a plumbing mode that
>> just feeds the raw data to a (presumed) frontend.  It could be as
>> simple as supporting
>>
>>  git log -L ... --pretty=raw --raw
>>
>> or similar, if this provides sufficient information.  Compare 'git
>> blame --porcelain'.
> 
> Very good feedback, I will add this, thanks a lot!
> 
>>
>> This section is too handwavy for my taste.  I think in most cases you
>> say "we can" when you really mean "git-blame already does it, so we
>> can just use a similar algorithm".  Which is fine, but I'd rather see
>> it spelled out so as to see what is not already covered by blame's code.
> 
> Changed in next version to make this clear. But only add some words to
> state that 'blame does similar' :)
> 
>>
>> Push the code somewhere public as you go, even between feature
>> completions.  Post RFCs once you have workable features so people can
>> comment.  Generally try to be visible.
>>
>> Bonus points if you can think of something visible to do during the
>> period where you look at code,
> 
> Yeah, really is a good point. And I have tried to play around on
> github.com and try to set up a http://github.com/byang/my_git for this
> purpose. :)

You may want to create your repo as a fork of gitster/git instead.
That's easier on github, they have a hard time anyways these days ;)
Seriously, it helps making use of their network feature etc.

I don't have anything to add to your proposal (I like it), but I'll be
at NKU next week (Conference @ Chern Institute) so drop me a PM if you wish.

Cheers,
Michael

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-30  2:52                 ` Bo Yang
  2010-03-30  9:07                   ` Michael J Gruber
@ 2010-03-30  9:10                   ` Jakub Narebski
  2010-03-30 11:15                     ` Bo Yang
  1 sibling, 1 reply; 54+ messages in thread
From: Jakub Narebski @ 2010-03-30  9:10 UTC (permalink / raw)
  To: Bo Yang; +Cc: Thomas Rast, Johannes Schindelin, git, Jens Lehmann

Bo Yang <struggleyb.nku@gmail.com> writes:
> On Tue, Mar 30, 2010 at 2:42 AM, Thomas Rast <trast@student.ethz.ch> wrote:
> >
> > Is this really the right use-case?  AFAICT the answer to the implied
> > question is given by simply running 'git blame -M 93fc05e:pretty.c'.
> >
> > (Coming up with a better example should be easy; the way I currently
> > think of the feature means that it will mostly replace git-blame for
> > me...)
> 
> I will cite the same example below in the scenario. :)

By the way, it would be good to find an example with "evil merge",
which means that the change to given line(s) is in the merge commit
itself.  Correctly simplifying history in such case might be
non-trivial.

Another example that it would be good to have is "history split"
example, which means the case where some lines were consolidated
(e.g. after refactoring), and some of lines in "preimage" come
from different lines of history.

This would help with writing tests for this feature (compare tests
for blame), although they are not in my opinion necessary for the
proposal itself.
 
I hope that all this cases would fall naturally from the
implementation.

[...]
> > Push the code somewhere public as you go, even between feature
> > completions.  Post RFCs once you have workable features so people can
> > comment.  Generally try to be visible.
> >
> > Bonus points if you can think of something visible to do during the
> > period where you look at code,
> 
> Yeah, really is a good point. And I have tried to play around on
> github.com and try to set up a http://github.com/byang/my_git for this
> purpose. :)

my_git is not very descriptive... well, unless you would do your work
on GSoC2010/line-level-history-browser branch, or something like that.

It might be good idea to have repo.or.cz as an additional repository,
as a fork of git.git repo, and with SoC / GSoC labels.  See
http://repo.or.cz/w/git.git/forks?t=soc

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-30  9:07                   ` Michael J Gruber
@ 2010-03-30  9:38                     ` Michael J Gruber
  2010-03-30 11:10                     ` Bo Yang
  1 sibling, 0 replies; 54+ messages in thread
From: Michael J Gruber @ 2010-03-30  9:38 UTC (permalink / raw)
  Cc: Bo Yang, Thomas Rast, Johannes Schindelin, git, Jens Lehmann

Michael J Gruber venit, vidit, dixit 30.03.2010 11:07:

> You may want to create your repo as a fork of gitster/git instead.

Actually, make this git/git, the other one isn't being updated... Sorry!

Michael

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-30  9:07                   ` Michael J Gruber
  2010-03-30  9:38                     ` Michael J Gruber
@ 2010-03-30 11:10                     ` Bo Yang
  1 sibling, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-30 11:10 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: Thomas Rast, Johannes Schindelin, git, Jens Lehmann

Hi Michael,

On Tue, Mar 30, 2010 at 5:07 PM, Michael J Gruber
<git@drmicha.warpmail.net> wrote:
>
> You may want to create your repo as a fork of gitster/git instead.
> That's easier on github, they have a hard time anyways these days ;)
> Seriously, it helps making use of their network feature etc.

Yeah, forked git/git. :)

> I don't have anything to add to your proposal (I like it), but I'll be
> at NKU next week (Conference @ Chern Institute) so drop me a PM if you wish.

That is really a big coincidence. :)
I am very willing to meet you at NKU, and I think I can be your guide
in NKU and some beautiful spots in Tianjin if you have spare time. :)
Anyway, let us talk about this in personal email off the list. :-)

Regards!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: GSoC draft proposal: Line-level history browser
  2010-03-30  9:10                   ` Jakub Narebski
@ 2010-03-30 11:15                     ` Bo Yang
  0 siblings, 0 replies; 54+ messages in thread
From: Bo Yang @ 2010-03-30 11:15 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Thomas Rast, Johannes Schindelin, git, Jens Lehmann

Hi Jakub,

On Tue, Mar 30, 2010 at 5:10 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> By the way, it would be good to find an example with "evil merge",
> which means that the change to given line(s) is in the merge commit
> itself.  Correctly simplifying history in such case might be
> non-trivial.

It is a little time consuming to find such a change in the history. I
think we can come up some ones at the start of the project manually
and put them into the testcases. :)

> Another example that it would be good to have is "history split"
> example, which means the case where some lines were consolidated
> (e.g. after refactoring), and some of lines in "preimage" come
> from different lines of history.
>
> This would help with writing tests for this feature (compare tests
> for blame), although they are not in my opinion necessary for the
> proposal itself.
>
> I hope that all this cases would fall naturally from the
> implementation.
> [...]
>> > Push the code somewhere public as you go, even between feature
>> > completions.  Post RFCs once you have workable features so people can
>> > comment.  Generally try to be visible.
>> >
>> > Bonus points if you can think of something visible to do during the
>> > period where you look at code,
>>
>> Yeah, really is a good point. And I have tried to play around on
>> github.com and try to set up a http://github.com/byang/my_git for this
>> purpose. :)
>
> my_git is not very descriptive... well, unless you would do your work
> on GSoC2010/line-level-history-browser branch, or something like that.
>
> It might be good idea to have repo.or.cz as an additional repository,
> as a fork of git.git repo, and with SoC / GSoC labels.  See
> http://repo.or.cz/w/git.git/forks?t=soc

Ah, a repo at  http://github.com/byang/gsoc-line-browser is created
and a mirror at http://repo.or.cz/w/gsoc-line-browser.git, I think
this is enough. :-)

Thanks!
Bo

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2010-03-30 11:15 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-20  9:18 GSoC draft proposal: Line-level history browser Bo Yang
2010-03-20 11:30 ` Johannes Schindelin
2010-03-20 13:10   ` Bo Yang
2010-03-20 13:30     ` Junio C Hamano
2010-03-21  6:03       ` Bo Yang
2010-03-20 13:36     ` Johannes Schindelin
2010-03-21  6:05       ` Bo Yang
2010-03-20 20:35 ` Alex Riesen
2010-03-20 20:57   ` Junio C Hamano
2010-03-21  6:10     ` Bo Yang
2010-03-20 21:58   ` A Large Angry SCM
2010-03-21  6:16     ` Bo Yang
2010-03-21 13:19       ` A Large Angry SCM
2010-03-22  3:48         ` Bo Yang
2010-03-22  4:24           ` Junio C Hamano
2010-03-22  4:34             ` Bo Yang
2010-03-22  5:32               ` Junio C Hamano
2010-03-22  7:31                 ` Bo Yang
2010-03-22  7:41                   ` Junio C Hamano
2010-03-22  7:52                     ` Bo Yang
2010-03-22  8:10                     ` Jonathan Nieder
2010-03-23  6:01                       ` Bo Yang
2010-03-23 10:08                         ` Jakub Narebski
2010-03-23 10:38                           ` Bo Yang
2010-03-23 11:22                             ` Jakub Narebski
2010-03-23 12:23                               ` Bo Yang
2010-03-23 13:49                                 ` Jakub Narebski
2010-03-23 15:23                                   ` Bo Yang
2010-03-23 19:57                                     ` Jonathan Nieder
2010-03-23 21:51                                       ` A Large Angry SCM
2010-03-24  2:30                                       ` Bo Yang
2010-03-23 12:02                             ` Peter Kjellerstedt
2010-03-23 18:57                         ` Jonathan Nieder
2010-03-24  2:39                           ` Bo Yang
2010-03-24  4:02                             ` Jonathan Nieder
2010-03-22 10:39                 ` Alex Riesen
2010-03-22 15:05                   ` Johannes Schindelin
2010-03-22  3:52         ` Bo Yang
2010-03-22 15:48           ` Jakub Narebski
2010-03-22 18:21             ` Johannes Schindelin
2010-03-22 18:38               ` Sverre Rabbelier
2010-03-22 19:26                 ` Johannes Schindelin
2010-03-22 20:21                   ` Sverre Rabbelier
2010-03-22 19:24           ` Johannes Schindelin
2010-03-23  6:08             ` Bo Yang
2010-03-23  6:27             ` Bo Yang
     [not found]           ` <201003282120.40536.trast@student.ethz.ch>
2010-03-29  4:14             ` Bo Yang
2010-03-29 18:42               ` Thomas Rast
2010-03-30  2:52                 ` Bo Yang
2010-03-30  9:07                   ` Michael J Gruber
2010-03-30  9:38                     ` Michael J Gruber
2010-03-30 11:10                     ` Bo Yang
2010-03-30  9:10                   ` Jakub Narebski
2010-03-30 11:15                     ` Bo Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.