git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Terrible bad performance for it blame --date=iso -C -C master -- <file>
@ 2017-03-31  7:30 Ulrich Windl
  2017-03-31 15:52 ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Ulrich Windl @ 2017-03-31  7:30 UTC (permalink / raw)
  To: git


Hi!

I was running "vc-annotate" in Emacs for a file from a large repository (>40000 files, a big percentage being binary, about 10 commits). For the first file the result was presented rather soon, but for a second file the command did not finish even after about 10 minutes!

The file in question is a rather short text file (124 kB), and according to git log it has one commit.

While being bored, I did an strace of the command to find out that a huge number of files is inspected.

I'm using git of openSUSE LEap 42.2 (git-2.10.2-3.1.x86_64).

Here's the stat of the process:

% cat /proc/6307/stat
6307 (git) R 6189 6307 6307 0 -1 4194304 30514692 0 1860 0 142035 6333 0 0 20 0 1 0 14519800 5850595328 391884 18446744073709551615 4194304 6038988 140726752300720 140726752293728 5549816 0 0 67108864 0 0 0 0 17 2 0 0 28211 0 0 8138160 8173100 9625600 140726752306005 140726752306072 140726752306072 140726752309227 0

Regards,
Ulrich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Terrible bad performance for it blame --date=iso -C -C master -- <file>
  2017-03-31  7:30 Terrible bad performance for it blame --date=iso -C -C master -- <file> Ulrich Windl
@ 2017-03-31 15:52 ` Junio C Hamano
  2017-03-31 16:23   ` Samuel Lijin
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Junio C Hamano @ 2017-03-31 15:52 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: git

"Ulrich Windl" <Ulrich.Windl@rz.uni-regensburg.de> writes:

> I was running "vc-annotate" in Emacs for a file from a large
> repository (>40000 files, a big percentage being binary, about 10
> commits). For the first file the result was presented rather soon, but
> for a second file the command did not finish even after about 10
> minutes!
>
> The file in question is a rather short text file (124 kB), and
> according to git log it has one commit.
>
> While being bored, I did an strace of the command to find out that a
> huge number of files is inspected.

With -C -C the user (vc-annotate?) is asking to inspect huge number
of files, to find if the contents of the file (except for the part
that came from its own previous version) came from other existing
files.  So this is very much expected.

It might not be a bad idea to teach "blame" not to pay attention to
any path that is marked as "-diff" (e.g. binary files) when trying
to see if remaining contents appeared by borrowing from them.  We do
not have that heuristics (yet).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Terrible bad performance for it blame --date=iso -C -C master -- <file>
  2017-03-31 15:52 ` Junio C Hamano
@ 2017-03-31 16:23   ` Samuel Lijin
  2017-04-03  6:56   ` Antw: " Ulrich Windl
  2017-05-01 17:58   ` Terrible bad performance for it blame --date=iso -C -C master -- <file> Samuel Lijin
  2 siblings, 0 replies; 11+ messages in thread
From: Samuel Lijin @ 2017-03-31 16:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ulrich Windl, git, Harrison Stall

Hi Ulrich,

Is there any chance you could share the repo where this is coming from?

This is actually something a colleague and I are looking into seeing
if we can crunch out some performance gains since -C -C isn't
threaded.

Sam

On Fri, Mar 31, 2017 at 10:52 AM, Junio C Hamano <gitster@pobox.com> wrote:
> "Ulrich Windl" <Ulrich.Windl@rz.uni-regensburg.de> writes:
>
>> I was running "vc-annotate" in Emacs for a file from a large
>> repository (>40000 files, a big percentage being binary, about 10
>> commits). For the first file the result was presented rather soon, but
>> for a second file the command did not finish even after about 10
>> minutes!
>>
>> The file in question is a rather short text file (124 kB), and
>> according to git log it has one commit.
>>
>> While being bored, I did an strace of the command to find out that a
>> huge number of files is inspected.
>
> With -C -C the user (vc-annotate?) is asking to inspect huge number
> of files, to find if the contents of the file (except for the part
> that came from its own previous version) came from other existing
> files.  So this is very much expected.
>
> It might not be a bad idea to teach "blame" not to pay attention to
> any path that is marked as "-diff" (e.g. binary files) when trying
> to see if remaining contents appeared by borrowing from them.  We do
> not have that heuristics (yet).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Antw: Re: Terrible bad performance for it blame --date=iso -C -C master -- <file>
  2017-03-31 15:52 ` Junio C Hamano
  2017-03-31 16:23   ` Samuel Lijin
@ 2017-04-03  6:56   ` Ulrich Windl
  2017-04-03 10:56     ` Terrible bad performance for it blame --date=iso -C SZEDER Gábor
  2017-05-01 17:58   ` Terrible bad performance for it blame --date=iso -C -C master -- <file> Samuel Lijin
  2 siblings, 1 reply; 11+ messages in thread
From: Ulrich Windl @ 2017-04-03  6:56 UTC (permalink / raw)
  To: gitster; +Cc: git

Junio,

thanks for explaining! So if there are at least two commits, blame is fast, but with only one commit blame tries hard to find another commit that might have contributed to the one file?
I verified that without those "-C" options the result is very quick.
In the other case (for the user bored of waiting seeking for some entertainment ;-)) a "-v (verbose) option could be useful.
Or at the very least: If git is expecting that some operation will take (or already did take) a lot of time, give some message explaining why it is taking a lot of time, and maybe how to avoid that.

Regards,
Ulrich


>>> Junio C Hamano <gitster@pobox.com> schrieb am 31.03.2017 um 17:52 in Nachricht
<xmqq60ip1m0f.fsf@gitster.mtv.corp.google.com>:
> "Ulrich Windl" <Ulrich.Windl@rz.uni-regensburg.de> writes:
> 
>> I was running "vc-annotate" in Emacs for a file from a large
>> repository (>40000 files, a big percentage being binary, about 10
>> commits). For the first file the result was presented rather soon, but
>> for a second file the command did not finish even after about 10
>> minutes!
>>
>> The file in question is a rather short text file (124 kB), and
>> according to git log it has one commit.
>>
>> While being bored, I did an strace of the command to find out that a
>> huge number of files is inspected.
> 
> With -C -C the user (vc-annotate?) is asking to inspect huge number
> of files, to find if the contents of the file (except for the part
> that came from its own previous version) came from other existing
> files.  So this is very much expected.
> 
> It might not be a bad idea to teach "blame" not to pay attention to
> any path that is marked as "-diff" (e.g. binary files) when trying
> to see if remaining contents appeared by borrowing from them.  We do
> not have that heuristics (yet).





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Terrible bad performance for it blame --date=iso -C
  2017-04-03  6:56   ` Antw: " Ulrich Windl
@ 2017-04-03 10:56     ` SZEDER Gábor
  2017-04-03 15:16       ` Jakub Narębski
  0 siblings, 1 reply; 11+ messages in thread
From: SZEDER Gábor @ 2017-04-03 10:56 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: SZEDER Gábor, gitster, git

> In the other case (for the user bored of waiting seeking for some
> entertainment ;-)) a "-v (verbose) option could be useful.  Or at the
> very least: If git is expecting that some operation will take (or
> already did take) a lot of time, give some message explaining why it
> is taking a lot of time, and maybe how to avoid that.

It already does so by default since v2.8.0, see aba37f495 (blame: add
support for --[no-]progress option, 2015-12-12).

  $ time git blame sha1_file.c |wc -l
  4026
  
  real    0m1.744s
  user    0m1.672s
  sys     0m0.068s
  $ time git blame -C -C sha1_file.c |wc -l
  Blaming lines: 100% (4026/4026), done.
  4026
  
  real    0m3.832s
  user    0m3.716s
  sys     0m0.112s

However, after a short peek at that commit, it only displays progress
by default when stderr is a terminal, which might not be the case when
invoked from emacs.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Terrible bad performance for it blame --date=iso -C
  2017-04-03 10:56     ` Terrible bad performance for it blame --date=iso -C SZEDER Gábor
@ 2017-04-03 15:16       ` Jakub Narębski
  2017-04-04  6:23         ` Antw: " Ulrich Windl
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Narębski @ 2017-04-03 15:16 UTC (permalink / raw)
  To: SZEDER Gábor, Ulrich Windl; +Cc: Junio C Hamano, git

W dniu 03.04.2017 o 12:56, SZEDER Gábor pisze:
> Ulrich Windl wrote:

>> In the other case (for the user bored of waiting seeking for some
>> entertainment ;-)) a "-v (verbose) option could be useful.  Or at the
>> very least: If git is expecting that some operation will take (or
>> already did take) a lot of time, give some message explaining why it
>> is taking a lot of time, and maybe how to avoid that.
> 
> It already does so by default since v2.8.0, see aba37f495 (blame: add
> support for --[no-]progress option, 2015-12-12).
> 
>   $ time git blame sha1_file.c |wc -l
>   4026
>   
>   real    0m1.744s
>   user    0m1.672s
>   sys     0m0.068s
>   $ time git blame -C -C sha1_file.c |wc -l
>   Blaming lines: 100% (4026/4026), done.
>   4026
>   
>   real    0m3.832s
>   user    0m3.716s
>   sys     0m0.112s
> 
> However, after a short peek at that commit, it only displays progress
> by default when stderr is a terminal, which might not be the case when
> invoked from emacs.

Emacs (magit?) should use `git blame --porcelain`, and do its own
progress report, just like 'git gui blame' and incremental blame mode
of gitweb.

Actually... there already is git-blamed - Minor mode for incremental
blame for Git, and mo-git-blame - An interactive, iterative 'git blame'
mode for Emacs, both available on ELPA (Emacs Lisp Package Archive).

HTH
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Antw: Re: Terrible bad performance for it blame --date=iso -C
  2017-04-03 15:16       ` Jakub Narębski
@ 2017-04-04  6:23         ` Ulrich Windl
  0 siblings, 0 replies; 11+ messages in thread
From: Ulrich Windl @ 2017-04-04  6:23 UTC (permalink / raw)
  To: Jakub Narebski, SZEDER Gábor; +Cc: git

>>> Jakub Narebski <jnareb@gmail.com> schrieb am 03.04.2017 um 17:16 in
Nachricht
<0ccc5cab-26b7-4b02-b964-452b61e92579@gmail.com>:
> W dniu 03.04.2017 o 12:56, SZEDER Gábor pisze:
>> Ulrich Windl wrote:
> 
>>> In the other case (for the user bored of waiting seeking for some
>>> entertainment ;-)) a "-v (verbose) option could be useful.  Or at the
>>> very least: If git is expecting that some operation will take (or
>>> already did take) a lot of time, give some message explaining why it
>>> is taking a lot of time, and maybe how to avoid that.
>> 
>> It already does so by default since v2.8.0, see aba37f495 (blame: add
>> support for --[no-]progress option, 2015-12-12).
>> 
>>   $ time git blame sha1_file.c |wc -l
>>   4026
>>   
>>   real    0m1.744s
>>   user    0m1.672s
>>   sys     0m0.068s
>>   $ time git blame -C -C sha1_file.c |wc -l
>>   Blaming lines: 100% (4026/4026), done.
>>   4026
>>   
>>   real    0m3.832s
>>   user    0m3.716s
>>   sys     0m0.112s
>> 
>> However, after a short peek at that commit, it only displays progress
>> by default when stderr is a terminal, which might not be the case when
>> invoked from emacs.
> 
> Emacs (magit?) should use `git blame --porcelain`, and do its own
> progress report, just like 'git gui blame' and incremental blame mode
> of gitweb.

I was thinking similar: The pain vc-annotate obviously should work without
those "-C" options, and with prefix argument (C-u <num> in Emacs) it could
start looking for copied stuff. HMO...

Worse than no progress reporting is the inability to kill the process (if you
run out of patience) with C-g (that stops most commands in Emacs).

> 
> Actually... there already is git-blamed - Minor mode for incremental
> blame for Git, and mo-git-blame - An interactive, iterative 'git blame'
> mode for Emacs, both available on ELPA (Emacs Lisp Package Archive).

I confess taht is still use RCS from time to time, and I prefer the
higher-level Emacs commands ;-)

Regards,
Ulrich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Terrible bad performance for it blame --date=iso -C -C master -- <file>
  2017-03-31 15:52 ` Junio C Hamano
  2017-03-31 16:23   ` Samuel Lijin
  2017-04-03  6:56   ` Antw: " Ulrich Windl
@ 2017-05-01 17:58   ` Samuel Lijin
  2017-05-02  0:59     ` Junio C Hamano
  2 siblings, 1 reply; 11+ messages in thread
From: Samuel Lijin @ 2017-05-01 17:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ulrich Windl, git

On Fri, Mar 31, 2017 at 10:52 AM, Junio C Hamano <gitster@pobox.com> wrote:
> "Ulrich Windl" <Ulrich.Windl@rz.uni-regensburg.de> writes:
>
>> I was running "vc-annotate" in Emacs for a file from a large
>> repository (>40000 files, a big percentage being binary, about 10
>> commits). For the first file the result was presented rather soon, but
>> for a second file the command did not finish even after about 10
>> minutes!
>>
>> The file in question is a rather short text file (124 kB), and
>> according to git log it has one commit.
>>
>> While being bored, I did an strace of the command to find out that a
>> huge number of files is inspected.
>
> With -C -C the user (vc-annotate?) is asking to inspect huge number
> of files, to find if the contents of the file (except for the part
> that came from its own previous version) came from other existing
> files.  So this is very much expected.
>
> It might not be a bad idea to teach "blame" not to pay attention to
> any path that is marked as "-diff" (e.g. binary files) when trying
> to see if remaining contents appeared by borrowing from them.  We do
> not have that heuristics (yet).

Could you elaborate on this? Do you mean to tell diffcore-rename to
ignore diff_filespec objects if they're binary?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Terrible bad performance for it blame --date=iso -C -C master -- <file>
  2017-05-01 17:58   ` Terrible bad performance for it blame --date=iso -C -C master -- <file> Samuel Lijin
@ 2017-05-02  0:59     ` Junio C Hamano
  2017-05-03  7:12       ` Samuel Lijin
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2017-05-02  0:59 UTC (permalink / raw)
  To: Samuel Lijin; +Cc: Ulrich Windl, git

Samuel Lijin <sxlijin@gmail.com> writes:

> On Fri, Mar 31, 2017 at 10:52 AM, Junio C Hamano <gitster@pobox.com> wrote:
>
>> It might not be a bad idea to teach "blame" not to pay attention to
>> any path that is marked as "-diff" (e.g. binary files) when trying
>> to see if remaining contents appeared by borrowing from them.  We do
>> not have that heuristics (yet).
>
> Could you elaborate on this? Do you mean to tell diffcore-rename to
> ignore diff_filespec objects if they're binary?

No and yes ;-).  I do not think it is a good idea to unconditionally
ignore binary in diffcore-rename.

But when we know that the rename detection is called from inside
blame.c, where by definition we would be digging line-oriented
contents, there is no point checking if the contents we are looking
for came from an existing binary file.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Terrible bad performance for it blame --date=iso -C -C master -- <file>
  2017-05-02  0:59     ` Junio C Hamano
@ 2017-05-03  7:12       ` Samuel Lijin
  2017-05-03 12:29         ` Antw: " Ulrich Windl
  0 siblings, 1 reply; 11+ messages in thread
From: Samuel Lijin @ 2017-05-03  7:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ulrich Windl, git

On Mon, May 1, 2017 at 7:59 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Samuel Lijin <sxlijin@gmail.com> writes:
>
>> On Fri, Mar 31, 2017 at 10:52 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>
>>> It might not be a bad idea to teach "blame" not to pay attention to
>>> any path that is marked as "-diff" (e.g. binary files) when trying
>>> to see if remaining contents appeared by borrowing from them.  We do
>>> not have that heuristics (yet).
>>
>> Could you elaborate on this? Do you mean to tell diffcore-rename to
>> ignore diff_filespec objects if they're binary?
>
> No and yes ;-).  I do not think it is a good idea to unconditionally
> ignore binary in diffcore-rename.
>
> But when we know that the rename detection is called from inside
> blame.c, where by definition we would be digging line-oriented
> contents, there is no point checking if the contents we are looking
> for came from an existing binary file.

A followup thought: perhaps it would make sense for blame.c rename
detection to only consider files with the same extension?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Antw: Re: Terrible bad performance for it blame --date=iso -C -C master -- <file>
  2017-05-03  7:12       ` Samuel Lijin
@ 2017-05-03 12:29         ` Ulrich Windl
  0 siblings, 0 replies; 11+ messages in thread
From: Ulrich Windl @ 2017-05-03 12:29 UTC (permalink / raw)
  To: Samuel Lijin, gitster; +Cc: git

>>> Samuel Lijin <sxlijin@gmail.com> schrieb am 03.05.2017 um 09:12 in Nachricht
<CAJZjrdVZ7gTZvm=CcG7pUPWtXjsMsHgyMRiE8xoXm=bZ4j6FEQ@mail.gmail.com>:
> On Mon, May 1, 2017 at 7:59 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Samuel Lijin <sxlijin@gmail.com> writes:
>>
>>> On Fri, Mar 31, 2017 at 10:52 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>>> It might not be a bad idea to teach "blame" not to pay attention to
>>>> any path that is marked as "-diff" (e.g. binary files) when trying
>>>> to see if remaining contents appeared by borrowing from them.  We do
>>>> not have that heuristics (yet).
>>>
>>> Could you elaborate on this? Do you mean to tell diffcore-rename to
>>> ignore diff_filespec objects if they're binary?
>>
>> No and yes ;-).  I do not think it is a good idea to unconditionally
>> ignore binary in diffcore-rename.
>>
>> But when we know that the rename detection is called from inside
>> blame.c, where by definition we would be digging line-oriented
>> contents, there is no point checking if the contents we are looking
>> for came from an existing binary file.
> 
> A followup thought: perhaps it would make sense for blame.c rename
> detection to only consider files with the same extension?

As traditionally file type and file name extension had little to do with each other in UNIX, I thing that would not be a good solution (to assume the opposite). I also know that Git does not care too much about hierarchies also, but my idea was so find candidates "nearby", i.e. look in the subdirectories of the first level and in the parent directory; then, if needed, widen the radius of the search...

Regards,
Ulrich






^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-05-03 12:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-31  7:30 Terrible bad performance for it blame --date=iso -C -C master -- <file> Ulrich Windl
2017-03-31 15:52 ` Junio C Hamano
2017-03-31 16:23   ` Samuel Lijin
2017-04-03  6:56   ` Antw: " Ulrich Windl
2017-04-03 10:56     ` Terrible bad performance for it blame --date=iso -C SZEDER Gábor
2017-04-03 15:16       ` Jakub Narębski
2017-04-04  6:23         ` Antw: " Ulrich Windl
2017-05-01 17:58   ` Terrible bad performance for it blame --date=iso -C -C master -- <file> Samuel Lijin
2017-05-02  0:59     ` Junio C Hamano
2017-05-03  7:12       ` Samuel Lijin
2017-05-03 12:29         ` Antw: " Ulrich Windl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).