git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: man dog <dogman888888@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Bug report: git -L requires excessive memory.
Date: Mon, 31 Oct 2022 22:45:54 +0100	[thread overview]
Message-ID: <20221031214554.GA1714@szeder.dev> (raw)
In-Reply-To: <CAFOPqVXz2XwzX8vGU7wLuqb2ZuwTuOFAzBLRM_QPk+NJa=eC-g@mail.gmail.com>

On Sun, Oct 30, 2022 at 01:59:41AM +0900, man dog wrote:
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.
> 
> What did you do before the bug happened? (Steps to reproduce your issue)
> git log -L /regex/,/regex/:myfile to a repo in which 2MB text file is
> committed about 2800 times.
> 
> What did you expect to happen? (Expected behavior)
> get the result.
> 
> What happened instead? (Actual behavior)
> fatal: Out of memory, malloc failed (tried to allocate 2346801 bytes)

Thanks for the report and the reproduction recipe.

This is not a buggy allocation (the size matches the size of the test
file + 1 byte), but the line-level log apparently leaks some memory
for each commit modifying the file in question, and in your case their
combined size is excessive because of the somewhat big file that is
modified in every commit.

'line-log.c' contains two "NEEDSWORK leaking like a sieve" comments,
but you managed to stumble upon yet another case (those two are in the
code path handling merge commits, but your history is linear).

The patch below plugs this leak.

  ---  >8  ---

diff --git a/line-log.c b/line-log.c
index 51d93310a4..b6ea82ac6b 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1195,6 +1195,9 @@ static int process_ranges_ordinary_commit(struct rev_info *rev, struct commit *c
 	if (parent)
 		add_line_range(rev, parent, parent_range);
 	free_line_log_data(parent_range);
+	for (int i = 0; i < queue.nr; i++)
+		diff_free_filepair(queue.queue[i]);
+	free(queue.queue);
 	return changed;
 }
 
  ---  8<  ---

> What's different between what you expected and what actually happened?
> The function requires too much memory.
> -n option should work for -L function.

Line-level log does work with '-n', but the file is so big and is
modified so many times between commits that do touch the specified
line range, that by the time it gets to the 10th commit to show it has
already leaked over 4GB memory.  Had you specified an even smaller
number of commits to show it might have worked:

  $ for i in {1..7} ; do /usr/bin/time -f "n: $i  maxRSS: %M" git log -L /func_007\(/,/}$/:test.txt -n $i >/dev/null || break ; done
  n: 1  maxRSS: 531192
  n: 2  maxRSS: 989504
  n: 3  maxRSS: 1447900
  n: 4  maxRSS: 1906408
  n: 5  maxRSS: 2364740
  n: 6  maxRSS: 2823148
  n: 7  maxRSS: 3282360

In you reproduction recipe the given line range is modified every 100
commit and there are 3000 commits in total, so I estimate the total
memory usage to be somewhere around 13.5GB.  With the patch above it
tops out at around 260MB.


> Anything else you want to add:
> I made a script to reproduce this. Please run the script below.
> Results in each environments are in its header.
> A workaround which is given in other BBS is included also.
> 
> 
> 
> 
> #!/bin/bash
> #
> # Bug report: git -L requires excessive memory.
> # Run this script to reproduce
> #
> # MINGW32(git version 2.38.1.windows.1) fatal: Out of memory, malloc
> failed (tried to allocate 2346801 bytes)
> # MINGW64(git version 2.38.1.windows.1) requires  8.6GB
> # Linux64(git version 2.20.1          ) requires 13.1GB
> #
> 
> git --version
> 
> if [ ! -d .git ]; then
>   git init
>   c=${1:-3000}
>   for (( i=0;i<c;i++)); do
>     gawk -v r="$i" '
>       BEGIN{
>         for (i=0;i<100;i++) {
>           if (r>=i) {
>             printf("function func_%03d(){ // revised at %d\n",i,
> int((r-i)/100)*100+i%100)
>             printf("  // contents of function\n")
>             printf("}\n")
>             make_subfuncs(i);
>           }
>         }
>         exit
>       }
>       function make_subfuncs(i,    j){
>         for (j=0;j<300;j++) {
>           printf("function func_%03d_sub%03d(){\n",i,j)
>           printf("  // contents of sub functions are NOT revised.\n")
>           printf("}\n")
>         }
>       }' > test.txt
>     git add test.txt
>     git commit -m "revision $i"
>   done
>   git gc
> fi
> 
> git log -L /func_007\(/,/}$/:test.txt # this command requires excessive memory.
> git log -L /func_007\(/,/}$/:test.txt -n 10 # -n option doesn't work also.
> #git log -L /func_007\(/,/}$/:test.txt HEAD~10..HEAD~0 # this works.

Perhaps I misunderstood, but I got the impression that you think that
'HEAD~10..HEAD~0' and '-n 10' do the same.

They are not: 'HEAD~10..HEAD~0' means to process only the last ten
commits, so it can't leak all that much, and that's why it worked.
'-n 10', however, means to _show_ only ten commits, but process as
many commits as necessary to find those ten.  In your case, with the
line range being modified every 100 commit, that amounts to processing
over 1000 commits.

> #
> # This can be a workaround
> #
> step=50
> num=`git log | grep -c commit`
> for ((i=0;i<$num;i+=$step)); do
>   end=$((i+$step))
>   range=HEAD~$end..HEAD~$i
>   if [ $end -ge $num ]; then
>     range=HEAD~$i
>   fi
> #  echo $range
>   git --no-pager log -L /func_007\(/,/}$/:test.txt $range
> done
> 
> 
> 
> 
> [System Info]
> [Enabled Hooks]

  reply	other threads:[~2022-10-31 21:46 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-29 16:59 Bug report: git -L requires excessive memory man dog
2022-10-31 21:45 ` SZEDER Gábor [this message]
2022-10-31 21:56   ` Taylor Blau
2022-11-02 22:01     ` [PATCH 0/3] line-log: plug some memory leaks SZEDER Gábor
2022-11-02 22:01       ` [PATCH 1/3] line-log: free diff queue when processing non-merge commits SZEDER Gábor
2022-11-03  0:20         ` Taylor Blau
2022-11-07 15:11           ` SZEDER Gábor
2022-11-07 15:29             ` Ævar Arnfjörð Bjarmason
2022-11-07 15:57               ` SZEDER Gábor
2022-11-08  2:14                 ` Taylor Blau
2022-11-02 22:01       ` [PATCH 2/3] line-log: free the diff queues' arrays when processing merge commits SZEDER Gábor
2022-11-03  0:21         ` Taylor Blau
2022-11-02 22:01       ` [PATCH 3/3] diff.c: use diff_free_queue() SZEDER Gábor
2022-11-03  0:24         ` Taylor Blau
2022-11-07 16:13           ` SZEDER Gábor
2022-11-08  2:14             ` Taylor Blau
2022-11-03  9:05       ` [PATCH 0/3] line-log: plug some memory leaks Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221031214554.GA1714@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=dogman888888@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).