* git blame --ignore-rev degenerate performance on large(r) line counts
@ 2021-08-01 12:58 Stefan Hoffmeister
0 siblings, 0 replies; only message in thread
From: Stefan Hoffmeister @ 2021-08-01 12:58 UTC (permalink / raw)
To: git
git blame has a very useful option --ignore-rev (or --ignore-rev-file)
which allows ignoring of commits for the purpose of attributing blame.
One use case for this option is a scenario where a repository was
"cleaned" up for line breaks, e.g. via "git add --renormalize ."
Alas, git blame with --ignore-rev will fall apart at runtime whenever
the number of lines of code affected is high; it will show exponential
increases in execution duration on linearly increasing line counts.
The Python script below demonstrates the undesirable performance
behaviour for a single text file and just two commits. In a real world
scenario, with a repository-wide line-ending fix commit affecting a
large number of _files_ (in addition to some files being large),
performance behaviour will be even worse.
The Python script below will initialize a new repo and do the
reproduction dance fully isolated and offline; "matplotlib" as a
dependency (pip install matplotlib) is a nice optional add-on to have,
as it visualizes that on my system (i7-7820HQ, 2.9 GHz) runtime
duration explodes exponentially at around 35000 lines of "Hello Hello"
text.
****************
#!/usr/bin/env bash
# Beware - no error handling
import matplotlib.pyplot as plt
import os
import shutil
import subprocess
import time
from typing import Generator
REPO_LOCATION="git-repository-container/git-repo"
BLAME_ME_TXT="blame-me.txt"
def run_iteration(line_count: int):
shutil.rmtree(REPO_LOCATION, ignore_errors=True)
os.makedirs(REPO_LOCATION, exist_ok=False)
os.chdir(REPO_LOCATION)
subprocess.run(["git", "init"], capture_output=True)
subprocess.run(["git", "config", "--local", "user.name", "me"],
capture_output=True)
subprocess.run(["git", "config", "--local", "user.email",
"me@example.com"], capture_output=True)
with open(".gitattributes", "w") as gitattributes:
gitattributes.writelines("* text=crlf\n")
def produce_some_file_content_sequence(line_count: int) ->
Generator[str, None, None]:
mytext: str = "Hello Hello\r\n"
num = 0
while num < line_count:
yield mytext
num += 1
with open(BLAME_ME_TXT, "w") as blame_me:
for line in produce_some_file_content_sequence(line_count):
blame_me.writelines(line)
subprocess.run(["git", "add", "."], capture_output=True)
subprocess.run(["git", "commit", "-m", "Initial commit"],
capture_output=True)
capture = subprocess.run(["git", "rev-list", "HEAD"], capture_output=True)
initial_commit_rev = (capture.stdout).decode().strip()
with open(".gitattributes", "w") as gitattributes:
gitattributes.writelines("* text=auto\n")
subprocess.run(["git", "add", "--renormalize", "."], capture_output=True)
subprocess.run(["git", "commit", "-m", "Renormalized"], capture_output=True)
capture = subprocess.run(["git", "rev-list", "--max-count", "1",
"HEAD"], capture_output=True)
renormalized_commit_rev = (capture.stdout).decode().strip()
blame_capture = subprocess.run(["git", "blame", "-C ./", "--line-porcelain",
"--ignore-rev", renormalized_commit_rev,
"HEAD", BLAME_ME_TXT],
capture_output=True)
x=list[int]()
y=list[float]()
for factor in range(40):
start = time.time()
line_count = 1000 * factor
x.append(line_count)
print(f'Line count: {line_count}')
run_iteration(line_count)
end = time.time()
duration = end - start
print(duration)
y.append(duration)
plt.plot(x, y)
plt.show()
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2021-08-01 12:59 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-01 12:58 git blame --ignore-rev degenerate performance on large(r) line counts Stefan Hoffmeister
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).