git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git blame --ignore-rev degenerate performance on large(r) line counts
@ 2021-08-01 12:58 Stefan Hoffmeister
  0 siblings, 0 replies; only message in thread
From: Stefan Hoffmeister @ 2021-08-01 12:58 UTC (permalink / raw)
  To: git

git blame has a very useful option --ignore-rev (or --ignore-rev-file)
which allows ignoring of commits for the purpose of attributing blame.

One use case for this option is a scenario where a repository was
"cleaned" up for line breaks, e.g. via "git add --renormalize ."

Alas, git blame with --ignore-rev will fall apart at runtime whenever
the number of lines of code affected is high; it will show exponential
increases in execution duration on linearly increasing line counts.

The Python script below demonstrates the undesirable performance
behaviour for a single text file and just two commits. In a real world
scenario, with a repository-wide line-ending fix commit affecting a
large number of _files_ (in addition to some files being large),
performance behaviour will be even worse.

The Python script below will initialize a new repo and do the
reproduction dance fully isolated and offline; "matplotlib" as a
dependency (pip install matplotlib) is a nice optional add-on to have,
as it visualizes that on my system (i7-7820HQ, 2.9 GHz) runtime
duration explodes exponentially at around 35000 lines of "Hello Hello"
text.

****************
#!/usr/bin/env bash

# Beware - no error handling

import matplotlib.pyplot as plt
import os
import shutil
import subprocess
import time
from typing import Generator

REPO_LOCATION="git-repository-container/git-repo"
BLAME_ME_TXT="blame-me.txt"

def run_iteration(line_count: int):
    shutil.rmtree(REPO_LOCATION, ignore_errors=True)
    os.makedirs(REPO_LOCATION, exist_ok=False)

    os.chdir(REPO_LOCATION)
    subprocess.run(["git", "init"], capture_output=True)

    subprocess.run(["git", "config", "--local", "user.name", "me"],
capture_output=True)
    subprocess.run(["git", "config", "--local", "user.email",
"me@example.com"], capture_output=True)

    with open(".gitattributes", "w") as gitattributes:
        gitattributes.writelines("* text=crlf\n")

    def produce_some_file_content_sequence(line_count: int) ->
Generator[str, None, None]:
        mytext: str = "Hello Hello\r\n"
        num = 0
        while num < line_count:
            yield mytext
            num += 1

    with open(BLAME_ME_TXT, "w") as blame_me:
        for line in produce_some_file_content_sequence(line_count):
            blame_me.writelines(line)

    subprocess.run(["git", "add", "."], capture_output=True)

    subprocess.run(["git", "commit", "-m", "Initial commit"],
capture_output=True)

    capture = subprocess.run(["git", "rev-list", "HEAD"], capture_output=True)
    initial_commit_rev = (capture.stdout).decode().strip()

    with open(".gitattributes", "w") as gitattributes:
        gitattributes.writelines("* text=auto\n")

    subprocess.run(["git", "add", "--renormalize", "."], capture_output=True)

    subprocess.run(["git", "commit", "-m", "Renormalized"], capture_output=True)

    capture = subprocess.run(["git", "rev-list", "--max-count", "1",
"HEAD"], capture_output=True)
    renormalized_commit_rev = (capture.stdout).decode().strip()

    blame_capture = subprocess.run(["git", "blame", "-C ./", "--line-porcelain",
        "--ignore-rev", renormalized_commit_rev,
        "HEAD", BLAME_ME_TXT],
        capture_output=True)

x=list[int]()
y=list[float]()
for factor in range(40):
    start = time.time()
    line_count = 1000 * factor
    x.append(line_count)

    print(f'Line count: {line_count}')
    run_iteration(line_count)

    end = time.time()
    duration = end - start
    print(duration)
    y.append(duration)

plt.plot(x, y)
plt.show()

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2021-08-01 12:59 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-01 12:58 git blame --ignore-rev degenerate performance on large(r) line counts Stefan Hoffmeister

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).