git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ryan Zoeller <rtzoeller@rtzoeller.com>
To: git@vger.kernel.org
Cc: Ryan Zoeller <rtzoeller@rtzoeller.com>
Subject: [RFC 0/1] Leading whitespace as a function identification heuristic?
Date: Wed, 23 Sep 2020 21:59:09 +0000	[thread overview]
Message-ID: <20200923215859.102981-1-rtzoeller@rtzoeller.com> (raw)

I've recently been annoyed by git's choice of lines when invoking
`git diff --function-context`. git isn't doing a _bad_ job per se,
given that it's using regular expressions to parse nonregular languages,
but in my opinion it could do better.

The issue I'm seeing is that constructs like nested functions cause
`git diff --function-context` to stop taking lines prematurely; a line
containing a function declaration has been reached, but it's not the one
I'm expecting git to stop at. This results in some of the function being
omitted from the output of `git diff --function-context`, and also
manifests itself as the incorrect function being listed in the hunk header
in general.

A trivial example: in a Python file containing the following contents,
modifying the indicated line will result in git picking `bar()` for the
function header, rather than `foo()`. This is obviously worse for longer
functions, or code with multiple inner functions or classes.

```
def foo(xs, i):
    def bar(x):
        return x + i

    # Intentionally not using a lambda,
    # because the issue occurs for inner functions.
    # Making this comment extra long so the function name
    # doesn't get automatically included in the context.

    return map(bar, xs) # Modify this line
```

Even without properly parsing the file as code, git could use
indentation as a heuristic to determine which lines containing functions
are relevant. I've used Python as an example here, but I think the
issue and heuristic apply to many languages.

This patch implements a proof of concept for this heuristic: only
lines which are indented less than the hunk are considered eligible
to contain function definitions. It only supports `git diff`,
ignoring things like `git grep` for simplicity (and introducing
`-1`s to the diff in leiu of handling them).

While I welcome feedback on the existing implementation, I'm really
requesting commentary on the following:

1. Is this indentation-aware function identification useful, and
    generally worth pursuing further?
2. What level of granularity would be desirable for enabling this?
    Something akin to xfuncname in .gitattributes, where it can be
    enabled per path?
3. How do we handle files with a mix of tabs and spaces?

Best,
Ryan

Ryan Zoeller (1):
  xdiff: use leading whitespace in function heuristic

 grep.c            |  2 +-
 line-range.c      |  2 +-
 xdiff-interface.c | 14 +++++++++++++-
 xdiff/xdiff.h     |  2 +-
 xdiff/xemit.c     | 23 +++++++++++++++++------
 5 files changed, 33 insertions(+), 10 deletions(-)

-- 
2.28.0.586.g47c91ef7fe



             reply	other threads:[~2020-09-23 21:59 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-23 21:59 Ryan Zoeller [this message]
2020-09-23 21:59 ` [RFC 1/1] xdiff: use leading whitespace in function heuristic Ryan Zoeller
2020-09-24  6:45 ` [RFC 0/1] Leading whitespace as a function identification heuristic? Junio C Hamano
2020-09-24 21:17   ` Jeff King
2020-09-24 22:01     ` Ryan Zoeller
2020-09-25  9:11       ` Phillip Wood
2020-09-25 18:43         ` Jeff King
2020-09-25 19:01           ` Phillip Wood
2020-09-25 19:05             ` Jeff King
2020-09-25 18:12 ` Johannes Sixt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200923215859.102981-1-rtzoeller@rtzoeller.com \
    --to=rtzoeller@rtzoeller.com \
    --cc=git@vger.kernel.org \
    --subject='Re: [RFC 0/1] Leading whitespace as a function identification heuristic?' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).