From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Beller Subject: Re: weird diff output? Date: Tue, 29 Mar 2016 23:05:09 -0700 Message-ID: References: <20160330045554.GA11007@sigill.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Jacob Keller , Junio C Hamano , Git mailing list , Jens Lehmann To: Jeff King X-From: git-owner@vger.kernel.org Wed Mar 30 08:05:21 2016 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1al9Fm-0004Zk-Cm for gcvg-git-2@plane.gmane.org; Wed, 30 Mar 2016 08:05:18 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752217AbcC3GFM (ORCPT ); Wed, 30 Mar 2016 02:05:12 -0400 Received: from mail-ig0-f178.google.com ([209.85.213.178]:37731 "EHLO mail-ig0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751027AbcC3GFK (ORCPT ); Wed, 30 Mar 2016 02:05:10 -0400 Received: by mail-ig0-f178.google.com with SMTP id l20so33091130igf.0 for ; Tue, 29 Mar 2016 23:05:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=p7kETsls6sQfbUP2t/eKJfxQLGSkCyPWbns4sC2Nd8Q=; b=HMVMNNkzi+xdVXijB6m5L8/pmW6NaGEOP/0vRW0vbNAc7/UzXA0TuulBj1tO4JVaDw sjxS8bs8SILAwhmVFIlurfPZedfwZi6XAfqQNVYcwgt/CqVk8lMg+RKKoE8ZsBcMcbSG 6furj+QTZwXn6OPHT241PIHP5qhsEkbY6q2meqUWa+95RQd2YYxY3hovUE97M7ZM0s2W t9gZbpm8BjeNWvm21hr2Eiw636rfLQXikCkw/FsNCZ+i+74Rqwf7xsiyquSqhJFH3jgS xqrJODOpL951pkGvUXo9ZAecsGwRnbRWjDkvPdABRM5N1f201LuAytEhARyYuCU6DtAN Eycw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=p7kETsls6sQfbUP2t/eKJfxQLGSkCyPWbns4sC2Nd8Q=; b=bFMGfSHakZH6cfc45MpntaNdmOr6uwUyqx31mUMHQ3L2DsNLRtPyRgpsmYslQ0Lnjl 2keomZI5RDquoT7DX/ubjeTIMfIiO91xlJ6B4hIgBWnN3NP5BrVWd4kJa0+0bVFL8oOS Jb/A+nq7d6ad/mLvvhzTpQHspgMZMd4YrAPEbP8SBnaScJ4+O/LeHfePjpeQXpKmF5Pe g5LciPyN9RWiklu5Qz+itTBeovef1ONd+coXKPzP7pwbMLDTS4gM34orqPHvrljv69/H xRYDCcnpalBnKDi3keFf7VU/Y7IDNlfPlHG+w/W3yt3gLw1gGAc4SBq1sg6852mXuKnN 4jQg== X-Gm-Message-State: AD7BkJJSv2V6c5e9cHk1ZzpoqhG9l8t4zsXL09TG2kwl+j3emsQfNiy+GZLDx1nlVs70OLNTNmajpK5i860o/ctf X-Received: by 10.50.112.10 with SMTP id im10mr7618666igb.93.1459317909447; Tue, 29 Mar 2016 23:05:09 -0700 (PDT) Received: by 10.107.17.27 with HTTP; Tue, 29 Mar 2016 23:05:09 -0700 (PDT) In-Reply-To: <20160330045554.GA11007@sigill.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Tue, Mar 29, 2016 at 9:55 PM, Jeff King wrote: > On Tue, Mar 29, 2016 at 04:05:57PM -0700, Jacob Keller wrote: > >> > This is what we want in both cases. >> > And I would argue it would appease many other kinds of text as well, because >> > an empty line is usually a strong indicator for any text that a >> > different thing comes along. >> > (Other programming languages, such as Java, C++ and any other C like >> > language behaves >> > that way; even when writing latex figures you'd rather want to break >> > at new lines?) >> > >> > Thanks, >> > Stefan >> >> This seems like a good heuristic. Can we think of any examples where >> it would produce wildly confusing diffs? I don't think it necessarily >> needs to be default but just a possible option when formatting diffs, >> much like we already have today. > > One thing I like to do when playing with new diff ideas is to pipe all > of "log -p" for a real project through it and see what differences it > produces. > > Below is a perl script that implements Stefan's heuristic. I checked its > output on git.git with: > > git log --format='commit %H' -p >old > perl /path/to/script new > diff -F ^commit -u old new | less Wow, that's amazing! I'll toy around with it tomorrow. :) > > which shows the differences, with the commit id in the hunk header > (which makes it easy to "git show $commit | perl /path/to/script" to > see the new diff with more context. > > In addition to the cases discussed, it seems to improve C comments by > turning: > > /* > + * new function > + */ > +void foo(void); > + > +/* > * old function > ... > > into: > > +/* > + * my function > + */ > +void foo(void); > + > /* > * old function > ... > > See 47fe3f6e for an example. > > It also seems to do OK with shell scripts. Commit e6bb5f78 is an example > where it improves a here-doc, as in the motivating example from this > thread. Similarly, the headers in 4df1e79 are much improved (though I'm > confused why the final one in that diff doesn't seem to have been > caught). > > I also ran into an interesting case in 86d26f24, where we have: > > + test_expect_success ' > + foo > + > +' > + That's an interesting case :) I was trying to generalize my thoughts on it. (How is an empty line special?) Instead of empty line we could go with the line with the least amount of characters in the lines which can be shifted up or down instead as well. Why so? The more characters are in a line, the more interesting the line is. (the more information is in there). Assuming one patch carries information that is highly relevant in itself, but may not be relevant to the surrounding (think adding a new function to a C file. The surrounding functions are not interesting for the diff, but rather you want to have all "relevant" information bundled into that one diff reasonably. Going by the rule of splitting at the shortest line instead of just at empty lines, is a generalization of So instead of looking at > + test_expect_success ' > + foo > + > +' > + we rather want to look at the string lengths of each line: > 21 > 5 > 0 > 1 > 0 and then take the minimum (so instead of only acting on the 'last 0' as shown by Jeff, we'd go to the minimum of those numbers.) Now on tie breaking (i.e two empty lines): We need to understand the "pattern" of whether the lonely 1 char line belongs above or below the chunk. I do not think we can do that just from the diff alone. * We either need to check the file ("Does the file start with the 0 1 0 pattern or does it end with that?" That would be a strong hint on whether to put the 1 line above or below the chunk.) However a typical file has noise at the top and bottom, so this heuristic is not often applicable. With noise I mean license headers or a java class or namespace ending with another brace or such. So probably this second order heuristic on which of the empty lines to pick for breaking needs more thoughts. * Go through the history of the file and check for occurrences (how was such a pattern added in the past? Ideally we want to find the first time such a pattern is added and then decide based on that whether to break at the first or second empty line) I guess both ways are expensive. Probably too expensive. So for now we can just go with "take first or last empty line (shortest line) of overlapping lines" and inspect that further. > > and there are _two_ blank lines to choose from. It looks really terrible > if you use the first one, but the second one looks good (and the script > below chooses the second, as it's closest to the hunk boundary). There > may be cases where that's bad, though. > > This is just a proof of concept. I guess we'd want to somehow integrate > the heuristic into git. > > -- >8 -- > #!/usr/bin/perl > > use strict; > use warnings 'all'; > > use constant { > STATE_NONE => 0, > STATE_LEADING_CONTEXT => 1, > STATE_IN_CHUNK => 2, > }; > my $state = STATE_NONE; > my @hunk; > while(<>) { > if ($state == STATE_NONE) { > print; > if (/^@/) { > $state = STATE_LEADING_CONTEXT; > } > } else { > if (/^ /) { > flush_hunk() if $state != STATE_LEADING_CONTEXT; > push @hunk, $_; > } elsif(/^[-+]/) { > push @hunk, $_; > $state = STATE_IN_CHUNK; > } else { > flush_hunk(); > $state = STATE_NONE; > print; > } > } > } > flush_hunk(); > > sub flush_hunk { > my $context_len = 0; > while ($context_len < @hunk && $hunk[$context_len] =~ /^ /) { > $context_len++; > } > > # Find the length of the ambiguous portion. > # Assumes our hunks have context first, and ambiguous additions at the end, > # which is how git generates them > my $ambig_len = 0; > while ($ambig_len < $context_len) { > my $i = $context_len - $ambig_len - 1; > my $j = @hunk - $ambig_len - 1; > if ($hunk[$j] =~ /^\+/ && substr($hunk[$i], 1) eq substr($hunk[$j], 1)) { > $ambig_len++; > } else { > last; > } > } > > # Now look for an empty line in the ambiguous portion (we can just look in > # the context side, as it is equivalent to the addition side at the end). > # We count down, though, as we prefer to use the line closest to the > # hunk as the cutoff. > my $empty; > for (my $i = $context_len - 1; $i >= $context_len - $ambig_len; $i--) { > if (length($hunk[$i]) == 2) { > $empty = $i; > last; > } > } > > if (defined $empty) { > # move empty lines after the chunk to be part of it > for (my $i = $empty + 1; $i < $context_len; $i++) { > $hunk[$i] =~ s/^ /+/; > $hunk[@hunk - $context_len + $i] =~ s/^\+/ /; > } > } > > print @hunk; > @hunk = (); > }