All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Sunshine <sunshine@sunshineco.com>
To: Elijah Newren <newren@gmail.com>
Cc: "Stefan Beller" <sbeller@google.com>,
	"Junio C Hamano" <gitster@pobox.com>, git <git@vger.kernel.org>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Jeff King" <peff@peff.net>
Subject: Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()
Date: Sat, 3 Feb 2018 23:42:58 -0500	[thread overview]
Message-ID: <CAPig+cQu20ZH3hj=2MSNqS3K-+qtZjtHAUQA0TL1LOG685yQtg@mail.gmail.com> (raw)
In-Reply-To: <CABPp-BGwAu7_+BJR+43G2SysmBMnZYEBtHnwjMZBRn5XDPubWg@mail.gmail.com>

On Sat, Feb 3, 2018 at 9:04 PM, Elijah Newren <newren@gmail.com> wrote:
> On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren <newren@gmail.com> wrote:
>> On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller <sbeller@google.com> wrote:
>>> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren <newren@gmail.com> wrote:
>>>> +       while (*--end_of_new == *--end_of_old &&
>>>> +              end_of_old != old_path &&
>>>> +              end_of_new != new_path)
>>>> +               ; /* Do nothing; all in the while loop */
>>>
>>> Assuming many repos are UTF8 (including in their paths),
>>> how does this work with display characters longer than one char?
>>> It should be fine as we cut at the slash?
>>
>> Can UTF-8 characters, other than '/', have a byte whose value matches
>> (unsigned char)('/')?  If so, then I'll need to figure out how to do
>> utf-8 character parsing.  Anyone have pointers?
>
> Well, after digging around for a while, I found this claim on the
> Wikipedia page for UTF-8:
>
>   Since ASCII bytes do not occur when encoding non-ASCII code points
> into UTF-8, UTF-8 is safe to use within most programming and document
> languages that interpret certain ASCII characters in a special way,
> such as "/" in filenames, "\" in escape sequences, and "%" in printf.
>
> So, unless I'm reading something wrong here, I think that means this
> code is just fine as it is.

You're reading it correctly. Unicode values greater than \U007f
encoded with UTF-8 will never contain bytes which can be confused with
any 7-bit ASCII character.

It's possible that Stefan was thinking of "combining characters"[1]
which may be "precomposed" and "decomposed"[2], but which appear the
same when rendered. For instance, "ö" might be a single Unicode
codepoint or two codepoints, such as "o" combined with a diaeresis.
It's a potential problem if you're comparing, byte by byte, two
filenames which look the same. However, Git takes pains[3] to avoid
this problem by ensuring (if possible) that filenames are precomposed
within Git even if they happen to be decomposed on the actual
filesystem. So, most likely, your code is okay as-is.

[1]: https://en.wikipedia.org/wiki/Combining_character
[2]: https://en.wikipedia.org/wiki/Diaeresis_(diacritic)
[3]: https://github.com/git/git/blob/master/compat/precompose_utf8.c

  reply	other threads:[~2018-02-04  4:43 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-30 23:25 [PATCH v7 00/31] Add directory rename detection to git Elijah Newren
2018-01-30 23:25 ` [PATCH v7 01/31] directory rename detection: basic testcases Elijah Newren
2018-01-30 23:25 ` [PATCH v7 02/31] directory rename detection: directory splitting testcases Elijah Newren
2018-01-30 23:25 ` [PATCH v7 03/31] directory rename detection: testcases to avoid taking detection too far Elijah Newren
2018-01-30 23:25 ` [PATCH v7 04/31] directory rename detection: partially renamed directory testcase/discussion Elijah Newren
2018-01-30 23:25 ` [PATCH v7 05/31] directory rename detection: files/directories in the way of some renames Elijah Newren
2018-01-30 23:25 ` [PATCH v7 06/31] directory rename detection: testcases checking which side did the rename Elijah Newren
2018-01-30 23:25 ` [PATCH v7 07/31] directory rename detection: more involved edge/corner testcases Elijah Newren
2018-01-30 23:25 ` [PATCH v7 08/31] directory rename detection: testcases exploring possibly suboptimal merges Elijah Newren
2018-01-30 23:25 ` [PATCH v7 09/31] directory rename detection: miscellaneous testcases to complete coverage Elijah Newren
2018-01-30 23:25 ` [PATCH v7 10/31] directory rename detection: tests for handling overwriting untracked files Elijah Newren
2018-01-30 23:25 ` [PATCH v7 11/31] directory rename detection: tests for handling overwriting dirty files Elijah Newren
2018-01-30 23:25 ` [PATCH v7 12/31] merge-recursive: move the get_renames() function Elijah Newren
2018-02-02 23:27   ` Stefan Beller
     [not found]     ` <CABPp-BFDgDDa_fPSFJQUSzR1k5-ix0SWrviUPFu+SCoyWfG5cQ@mail.gmail.com>
2018-02-05 18:57       ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 13/31] merge-recursive: introduce new functions to handle rename logic Elijah Newren
2018-02-02 23:36   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 14/31] merge-recursive: fix leaks of allocated renames and diff_filepairs Elijah Newren
2018-02-02 23:41   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 15/31] merge-recursive: make !o->detect_rename codepath more obvious Elijah Newren
2018-02-02 23:48   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 16/31] merge-recursive: split out code for determining diff_filepairs Elijah Newren
2018-02-03  0:06   ` Stefan Beller
2018-02-03  1:43     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames Elijah Newren
2018-02-03  0:26   ` Stefan Beller
2018-02-03 21:34     ` Elijah Newren
2018-02-04  8:54       ` Johannes Sixt
2018-02-05 14:56         ` Elijah Newren
2018-02-05 20:01         ` Junio C Hamano
2018-02-05 19:44       ` Stefan Beller
2018-02-05 21:27         ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 18/31] merge-recursive: make a helper function for cleanup for handle_renames Elijah Newren
2018-02-03  0:31   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 19/31] merge-recursive: add get_directory_renames() Elijah Newren
2018-02-03  1:02   ` Stefan Beller
2018-02-03 22:32     ` Elijah Newren
2018-02-04  2:04       ` Elijah Newren
2018-02-04  4:42         ` Eric Sunshine [this message]
2018-02-04  4:44           ` Eric Sunshine
2018-02-05 19:39       ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 20/31] merge-recursive: check for directory level conflicts Elijah Newren
2018-02-05 20:00   ` Stefan Beller
2018-02-05 21:12     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 21/31] merge-recursive: add a new hashmap for storing file collisions Elijah Newren
2018-02-05 20:02   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 22/31] merge-recursive: add computation of collisions due to dir rename & merging Elijah Newren
2018-01-30 23:25 ` [PATCH v7 23/31] merge-recursive: check for file level conflicts then get new name Elijah Newren
2018-01-30 23:25 ` [PATCH v7 24/31] merge-recursive: when comparing files, don't include trees Elijah Newren
2018-01-30 23:25 ` [PATCH v7 25/31] merge-recursive: apply necessary modifications for directory renames Elijah Newren
2018-02-16  1:14   ` SZEDER Gábor
2018-01-30 23:25 ` [PATCH v7 26/31] merge-recursive: avoid clobbering untracked files with " Elijah Newren
2018-01-30 23:25 ` [PATCH v7 27/31] merge-recursive: fix overwriting dirty files involved in renames Elijah Newren
2018-02-05 20:52   ` Stefan Beller
2018-02-05 21:26     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 28/31] merge-recursive: fix remaining directory rename + dirty overwrite cases Elijah Newren
2018-02-05 21:52   ` Stefan Beller
2018-02-05 22:18     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 29/31] directory rename detection: new testcases showcasing a pair of bugs Elijah Newren
2018-01-30 23:25 ` [PATCH v7 30/31] merge-recursive: avoid spurious rename/rename conflict from dir renames Elijah Newren
2018-01-30 23:25 ` [PATCH v7 31/31] merge-recursive: ensure we write updates for directory-renamed file Elijah Newren
2018-02-05 21:58   ` Stefan Beller
2018-01-30 23:41 ` [PATCH v7 00/31] Add directory rename detection to git Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPig+cQu20ZH3hj=2MSNqS3K-+qtZjtHAUQA0TL1LOG685yQtg@mail.gmail.com' \
    --to=sunshine@sunshineco.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.