All of lore.kernel.org
 help / color / mirror / Atom feed
From: Esko Luontola <esko.luontola@gmail.com>
To: Robin Rosenberg <robin.rosenberg@dewire.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC 1/8] UTF helpers
Date: Wed, 13 May 2009 12:24:30 +0300	[thread overview]
Message-ID: <4A0A91CE.3080905@gmail.com> (raw)
In-Reply-To: <200905130724.44634.robin.rosenberg@dewire.com>

Robin Rosenberg wrote on 13.5.2009 8:24:
> If the conclusion is that this is a way forward, then I
> could start working on a completely new set of much cleaner patches.,

That would be great!

I see that in those early patches you took the approach of converting 
the filenames from the local encoding to UTF-8 at the outer edges of 
Git. That obviously was the easiest way to make the changes with minimal 
changes to Git.

I've been thinking about a bit more extensive approach, which should 
serve the interest of all stakeholders:


Now the tree object contains the following information for each file: 
filename, mode, sha1. To that would be added one more string: filename 
encoding. Unless the encoding is specified (such as in old commits 
before the encoding information was added), the default encoding is 
"binary", which is the same as how Git works now (it thinks filenames as 
series of bytes, ignoring their encoding completely).

When a file is added/committed, the following things will happen:

1. Git finds out what is the filename encoding used by the system. Git 
will try to detect it automatically from the environment, and the 
autodetected value can be overridden by setting a config variable 
"i18n.localFilenameEncoding". If autodetection fails, it will default to 
"binary".

2. Git reads the config variable "i18n.commitFilenameEncoding". If 
localFilenameEncoding equals commitFilenameEncoding, or if either of 
them is "binary", go to step 3A. Otherwise go to step 3B.

3A. Git saves the filename together with the local filename encoding. 
The bytes of the filename are not changed when it is stored in the 
repository (the same as now).

3B. Git converts the filename from localFilenameEncoding to 
commitFilenameEncoding. (The commitFilenameEncoding may also specify a 
normalized form for UTF-8, for example "UTF-8 NFC". This is needed for 
Mac OS X.) Then Git saves the filename together with the commit filename 
encoding.


When a file is checked out, the following things will happen:

1. Git reads the actual filename encoding from the repository. If it is 
not specified, "binary" will be assumed.

2. Git detects the local filename encoding, the same was as before. If 
the actual filename encoding equals the local filename encoding, or if 
either of them is "binary", go to step 3A. Otherwise go to step 3B.

3A. Git creates the file using the same bytes as filename as what is 
stored in the repository. This is the same as how Git works now.

3B. Git converts the filename from the actual filename encoding to the 
local filename encoding, and creates the file using the encoding of the 
local platform.


This should fit in with Git's philosophy of not modifying the user's 
data without the user's permission. The data will always be stored 
unchanged into the repository, unless the user specifies 
"i18n.commitFilenameEncoding". The conversions are by default done only 
on checkout. Git will try to serve the needs of the user as well as it 
can by detecting the local filename encoding, but if the user so 
desires, he can disable the conversions by specifying 
"i18n.localFilenameEncoding" as "binary", in which case Git will work 
the same way as it does today.


I was browsing Git's code, and it seems that the encoding information 
would need to be added to struct name_entry in tree-walk.h. A quick 
search reveals that name_entry is used in 15 files, out of which only 4 
files use it more than once. It would probably make sense to create a 
new datatype for the filename, for example "struct encoded_path { const 
char *path; const char *encoding; }", and then provide functions for 
accessing the filename with the right encoding (commit or local).

I might even myself be able to make that change, because Git is not 
legacy software (it has tests) and the needed changes seem quite local. 
I would just need a way to detect the encodings (at first it could rely 
on manually set config variables) and have a library for doing the 
encoding conversions.

One big question is, that will this change require a change to the 
repository format? Will it be possible to add the encoding field to the 
tree object, without breaking compatibility with older Git clients? If 
compatibility needs to be broken, how it can be done in a controlled 
fashion?

-- 
Esko Luontola
www.orfjackal.net

  reply	other threads:[~2009-05-13  9:24 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-12 22:50 [RFC 0/8] Antique UTF-8 filename support Robin Rosenberg
2009-05-12 22:50 ` [RFC 1/8] UTF helpers Robin Rosenberg
2009-05-12 22:50   ` [RFC 2/8] Messages in locale Robin Rosenberg
2009-05-12 22:50     ` [RFC 3/8] Extend tests to cover locale wrt to commit messages Robin Rosenberg
2009-05-12 22:50       ` [RFC 4/8] UTF file names Robin Rosenberg
     [not found]         ` <1242168631-30753-6-git-send-email-robin.rosenberg@dewire.com>
2009-05-12 22:50           ` [RFC 6/8] test of utf_locallinks Robin Rosenberg
2009-05-12 22:50             ` [RFC 7/8] Convert symlink dest in diff Robin Rosenberg
2009-05-12 22:50               ` [RFC 8/8] UTF-8 in non-SHA1-objects Robin Rosenberg
2009-05-13  0:20   ` [RFC 1/8] UTF helpers Johannes Schindelin
2009-05-13  5:24     ` Robin Rosenberg
2009-05-13  9:24       ` Esko Luontola [this message]
2009-05-13 10:02         ` Andreas Ericsson
2009-05-13 10:21           ` Esko Luontola
2009-05-13 11:44             ` Alex Riesen
2009-05-13 18:48         ` Junio C Hamano
2009-05-13 19:31           ` Esko Luontola
2009-05-13 20:10             ` Junio C Hamano
2009-05-13 10:14       ` Johannes Schindelin
2009-05-14  4:38       ` Junio C Hamano
2009-05-14 13:57         ` Jay Soffian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A0A91CE.3080905@gmail.com \
    --to=esko.luontola@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=robin.rosenberg@dewire.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.