From mboxrd@z Thu Jan 1 00:00:00 1970 From: Esko Luontola Subject: Re: [RFC 1/8] UTF helpers Date: Wed, 13 May 2009 12:24:30 +0300 Message-ID: <4A0A91CE.3080905@gmail.com> References: <1242168631-30753-1-git-send-email-robin.rosenberg@dewire.com> <1242168631-30753-2-git-send-email-robin.rosenberg@dewire.com> <200905130724.44634.robin.rosenberg@dewire.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: git@vger.kernel.org To: Robin Rosenberg X-From: git-owner@vger.kernel.org Wed May 13 11:24:51 2009 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1M4Ahl-0006V6-RP for gcvg-git-2@gmane.org; Wed, 13 May 2009 11:24:50 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757293AbZEMJYl (ORCPT ); Wed, 13 May 2009 05:24:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757172AbZEMJYk (ORCPT ); Wed, 13 May 2009 05:24:40 -0400 Received: from mail-ew0-f176.google.com ([209.85.219.176]:47332 "EHLO mail-ew0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754373AbZEMJYj (ORCPT ); Wed, 13 May 2009 05:24:39 -0400 Received: by ewy24 with SMTP id 24so625287ewy.37 for ; Wed, 13 May 2009 02:24:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=q8uhFsXOBGBe/k8W/wcZMw3B6y7/2C7kTVwFGxHr4qA=; b=nSW/k7jQ9jBOdrTeFzk7oL4/3E73B7lbyzBlDDcv/j/BVhzapbQB0Bxu4tGiytFqqx rsBZoAAOd0Qr5uV2DnfEMCPAPGUw+yJ35fRsLb1DGRnFLVThgoj36Z31atZzuhwaGDVb pzj0JXNHFDZtD9xlEWDN1SVdczFN4HsEfmzA4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; b=cp0pmy3MWPpI81GV0SLLksQRZZrQdSr6WV0JH5ybhRoczac6D98Tx0e56ySrM5f+rl 2frbgn0BkXRLgQpX4D9hfA29uFL6nrGtd7BcAXT7nt2oLmtykTmHUwdHcatNm0/uPvnG Lx2rjcEI2sSIxvn6TcbtjVYhuwv7ksIC5rSns= Received: by 10.210.86.10 with SMTP id j10mr8082553ebb.70.1242206679020; Wed, 13 May 2009 02:24:39 -0700 (PDT) Received: from ?10.0.0.2? ([88.195.117.100]) by mx.google.com with ESMTPS id 7sm2160858eyg.57.2009.05.13.02.24.38 (version=SSLv3 cipher=RC4-MD5); Wed, 13 May 2009 02:24:38 -0700 (PDT) User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) In-Reply-To: <200905130724.44634.robin.rosenberg@dewire.com> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Robin Rosenberg wrote on 13.5.2009 8:24: > If the conclusion is that this is a way forward, then I > could start working on a completely new set of much cleaner patches., That would be great! I see that in those early patches you took the approach of converting the filenames from the local encoding to UTF-8 at the outer edges of Git. That obviously was the easiest way to make the changes with minimal changes to Git. I've been thinking about a bit more extensive approach, which should serve the interest of all stakeholders: Now the tree object contains the following information for each file: filename, mode, sha1. To that would be added one more string: filename encoding. Unless the encoding is specified (such as in old commits before the encoding information was added), the default encoding is "binary", which is the same as how Git works now (it thinks filenames as series of bytes, ignoring their encoding completely). When a file is added/committed, the following things will happen: 1. Git finds out what is the filename encoding used by the system. Git will try to detect it automatically from the environment, and the autodetected value can be overridden by setting a config variable "i18n.localFilenameEncoding". If autodetection fails, it will default to "binary". 2. Git reads the config variable "i18n.commitFilenameEncoding". If localFilenameEncoding equals commitFilenameEncoding, or if either of them is "binary", go to step 3A. Otherwise go to step 3B. 3A. Git saves the filename together with the local filename encoding. The bytes of the filename are not changed when it is stored in the repository (the same as now). 3B. Git converts the filename from localFilenameEncoding to commitFilenameEncoding. (The commitFilenameEncoding may also specify a normalized form for UTF-8, for example "UTF-8 NFC". This is needed for Mac OS X.) Then Git saves the filename together with the commit filename encoding. When a file is checked out, the following things will happen: 1. Git reads the actual filename encoding from the repository. If it is not specified, "binary" will be assumed. 2. Git detects the local filename encoding, the same was as before. If the actual filename encoding equals the local filename encoding, or if either of them is "binary", go to step 3A. Otherwise go to step 3B. 3A. Git creates the file using the same bytes as filename as what is stored in the repository. This is the same as how Git works now. 3B. Git converts the filename from the actual filename encoding to the local filename encoding, and creates the file using the encoding of the local platform. This should fit in with Git's philosophy of not modifying the user's data without the user's permission. The data will always be stored unchanged into the repository, unless the user specifies "i18n.commitFilenameEncoding". The conversions are by default done only on checkout. Git will try to serve the needs of the user as well as it can by detecting the local filename encoding, but if the user so desires, he can disable the conversions by specifying "i18n.localFilenameEncoding" as "binary", in which case Git will work the same way as it does today. I was browsing Git's code, and it seems that the encoding information would need to be added to struct name_entry in tree-walk.h. A quick search reveals that name_entry is used in 15 files, out of which only 4 files use it more than once. It would probably make sense to create a new datatype for the filename, for example "struct encoded_path { const char *path; const char *encoding; }", and then provide functions for accessing the filename with the right encoding (commit or local). I might even myself be able to make that change, because Git is not legacy software (it has tests) and the needed changes seem quite local. I would just need a way to detect the encodings (at first it could rely on manually set config variables) and have a library for doing the encoding conversions. One big question is, that will this change require a change to the repository format? Will it be possible to add the encoding field to the tree object, without breaking compatibility with older Git clients? If compatibility needs to be broken, how it can be done in a controlled fashion? -- Esko Luontola www.orfjackal.net