From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paul Eggert Subject: Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames Date: Tue, 11 Oct 2005 23:51:41 -0700 Message-ID: <877jcjmdmq.fsf@penguin.cs.ucla.edu> References: <7vu0ftyvbc.fsf@assigned-by-dhcp.cox.net> <20051007232909.GB8893@steel.home> <7vpsqgyjrj.fsf@assigned-by-dhcp.cox.net> <20051008064555.GA3831@steel.home> <7vachks7aq.fsf@assigned-by-dhcp.cox.net> <20051008133032.GA32079@localhost> <7v64s7svya.fsf@assigned-by-dhcp.cox.net> <7vu0frpxs1.fsf@assigned-by-dhcp.cox.net> <87mzlgh8xa.fsf@penguin.cs.ucla.edu> <87ek6s0w34.fsf@penguin.cs.ucla.edu> <87slv7zvqj.fsf@penguin.cs.ucla.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Junio C Hamano , Robert Fitzsimons , Alex Riesen , git@vger.kernel.org, Kai Ruemmler X-From: git-owner@vger.kernel.org Wed Oct 12 08:53:33 2005 Return-path: Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1EPaTr-0002ai-VO for gcvg-git@gmane.org; Wed, 12 Oct 2005 08:52:52 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932368AbVJLGwt (ORCPT ); Wed, 12 Oct 2005 02:52:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932394AbVJLGwt (ORCPT ); Wed, 12 Oct 2005 02:52:49 -0400 Received: from Kiwi.CS.UCLA.EDU ([131.179.128.19]:59309 "EHLO kiwi.cs.ucla.edu") by vger.kernel.org with ESMTP id S932368AbVJLGws (ORCPT ); Wed, 12 Oct 2005 02:52:48 -0400 Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by kiwi.cs.ucla.edu (8.11.7p1+Sun/8.11.7/UCLACS-5.2) with ESMTP id j9C6pf406118; Tue, 11 Oct 2005 23:51:42 -0700 (PDT) Received: from eggert by penguin.cs.ucla.edu with local (Exim 4.50) id 1EPaSj-0002LE-7w; Tue, 11 Oct 2005 23:51:41 -0700 To: Linus Torvalds In-Reply-To: (Linus Torvalds's message of "Tue, 11 Oct 2005 13:56:12 -0700 (PDT)") User-Agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux) Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: Linus Torvalds writes: > you can read it as a UTF-8 stream, but then quote things at a byte > level (ie if you quote one "character", you quote _all_ bytes in > that character). Yes, that's what I had in mind. > And you quote if: > > - the UTF-8 _character_ is in the 0x80-0x9f control range Yes. Or more generally, if it's any UTF-8 control character. > - any _raw_byte_ is in the 0x80-0x9f range (it might not be UTF-8) Why quote the raw bytes? Is this for terminal escapes on older xterm (or xterm-like) implementations that don't understand UTF-8? If so, I'm not sure I'd bother, as it would introduce a lot of annoying quoting with perfectly reasonable UTF-8, and (if we assume the world is moving to UTF-8) it addresses a problem that is going away. > - any _raw_byte_ is 0xfe-0xff (illegal UTF-8 character) > - misformed UTF-8 (non-shortest sequence, or just generally invalid > sequences with missing or wrong high bits) Yes, that makes sense. > quite frankly, that's a pretty painful thing to write. It's not trivially short, yes. But it shouldn't be that hard. Also, I guess we don't have to write it, at least not at first. As long as we specify something like the C quoted-string format mentioned earlier, we can encode into that format using a naive algorithm (e.g., quote any non-ASCII byte or ASCII control character), and beautify the encoding method later. > The upside is that it's easy to decode: you can _unquote_ it just as > a byte stream. Yes, that's the idea. Also, the interchange format is the most important thing. We have to decode anything that is in the format, and we must encode into the format. Encoding prettily is nice, but not necessary.