git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Unicode: update of combining code points
@ 2014-04-07 19:30 Torsten Bögershausen
  2014-04-15 19:10 ` Peter Krefting
  0 siblings, 1 reply; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-07 19:30 UTC (permalink / raw)
  To: git; +Cc: tboegi

Unicode 6.3 defines the following code as combining or accents,
git_wcwidth() should return 0.

Earlier unicode standards had defined these code point as "reserved":

358 COMBINING DOT ABOVE RIGHT
359 COMBINING ASTERISK BELOW
35A COMBINING DOUBLE RING BELOW
35B COMBINING ZIGZAG ABOVE
35C COMBINING DOUBLE BREVE BELOW
487 COMBINING CYRILLIC POKRYTIE
5A2 HEBREW ACCENT ATNAH HAFUKH,
5BA HEBREW POINT HOLAM HASER FOR VAV
5C5 HEBREW MARK LOWER DOT
5C7 HEBREW POINT QAMATS QATAN
604 ARABIC SIGN SAMVAT
616 ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
617 ARABIC SMALL HIGH ZAIN
618 ARABIC SMALL FATHA
619 ARABIC SMALL DAMMA
61A ARABIC SMALL KASRA
659 ARABIC ZWARAKAY
65A ARABIC VOWEL SIGN SMALL V ABOVE
65B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
65C ARABIC VOWEL SIGN DOT BELOW
65D ARABIC REVERSED DAMMA
65E ARABIC FATHA WITH TWO DOTS
65F ARABIC WAVY HAMZA BELOW

This commit touches only the range 300-6FF, there may be more to be updated.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -84,11 +84,10 @@ static int git_wcwidth(ucs_char_t ch)
 	 *   "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c".
 	 */
 	static const struct interval combining[] = {
-		{ 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
-		{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
-		{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
-		{ 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 },
-		{ 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
+		{ 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
+		{ 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
+		{ 0x05C7, 0x05C7 }, { 0x0600, 0x0604 }, { 0x0610, 0x061A },
+		{ 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
 		{ 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F },
 		{ 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 },
 		{ 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-07 19:30 [PATCH] Unicode: update of combining code points Torsten Bögershausen
@ 2014-04-15 19:10 ` Peter Krefting
  2014-04-16  4:48   ` Torsten Bögershausen
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Krefting @ 2014-04-15 19:10 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git

Torsten Bögershausen:

> diff --git a/utf8.c b/utf8.c
> index a831d50..77c28d4 100644
> --- a/utf8.c
> +++ b/utf8.c

Is there a script that generates this code from the Unicode database 
files, or did you hand-update it?

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-15 19:10 ` Peter Krefting
@ 2014-04-16  4:48   ` Torsten Bögershausen
  2014-04-16 10:51     ` Kevin Bracey
  2014-04-24  9:02     ` Peter Krefting
  0 siblings, 2 replies; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-16  4:48 UTC (permalink / raw)
  To: Peter Krefting, Torsten Bögershausen; +Cc: git

On 15.04.14 21:10, Peter Krefting wrote:
> Torsten Bögershausen:
> 
>> diff --git a/utf8.c b/utf8.c
>> index a831d50..77c28d4 100644
>> --- a/utf8.c
>> +++ b/utf8.c
> 
> Is there a script that generates this code from the Unicode database files, or did you hand-update it?
> 
Some of the code points which have "0 length on the display" are called
"combining", others are called "vowels" or "accents".
E.g. 5BF is not marked any of them, but if you look at the glyph, it should
be combining (please correct me if that is wrong).

If I could have found a file which indicates for each code point, what it
is, I could write a script.

So yes, it is updated by hand.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-16  4:48   ` Torsten Bögershausen
@ 2014-04-16 10:51     ` Kevin Bracey
  2014-04-16 19:58       ` Torsten Bögershausen
  2014-04-24  9:02     ` Peter Krefting
  1 sibling, 1 reply; 15+ messages in thread
From: Kevin Bracey @ 2014-04-16 10:51 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Peter Krefting, git

On 16/04/2014 07:48, Torsten Bögershausen wrote:
> On 15.04.14 21:10, Peter Krefting wrote:
>> Torsten Bögershausen:
>>
>>> diff --git a/utf8.c b/utf8.c
>>> index a831d50..77c28d4 100644
>>> --- a/utf8.c
>>> +++ b/utf8.c
>> Is there a script that generates this code from the Unicode database files, or did you hand-update it?
>>
> Some of the code points which have "0 length on the display" are called
> "combining", others are called "vowels" or "accents".
> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
> be combining (please correct me if that is wrong).

Indeed it is combining (more specifically it has General Category 
"Nonspacing_Mark" = "Mn").

>
> If I could have found a file which indicates for each code point, what it
> is, I could write a script.
>

The most complete and machine-readable data are in these files:

http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt

The general categories can also be seen more legibly in:

http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt

For docs, see:

http://www.unicode.org/reports/tr44/
http://www.unicode.org/reports/tr11/
http://www.unicode.org/ucd/

The existing utf8.c comments describe the attributes being selected from 
the tables (general categories "Cf","Mn","Me", East Asian Width "W", 
"F"). And they suggest that the combining character table was originally 
auto-generated from UnicodeData.txt with a "uniset" tool. Presumably this?

https://github.com/depp/uniset

The fullwidth-checking code looks like it was done by hand, although 
apparently uniset can process EastAsianWidth.txt.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-16 10:51     ` Kevin Bracey
@ 2014-04-16 19:58       ` Torsten Bögershausen
  2014-04-17  6:32         ` Kevin Bracey
  0 siblings, 1 reply; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-16 19:58 UTC (permalink / raw)
  To: Kevin Bracey, Torsten Bögershausen; +Cc: Peter Krefting, git

On 2014-04-16 12.51, Kevin Bracey wrote:
> On 16/04/2014 07:48, Torsten Bögershausen wrote:
>> On 15.04.14 21:10, Peter Krefting wrote:
>>> Torsten Bögershausen:
>>>
>>>> diff --git a/utf8.c b/utf8.c
>>>> index a831d50..77c28d4 100644
>>>> --- a/utf8.c
>>>> +++ b/utf8.c
>>> Is there a script that generates this code from the Unicode database files, or did you hand-update it?
>>>
>> Some of the code points which have "0 length on the display" are called
>> "combining", others are called "vowels" or "accents".
>> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
>> be combining (please correct me if that is wrong).
> 
> Indeed it is combining (more specifically it has General Category "Nonspacing_Mark" = "Mn").
> 
>>
>> If I could have found a file which indicates for each code point, what it
>> is, I could write a script.
>>
> 
> The most complete and machine-readable data are in these files:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
> 
> The general categories can also be seen more legibly in:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
> 
> For docs, see:
> 
> http://www.unicode.org/reports/tr44/
> http://www.unicode.org/reports/tr11/
> http://www.unicode.org/ucd/
> 
> The existing utf8.c comments describe the attributes being selected from the tables (general categories "Cf","Mn","Me", East Asian Width "W", "F"). And they suggest that the combining character table was originally auto-generated from UnicodeData.txt with a "uniset" tool. Presumably this?
> 
> https://github.com/depp/uniset
> 
> The fullwidth-checking code looks like it was done by hand, although apparently uniset can process EastAsianWidth.txt.
> 
> Kevin
Excellent, thanks for the pointers.
Running the script below shows that 
"0X00AD SOFT HYPHEN" should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines 
in the script is the right one.

What does this mean for us:
"Cf 	Format 	a format control character"


#!/bin/sh

if ! test -f UnicodeData.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
fi &&
if ! test -f EastAsianWidth.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
fi
if ! test -f DerivedGeneralCategory.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
fi &&
if ! test -d uniset; then
  git clone https://github.com/tboegi/uniset.git
fi &&
(
  cd uniset &&
  if ! test -x uniset; then 
    autoreconf -i &&
    ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'
  fi &&
  make
) &&
UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf
#UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn










> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-16 19:58       ` Torsten Bögershausen
@ 2014-04-17  6:32         ` Kevin Bracey
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Bracey @ 2014-04-17  6:32 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Peter Krefting, git

On 16/04/2014 22:58, Torsten Bögershausen wrote:
> Excellent, thanks for the pointers.
> Running the script below shows that
> "0X00AD SOFT HYPHEN" should have zero length (and some others too).
> I wonder if that is really the case, and which one of the last 2 lines
> in the script is the right one.
>
> What does this mean for us:
> "Cf 	Format 	a format control character"
>
Maybe dig back through the Git logs to check the original logic, but the 
comments suggest that "Cf" characters have been viewed as zero-width. 
That makes sense - they're usually markers indicating things like 
bidirectional text flow, so won't be taking space. (Although they may be 
causing even more extreme layout effects...)

Soft-hyphen is noted as an explicit exception to the rule in the utf8.c 
comments. As of Unicode 4.0, it's supposed to be a character indicating 
a point where a hyphen could be placed if a line-wrap occurs, and if 
that wrap happens, then it can actually take up 1 space, otherwise not. 
So its width could be either 0 or 1, depending. Or, quite likely, the 
terminal doesn't treat it specially, and it always just looks like a 
hyphen... Thus we err on the safe side and give it width 1.

See http://en.wikipedia.org/wiki/Soft_hyphen for background.

The comments suggest adding "-00AD +1160-11FF" to the uniset command 
line for that tweak and for composing Hangul. (The +200B tweak isn't 
necessary any more - Zero-Width Space U+200B became Cf officially in 
Unicode 4.0.1:

http://en.wikipedia.org/wiki/Zero-width_space
http://www.unicode.org/review/resolved-pri.html#pri21
)

All of this is only really an approximation - a best-effort attempt to 
figure out the width of a string without any actual communication with 
the display device. So it'll never be perfect. The choice between double 
and single width in particular will often be unpredictable, unless you 
had deeper locale knowledge.

Actually, while doing this, I've realised that this was originally 
Markus Kuhn's implementation, and that is acknowledged at the top of the 
file:

http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

Good, because he knows what he's doing.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-16  4:48   ` Torsten Bögershausen
  2014-04-16 10:51     ` Kevin Bracey
@ 2014-04-24  9:02     ` Peter Krefting
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Krefting @ 2014-04-24  9:02 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Git Mailing List

Torsten Bögershausen:

> Some of the code points which have "0 length on the display" are called
> "combining", others are called "vowels" or "accents".
> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
> be combining (please correct me if that is wrong).

All combining characters has a non-zero combining class in 
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (fourth field, 
called Canonical_Combining_Class in 
http://www.unicode.org/reports/tr44/ ). For instance, the aforementioned 
U+05BF is defined as follows:

   05BF;HEBREW POINT RAFE;Mn;23;NSM;;;;;N;;;;;

The combining class is 23, so this is a combining character.

There is a difference between non-spacing combining marks ("Mn" in the 
third column (General_Category)) and others ("Mc" for spacing marks 
and "Me" for enclosing marks), so they might need specifial handling. 
Additionally, you have the "zero-width" characters, such as U+200B 
Zero Width Space. These have the "Cf" class, although it also contains 
visible characters IIRC.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-09 17:30       ` Junio C Hamano
@ 2014-04-10  4:12         ` Torsten Bögershausen
  0 siblings, 0 replies; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-10  4:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Torsten Bögershausen, Jonathan Nieder, git

Excellent, thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-09 16:48     ` Torsten Bögershausen
@ 2014-04-09 17:30       ` Junio C Hamano
  2014-04-10  4:12         ` Torsten Bögershausen
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2014-04-09 17:30 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Jonathan Nieder, git

Torsten Bögershausen <tboegi@web.de> writes:

> How about this as a commit message:
>
> Unicode: partially update to version 6.3
>
> Unicode 6.3 defines the following code points as combining or accents,
> git_wcwidth() should return 0.
>
> Earlier unicode standards had defined these code point as "reserved":
> 358--35C
> 487
> 5A2, 5BA, 5C5, 5C7
> 604, 616--61A, 659--65F
>
> Note: for this commit only the range 0..7FF has been checked,
> more updates may be needed.
>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>

Thanks.

I do not think you meant to say that the listed codepoints above are
the only ones that were "reserved".  Rather, the codepoints listed
are what are affected by this hange, and these were all reserved.

Also it may help end-user visible effect like Jonathan asked in his
earlier message.  How about extending it like this?

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
utf8.c: partially update to version 6.3

Unicode 6.3 defines more code points as combining or accents.  For
example, the character "ö" could be expressed as an "o" followed by
U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above).  We should
consider that such a sequence of two codepoints occupies one display
column for the alignment purposes, and for that, git_wcwidth()
should return 0 for them.  Affected codepoints are:

    U+0358..U+035C
    U+0487
    U+05A2, U+05BA, U+05C5, U+05C7
    U+0604, U+0616..U+061A, U+0659..U+065F

Earlier unicode standards had defined these as "reserved".

Only the range 0..U+07FF has been checked to see which codepoints
need to be marked as 0-width while preparing for this commit; more
updates may be needed.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-08 22:37   ` Junio C Hamano
@ 2014-04-09 16:48     ` Torsten Bögershausen
  2014-04-09 17:30       ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-09 16:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Nieder, git, Torsten Bögershausen

On 04/09/2014 12:37 AM, Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:
>
>> Torsten Bögershausen wrote:
>>
>>> Unicode 6.3 defines the following code as combining or accents,
>>> git_wcwidth() should return 0.
>>>
>>> Earlier unicode standards had defined these code point as "reserved":
>> Thanks for the update.  Could the commit message also explain how this
>> was noticed and what the user-visible effect is?
>>
>> For example:
>>
>>  "Unicode just announced that <...>.  That means we should mark the
>>   relevant code points as combining characters so git knows they are
>>   zero-width and doesn't screw up the alignment when presenting branch
>>   names in columns with 'git branch --column'"
>>
>> or something like that.
> Perhaps (the original read clearly enough for me, though).
>
>> [...]
>>> 358 COMBINING DOT ABOVE RIGHT
>>> 359 COMBINING ASTERISK BELOW
>> I'm not sure this list is needed --- the code + the reference to the
>> Unicode 6.3 standard seems like enough (but if you think otherwise,
>> I don't really mind).
> I can go either way.
>
>>> This commit touches only the range 300-6FF, there may be more to be updated.
>> The "there may be more" here sounds ominous.
> Indeed it does ;-)
>
>> Does that mean Unicode
>> 6.3 also added some zero-width characters in other ranges that should
>> be dealt with in the future?  How many such ranges?  How do we know
>> when we're done?
>>
>> Just biting off the most important characters first and putting off
>> the rest for later sounds fine to me --- my complaint is that the
>> above comment doesn't make clear what the to-do list is for finishing
>> the update later.
> I'll queue this at the tip of 'pu', not to forget about it while
> waiting for a clarification.
>
> Thanks.
Thanks for comments, here comes the long version of the strory:
I recently fooled myself by running
"git config --global user.name" with a decomposed "ö" on a new Mac OS X machine.

While there was little problems on Mac OS, all Windows and Linux machines stumbled
over the decomposed ö, to be more exact over 0x308, COMBINING DIARESIS, (the 2 dots),
giving all kind of weired output in "git log".

Looking into commit.c and utf8.c, how to improve the situation, I made this observations:
- Some code from commit.c can possibly be moved into utf8.c, so that we only
  have 1 utf8 code parser.
- A solution would be to run precompose_string() under Mac OS (which is a nop otherwise).
  This could have saved my day. Probably I will make a patch some day.
- Some of the combining code points exist in Unicode 6.3, but not in utf8.c
  (which seams to be based on Unicode >2.0 <6.3)
  I found some in the 0x300 area, and looked at the neighbors, and had enough time to
  read all code pages up to 0x7FF. 

 So if somebody knows how to find out which code points that are combined, accents,,, or in other words should return 0 in git_wcwidth(), please let me know.

How about this as a commit message:

Unicode: partially update to version 6.3

Unicode 6.3 defines the following code points as combining or accents,
git_wcwidth() should return 0.

Earlier unicode standards had defined these code point as "reserved":
358--35C
487
5A2, 5BA, 5C5, 5C7
604, 616--61A, 659--65F

Note: for this commit only the range 0..7FF has been checked,
more updates may be needed.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-07 19:54 ` Jonathan Nieder
@ 2014-04-08 22:37   ` Junio C Hamano
  2014-04-09 16:48     ` Torsten Bögershausen
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2014-04-08 22:37 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Torsten Bögershausen, git

Jonathan Nieder <jrnieder@gmail.com> writes:

> Torsten Bögershausen wrote:
>
>> Unicode 6.3 defines the following code as combining or accents,
>> git_wcwidth() should return 0.
>>
>> Earlier unicode standards had defined these code point as "reserved":
>
> Thanks for the update.  Could the commit message also explain how this
> was noticed and what the user-visible effect is?
>
> For example:
>
>  "Unicode just announced that <...>.  That means we should mark the
>   relevant code points as combining characters so git knows they are
>   zero-width and doesn't screw up the alignment when presenting branch
>   names in columns with 'git branch --column'"
>
> or something like that.

Perhaps (the original read clearly enough for me, though).

> [...]
>> 358 COMBINING DOT ABOVE RIGHT
>> 359 COMBINING ASTERISK BELOW
>
> I'm not sure this list is needed --- the code + the reference to the
> Unicode 6.3 standard seems like enough (but if you think otherwise,
> I don't really mind).

I can go either way.

>> This commit touches only the range 300-6FF, there may be more to be updated.
>
> The "there may be more" here sounds ominous.

Indeed it does ;-)

> Does that mean Unicode
> 6.3 also added some zero-width characters in other ranges that should
> be dealt with in the future?  How many such ranges?  How do we know
> when we're done?
>
> Just biting off the most important characters first and putting off
> the rest for later sounds fine to me --- my complaint is that the
> above comment doesn't make clear what the to-do list is for finishing
> the update later.

I'll queue this at the tip of 'pu', not to forget about it while
waiting for a clarification.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Unicode: update of combining code points
  2014-04-07 19:39 Torsten Bögershausen
@ 2014-04-07 19:54 ` Jonathan Nieder
  2014-04-08 22:37   ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Nieder @ 2014-04-07 19:54 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git

Hi,

Torsten Bögershausen wrote:

> Unicode 6.3 defines the following code as combining or accents,
> git_wcwidth() should return 0.
>
> Earlier unicode standards had defined these code point as "reserved":

Thanks for the update.  Could the commit message also explain how this
was noticed and what the user-visible effect is?

For example:

 "Unicode just announced that <...>.  That means we should mark the
  relevant code points as combining characters so git knows they are
  zero-width and doesn't screw up the alignment when presenting branch
  names in columns with 'git branch --column'"

or something like that.

[...]
> 358 COMBINING DOT ABOVE RIGHT
> 359 COMBINING ASTERISK BELOW

I'm not sure this list is needed --- the code + the reference to the
Unicode 6.3 standard seems like enough (but if you think otherwise,
I don't really mind).

> This commit touches only the range 300-6FF, there may be more to be updated.

The "there may be more" here sounds ominous.  Does that mean Unicode
6.3 also added some zero-width characters in other ranges that should
be dealt with in the future?  How many such ranges?  How do we know
when we're done?

Just biting off the most important characters first and putting off
the rest for later sounds fine to me --- my complaint is that the
above comment doesn't make clear what the to-do list is for finishing
the update later.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] Unicode: update of combining code points
@ 2014-04-07 19:39 Torsten Bögershausen
  2014-04-07 19:54 ` Jonathan Nieder
  0 siblings, 1 reply; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-07 19:39 UTC (permalink / raw)
  To: git; +Cc: tboegi

Unicode 6.3 defines the following code as combining or accents,
git_wcwidth() should return 0.

Earlier unicode standards had defined these code point as "reserved":

358 COMBINING DOT ABOVE RIGHT
359 COMBINING ASTERISK BELOW
35A COMBINING DOUBLE RING BELOW
35B COMBINING ZIGZAG ABOVE
35C COMBINING DOUBLE BREVE BELOW
487 COMBINING CYRILLIC POKRYTIE
5A2 HEBREW ACCENT ATNAH HAFUKH,
5BA HEBREW POINT HOLAM HASER FOR VAV
5C5 HEBREW MARK LOWER DOT
5C7 HEBREW POINT QAMATS QATAN
604 ARABIC SIGN SAMVAT
616 ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
617 ARABIC SMALL HIGH ZAIN
618 ARABIC SMALL FATHA
619 ARABIC SMALL DAMMA
61A ARABIC SMALL KASRA
659 ARABIC ZWARAKAY
65A ARABIC VOWEL SIGN SMALL V ABOVE
65B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
65C ARABIC VOWEL SIGN DOT BELOW
65D ARABIC REVERSED DAMMA
65E ARABIC FATHA WITH TWO DOTS
65F ARABIC WAVY HAMZA BELOW

This commit touches only the range 300-6FF, there may be more to be updated.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -84,11 +84,10 @@ static int git_wcwidth(ucs_char_t ch)
 	 *   "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c".
 	 */
 	static const struct interval combining[] = {
-		{ 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
-		{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
-		{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
-		{ 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 },
-		{ 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
+		{ 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
+		{ 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
+		{ 0x05C7, 0x05C7 }, { 0x0600, 0x0604 }, { 0x0610, 0x061A },
+		{ 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
 		{ 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F },
 		{ 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 },
 		{ 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] Unicode: update of combining code points
@ 2014-04-07 19:38 Torsten Bögershausen
  0 siblings, 0 replies; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-07 19:38 UTC (permalink / raw)
  To: git; +Cc: tboegi

Unicode 6.3 defines the following code as combining or accents,
git_wcwidth() should return 0.

Earlier unicode standards had defined these code point as "reserved":

358 COMBINING DOT ABOVE RIGHT
359 COMBINING ASTERISK BELOW
35A COMBINING DOUBLE RING BELOW
35B COMBINING ZIGZAG ABOVE
35C COMBINING DOUBLE BREVE BELOW
487 COMBINING CYRILLIC POKRYTIE
5A2 HEBREW ACCENT ATNAH HAFUKH,
5BA HEBREW POINT HOLAM HASER FOR VAV
5C5 HEBREW MARK LOWER DOT
5C7 HEBREW POINT QAMATS QATAN
604 ARABIC SIGN SAMVAT
616 ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
617 ARABIC SMALL HIGH ZAIN
618 ARABIC SMALL FATHA
619 ARABIC SMALL DAMMA
61A ARABIC SMALL KASRA
659 ARABIC ZWARAKAY
65A ARABIC VOWEL SIGN SMALL V ABOVE
65B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
65C ARABIC VOWEL SIGN DOT BELOW
65D ARABIC REVERSED DAMMA
65E ARABIC FATHA WITH TWO DOTS
65F ARABIC WAVY HAMZA BELOW

This commit touches only the range 300-6FF, there may be more to be updated.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -84,11 +84,10 @@ static int git_wcwidth(ucs_char_t ch)
 	 *   "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c".
 	 */
 	static const struct interval combining[] = {
-		{ 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
-		{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
-		{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
-		{ 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 },
-		{ 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
+		{ 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
+		{ 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
+		{ 0x05C7, 0x05C7 }, { 0x0600, 0x0604 }, { 0x0610, 0x061A },
+		{ 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
 		{ 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F },
 		{ 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 },
 		{ 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] Unicode: update of combining code points
@ 2014-04-07 19:34 Torsten Bögershausen
  0 siblings, 0 replies; 15+ messages in thread
From: Torsten Bögershausen @ 2014-04-07 19:34 UTC (permalink / raw)
  To: git; +Cc: tboegi

Unicode 6.3 defines the following code as combining or accents,
git_wcwidth() should return 0.

Earlier unicode standards had defined these code point as "reserved":

358 COMBINING DOT ABOVE RIGHT
359 COMBINING ASTERISK BELOW
35A COMBINING DOUBLE RING BELOW
35B COMBINING ZIGZAG ABOVE
35C COMBINING DOUBLE BREVE BELOW
487 COMBINING CYRILLIC POKRYTIE
5A2 HEBREW ACCENT ATNAH HAFUKH,
5BA HEBREW POINT HOLAM HASER FOR VAV
5C5 HEBREW MARK LOWER DOT
5C7 HEBREW POINT QAMATS QATAN
604 ARABIC SIGN SAMVAT
616 ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
617 ARABIC SMALL HIGH ZAIN
618 ARABIC SMALL FATHA
619 ARABIC SMALL DAMMA
61A ARABIC SMALL KASRA
659 ARABIC ZWARAKAY
65A ARABIC VOWEL SIGN SMALL V ABOVE
65B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
65C ARABIC VOWEL SIGN DOT BELOW
65D ARABIC REVERSED DAMMA
65E ARABIC FATHA WITH TWO DOTS
65F ARABIC WAVY HAMZA BELOW

This commit touches only the range 300-6FF, there may be more to be updated.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -84,11 +84,10 @@ static int git_wcwidth(ucs_char_t ch)
 	 *   "uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c".
 	 */
 	static const struct interval combining[] = {
-		{ 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
-		{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
-		{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
-		{ 0x05C4, 0x05C4 }, { 0x0600, 0x0603 }, { 0x0610, 0x0615 },
-		{ 0x064B, 0x0658 }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
+		{ 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
+		{ 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
+		{ 0x05C7, 0x05C7 }, { 0x0600, 0x0604 }, { 0x0610, 0x061A },
+		{ 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06E4 },
 		{ 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED }, { 0x070F, 0x070F },
 		{ 0x0711, 0x0711 }, { 0x0730, 0x074A }, { 0x07A6, 0x07B0 },
 		{ 0x0901, 0x0902 }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-04-24  9:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-07 19:30 [PATCH] Unicode: update of combining code points Torsten Bögershausen
2014-04-15 19:10 ` Peter Krefting
2014-04-16  4:48   ` Torsten Bögershausen
2014-04-16 10:51     ` Kevin Bracey
2014-04-16 19:58       ` Torsten Bögershausen
2014-04-17  6:32         ` Kevin Bracey
2014-04-24  9:02     ` Peter Krefting
2014-04-07 19:34 Torsten Bögershausen
2014-04-07 19:38 Torsten Bögershausen
2014-04-07 19:39 Torsten Bögershausen
2014-04-07 19:54 ` Jonathan Nieder
2014-04-08 22:37   ` Junio C Hamano
2014-04-09 16:48     ` Torsten Bögershausen
2014-04-09 17:30       ` Junio C Hamano
2014-04-10  4:12         ` Torsten Bögershausen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).