git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Torsten Bögershausen" <tboegi@web.de>
To: "Kevin Bracey" <kevin@bracey.fi>, "Torsten Bögershausen" <tboegi@web.de>
Cc: Peter Krefting <peter@softwolves.pp.se>, git@vger.kernel.org
Subject: Re: [PATCH] Unicode: update of combining code points
Date: Wed, 16 Apr 2014 21:58:31 +0200	[thread overview]
Message-ID: <534EE0E7.2030608@web.de> (raw)
In-Reply-To: <534E60BF.5020602@bracey.fi>

On 2014-04-16 12.51, Kevin Bracey wrote:
> On 16/04/2014 07:48, Torsten Bögershausen wrote:
>> On 15.04.14 21:10, Peter Krefting wrote:
>>> Torsten Bögershausen:
>>>
>>>> diff --git a/utf8.c b/utf8.c
>>>> index a831d50..77c28d4 100644
>>>> --- a/utf8.c
>>>> +++ b/utf8.c
>>> Is there a script that generates this code from the Unicode database files, or did you hand-update it?
>>>
>> Some of the code points which have "0 length on the display" are called
>> "combining", others are called "vowels" or "accents".
>> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
>> be combining (please correct me if that is wrong).
> 
> Indeed it is combining (more specifically it has General Category "Nonspacing_Mark" = "Mn").
> 
>>
>> If I could have found a file which indicates for each code point, what it
>> is, I could write a script.
>>
> 
> The most complete and machine-readable data are in these files:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
> 
> The general categories can also be seen more legibly in:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
> 
> For docs, see:
> 
> http://www.unicode.org/reports/tr44/
> http://www.unicode.org/reports/tr11/
> http://www.unicode.org/ucd/
> 
> The existing utf8.c comments describe the attributes being selected from the tables (general categories "Cf","Mn","Me", East Asian Width "W", "F"). And they suggest that the combining character table was originally auto-generated from UnicodeData.txt with a "uniset" tool. Presumably this?
> 
> https://github.com/depp/uniset
> 
> The fullwidth-checking code looks like it was done by hand, although apparently uniset can process EastAsianWidth.txt.
> 
> Kevin
Excellent, thanks for the pointers.
Running the script below shows that 
"0X00AD SOFT HYPHEN" should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines 
in the script is the right one.

What does this mean for us:
"Cf 	Format 	a format control character"


#!/bin/sh

if ! test -f UnicodeData.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
fi &&
if ! test -f EastAsianWidth.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
fi
if ! test -f DerivedGeneralCategory.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
fi &&
if ! test -d uniset; then
  git clone https://github.com/tboegi/uniset.git
fi &&
(
  cd uniset &&
  if ! test -x uniset; then 
    autoreconf -i &&
    ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'
  fi &&
  make
) &&
UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf
#UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn










> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2014-04-16 19:58 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-07 19:30 [PATCH] Unicode: update of combining code points Torsten Bögershausen
2014-04-15 19:10 ` Peter Krefting
2014-04-16  4:48   ` Torsten Bögershausen
2014-04-16 10:51     ` Kevin Bracey
2014-04-16 19:58       ` Torsten Bögershausen [this message]
2014-04-17  6:32         ` Kevin Bracey
2014-04-24  9:02     ` Peter Krefting
2014-04-07 19:34 Torsten Bögershausen
2014-04-07 19:38 Torsten Bögershausen
2014-04-07 19:39 Torsten Bögershausen
2014-04-07 19:54 ` Jonathan Nieder
2014-04-08 22:37   ` Junio C Hamano
2014-04-09 16:48     ` Torsten Bögershausen
2014-04-09 17:30       ` Junio C Hamano
2014-04-10  4:12         ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=534EE0E7.2030608@web.de \
    --to=tboegi@web.de \
    --cc=git@vger.kernel.org \
    --cc=kevin@bracey.fi \
    --cc=peter@softwolves.pp.se \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).