All of lore.kernel.org
 help / color / mirror / Atom feed
* Converting man-pages to UTF-8
@ 2014-02-14 10:43 Michael Kerrisk (man-pages)
       [not found] ` <CAKgNAkh5tHmJc2DrcoAJsDWWFao6bPckd2sN1dw-CZDSFNi5kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-14 10:43 UTC (permalink / raw)
  To: linux-man; +Cc: Colin Watson, Bruno Haible, Werner Lemberg, Peter Schiffer

Hello all,

At https://bugzilla.kernel.org/show_bug.cgi?id=60807 is a proposal to
convert the pages of the the "man-pages" project to UTF 8. I thought
it worthwhile bringing that topic to the list, and CCing a few people
who may have some ideas about this step, since I'm not too sure of the
implications.

Peter Schiffer has kindly written some some scripts to do the
conversion, which would touch about 40 files. However, as far I can
tell, many of the pages that have non-ASCII characters have inside
groff comments (author's names, etc.). The only pages that have
non-ASCII characters in the rendered source are various man7 pages on
character sets. These were the pages to which I added a groff encoding
marker in response to Colin Watson's input on this Debian bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209

Moving to UTF-8 for the pages seems like a good idea, at least at some
point. However, I'm wondering whether there are any backward
compatibility issues that I should need to worry about. As far as I
know, groff added UTF-8 support back in Jan 2009, so, just over 5
years ago. Perhaps that's long enough ago now, that any backward
compatibility issues with old versions of groff would be minimal.
(I.e., the number of people installing new man-pages on systems with
old groff is likely to be very small, and anyway, only a dozen or so
pages in Section 7 are affected. Furthermore, I'm assuming that Linux
distros have been shipping groff v1.20+ for quite a long time now.)

Bottom line question: anyone see a reason not to do this conversion now?

Thanks,

Michael







-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Converting man-pages to UTF-8
       [not found] ` <CAKgNAkh5tHmJc2DrcoAJsDWWFao6bPckd2sN1dw-CZDSFNi5kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-02-14 11:42   ` Colin Watson
       [not found]     ` <20140214114216.GE6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Colin Watson @ 2014-02-14 11:42 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Bruno Haible, Werner Lemberg, Peter Schiffer

On Fri, Feb 14, 2014 at 11:43:30AM +0100, Michael Kerrisk (man-pages) wrote:
> At https://bugzilla.kernel.org/show_bug.cgi?id=60807 is a proposal to
> convert the pages of the the "man-pages" project to UTF 8. I thought
> it worthwhile bringing that topic to the list, and CCing a few people
> who may have some ideas about this step, since I'm not too sure of the
> implications.
> 
> Peter Schiffer has kindly written some some scripts to do the
> conversion, which would touch about 40 files. However, as far I can
> tell, many of the pages that have non-ASCII characters have inside
> groff comments (author's names, etc.). The only pages that have
> non-ASCII characters in the rendered source are various man7 pages on
> character sets. These were the pages to which I added a groff encoding
> marker in response to Colin Watson's input on this Debian bug:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209
> 
> Moving to UTF-8 for the pages seems like a good idea, at least at some
> point. However, I'm wondering whether there are any backward
> compatibility issues that I should need to worry about. As far as I
> know, groff added UTF-8 support back in Jan 2009, so, just over 5
> years ago. Perhaps that's long enough ago now, that any backward
> compatibility issues with old versions of groff would be minimal.
> (I.e., the number of people installing new man-pages on systems with
> old groff is likely to be very small, and anyway, only a dozen or so
> pages in Section 7 are affected. Furthermore, I'm assuming that Linux
> distros have been shipping groff v1.20+ for quite a long time now.)

I think for characters in comments you're probably fine, and any
problems you might have had should be gone as of groff 1.20.  Debian
switched to that in July 2009, and I think we were late to the party
because we had some difficult historical baggage to clean up at the same
time.  I'm not aware of anyone shipping older versions of groff any
more.

When you convert characters that show up in rendered source, I suspect
systems using the other man package (1.6g or similar versions) may
render them poorly, because it invokes nroff in some fairly naïve and
hardcoded ways.  However, they already break in various related ways,
and most distributions have switched to man-db now, or dealt with things
some other way.  My rough survey of the major distributions for this is:

  Arch has been good since about 2009

  Debian and descendants are good as of late 2007 / early 2008 (addition
  of manconv to man-db)

  Fedora is definitely good as of 2010 (switch to man-db), and I think
  was good before that as IIRC they did a flag day to switch everything
  to UTF-8 with man

  Gentoo switched to man-db at the end of 2013, so should be good now

  Mageia has a current groff, but uses man 1.6g with a stack of patches
  (some encoding-related)

  openSUSE has been fine for about the same length of time as Debian

  Slackware has a current groff, but uses man 1.6g without much in the
  way of special patches (just one to make things work for UTF-8
  *output*)

My guess is that Mageia and Slackware may find that things only work
properly for users in UTF-8 locales, but most other major distributions
should be fine.  You won't be the first author to switch to UTF-8 manual
pages; all you'll be doing is making existing shortcomings perhaps
marginally more obvious.  In any case, the pages currently encoded in
ISO-8859-1 won't be very seriously affected, and users of problematic
systems will only have been able to read the other pages with good luck
and a following wind anyway; switching to UTF-8 will probably actually
improve things for them if they're using a UTF-8 locale.  (That is, the
problems that the affected systems have generally relate to attempting
to read pages whose encoding doesn't match that of their locale.)  They
might possibly need to add the -k option to their nroff invocation in
man.conf.

If I were you I would just go ahead.

Regarding your questions in the bug, please do keep the "coding:" tag in
there; man-db will figure this out by brute force, but if left to its
own devices I think groff's preconv will default to the locale's
encoding, so it will only work for some people.

Cheers,

-- 
Colin Watson                                       [cjwatson-8fiUuRrzOP0dnm+yROfE0A@public.gmane.org]
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Converting man-pages to UTF-8
       [not found]     ` <20140214114216.GE6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
@ 2014-02-14 15:28       ` Michael Kerrisk (man-pages)
       [not found]         ` <52FE360B.9050302-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-14 15:28 UTC (permalink / raw)
  To: Colin Watson
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-man, Bruno Haible,
	Werner Lemberg, Peter Schiffer

On 02/14/2014 12:42 PM, Colin Watson wrote:
> On Fri, Feb 14, 2014 at 11:43:30AM +0100, Michael Kerrisk (man-pages) wrote:
>> At https://bugzilla.kernel.org/show_bug.cgi?id=60807 is a proposal to
>> convert the pages of the the "man-pages" project to UTF 8. I thought
>> it worthwhile bringing that topic to the list, and CCing a few people
>> who may have some ideas about this step, since I'm not too sure of the
>> implications.
>>
>> Peter Schiffer has kindly written some some scripts to do the
>> conversion, which would touch about 40 files. However, as far I can
>> tell, many of the pages that have non-ASCII characters have inside
>> groff comments (author's names, etc.). The only pages that have
>> non-ASCII characters in the rendered source are various man7 pages on
>> character sets. These were the pages to which I added a groff encoding
>> marker in response to Colin Watson's input on this Debian bug:
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209
>>
>> Moving to UTF-8 for the pages seems like a good idea, at least at some
>> point. However, I'm wondering whether there are any backward
>> compatibility issues that I should need to worry about. As far as I
>> know, groff added UTF-8 support back in Jan 2009, so, just over 5
>> years ago. Perhaps that's long enough ago now, that any backward
>> compatibility issues with old versions of groff would be minimal.
>> (I.e., the number of people installing new man-pages on systems with
>> old groff is likely to be very small, and anyway, only a dozen or so
>> pages in Section 7 are affected. Furthermore, I'm assuming that Linux
>> distros have been shipping groff v1.20+ for quite a long time now.)
> 
> I think for characters in comments you're probably fine, and any
> problems you might have had should be gone as of groff 1.20.  Debian
> switched to that in July 2009, and I think we were late to the party
> because we had some difficult historical baggage to clean up at the same
> time.  I'm not aware of anyone shipping older versions of groff any
> more.
> 
> When you convert characters that show up in rendered source, I suspect
> systems using the other man package (1.6g or similar versions) may
> render them poorly, because it invokes nroff in some fairly naïve and
> hardcoded ways.  However, they already break in various related ways,
> and most distributions have switched to man-db now, or dealt with things
> some other way.  My rough survey of the major distributions for this is:
> 
>   Arch has been good since about 2009
> 
>   Debian and descendants are good as of late 2007 / early 2008 (addition
>   of manconv to man-db)
> 
>   Fedora is definitely good as of 2010 (switch to man-db), and I think
>   was good before that as IIRC they did a flag day to switch everything
>   to UTF-8 with man
> 
>   Gentoo switched to man-db at the end of 2013, so should be good now
> 
>   Mageia has a current groff, but uses man 1.6g with a stack of patches
>   (some encoding-related)
> 
>   openSUSE has been fine for about the same length of time as Debian
> 
>   Slackware has a current groff, but uses man 1.6g without much in the
>   way of special patches (just one to make things work for UTF-8
>   *output*)
> 
> My guess is that Mageia and Slackware may find that things only work
> properly for users in UTF-8 locales, but most other major distributions
> should be fine.  You won't be the first author to switch to UTF-8 manual
> pages; all you'll be doing is making existing shortcomings perhaps
> marginally more obvious.  In any case, the pages currently encoded in
> ISO-8859-1 won't be very seriously affected, and users of problematic
> systems will only have been able to read the other pages with good luck
> and a following wind anyway; switching to UTF-8 will probably actually
> improve things for them if they're using a UTF-8 locale.  (That is, the
> problems that the affected systems have generally relate to attempting
> to read pages whose encoding doesn't match that of their locale.)  They
> might possibly need to add the -k option to their nroff invocation in
> man.conf.
> 
> If I were you I would just go ahead.
> 
> Regarding your questions in the bug, please do keep the "coding:" tag in
> there; man-db will figure this out by brute force, but if left to its
> own devices I think groff's preconv will default to the locale's
> encoding, so it will only work for some people.

Hello Colin,

Thanks for the extensive reply! One final point. For the pages that
have non-ASCII characters only in source comments, not in rendered
input source, does it matter whether or not the "coding:" tag is added?
I ask because, simply for documentary purposes, I'm wondering whether
we should add that tag only in the pages that have UTF-8 in the rendered
input.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Converting man-pages to UTF-8
       [not found]         ` <52FE360B.9050302-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-02-14 16:30           ` Colin Watson
       [not found]             ` <20140214163035.GF6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Colin Watson @ 2014-02-14 16:30 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Bruno Haible, Werner Lemberg, Peter Schiffer

On Fri, Feb 14, 2014 at 04:28:11PM +0100, Michael Kerrisk (man-pages) wrote:
> Thanks for the extensive reply! One final point. For the pages that
> have non-ASCII characters only in source comments, not in rendered
> input source, does it matter whether or not the "coding:" tag is added?
> I ask because, simply for documentary purposes, I'm wondering whether
> we should add that tag only in the pages that have UTF-8 in the rendered
> input.

I don't *believe* it will make a difference to groff's preconv, since
all the other encodings people might be using will at worst only mangle
UTF-8 characters in comments in ways that preserve their comment status.
It might allow man-db to be marginally more efficient, but I think any
difference would be negligible.

Regards,

-- 
Colin Watson                                       [cjwatson-8fiUuRrzOP0dnm+yROfE0A@public.gmane.org]
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Converting man-pages to UTF-8
       [not found]             ` <20140214163035.GF6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
@ 2014-02-16  7:41               ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-16  7:41 UTC (permalink / raw)
  To: Colin Watson
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, linux-man, Bruno Haible,
	Werner Lemberg, Peter Schiffer

On 02/14/2014 05:30 PM, Colin Watson wrote:
> On Fri, Feb 14, 2014 at 04:28:11PM +0100, Michael Kerrisk (man-pages) wrote:
>> Thanks for the extensive reply! One final point. For the pages that
>> have non-ASCII characters only in source comments, not in rendered
>> input source, does it matter whether or not the "coding:" tag is added?
>> I ask because, simply for documentary purposes, I'm wondering whether
>> we should add that tag only in the pages that have UTF-8 in the rendered
>> input.
> 
> I don't *believe* it will make a difference to groff's preconv, since
> all the other encodings people might be using will at worst only mangle
> UTF-8 characters in comments in ways that preserve their comment status.
> It might allow man-db to be marginally more efficient, but I think any
> difference would be negligible.

Thanks, Colin. Then, for now, I'm adding the "coding:" tag only to
pages with UTF-8 characters in the rendered text.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-02-16  7:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-14 10:43 Converting man-pages to UTF-8 Michael Kerrisk (man-pages)
     [not found] ` <CAKgNAkh5tHmJc2DrcoAJsDWWFao6bPckd2sN1dw-CZDSFNi5kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-02-14 11:42   ` Colin Watson
     [not found]     ` <20140214114216.GE6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
2014-02-14 15:28       ` Michael Kerrisk (man-pages)
     [not found]         ` <52FE360B.9050302-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-02-14 16:30           ` Colin Watson
     [not found]             ` <20140214163035.GF6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
2014-02-16  7:41               ` Michael Kerrisk (man-pages)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.