All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] new Page: isalpha__3(3)
@ 2014-03-04 15:35 walter harms
       [not found] ` <5315F2B5.2040009-fPG8STNUNVg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: walter harms @ 2014-03-04 15:35 UTC (permalink / raw)
  To: linux-man

Hi List,
The ctype macros like  isalpha(3) have a locale specific counterpart.
This page was missing.

re,
 wh


Signed-off-by: wharms-fPG8STNUNVg@public.gmane.org <wharms-fPG8STNUNVg@public.gmane.org>


.\" Copyright (c) 2013 by Walter Harms
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.\"
.TH ISALPHA_L 3 2013-09-20 "GNU" "Linux Programmer's Manual"
.SH NAME
isalnum_l, isalpha_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l,
isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l \- character
classification routines
.SH SYNOPSIS
.nf
.B #include <ctype.h>
.sp
.BI "int isalnum_l(int " "c" ", locale_t " loc );
.br
.BI "int isalpha_l(int " "c" ", locale_t " loc );
.br
.BI "int isascii_l(int " "c" ", locale_t " loc );
.br
.BI "int isblank_l(int " "c" ", locale_t " loc );
.br
.BI "int iscntrl_l(int " "c" ", locale_t " loc );
.br
.BI "int isdigit_l(int " "c" ", locale_t " loc );
.br
.BI "int isgraph_l(int " "c" ", locale_t " loc );
.br
.BI "int islower_l(int " "c" ", locale_t " loc );
.br
.BI "int isprint_l(int " "c" ", locale_t " loc );
.br
.BI "int ispunct_l(int " "c" ", locale_t " loc );
.br
.BI "int isspace_l(int " "c" ", locale_t " loc );
.br
.BI "int isupper_l(int " "c" ", locale_t " loc );
.br
.BI "int isxdigit_l(int " "c" ", locale_t " loc );
.fi
.sp
.SH DESCRIPTION
These functions check whether
.IR c ,
which must have the value of an
.I unsigned char
or
.BR EOF ,
falls into a certain character class according to the current locale.
.TP
.BR isalnum_l ()
checks for an alphanumeric character; it is equivalent to
.BI "(isalpha(" c ") || isdigit(" c "))" \fR.
.TP
.BR isalpha_l ()
checks for an alphabetic character; in the standard \fB"C"\fP
locale, it is equivalent to
.BI "(isupper_l(" c ") || islower_l(" c "))" \fR.
In some locales, there may be additional characters for which
.BR isalpha ()
is true\(emletters which are neither upper case nor lower
case.
.TP
.BR isascii_l ()
checks whether \fIc\fP is a 7-bit
.I unsigned char
value that fits into
the ASCII character set.
.TP
.BR isblank_l ()
checks for a blank character; that is, a space or a tab.
.TP
.BR iscntrl_l ()
checks for a control character.
.TP
.BR isdigit_l ()
checks for a digit (0 through 9).
.TP
.BR isgraph_l ()
checks for any printable character except space.
.TP
.BR islower_l ()
checks for a lower-case character.
.TP
.BR isprint_l ()
checks for any printable character including space.
.TP
.BR ispunct_l ()
checks for any printable character which is not a space or an
alphanumeric character.
.TP
.BR isspace_l ()
checks for white-space characters.
In the
.B """C"""
and
.B """POSIX"""
locales, these are: space, form-feed
.RB ( \(aq\ef\(aq ),
newline
.RB ( \(aq\en\(aq ),
carriage return
.RB ( \(aq\er\(aq ),
horizontal tab
.RB ( \(aq\et\(aq ),
and vertical tab
.RB ( \(aq\ev\(aq ).
.TP
.BR isupper_l ()
checks for an uppercase letter.
.TP
.BR isxdigit_l ()
checks for a hexadecimal digits, that is, one of
.br
.BR "0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F" .
.SH RETURN VALUE
The values returned are nonzero if the character
.I c
falls into the tested class, and a zero value
if not.
.SH CONFORMING TO
POSIX.1-2008 specifies all of these functions.
.SH NOTES
The details of what characters belong into which class depend on the current
locale.
.sp
from
.IR locale.h :
The concept of one static locale per category is not very well
thought out.  Many applications will need to process its data using
information from several different locales.  Another application is
the implementation of the internationalization handling in the
upcoming ISO C++ standard library.  To support this another set of
the functions using locale data exist which have an additional
argument.

For example,
.BR isupper ()
will not recognize an A-umlaut (\(:A) as an uppercase letter in the default
.B "C"
locale.
.SH EXAMPLE
The following example takes a locale abbreviation like "de_DE" as argument.
Is no argument is supplied it will use "C". With "de_DE" the code will
identify the O-Umlaut correctly as alphanumeric character but not with "C".
In contrast the punctuation will perform as before.
.nf

#include <stdio.h>
#include <locale.h>

int main(int argc,char *argv[])
{
  char *str="c2p\(:O.,";
  int i;
  locale_t loc;
    if (argc > 1 )
      loc = newlocale (LC_ALL_MASK, argv[1], NULL);
    else
      loc = newlocale (LC_ALL_MASK, "C", NULL);

  for(i=0;str[i]!=0;i++) {
    if (isalnum_l(str[i],loc))
	printf("The character %c is alphanumeric.\\n",str[i]);
    if ( ispunct_l(str[i],loc) )
	printf ("The character %c is punctuation.\\n",str[i]);
  }
  return 0;
}

.fi
.SH SEE ALSO
.BR iswalnum (3),
.BR iswalpha (3),
.BR iswblank (3),
.BR iswcntrl (3),
.BR iswdigit (3),
.BR iswgraph (3),
.BR iswlower (3),
.BR iswprint (3),
.BR iswpunct (3),
.BR iswspace (3),
.BR iswupper (3),
.BR iswxdigit (3),
.BR setlocale (3),
.BR toascii (3),
.BR tolower (3),
.BR toupper (3),
.BR ascii (7),
.BR locale (7)
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] new Page: isalpha__3(3)
       [not found] ` <5315F2B5.2040009-fPG8STNUNVg@public.gmane.org>
@ 2014-03-10 13:24   ` Michael Kerrisk (man-pages)
       [not found]     ` <531DBD1F.5090400-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
       [not found]     ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ@mail.gmail.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-03-10 13:24 UTC (permalink / raw)
  To: wharms-fPG8STNUNVg, linux-man
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Bruno Haible

[CC += Bruno; 

Bruno, if you have *two* moments to spare, maybe you could confirm that 
I am not saying stupid things about UTF-8, MBS, and WCS. If you have only
*one* moment to spare, I'd appreciate if you look at the program I provide
below as an example of the use of mbstowcs(); I plan to add it to the
mbstowcs() man page that you wrote, and I wonder if you might check
it to see that everything is in order.Thanks! To save time, just grep
for the instance of your name below...]


Hello Walter,

You've submitted a number of pages over the past months that I have
not found the energy to review. I respond to this submission, with 
the goal of explaining why, since it shows many of the problems that
I see in the other submissions (and, in several cases, problems that
I have commented on in your past submissions).

On 03/04/2014 04:35 PM, walter harms wrote:
> Hi List,
> The ctype macros like  isalpha(3) have a locale specific counterpart.
> This page was missing.
> 
> re,
>  wh
> 
> 
> Signed-off-by: wharms-fPG8STNUNVg@public.gmane.org <wharms-fPG8STNUNVg@public.gmane.org>
> 
> 
> .\" Copyright (c) 2013 by Walter Harms
> .\"
> .\" %%%LICENSE_START(VERBATIM)
> .\" Permission is granted to make and distribute verbatim copies of this
> .\" manual provided the copyright notice and this permission notice are
> .\" preserved on all copies.
> .\"
> .\" Permission is granted to copy and distribute modified versions of this
> .\" manual under the conditions for verbatim copying, provided that the
> .\" entire resulting derived work is distributed under the terms of a
> .\" permission notice identical to this one.
> .\"
> .\" Since the Linux kernel and libraries are constantly changing, this
> .\" manual page may be incorrect or out-of-date.  The author(s) assume no
> .\" responsibility for errors or omissions, or for damages resulting from
> .\" the use of the information contained herein.  The author(s) may not
> .\" have taken the same level of care in the production of this manual,
> .\" which is licensed free of charge, as they might when working
> .\" professionally.
> .\"
> .\" Formatted or processed versions of this manual, if unaccompanied by
> .\" the source, must acknowledge the copyright and authors of this work.
> .\" %%%LICENSE_END

Problem: much of the text below is a straight copy from isalpha.3,
written by Thomas Koenig. You can't do this without attribution. 
(Simplest would have been to retain Thomas's copyright line and 
license; I see that you did the latter.) To be clear, I do not
believe you've done this maliciously; rather, you've done it in 
ignorance of what the requirements of copyright law are. But, I 
feel sure I've mentioned this issue to you in the past...

Problem: notwithstanding the copyright issue, the duplication of text
is done without good reason. A more sensible approach in this case would 
be to perform some integration/reworking in the existing ispalpha.3 page, 
or simply to defer to that page inside this page.

> .\"
> .\"
> .TH ISALPHA_L 3 2013-09-20 "GNU" "Linux Programmer's Manual"
> .SH NAME
> isalnum_l, isalpha_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l,
> isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l \- character
> classification routines
> .SH SYNOPSIS
> .nf
> .B #include <ctype.h>
> .sp
> .BI "int isalnum_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isalpha_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isascii_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isblank_l(int " "c" ", locale_t " loc );
> .br
> .BI "int iscntrl_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isdigit_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isgraph_l(int " "c" ", locale_t " loc );
> .br
> .BI "int islower_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isprint_l(int " "c" ", locale_t " loc );
> .br
> .BI "int ispunct_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isspace_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isupper_l(int " "c" ", locale_t " loc );
> .br
> .BI "int isxdigit_l(int " "c" ", locale_t " loc );
> .fi
> .sp

Problem: no description of the feature test macro requirements for these 
pages. See man-pages(7).

Problem: this patch includes no "link" pages (containing just 
".so man3/isalpha_l.3" for the dozen or so other functions documented on 
this page.)

> .SH DESCRIPTION
> These functions check whether
> .IR c ,
> which must have the value of an
> .I unsigned char
> or
> .BR EOF ,
> falls into a certain character class according to the current locale.

The above sentence applies for the functions documented in isalpha.3,
but is meaningless for the functions listed in the SYNOPSIS of this page,
where the conversion is dependent on the locale 'loc', not the current 
locale.

Problem: all of the rest of the DESCRIPTION contains nothing new that isn't
already in isalpha.3.

> .TP
> .BR isalnum_l ()
> checks for an alphanumeric character; it is equivalent to
> .BI "(isalpha(" c ") || isdigit(" c "))" \fR.
> .TP
> .BR isalpha_l ()
> checks for an alphabetic character; in the standard \fB"C"\fP
> locale, it is equivalent to
> .BI "(isupper_l(" c ") || islower_l(" c "))" \fR.
> In some locales, there may be additional characters for which
> .BR isalpha ()
> is true\(emletters which are neither upper case nor lower
> case.
> .TP
> .BR isascii_l ()
> checks whether \fIc\fP is a 7-bit
> .I unsigned char
> value that fits into
> the ASCII character set.
> .TP
> .BR isblank_l ()
> checks for a blank character; that is, a space or a tab.
> .TP
> .BR iscntrl_l ()
> checks for a control character.
> .TP
> .BR isdigit_l ()
> checks for a digit (0 through 9).
> .TP
> .BR isgraph_l ()
> checks for any printable character except space.
> .TP
> .BR islower_l ()
> checks for a lower-case character.
> .TP
> .BR isprint_l ()
> checks for any printable character including space.
> .TP
> .BR ispunct_l ()
> checks for any printable character which is not a space or an
> alphanumeric character.
> .TP
> .BR isspace_l ()
> checks for white-space characters.
> In the
> .B """C"""
> and
> .B """POSIX"""
> locales, these are: space, form-feed
> .RB ( \(aq\ef\(aq ),
> newline
> .RB ( \(aq\en\(aq ),
> carriage return
> .RB ( \(aq\er\(aq ),
> horizontal tab
> .RB ( \(aq\et\(aq ),
> and vertical tab
> .RB ( \(aq\ev\(aq ).
> .TP
> .BR isupper_l ()
> checks for an uppercase letter.
> .TP
> .BR isxdigit_l ()
> checks for a hexadecimal digits, that is, one of
> .br
> .BR "0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F" .

Problem: there is no mention of the requirements on 'loc'. (It must be 
a valid locale handle and must not be LC_GLOBAL_LOCALE.)

> .SH RETURN VALUE
> The values returned are nonzero if the character
> .I c
> falls into the tested class, and a zero value
> if not.

(Minor) problem: no VERSIONS section.

> .SH CONFORMING TO
> POSIX.1-2008 specifies all of these functions.
> .SH NOTES
> The details of what characters belong into which class depend on the current
> locale.
> .sp
> from
> .IR locale.h :
> The concept of one static locale per category is not very well
> thought out.  Many applications will need to process its data using
> information from several different locales.  Another application is
> the implementation of the internationalization handling in the
> upcoming ISO C++ standard library.  To support this another set of
> the functions using locale data exist which have an additional
> argument.

Simply quoting this text from locale.h without explanation does not
really add much to the description. The point is that the *_l
pages are designed to address the limitation that the traditional
locale APIs do not mix well with multi-threaded applications
and with applications that must deal with multiple locales.
A general statement to that effect needs to appear somewhere, though
probably not on this page. (I'll add something to locale(7).)

> For example,
> .BR isupper ()
> will not recognize an A-umlaut (\(:A) as an uppercase letter in the default
> .B "C"
> locale.

Problem: The sentence above relates to a function not even in the 
SYNOPSIS of this page.

> .SH EXAMPLE
> The following example takes a locale abbreviation like "de_DE" as argument.
> Is no argument is supplied it will use "C". With "de_DE" the code will
> identify the O-Umlaut correctly as alphanumeric character but not with "C".
> In contrast the punctuation will perform as before.

> .nf
> 
> #include <stdio.h>
> #include <locale.h>
> 
> int main(int argc,char *argv[])
> {
>   char *str="c2p\(:O.,";
>   int i;
>   locale_t loc;
>     if (argc > 1 )
>       loc = newlocale (LC_ALL_MASK, argv[1], NULL);
>     else
>       loc = newlocale (LC_ALL_MASK, "C", NULL);
> 
>   for(i=0;str[i]!=0;i++) {
>     if (isalnum_l(str[i],loc))
> 	printf("The character %c is alphanumeric.\\n",str[i]);
>     if ( ispunct_l(str[i],loc) )
> 	printf ("The character %c is punctuation.\\n",str[i]);
>   }
>   return 0;
> }
> 
> .fi

The example program above violates multiple guidelines from man-pages(7), 
including, for example:

* 4-space indent levels
* spacing around operators and parentheses does not follow K&R norms
  ("indent -kr" fixes most of this) or the norms demonstrated in numerous
  existing pages.
* The program does not compile without warnings when using 'cc -Wall"
  (in particular, _XOPEN_SOURCE needs to be defined as 700).
* Various header files that should be included, are not.
* The program does not do error checking of function calls.

But, more to the point, the program appears to be broken, if you are
operating in a UTF-8 locale, which I assume you are. I suppose 
the program does work if you are operating in an iso-8859-1 locale
(though that seems an unlikely set-up these days), but that
point would need some careful explanation in the man page, or some 
clarification in a shell session log that shows a run of the program,
otherwise the program would cause much confusion for people on UTF-8 
systems.

Functions such as isalnum_l() can't be applied to UTF-8 characters.
POSIX seems clear:

       The  c argument is an int, the value of which the application
       shall ensure is representable as an unsigned char or equal to
       the  value  of  the  macro EOF. If the argument has any other
       value, the behavior is undefined.

(It would have been useful to see sample output from your program as 
part of the man page, as a help to the reader, but also as a check of
what you believe is happening, and what locale you are working with.) 

Instead, a conversion to wide characters is needed (mbstowcs(3)), and
then the use of the isw*_l() functions. See my example, further down.

A modified version of your program illustrates the problem:

#define _XOPEN_SOURCE 700
#include <locale.h>
#include <string.h>
#include <stdio.h>
#include <locale.h>
#include <ctype.h>

int main(int argc, char *argv[])
{
    char *str = "c2pÖ.,";
    int i;
    locale_t loc;

    printf("string length = %ld\n", (long) strlen(str));
    if (argc > 1)
        loc = newlocale(LC_ALL_MASK, argv[1], NULL);
    else
        loc = newlocale(LC_ALL_MASK, "C", NULL);

    for (i = 0; str[i] != 0; i++) {
        printf("%d: %x %c: ", i, str[i] & 0xff, str[i] & 0xff);

        printf("%salphanumeric ", isalnum_l(str[i], loc) ? "" : "!");
        printf("%spunctuation ", ispunct_l(str[i], loc) ? "" : "!");
        printf("\n");
    }
    return 0;
}

See what happens when we run this:

$ ./a.out de_DE
string length = 7
0: 63 c: alphanumeric !punctuation 
1: 32 2: alphanumeric !punctuation 
2: 70 p: alphanumeric !punctuation 
3: c3 �: alphanumeric !punctuation   <====
4: 96 �: !alphanumeric !punctuation  <====
5: 2e .: !alphanumeric punctuation 
6: 2c ,: !alphanumeric punctuation 

Assuming you are operating in a UTF-8 system, the fourth glyph
in your string (O-umlaut) is actually treated as two bytes, 
not one character by the program, but the construction of your 
program hides this, because it does not print information about
all of the bytes!

Now, I am no expert (and so I may yet end up embarrassed), 
but it appears to me that you have misunderstood some of the 
fundamentals of how multibyte characters, UTF-8, and wide character 
strings work, and you didn't gain that understanding while writing 
the page. (Good reading: 
http://www.joelonsoftware.com/articles/Unicode.html 
http://www.cprogramming.com/tutorial/unicode.html 
And I admit I learned a whole lot as I pulled your program apart.)

> .SH SEE ALSO

Repeating the SEE ALSO from isalpha.3 isn't really correct (or useful) 
here.

> .BR iswalnum (3),
> .BR iswalpha (3),
> .BR iswblank (3),
> .BR iswcntrl (3),
> .BR iswdigit (3),
> .BR iswgraph (3),
> .BR iswlower (3),
> .BR iswprint (3),
> .BR iswpunct (3),
> .BR iswspace (3),
> .BR iswupper (3),
> .BR iswxdigit (3),
> .BR setlocale (3),
> .BR toascii (3),
> .BR tolower (3),
> .BR toupper (3),
> .BR ascii (7),
> .BR locale (7)

(Minor) problem: probably, other pages (at least, locale(7)) should 
add SEE ALSO references to this page. That change is not in this patch.

==

Now, to be clear: many page submissions that I receive fail on some
of the points mentioned above, but this page fails on multiple counts.

In summary... Walter, you are often good at finding things that need to 
be documented, and I know your work is well intended. However, the pages
you submit often require so much review/repair effort (in some cases, 
initial drafts of pages appear not to even have been run through a spell
checker, though this page seems okay), that it is often faster to write
the pages myself. And some of the same problems that I comment on in
earlier submissions turn up again in new submissions. Thus, it is hard 
for me to find the enthusiasm to review these pages myself (and I have
in any case very limited bandwidth) and help them get repaired (and 
it's rare that others step in, though I noticed that Stefan Puiu did 
take a shot on one of your submissions). I do not know what the solution 
is here, but this mail explains the problems from my side, and why 
I'm often unresponsive / slow to respond to your submissions (and 
am likely to remain so, unless something changes).

And here's how I think the kind of thing you intended to do in your example 
program actually needs to be done. (Bruno, perhaps you can confirm that this 
code is okay, as I plan to place this example in the wcstombs(3) man page.)

---8x------8x------8x------8x------8x------8x------8x------8x------8x---

#include <locale.h>
#include <wchar.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int
main(int argc, char *argv[])
{
    size_t mbslen;	/* Number of multibyte characters in source */
    wchar_t *wcs;	/* Pointer to converted wide character string */
    wchar_t *wp;

    if (argc < 3) {
        fprintf(stderr, "Usage: %s <locale> <string>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    /* Apply the specified locale */

    if (setlocale(LC_ALL, argv[1]) == NULL) {
        perror("setlocale");
        exit(EXIT_FAILURE);
    }

    /* Calculate the length required to hold argv[2] converted to
       a wide character string */
       
    mbslen = mbstowcs(NULL, argv[2], 0);
    if (mbslen == -1) {
        perror("mbstowcs");
        exit(EXIT_FAILURE);
    }

    /* Describe the source string to the user */

    printf("Length of source string (excluding terminator):\n");
    printf("    %ld bytes\n", (long) strlen(argv[2]));
    printf("    %ld multibyte characters\n\n", (long) mbslen);

    /* Allocate wide character string of the desired size.  Add 1
       to allow for terminating null wide character (L'\0'). */

    wcs = calloc(mbslen + 1, sizeof(wchar_t));
    if (wcs == NULL) {
        perror("calloc");
        exit(EXIT_FAILURE);
    }
    
    /* Convert the multibyte character string in argv[2] to a 
       wide character string */

    if (mbstowcs(wcs, argv[2], mbslen + 1) == -1) {
        perror("mbstowcs");
        exit(EXIT_FAILURE);
    }

    printf("Wide character string is: %ls (%ld characters)\n",
            wcs, (long) mbslen);

    /* Now do some inspection of the classes of the characters in
       the wide character string */
    
    for (wp = wcs; *wp != 0; wp++) {
        printf("    %lc ", (wint_t) *wp);

        if (!iswalpha(*wp))
            printf("!");
        printf("alpha ");

        if (iswalpha(*wp)) {
            if (iswupper(*wp))
                printf("upper ");

            if (iswlower(*wp))
                printf("lower ");
        }

        putchar('\n');
    }

    exit(EXIT_SUCCESS);
}

---8x------8x------8x------8x------8x------8x------8x------8x------8x---

And here's an example of what we see when running the program:

$ ./a.out de_DE.UTF-8 "Grüße!"
Length of source string (excluding terminator):
    8 bytes
    6 multibyte characters

Wide character string is: Grüße! (6 characters)
    G alpha upper 
    r alpha lower 
    ü alpha lower 
    ß alpha lower 
    e alpha lower 
    ! !alpha

With kind regards,

Michael

PS I'm working on adding the *_l functions to the isalpha.3 page, and
will send a draft to the list.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] new Page: isalpha__3(3)
       [not found]     ` <531DBD1F.5090400-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-03-12 18:34       ` walter harms
  0 siblings, 0 replies; 5+ messages in thread
From: walter harms @ 2014-03-12 18:34 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: linux-man



Am 10.03.2014 14:24, schrieb Michael Kerrisk (man-pages):

> 
> Hello Walter,
> 
> You've submitted a number of pages over the past months that I have
> not found the energy to review. I respond to this submission, with 
> the goal of explaining why, since it shows many of the problems that
> I see in the other submissions (and, in several cases, problems that
> I have commented on in your past submissions).

My major problem was i got no reply at all and was wondering if the page
arrived in the ml at all.
I shortened the mail a bit.
Since you made a new page, most stuff is now obsolete but i would like
to explain a few things i did.



>> .SH CONFORMING TO
>> POSIX.1-2008 specifies all of these functions.
>> .SH NOTES
>> The details of what characters belong into which class depend on the current
>> locale.
>> .sp
>> from
>> .IR locale.h :
>> The concept of one static locale per category is not very well
>> thought out.  Many applications will need to process its data using
>> information from several different locales.  Another application is
>> the implementation of the internationalization handling in the
>> upcoming ISO C++ standard library.  To support this another set of
>> the functions using locale data exist which have an additional
>> argument.
> 
> Simply quoting this text from locale.h without explanation does not
> really add much to the description. The point is that the *_l
> pages are designed to address the limitation that the traditional
> locale APIs do not mix well with multi-threaded applications
> and with applications that must deal with multiple locales.
> A general statement to that effect needs to appear somewhere, though
> probably not on this page. (I'll add something to locale(7).)

For me that was very clear at that time since i had exactly that
problem, i needed two locale internally for parsing, no need for threads.
For me that illustrated the intention of the original author, therefore
it went into notes.

> 
>> For example,
>> .BR isupper ()
>> will not recognize an A-umlaut (\(:A) as an uppercase letter in the default
>> .B "C"
>> locale.
> 
> Problem: The sentence above relates to a function not even in the 
> SYNOPSIS of this page.
> 

yes, bad example, the basic idea was to describe that some function
depend on a certain locale.


> But, more to the point, the program appears to be broken, if you are
> operating in a UTF-8 locale, which I assume you are. I suppose 
> the program does work if you are operating in an iso-8859-1 locale
> (though that seems an unlikely set-up these days), but that
> point would need some careful explanation in the man page, or some 
> clarification in a shell session log that shows a run of the program,
> otherwise the program would cause much confusion for people on UTF-8 
> systems.

I use 8859-1 and i guess that is the reason i did not see any problems.
But i am really wondering why i missed the -Wall point.

> 
> Functions such as isalnum_l() can't be applied to UTF-8 characters.
> POSIX seems clear:
> 
>        The  c argument is an int, the value of which the application
>        shall ensure is representable as an unsigned char or equal to
>        the  value  of  the  macro EOF. If the argument has any other
>        value, the behavior is undefined.
> 
> (It would have been useful to see sample output from your program as 
> part of the man page, as a help to the reader, but also as a check of
> what you believe is happening, and what locale you are working with.) 
> 
> Instead, a conversion to wide characters is needed (mbstowcs(3)), and
> then the use of the isw*_l() functions. See my example, further down.

I found isw*_l() also while figuring out what to do with isalpha_l and friends.


> ==
> 
> Now, to be clear: many page submissions that I receive fail on some
> of the points mentioned above, but this page fails on multiple counts.
> 
> In summary... Walter, you are often good at finding things that need to 
> be documented, and I know your work is well intended. However, the pages
> you submit often require so much review/repair effort (in some cases, 
> initial drafts of pages appear not to even have been run through a spell
> checker, though this page seems okay), that it is often faster to write
> the pages myself. And some of the same problems that I comment on in
> earlier submissions turn up again in new submissions. Thus, it is hard 
> for me to find the enthusiasm to review these pages myself (and I have
> in any case very limited bandwidth) and help them get repaired (and 
> it's rare that others step in, though I noticed that Stefan Puiu did 
> take a shot on one of your submissions). I do not know what the solution 
> is here, but this mail explains the problems from my side, and why 
> I'm often unresponsive / slow to respond to your submissions (and 
> am likely to remain so, unless something changes).

I admit that i underestimated the complexity of the these locale stuff.
I can only say i tried to document what was not documented and add an
example basically showing how i came to that conclusion and how it works.

> same problems that I comment in earlier submissions

Sorry about that that i always try to get better, the wired thing is
i can not find anything in my mailarchive. I found i send some pages
years ago what i can not find is any reply, with one exception
a typo in sem_wait.3.
Serious, i can not remember what happened, did i ever send a reply ?

And important, i guess the ISWALPHA_L page has the same defects as it is a direct
derivative of ISALPHA_L.

NTL it seems there are a lot more _l functions, i did not check if there
is a page but you may like to add some of them to undocumented(3):
see: http://www.freebsd.org/cgi/man.cgi?query=xlocale&sektion=3

sorry for the trouble,
wh


--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] new Page: isalpha__3(3)
       [not found]       ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-03-14 11:05         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-03-14 11:05 UTC (permalink / raw)
  To: Stefan Puiu; +Cc: linux-man, Bruno Haible

[CC restored, since I think this is a point that others may have comments on]

Hi Stefan,

On Thu, Mar 13, 2014 at 1:23 PM, Stefan Puiu <stefan.puiu-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Michael,
>
> Small nit about your wcstombs example below:
>
> On Mon, Mar 10, 2014 at 3:24 PM, Michael Kerrisk (man-pages)
> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> [...]
>> ---8x------8x------8x------8x------8x------8x------8x------8x------8x---
> [...]
>>     printf("Length of source string (excluding terminator):\n");
>>     printf("    %ld bytes\n", (long) strlen(argv[2]));
>>     printf("    %ld multibyte characters\n\n", (long) mbslen);
>
> Why not use %zu for mbslen and strlen(argv[2])? Both are size_t as far
> as I can tell. Then you wouldn't need the cast.

The 'z' specifier is a C99 invention as I recall, and it took a few
years before it became widespread. For example, it wasn't on Sun's
libc for Solaris 8, and reading some man pages suggests that it wasn't
there on Solaris 9 (released in 2002), though it was there by Solaris
10 (2005). Likewise, FreeBSD seems to have since it version 5 (2003).

So although glibc has had 'z' for a long time (at least as far back as
glibc 2.1 at the start of 1999, though I suspect a little earlier as
well), there were even a few years ago a lot of installed non-Linux
systems that didn't support it.

So, for portable code, I've tended to almost reflexively use
%ld+(long). Probably, the need to do that is less pressing now. There
are many fewer of those installed legacy systems these days. I'll
change that code for the mbstowcs() page to use %zu.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] new Page: isalpha__3(3)
@ 2014-03-04  8:37 walter harms
  0 siblings, 0 replies; 5+ messages in thread
From: walter harms @ 2014-03-04  8:37 UTC (permalink / raw)
  To: LKML

Hi List,
The ctype macros like  isalpha(3) have a locale specific counterpart.
This page was missing.

re,
 wh


Signed-off-by: wharms@bfs.de <wharms@bfs.de>


.\" Copyright (c) 2013 by Walter Harms
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.\"
.TH ISALPHA_L 3 2013-09-20 "GNU" "Linux Programmer's Manual"
.SH NAME
isalnum_l, isalpha_l, isblank_l, iscntrl_l, isdigit_l, isgraph_l, islower_l,
isprint_l, ispunct_l, isspace_l, isupper_l, isxdigit_l \- character
classification routines
.SH SYNOPSIS
.nf
.B #include <ctype.h>
.sp
.BI "int isalnum_l(int " "c" ", locale_t " loc );
.br
.BI "int isalpha_l(int " "c" ", locale_t " loc );
.br
.BI "int isascii_l(int " "c" ", locale_t " loc );
.br
.BI "int isblank_l(int " "c" ", locale_t " loc );
.br
.BI "int iscntrl_l(int " "c" ", locale_t " loc );
.br
.BI "int isdigit_l(int " "c" ", locale_t " loc );
.br
.BI "int isgraph_l(int " "c" ", locale_t " loc );
.br
.BI "int islower_l(int " "c" ", locale_t " loc );
.br
.BI "int isprint_l(int " "c" ", locale_t " loc );
.br
.BI "int ispunct_l(int " "c" ", locale_t " loc );
.br
.BI "int isspace_l(int " "c" ", locale_t " loc );
.br
.BI "int isupper_l(int " "c" ", locale_t " loc );
.br
.BI "int isxdigit_l(int " "c" ", locale_t " loc );
.fi
.sp
.SH DESCRIPTION
These functions check whether
.IR c ,
which must have the value of an
.I unsigned char
or
.BR EOF ,
falls into a certain character class according to the current locale.
.TP
.BR isalnum_l ()
checks for an alphanumeric character; it is equivalent to
.BI "(isalpha(" c ") || isdigit(" c "))" \fR.
.TP
.BR isalpha_l ()
checks for an alphabetic character; in the standard \fB"C"\fP
locale, it is equivalent to
.BI "(isupper_l(" c ") || islower_l(" c "))" \fR.
In some locales, there may be additional characters for which
.BR isalpha ()
is true\(emletters which are neither upper case nor lower
case.
.TP
.BR isascii_l ()
checks whether \fIc\fP is a 7-bit
.I unsigned char
value that fits into
the ASCII character set.
.TP
.BR isblank_l ()
checks for a blank character; that is, a space or a tab.
.TP
.BR iscntrl_l ()
checks for a control character.
.TP
.BR isdigit_l ()
checks for a digit (0 through 9).
.TP
.BR isgraph_l ()
checks for any printable character except space.
.TP
.BR islower_l ()
checks for a lower-case character.
.TP
.BR isprint_l ()
checks for any printable character including space.
.TP
.BR ispunct_l ()
checks for any printable character which is not a space or an
alphanumeric character.
.TP
.BR isspace_l ()
checks for white-space characters.
In the
.B """C"""
and
.B """POSIX"""
locales, these are: space, form-feed
.RB ( \(aq\ef\(aq ),
newline
.RB ( \(aq\en\(aq ),
carriage return
.RB ( \(aq\er\(aq ),
horizontal tab
.RB ( \(aq\et\(aq ),
and vertical tab
.RB ( \(aq\ev\(aq ).
.TP
.BR isupper_l ()
checks for an uppercase letter.
.TP
.BR isxdigit_l ()
checks for a hexadecimal digits, that is, one of
.br
.BR "0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F" .
.SH RETURN VALUE
The values returned are nonzero if the character
.I c
falls into the tested class, and a zero value
if not.
.SH CONFORMING TO
POSIX.1-2008 specifies all of these functions.
.SH NOTES
The details of what characters belong into which class depend on the current
locale.
.sp
from
.IR locale.h :
The concept of one static locale per category is not very well
thought out.  Many applications will need to process its data using
information from several different locales.  Another application is
the implementation of the internationalization handling in the
upcoming ISO C++ standard library.  To support this another set of
the functions using locale data exist which have an additional
argument.

For example,
.BR isupper ()
will not recognize an A-umlaut (\(:A) as an uppercase letter in the default
.B "C"
locale.
.SH EXAMPLE
The following example takes a locale abbreviation like "de_DE" as argument.
Is no argument is supplied it will use "C". With "de_DE" the code will
identify the O-Umlaut correctly as alphanumeric character but not with "C".
In contrast the punctuation will perform as before.
.nf

#include <stdio.h>
#include <locale.h>

int main(int argc,char *argv[])
{
  char *str="c2p\(:O.,";
  int i;
  locale_t loc;
    if (argc > 1 )
      loc = newlocale (LC_ALL_MASK, argv[1], NULL);
    else
      loc = newlocale (LC_ALL_MASK, "C", NULL);

  for(i=0;str[i]!=0;i++) {
    if (isalnum_l(str[i],loc))
	printf("The character %c is alphanumeric.\\n",str[i]);
    if ( ispunct_l(str[i],loc) )
	printf ("The character %c is punctuation.\\n",str[i]);
  }
  return 0;
}

.fi
.SH SEE ALSO
.BR iswalnum (3),
.BR iswalpha (3),
.BR iswblank (3),
.BR iswcntrl (3),
.BR iswdigit (3),
.BR iswgraph (3),
.BR iswlower (3),
.BR iswprint (3),
.BR iswpunct (3),
.BR iswspace (3),
.BR iswupper (3),
.BR iswxdigit (3),
.BR setlocale (3),
.BR toascii (3),
.BR tolower (3),
.BR toupper (3),
.BR ascii (7),
.BR locale (7)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-03-14 11:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-04 15:35 [PATCH] new Page: isalpha__3(3) walter harms
     [not found] ` <5315F2B5.2040009-fPG8STNUNVg@public.gmane.org>
2014-03-10 13:24   ` Michael Kerrisk (man-pages)
     [not found]     ` <531DBD1F.5090400-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-03-12 18:34       ` walter harms
     [not found]     ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ@mail.gmail.com>
     [not found]       ` <CACKs7VCcOGugkbs-=Rmu0XiBtiWEvHU8oyCKqKtdLMw2AfDJMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-03-14 11:05         ` Michael Kerrisk (man-pages)
  -- strict thread matches above, loose matches on Subject: below --
2014-03-04  8:37 walter harms

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.