From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94F5DC6FD1D for ; Fri, 7 Apr 2023 19:02:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232655AbjDGTCI (ORCPT ); Fri, 7 Apr 2023 15:02:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53638 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231262AbjDGTBu (ORCPT ); Fri, 7 Apr 2023 15:01:50 -0400 Received: from mail.cs.ucla.edu (mail.cs.ucla.edu [131.179.128.66]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E1DFEF82 for ; Fri, 7 Apr 2023 12:00:26 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 64E6E3C09FA01; Fri, 7 Apr 2023 12:00:17 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id zp66LFy4pHE7; Fri, 7 Apr 2023 12:00:17 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 0E32C3C097AFC; Fri, 7 Apr 2023 12:00:17 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 0E32C3C097AFC DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1680894017; bh=RLGYiM+qbMJT97xdVCzt/fP+O1nLWgInQnT7uysumQU=; h=Message-ID:Date:MIME-Version:To:From; b=ihF/bUcmkB0vPm8EweR+bhESphyJsVE2A7pZD1bxge1K/Cra6+NeQzVjYAe5GeL/j vaDyolqDxZnVUVXzM2Hd1PtptcnFudh/Qh35TBFhU6mxh2ydjnJn6Ld36fuKNdswI/ ni/1Q4NScfFlhwCsJ21c/CAIySv8QO0sz6my2gtds3PT88nO3+egndruMz8p8ihg7m EOKYEeW5pMXGxkO7EvhyGnTSQg1Yg5sx/0iruvSSIMTUyvWe40UweTpMAIe4NX2GQz M+Ph7M68u3adlMdxx6c4iHzdSaNW22mP8owoQSVfgfA4CCiR4QTAoMwOo44EKwSvHW nRXRIjBoM3wxg== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id hpZkBCsrcJpZ; Fri, 7 Apr 2023 12:00:16 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id C1CF63C09FA02; Fri, 7 Apr 2023 12:00:16 -0700 (PDT) Message-ID: <065bcdcb-5770-5384-5afe-4a4d29272274@cs.ucla.edu> Date: Fri, 7 Apr 2023 12:00:16 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Content-Language: en-US To: demerphq Cc: 60690@debbugs.gnu.org, mega lith01 , Carlo Arenas , =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?UTF-8?Q?mason?= , =?UTF-8?Q?Tukusej=e2=80=99s_Sirs?= , git@vger.kernel.org, pcre2-dev@googlegroups.com, Junio C Hamano References: <2554712d-e386-3bab-bc6c-1f0e85d999db@cs.ucla.edu> <96358c4e-7200-e5a5-869e-5da9d0de3503@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Subject: Re: bug#60690: -P '\d' in GNU and git grep In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 2023-04-06 06:39, demerphq wrote: > Unicode specifies that \d match any digit > in any script that it supports. "Specifies" is too strong. The Unicode Regular Expressions technical standard (UTS#18) mentions \d only in Annex C[1], next to the word "digit" in a column labeled "Property" (even though \d is really syntax not a property). This is at best an informal recommendation, not a requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for illustration and that although it's similar to Perl's, the two syntax forms may not be exactly the same. So we can't look to UTS#18 for a definitive way out of the \d mess, as the Unicode folks specifically delegated matters to us. Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C says "\p{gc=Decimal_Number}" is the standard recommended syntax assignment for digits. However, PCRE2 does not support this syntax; it supports another variant \p{Nd} that UTS#18 also recommends. So it appears that PCRE2 already does not implement every recommended aspect of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support "\p{gc=Decimal_Number}". Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class, that's clearly enough for grep -P to conform to UTS#18 with respect to digits. > A) how do you tell the regular expression > engine what semantics you want and B) how does the regular expression > library identify the encoding in the file, and how does it handle > malformed content in that file. Here's how GNU grep does it: * RE semantics are specified via command-line options like -P. * Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'. * REs do not match encoding errors. > on *nix there is no tradition of using BOM's to > distinguish the 6 different possible encodings of Unicode (UTF-8, > UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE) Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe or MS-Windows code these legacy encodings are obviously a big deal. However, there seems little reason to force their nontrivial hassles onto every GNU/Linux program that processes text. A few specialized apps like 'iconv' deal with offbeat encodings, and that is probably a better approach all around. > there seems > to be some level of desire of matching with unicode semantics against > files that are not uniformly encoded in one of these formats. That is a use case, yes. It's what 'strings' and 'grep' do. [1]: https://unicode.org/reports/tr18/#Compatibility_Properties [2]: https://unicode.org/reports/tr18/#Conformance