From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6868C3F2C6 for ; Tue, 3 Mar 2020 10:14:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B26DA21739 for ; Tue, 3 Mar 2020 10:14:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=mirlab-org.20150623.gappssmtp.com header.i=@mirlab-org.20150623.gappssmtp.com header.b="a5uUjNdk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728469AbgCCKOL (ORCPT ); Tue, 3 Mar 2020 05:14:11 -0500 Received: from mail-lf1-f68.google.com ([209.85.167.68]:43429 "EHLO mail-lf1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728323AbgCCKOL (ORCPT ); Tue, 3 Mar 2020 05:14:11 -0500 Received: by mail-lf1-f68.google.com with SMTP id s23so2195191lfs.10 for ; Tue, 03 Mar 2020 02:14:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mirlab-org.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=HZLJdyompUEUfGaYkKC4//Srq4/t18wcP+yf/P65E7A=; b=a5uUjNdkPvsrGH3XeHiv1khDvumuDRVBlNJFC6lUcfnUmuEr9PE65vkxV2uphGI8I+ 3UDxl8sCAJQepm6KrwdpYMCFAy4EhMEfbAtQadb5vks/qYfRrfHpQmiOnHv3zLQHSL64 LsKSugPrYvGKzFfo8Wcxx57HXmjc6Rh5nN3sM0/Vc7MX01beFpdLGSt6E72oLxzAtRQo 4hbRPYZQs8ryVm7Mxw3O7gto5lZutkgVkcNa7CFh/DpvT8w2plqfGGrWnnV4TP8cgK6O ET/tJ71P9YBvAv1wQtLs9YjEUD7/VNYhhflzk+EiIRITC1vrifpAWn3tuqabLLlztiXE 2WeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=HZLJdyompUEUfGaYkKC4//Srq4/t18wcP+yf/P65E7A=; b=q7d2WDji6zXIRiu5IyGzTVY+sIU3AN4/e3zGZw98Ums8aKLs7iq0URzcWLYTPaHjT7 MkBe68EE6pjG1c9VNnlJdEIg6XaR7ylmFLI5dzfRktrEkHuBKgQ7fvZrMFplZaV/js+l 3wruOzYeivibPkm+sKo5Btx/UqN2zR4m+FSWrEM/SlMAq39rl199uPkXURXpJ6u+MfZd +PA+CFXG1k5/ENadv/pYZo/U8oqXKcItqQp51umSDC6IInNEelz3rriu+FX77zJYieAi LZm8vDzh3ld4Acju0W4UbYQ6WQRBu6FCu/SuE7R/GPtYnvC/QhDh40U5yojYkskSsvPu PnMg== X-Gm-Message-State: ANhLgQ1nNtoRqRPd+vqm65mXIEZdqS/sQJOHJf39fmeqHT+QgYIJF3l7 r7lXHVdxQS/lJ4MJixmvFwXB32ZJFfxjRCeWnA5/fXbeniA= X-Google-Smtp-Source: ADFU+vu5Z+2UkQtHXzD5lzUdpVS3/bgA81N4t8QmWWStrtzKpM10C+DUPxwjWYbFyWaEbFuH3qhuWk3LkwXpW6YO3Nw= X-Received: by 2002:ac2:5699:: with SMTP id 25mr2373610lfr.54.1583230449412; Tue, 03 Mar 2020 02:14:09 -0800 (PST) MIME-Version: 1.0 References: <20200302103754.nsvtne2vvduug77e@yavin> <20200302104741.b5lypijqlbpq5lgz@yavin> <20200303070928.aawxoyeq77wnc3ts@yavin> In-Reply-To: <20200303070928.aawxoyeq77wnc3ts@yavin> From: lampahome Date: Tue, 3 Mar 2020 18:13:56 +0800 Message-ID: Subject: Re: why do we need utf8 normalization when compare name? To: Aleksa Sarai Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org > Unicode normalisation will take the strings "=C3=B1" (U+00F1) and "n=E2= =97=8C=CC=83" > (U+006E U+0303) and turn them into the same Unicode string. Note that > there are four kinds of Unicode normalisation (NFD, NFC, NFKD, NFKC), so > what precise string you end up with depends on which form you're using. > Linux uses NFD, I believe. > And yes, once the strings are normalised and encoded as UTF-8 you then > do a byte-by-byte comparison (if the comparison is case-insensitive then > fs/unicode/... will case-fold the Unicode symbols during normalisation). > What I'm confused is why encoded as utf-8 after normalize finished? >From above, turn "=C3=B1" (U+00F1) and "n=E2=97=8C=CC=83" (U+006E U+0303) i= nto the same Unicode string. Then why should we just compare bytes from normalized.