From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9365FC43441 for ; Sat, 24 Nov 2018 09:56:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 446F720865 for ; Sat, 24 Nov 2018 09:56:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linaro.org header.i=@linaro.org header.b="F+kt41U+" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 446F720865 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726101AbeKXUo5 (ORCPT ); Sat, 24 Nov 2018 15:44:57 -0500 Received: from mail-it1-f194.google.com ([209.85.166.194]:34088 "EHLO mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725940AbeKXUo5 (ORCPT ); Sat, 24 Nov 2018 15:44:57 -0500 Received: by mail-it1-f194.google.com with SMTP id x124so19696081itd.1 for ; Sat, 24 Nov 2018 01:56:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=8cgzZRk2BNE6EhD3aCwEge8i+GxY5BFis5pjovLow1k=; b=F+kt41U+55ac0yY22HuHY9MXX6sZ+KWHN+s2eb/AKLwzVcVje54CSAHlR67MM1JjJn AF+WhmpAiMFcgsFbFMc4w2LbI5vPb0jE3a7Pb0P++INep48jFj363zw2IpKDV4sr7QUg JuDFNFuVGY7JghfeSuiOapWF52mwQxWYamxbc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=8cgzZRk2BNE6EhD3aCwEge8i+GxY5BFis5pjovLow1k=; b=OSycvdkbJt2jBEMdVjEpXy2e4QJLZs5FhLeeEiawMcV9MBgcADrvTRsQtYYxOzjn5w Cwc8fcd8HeTA37DbdEYU9wNQ1OqXCuR7WZ5V+0xvX9SV0V6bekIiCEFPbOHkDURQENUR c+WztIJv/XotJM5B2VmSSfFtQwihjtsZwMrDiHlakTEM0vQF20sogCOS21jD8c+ZO0sO BAeOA6aD54yliRIlVoIYc66QrzspjzAHrdzrsPxcBY0N8Lpd2CNXg3A3tmgn3NrLXUpt XiPcviw/puAdA+QLvXFz62Mv6DM+xzX4uJwEAlRtCNtA497o+QRYmjWDamYcdKnj6Vf7 qZyA== X-Gm-Message-State: AGRZ1gLIiOl2TGwVXHWwtfalAenhQ0/q2Ztnt8+jXjYlR66UWbHAWwbB GOCqYzg5xFhB67shHhQThXYRjwNogXpVAlMSe1zp8grm X-Google-Smtp-Source: AJdET5dHHwDpbFj59rThNUJBW1B2PC0piu8sKy/VUYtsda4JH5yILliM7TP0e9am1IsmUDnL4dMeqxuE3E5D2l21AWM= X-Received: by 2002:a02:8449:: with SMTP id l9-v6mr16657742jah.130.1543053413339; Sat, 24 Nov 2018 01:56:53 -0800 (PST) MIME-Version: 1.0 References: <0DF8AA8C9CD0FB4ABF4E83674055F7E2D82974@dggemi508-mbx.china.huawei.com> In-Reply-To: <0DF8AA8C9CD0FB4ABF4E83674055F7E2D82974@dggemi508-mbx.china.huawei.com> From: Ard Biesheuvel Date: Sat, 24 Nov 2018 10:56:41 +0100 Message-ID: Subject: Re: Re: [PATCH] arm64: crc: accelerated-crc32-by-64bytes To: sunrui26@huawei.com Cc: Catalin Marinas , Will Deacon , linux-arm-kernel , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 24 Nov 2018 at 07:42, sunrui wrote: > > > On Thu, 22 Nov 2018 at 02:50, sunrui wrote: > > > > > > > > On Sun, 18 Nov 2018 at 23:30, Rui Sun wrote: > > > > > > > > > > add 64 bytes loop to acceleration calculation > > > > > > > > > > > > > Can you share some performance numbers please? > > > > > > > > Also, we don't need 64 byte, 32 byte and 16 byte code paths: just make = the 8 byte one a loop as well, and drop the 32 byte and 16 byte ones. > > > > > > > > -- > > > > > > > > Consider of some processor has instruction N-way parallel function, wit= h the increase of the data buf=E2=80=99s size, 64B loop will performance be= tter than 16B loop. > > > > > > > > On the other hand, in the same environment I tested the 8B loop, which = is worse than the 16-byte loop. > > > > > > > > The test result is shown in the fellow excel(crc test result.xlsx) > > sheet1(64B loop) and sheet2(8B loop) > > > > > >Maybe I phrased that wrong: if we add the 64-byte loop, there is no need= for a 32-byte block, a 16 byte block and a 8 byte block, since they all us= e the same crc32x instruction. After the 64-byte loop, just loop in the 8-b= yte sequence until the remaining data is less than 8 bytes. > > > > > > > I think we should not use 8-byte loop after 64-byte loop. Although the nu= mber of code lines is reduced, but it will run more subs and b.cond instruc= tion. I test it and shown the result in the fellow excel. > OK > Why I used three temp variables to do the ldp below is because our proces= sor have two load/store unit, if we use the registers which are independent= , it can processed in parallel. > Yes, but you are adding three instructions to a tight loop, which will be noticeable on in-order cores. Just use something like ldp x3, x4, [x0] ldp x5, x6, [x0, #16] ldp x7, x8, [x0, #32] ldp x9, x10, [x0, #48] add x0, x0, #64 Those are completely independent as well > By the way, In most cases, crc short XOR 0xffffffff before and after the= calculation, if we add 'mvn w0, w0' at the beginning and before the return= will bring some benefits. What do you think about it? The C code will take care of that. > > > > > > -- > > > > > > > > > Signed-off-by: Rui Sun > > > > > --- > > > > > arch/arm64/lib/crc32.S | 54 > > > > > ++++++++++++++++++++++++++++++++++++++++++++++---- > > > > > 1 file changed, 50 insertions(+), 4 deletions(-) > > > > > > > > > > diff --git a/arch/arm64/lib/crc32.S b/arch/arm64/lib/crc32.S index > > > > > 5bc1e85..2b37009 100644 > > > > > --- a/arch/arm64/lib/crc32.S > > > > > +++ b/arch/arm64/lib/crc32.S > > > > > @@ -15,15 +15,61 @@ > > > > > .cpu generic+crc > > > > > > > > > > .macro __crc32, c > > > > > -0: subs x2, x2, #16 > > > > > - b.mi 8f > > > > > + > > > > > +64: cmp x2, #64 > > > > > + b.lt 32f > > > > > + > > > > > + adds x11, x1, #16 > > > > > + adds x12, x1, #32 > > > > > + adds x13, x1, #48 > > > > > + > > > > > +0 : subs x2, x2, #64 > > > > > + b.mi 32f > > > > > + > > > > > + ldp x3, x4, [x1], #64 > > > > > + ldp x5, x6, [x11], #64 > > > > > + ldp x7, x8, [x12], #64 > > > > > + ldp x9, x10,[x13], #64 > > > > > + > > > > > > > > Can we do this instead, and get rid of the temp variables? > > > > > > > > ldp x3, x4, [x1], #64 > > > > ldp x5, x6, [x1, #-48] > > > > ldp x7, x8, [x1, #-32] > > > > ldp x9, x10,[x1, #-16] > > > > > > > > > + CPU_BE( rev x3, x3 ) > > > > > + CPU_BE( rev x4, x4 ) > > > > > + CPU_BE( rev x5, x5 ) > > > > > + CPU_BE( rev x6, x6 ) > > > > > + CPU_BE( rev x7, x7 ) > > > > > + CPU_BE( rev x8, x8 ) > > > > > + CPU_BE( rev x9, x9 ) > > > > > + CPU_BE( rev x10,x10 ) > > > > > + > > > > > + crc32\c\()x w0, w0, x3 > > > > > + crc32\c\()x w0, w0, x4 > > > > > + crc32\c\()x w0, w0, x5 > > > > > + crc32\c\()x w0, w0, x6 > > > > > + crc32\c\()x w0, w0, x7 > > > > > + crc32\c\()x w0, w0, x8 > > > > > + crc32\c\()x w0, w0, x9 > > > > > + crc32\c\()x w0, w0, x10 > > > > > + > > > > > + b.ne 0b > > > > > + ret > > > > > + > > > > > +32: tbz x2, #5, 16f > > > > > + ldp x3, x4, [x1], #16 > > > > > + ldp x5, x6, [x1], #16 > > > > > +CPU_BE( rev x3, x3 ) > > > > > +CPU_BE( rev x4, x4 ) > > > > > +CPU_BE( rev x5, x5 ) > > > > > +CPU_BE( rev x6, x6 ) > > > > > + crc32\c\()x w0, w0, x3 > > > > > + crc32\c\()x w0, w0, x4 > > > > > + crc32\c\()x w0, w0, x5 > > > > > + crc32\c\()x w0, w0, x6 > > > > > + > > > > > +16: tbz x2, #4, 8f > > > > > ldp x3, x4, [x1], #16 > > > > > CPU_BE( rev x3, x3 ) > > > > > CPU_BE( rev x4, x4 ) > > > > > crc32\c\()x w0, w0, x3 > > > > > crc32\c\()x w0, w0, x4 > > > > > - b.ne 0b > > > > > - ret > > > > > > > > > > 8: tbz x2, #3, 4f > > > > > ldr x3, [x1], #8 > > > > > -- > > > > > 1.8.3.1 > > > > > > > > >