From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D08CC433EF for ; Sun, 28 Nov 2021 21:01:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1358360AbhK1VFF (ORCPT ); Sun, 28 Nov 2021 16:05:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36782 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237186AbhK1VDC (ORCPT ); Sun, 28 Nov 2021 16:03:02 -0500 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C7B99C061756 for ; Sun, 28 Nov 2021 12:59:45 -0800 (PST) Received: by mail-pl1-x62e.google.com with SMTP id b11so10388793pld.12 for ; Sun, 28 Nov 2021 12:59:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=PF7JllFcyeGRUi7owsbg+8vGGeO0cpWlm5Ptr7fCVfc=; b=AY23yjMIbgBl3u/Zravz24Uv05a6REq/Jui8qzKR8tGr6Ww44YHAO4w+a5Fth1iUNb ByKbglA/OMqa6KwoqOL0wVdjENMVqV5NsH/bkUXC8CSI3/Ppc5vpM640sgobcoXKRbw6 WMIs5EvmvsWi7/MVcU1BIMD+GWwsb/JAaejjP1IFkBLYA93A/9MYEmyQN2Sfw6ZIXnhF OmlPVUXiwlLnCQYhLGVdlaPHYTadHlx3iXrpFvuxUG0WLpryaLTUopNm8fcw7EZG/Tst Z2bHOWBRytYcvlH3O+YcTamoqS8h3j4ljpHg3IqCdbBbwYFcZ3c6zOIC0mGzPJlEKkB8 Wnpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=PF7JllFcyeGRUi7owsbg+8vGGeO0cpWlm5Ptr7fCVfc=; b=u63tExW4lXBkUEgHG4l9zzjyC/OibC4aAKw4zoUCjjXNkxuDR6/+ZLCHRFVh3ZNEj8 1QHg1lz4nrT08BYEde9GWZTP25LjGKxwrKl5C/C/6esGrKO+xYD1RB+YAXlstjbgXUe2 4FgJ7jGbB3rrrDvTeghYstTfPkm70ekQ9vmmokHmyxKnr1sHL1lITNZujiOaR/ssz7N3 qO7FZugWlaM9KjVFpKCM40ehMyZyLrVlHCtiPXGFJ4Gqq1SC147dVdjRFHtrwXKZoh69 H85jtwuBaee0BxW8YAHyEP99QgJTCl97liggw9TdZGsWTgvmjriAzZn0sdGAfx8IiUPD idoQ== X-Gm-Message-State: AOAM5326oEc0o2bM2WKeqIN4zQ0PnhwH3vkkl1wxcNAnrn4PxKE5yqyp Ua261qTBOFqzGxDT/77OsZCP3cTxldpg5VyrO5Y= X-Google-Smtp-Source: ABdhPJxs+JUVv4n+Qp0u575NljJPIE/UP5Z6iatw5daPetpZaIB/dshOFYYPnY9ohEzqwH+mzqvF9VTOFbQNQ4FGgbg= X-Received: by 2002:a17:902:ab14:b0:143:77d8:2558 with SMTP id ik20-20020a170902ab1400b0014377d82558mr54822843plb.54.1638133185235; Sun, 28 Nov 2021 12:59:45 -0800 (PST) MIME-Version: 1.0 References: <20211125193852.3617-1-goldstein.w.n@gmail.com> <8e4961ae0cf04a5ca4dffdec7da2e57b@AcuMS.aculab.com> In-Reply-To: <8e4961ae0cf04a5ca4dffdec7da2e57b@AcuMS.aculab.com> From: Noah Goldstein Date: Sun, 28 Nov 2021 14:59:33 -0600 Message-ID: Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c To: David Laight Cc: Eric Dumazet , "tglx@linutronix.de" , "mingo@redhat.com" , Borislav Petkov , "dave.hansen@linux.intel.com" , X86 ML , "hpa@zytor.com" , "peterz@infradead.org" , "alexanderduyck@fb.com" , open list Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Nov 28, 2021 at 1:47 PM David Laight wrote: > > ... > > Regarding the 32 byte case, adding two accumulators helps with the latency > > numbers but causes a regression in throughput for the 40/48 byte cases. Which > > is the more important metric for the usage of csum_partial()? > > > > Here are the numbers for the smaller sizes: > > > > size, lat old, lat ver2, lat ver1, tput old, tput ver2, tput ver1 > > 0, 4.961, 4.503, 4.901, 4.887, 4.399, 4.951 > > 8, 5.590, 5.594, 5.620, 4.227, 4.110, 4.252 > > 16, 6.182, 6.398, 6.202, 4.233, 4.062, 4.278 > > 24, 7.392, 7.591, 7.380, 4.256, 4.246, 4.279 > > 32, 7.371, 6.366, 7.390, 4.550, 4.900, 4.537 > > 40, 8.621, 7.496, 8.601, 4.862, 5.162, 4.836 > > 48, 9.406, 8.128, 9.374, 5.206, 5.736, 5.234 > > 56, 10.535, 9.189, 10.522, 5.416, 5.772, 5.447 > > 64, 10.000, 7.487, 7.590, 6.946, 6.975, 6.989 > > 72, 11.192, 8.639, 8.763, 7.210, 7.311, 7.277 > > 80, 11.734, 9.179, 9.409, 7.605, 7.620, 7.548 > > 88, 12.933, 10.545, 10.584, 7.878, 7.902, 7.858 > > 96, 12.952, 9.331, 10.625, 8.168, 8.470, 8.206 > > 104, 14.206, 10.424, 11.839, 8.491, 8.785, 8.502 > > 112, 14.763, 11.403, 12.416, 8.798, 9.134, 8.771 > > 120, 15.955, 12.635, 13.651, 9.175, 9.494, 9.130 > > 128, 15.271, 10.599, 10.724, 9.726, 9.672, 9.655 > > > > 'ver2' uses two accumulators for 32 byte case and has better latency numbers > > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the > > implementation > > posted which has essentially the same numbers for tput/lat as 'old' > > for sizes [0, 63]. > > Which cpu are you testing on - it will make a big difference ? Tigerlake, although assuming `adc` as the bottleneck, the results should be largely independent. > And what are you measing throughput in? Running back to back iterations with the same input without any dependency between iterations. The OoO window will include multiple iterations at once. > And are you testing aligned or mis-aligned 64bit reads? Aligned as that is the common case. > > I think one of the performance counters will give 'cpu clocks'. Time is in Ref Cycles using `rdtsc` > > I did some tests early last year and got 8 bytes/clock on broadwell/haswell > with code that 'loop carried' the carry flag (a single adc chain). > On the older Intel cpu (Ivy bridge onwards) 'adc' has a latency of 2 > for the result, but the carry flag is available earlier. > So alternating the target register in the 'adc' chain will give (nearly) > 8 bytes/clock - I think I got to 7.5. > > That can all be done with only 4 reads per interaction. > IIRC broadwell/haswell only need 2 reads/iteration. > > It is actually likely (certainly worth checking) that haswell/broadwell > can do two misaligned memory reads every clock. > So it may not be worth aligning the reads (which the old code did). > In any case aren't tx packets likely to be aligned, and rx ones > misaligned to some known 4n+2 boundary? > > Using adxc/adxo together is a right PITA. I'm a bit hesitant about adxc/adxo because they are extensions so support will need to be tested. > I did get (about) 12 bytes/clock fo long buffers while loop carrying > both the overflow and carry flags. > > Also is there a copy of the patched code anywhere? https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/ > I think I've missed some of the patches and they are difficult to follow. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)