From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4DD71C433FE for ; Thu, 2 Dec 2021 20:19:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348770AbhLBUW5 (ORCPT ); Thu, 2 Dec 2021 15:22:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235095AbhLBUWx (ORCPT ); Thu, 2 Dec 2021 15:22:53 -0500 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C5FDC06174A; Thu, 2 Dec 2021 12:19:31 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id b11so501057pld.12; Thu, 02 Dec 2021 12:19:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=EZyJYJGDiU9aKpyBczLv1xRZG/LkexDA2FqwX44VCAM=; b=SBmk8vku3Nle/xiZM+Y7yBU8eKxHyoFJVTeABCRA51m1JnxPVovLBWj3uThFliSkll 2btIwOZuM0nBBKSBByl73odrP3zidi5MTg/xfDuDNOFaaChdA3AQc4b0B2ohMwq1vLwv 2QUTs1QLxvP2BVjzsztwShPgCvnfcKs1lbsN1fDzR9ovyFtVHyyM2Cc6ywjN4rOvs9qv hpCbqNMukcgleu7V4WPfotDSnP07ZSf6oM4y0xHX4aKotj4MXBNdQ/Mt6BjS8t6RCV0v kHQAFDtkdYKpG94g8X8Bbikju2h6B/2XMMVjXoWYUP87BXmHkoBez8yOfw6Qmjx2ljqb ItaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=EZyJYJGDiU9aKpyBczLv1xRZG/LkexDA2FqwX44VCAM=; b=ueSYXvCX27Qq0Z8i3kPTtz3G+fiaLgF/u8414gDYq+k99dwcjXGEk/jbHIjea6afdb w20QM1sYoZcJef3Jf0LY92BEkqozGUIkqBIPByqEOYh5F0Ndnss1VYdBKZTsHJ+M/x6A CoFvfbuNmNrkzgkFF/GWTEU7tZ9J4OSS2HDYqq5GNAV1bzrjmBWSdRYDkDlOF+M71LhG FfaQhz2jOe7obVNJJ3Hbs58BS7FLTfNE9rCfbnsNvRkaTwHH57WrNFrVMH3jzXwFZrIn L90mWzjoUxzHVdQ8s4g78ep1lnmvy9ycb3Lkq8jkN/qxu6mYvKeKxzSv2+1fXOHlqObt 4+Hw== X-Gm-Message-State: AOAM5310Yv2Z/QJ9315sbWoLj5H5dipxm2gh+pCZpu1UuXgFcLSxdLUf VadN69kbBqLrqELvkBLMsout9RHxiN7tQ3zTKSc= X-Google-Smtp-Source: ABdhPJwpJOfg8iRFGqhbc7DMWiH3xS7W842CCQOTVqBCazcIAFXk5W6AmCe0joYoZh7cWSkvJ5m5a9NObL9koqYFfE4= X-Received: by 2002:a17:90b:1892:: with SMTP id mn18mr8464878pjb.178.1638476370639; Thu, 02 Dec 2021 12:19:30 -0800 (PST) MIME-Version: 1.0 References: <20211125193852.3617-1-goldstein.w.n@gmail.com> <8e4961ae0cf04a5ca4dffdec7da2e57b@AcuMS.aculab.com> <29cf408370b749069f3b395781fe434c@AcuMS.aculab.com> In-Reply-To: From: Noah Goldstein Date: Thu, 2 Dec 2021 14:19:19 -0600 Message-ID: Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c To: Eric Dumazet Cc: David Laight , "tglx@linutronix.de" , "mingo@redhat.com" , Borislav Petkov , "dave.hansen@linux.intel.com" , X86 ML , "hpa@zytor.com" , "peterz@infradead.org" , "alexanderduyck@fb.com" , open list , netdev Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet wrote: > > On Thu, Dec 2, 2021 at 6:24 AM David Laight wrote: > > > > I've dug out my test program and measured the performance of > > various copied of the inner loop - usually 64 bytes/iteration. > > Code is below. > > > > It uses the hardware performance counter to get the number of > > clocks the inner loop takes. > > This is reasonable stable once the branch predictor has settled down. > > So the different in clocks between a 64 byte buffer and a 128 byte > > buffer is the number of clocks for 64 bytes. Intuitively 10 passes is a bit low. Also you might consider aligning the `csum64` function and possibly the loops. There a reason you put ` jrcxz` at the beginning of the loops instead of the end? > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.) > > > > What is interesting is that even some of the trivial loops appear > > to be doing 16 bytes per clock for short buffers - which is impossible. > > Checksum 1k bytes and you get an entirely different answer. > > The only loop that really exceeds 8 bytes/clock for long buffers > > is the adxc/adoc one. > > > > What is almost certainly happening is that all the memory reads and > > the dependant add/adc instructions are all queued up in the 'out of > > order' execution unit. > > Since 'rdpmc' isn't a serialising instruction they can still be > > outstanding when the function returns. > > Uncomment the 'rdtsc' and you get much slower values for short buffers. Maybe add an `lfence` before / after `csum64` > > > > When testing the full checksum function the queued up memory > > reads and adc are probably running in parallel with the logic > > that is handling lengths that aren't multiples of 64. > > > > I also found nothing consistently different for misaligned reads. > > > > These were all tested on my i7-7700 cpu. > > > > I usually do not bother timing each call. > I instead time a loop of 1,000,000,000 calls. > Yes, this includes loop cost, but this is the same cost for all variants. > for (i = 0; i < 100*1000*1000; i++) { > res += csum_partial((void *)frame + 14 + 64*0, 40, 0); > res += csum_partial((void *)frame + 14 + 64*1, 40, 0); > res += csum_partial((void *)frame + 14 + 64*2, 40, 0); > res += csum_partial((void *)frame + 14 + 64*3, 40, 0); > res += csum_partial((void *)frame + 14 + 64*4, 40, 0); > res += csum_partial((void *)frame + 14 + 64*5, 40, 0); > res += csum_partial((void *)frame + 14 + 64*6, 40, 0); > res += csum_partial((void *)frame + 14 + 64*7, 40, 0); > res += csum_partial((void *)frame + 14 + 64*8, 40, 0); > res += csum_partial((void *)frame + 14 + 64*9, 40, 0); > } + 1. You can also feed `res` from previous iteration to the next iteration to measure latency cheaply if that is better predictor of performance. > > Then use " perf stat ./bench" or similar.