From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 27226C433EF
	for <dpdk-dev@archiver.kernel.org>; Mon, 11 Jul 2022 11:33:56 +0000 (UTC)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id C8035410DD;
	Mon, 11 Jul 2022 13:33:55 +0200 (CEST)
Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com
 [209.85.221.51]) by mails.dpdk.org (Postfix) with ESMTP id 5017940684
 for <dev@dpdk.org>; Mon, 11 Jul 2022 13:33:54 +0200 (CEST)
Received: by mail-wr1-f51.google.com with SMTP id bk26so6572610wrb.11
 for <dev@dpdk.org>; Mon, 11 Jul 2022 04:33:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to;
 bh=qo78g3e5utIDly0RpbZY/Va/AU0wu8fcz3mCX3kFLUs=;
 b=Hh7goorsPIc60BdDRG4zN+iuCsTXY2NdqtZhVpLP0QXFWgHd28SwafF8VDkaIQu81u
 0yDg68FX6kMxrp1hLKLKykk4wC6qAFeDSQwgfsCT+uSYsZ9zjhA+xZ6/+2v7HLX9s8ir
 +dpqiQJMsAiMg5q3zpWUGUhZPvsmeISr4/odOBYji3Ethq9/yRCHHi9HhWRhRLSh90AG
 bDL8xlkz72WgJwhuxu0LqaEcfeX+IZBC33g9FSusanMgRo4gsGIegQvHMaVfijle9hAv
 xVf5wXC/BQpUcbX3sKW/sAwgD7+hdimpxzdcUUsCB+euJFriGXNIzd7PqKR2fudgthRh
 3fwg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to;
 bh=qo78g3e5utIDly0RpbZY/Va/AU0wu8fcz3mCX3kFLUs=;
 b=VlMSWKe0iMJBJq8TiAfq/PHNhV2Jrdz2rD0FEF/vzId/NImEPQIu9MwPrkMew9tYiN
 0Zvg20JlnQPhf4pHXjzOY9J+LmtfNENGgWywH69tUSqfhXPBcLvpgbe4FZxQg2q1o5yi
 3hj68/mHVYcouAW2j1cW2wJ0HO2cNPHWnRla0Y7JsZr8F3UJnQK5z/q2N/I1KmuM4L3Y
 fbzGIsdRSw/KErY3BiJuYmP7cYGiUZprek+VXAGf9EQbCKKQrCA1kBXJy9Qbp2hqM5m8
 0o1Ct+uUi3SGfpBu6DRS+zoWWKCQEfxzTJIkfm6/H9RlOyXqjl5yVw2QLkouA87cp9sT
 D6yg==
X-Gm-Message-State: AJIora/i5W6JnJlv5w8jKrVp/m8CudaLSShhG+WuMEFGbe8DFQkSEcE7
 8nlCA8RXAMsTZwIzHjs0jwx6pA==
X-Google-Smtp-Source: AGRyM1tiX1hRXVLExX5TkBZuPFX+cIlujaRSLhuzP1veFLlEAubb5sW3GwoLqtRWgTmjtaiDBQEyKg==
X-Received: by 2002:a5d:4283:0:b0:21d:7ae3:71a3 with SMTP id
 k3-20020a5d4283000000b0021d7ae371a3mr16558993wrq.233.1657539234004; 
 Mon, 11 Jul 2022 04:33:54 -0700 (PDT)
Received: from 6wind.com ([2a01:e0a:5ac:6460:c065:401d:87eb:9b25])
 by smtp.gmail.com with ESMTPSA id
 h6-20020a5d5486000000b0021b829d111csm5537652wrv.112.2022.07.11.04.33.52
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 11 Jul 2022 04:33:53 -0700 (PDT)
Date: Mon, 11 Jul 2022 13:33:52 +0200
From: Olivier Matz <olivier.matz@6wind.com>
To: Mattias =?iso-8859-1?Q?R=F6nnblom?= <mattias.ronnblom@ericsson.com>
Cc: Emil Berg <emil.berg@ericsson.com>,
 "bruce.richardson@intel.com" <bruce.richardson@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>,
 "stable@dpdk.org" <stable@dpdk.org>,
 "bugzilla@dpdk.org" <bugzilla@dpdk.org>,
 "dev@dpdk.org" <dev@dpdk.org>, Onar Olsen <onar.olsen@ericsson.com>,
 Morten =?iso-8859-1?Q?Br=F8rup?= <mb@smartsharesystems.com>
Subject: Re: [PATCH v2 1/2] app/test: add cksum performance test
Message-ID: <YswKoLL5BCS7qvrZ@platinum>
References: <6839721a-8050-0e11-0c66-0f735ec8c56d@ericsson.com>
 <20220708125608.24532-1-mattias.ronnblom@ericsson.com>
 <YsvxpfTqZZJBI2FD@platinum>
 <f528142b-28d2-ede3-fe70-08f4ca432ec6@ericsson.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <f528142b-28d2-ede3-fe70-08f4ca432ec6@ericsson.com>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On Mon, Jul 11, 2022 at 10:42:37AM +0000, Mattias Rönnblom wrote:
> On 2022-07-11 11:47, Olivier Matz wrote:
> > Hi Mattias,
> > 
> > Please see few comments below.
> > 
> > On Fri, Jul 08, 2022 at 02:56:07PM +0200, Mattias Rönnblom wrote:
> >> Add performance test for the rte_raw_cksum() function, which delegates
> >> the actual work to __rte_raw_cksum(), which in turn is used by other
> >> functions in need of Internet checksum calculation.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>
> >> ---
> >>
> >> v2:
> >>    * Added __rte_unused to unused volatile variable, to keep the Intel
> >>      compiler happy.
> >> ---
> >>   MAINTAINERS                |   1 +
> >>   app/test/meson.build       |   1 +
> >>   app/test/test_cksum_perf.c | 118 +++++++++++++++++++++++++++++++++++++
> >>   3 files changed, 120 insertions(+)
> >>   create mode 100644 app/test/test_cksum_perf.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index c923712946..2a4c99e05a 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -1414,6 +1414,7 @@ Network headers
> >>   M: Olivier Matz <olivier.matz@6wind.com>
> >>   F: lib/net/
> >>   F: app/test/test_cksum.c
> >> +F: app/test/test_cksum_perf.c
> >>   
> >>   Packet CRC
> >>   M: Jasvinder Singh <jasvinder.singh@intel.com>
> >> diff --git a/app/test/meson.build b/app/test/meson.build
> >> index 431c5bd318..191db03d1d 100644
> >> --- a/app/test/meson.build
> >> +++ b/app/test/meson.build
> >> @@ -18,6 +18,7 @@ test_sources = files(
> >>           'test_bpf.c',
> >>           'test_byteorder.c',
> >>           'test_cksum.c',
> >> +        'test_cksum_perf.c',
> >>           'test_cmdline.c',
> >>           'test_cmdline_cirbuf.c',
> >>           'test_cmdline_etheraddr.c',
> >> diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
> >> new file mode 100644
> >> index 0000000000..bff73cb3bb
> >> --- /dev/null
> >> +++ b/app/test/test_cksum_perf.c
> >> @@ -0,0 +1,118 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2022 Ericsson AB
> >> + */
> >> +
> >> +#include <stdio.h>
> >> +
> >> +#include <rte_common.h>
> >> +#include <rte_cycles.h>
> >> +#include <rte_ip.h>
> >> +#include <rte_malloc.h>
> >> +#include <rte_random.h>
> >> +
> >> +#include "test.h"
> >> +
> >> +#define NUM_BLOCKS (10)
> >> +#define ITERATIONS (1000000)
> > 
> > Parenthesis can be safely removed
> > 
> >> +
> >> +static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
> >> +
> >> +static __rte_noinline uint16_t
> >> +do_rte_raw_cksum(const void *buf, size_t len)
> >> +{
> >> +	return rte_raw_cksum(buf, len);
> >> +}
> > 
> > I don't understand the need to have this wrapper, especially marked
> > __rte_noinline. What is the objective?
> > 
> 
> The intention is to disallow the compiler to perform unrolling and 
> integrating/interleave one cksum operating with the next buffer's in a 
> way that wouldn't be feasable in a real application.
> 
> It will result in an overestimation of the cost for small cksums, so 
> it's still misleading, but in another direction. :)

OK, got it. I think it's fine like you did then.

> 
> > Note that when I remove the __rte_noinline, the performance is better
> > for size 20 and 21.
> > 
> >> +
> >> +static void
> >> +init_block(void *buf, size_t len)
> > 
> > Can buf be a (char *) instead?
> > It would avoid a cast below.
> > 
> 
> Yes.
> 
> >> +{
> >> +	size_t i;
> >> +
> >> +	for (i = 0; i < len; i++)
> >> +		((char *)buf)[i] = (uint8_t)rte_rand();
> >> +}
> >> +
> >> +static int
> >> +test_cksum_perf_size_alignment(size_t block_size, bool aligned)
> >> +{
> >> +	char *data[NUM_BLOCKS];
> >> +	char *blocks[NUM_BLOCKS];
> >> +	unsigned int i;
> >> +	uint64_t start;
> >> +	uint64_t end;
> >> +	/* Floating point to handle low (pseudo-)TSC frequencies */
> >> +	double block_latency;
> >> +	double byte_latency;
> >> +	volatile __rte_unused uint64_t sum = 0;
> >> +
> >> +	for (i = 0; i < NUM_BLOCKS; i++) {
> >> +		data[i] = rte_malloc(NULL, block_size + 1, 0);
> >> +
> >> +		if (data[i] == NULL) {
> >> +			printf("Failed to allocate memory for block\n");
> >> +			return TEST_FAILED;
> >> +		}
> >> +
> >> +		init_block(data[i], block_size + 1);
> >> +
> >> +		blocks[i] = aligned ? data[i] : data[i] + 1;
> >> +	}
> >> +
> >> +	start = rte_rdtsc();
> >> +
> >> +	for (i = 0; i < ITERATIONS; i++) {
> >> +		unsigned int j;
> >> +		for (j = 0; j < NUM_BLOCKS; j++)
> >> +			sum += do_rte_raw_cksum(blocks[j], block_size);
> >> +	}
> >> +
> >> +	end = rte_rdtsc();
> >> +
> >> +	block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS);
> >> +	byte_latency = block_latency / block_size;
> >> +
> >> +	printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned",
> >> +	       block_size, block_latency, byte_latency);
> > 
> > When I run the test on my dev machine, I get the following results,
> > which are quite reproductible:
> > 
> > Aligned           20       10.4      0.52     (range is 0.48 - 0.52)
> > Unaligned         20        7.9      0.39     (range is 0.39 - 0.40)
> > ...
> > 
> > If I increase the number of iterations, the first results
> > change significantly:
> > 
> > Aligned           20        8.2      0.42     (range is 0.41 - 0.42)
> > Unaligned         20        8.0      0.40     (always this value)
> 
> 
> I suspect you have frequency scaling enabled on your system. This is 
> generally not advisable, you want to some level of determinism in when 
> benchmarking. Especially on short runs like this is (and must be).
> 
> I thought about doing something about this, but it seemed like an issue 
> that should be addressed on a framework level, rather than on a per-perf 
> autotest level.
> 
> If you want your CPU core to scale up, you can just insert
> 
> rte_delay_block_us(100000);
> 
> before the actual test is run.

Your hypothesis is correct. When I disable frequency scaling, the results
are now the same with 100K, 1M or 10M iterations.

However, adding a rte_delay_us_block() does not really solve the issue,
probably because it calls rte_pause() in the loop.

> Should I add this? I *think* 100 ms should be enough, but maybe someone 
> with more in-depth knowledge of the frequency governors can comment on this.

Anyway, I think we don't need to add the blocking delay, we can
legitimally expect that freq scaling is disabled when we run performance
tests.

> 
> > 
> > To have more precise tests with small size, would it make sense to
> > target a test time instead of an iteration count? Something like
> > this:
> > 
> 
> The time lost when running on a lower frequency (plus the hiccups when 
> the frequency is changed) will be amortized as you add to the length of 
> the test run, which will partly solved the problem. A better solution is 
> to not start the test before the core runs on the max frequency.
> 
> Again, this is assuming DVFS is what you suffer from here. I guess in 
> theory it could be TLB miss as well.
> 
> > 	#define ITERATIONS 1000000
> > 	uint64_t iterations = 0;
> > 
> > 	...
> > 
> > 	do {
> > 		for (i = 0; i < ITERATIONS; i++) {
> > 			unsigned int j;
> > 			for (j = 0; j < NUM_BLOCKS; j++)
> > 				sum += do_rte_raw_cksum(blocks[j], block_size);
> > 		}
> > 		iterations += ITERATIONS;
> > 		end = rte_rdtsc();
> > 	} while ((end - start) < rte_get_tsc_hz());
> > 
> > 	block_latency = (end - start) / (double)(iterations * NUM_BLOCKS);
> > 
> > 
> > After this change, the aligned and unaligned cases have the same
> > performance on my machine.
> > 
> > 
> 
> RTE>>cksum_perf_autotest
> ### rte_raw_cksum() performance ###
> Alignment  Block size    TSC cycles/block  TSC cycles/byte
> Aligned           20                16.1             0.81
> Unaligned         20                16.1             0.81
> 
> ... with the 100 ms busy-wait delay (and frequency scaling enabled) on 
> my AMD machine.
> 
> >> +
> >> +	for (i = 0; i < NUM_BLOCKS; i++)
> >> +		rte_free(data[i]);
> >> +
> >> +	return TEST_SUCCESS;
> >> +}
> >> +
> >> +static int
> >> +test_cksum_perf_size(size_t block_size)
> >> +{
> >> +	int rc;
> >> +
> >> +	rc = test_cksum_perf_size_alignment(block_size, true);
> >> +	if (rc != TEST_SUCCESS)
> >> +		return rc;
> >> +
> >> +	rc = test_cksum_perf_size_alignment(block_size, false);
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +static int
> >> +test_cksum_perf(void)
> >> +{
> >> +	uint16_t i;
> >> +
> >> +	printf("### rte_raw_cksum() performance ###\n");
> >> +	printf("Alignment  Block size    TSC cycles/block  TSC cycles/byte\n");
> >> +
> >> +	for (i = 0; i < RTE_DIM(data_sizes); i++) {
> >> +		int rc;
> >> +
> >> +		rc = test_cksum_perf_size(data_sizes[i]);
> >> +		if (rc != TEST_SUCCESS)
> >> +			return rc;
> >> +	}
> >> +
> >> +	return TEST_SUCCESS;
> >> +}
> >> +
> >> +
> >> +REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf);
> >> +
> > 
> > The last empty line can be removed.
> > 
> 
> OK.
> 
> Thanks for the review. I will send a v3 as soon as we've settled the 
> DVFS issue.
> 
> >> -- 
> >> 2.25.1
> >>
>