From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 809D2C433EF
	for <dpdk-dev@archiver.kernel.org>; Fri, 14 Jan 2022 09:54:03 +0000 (UTC)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id E093442774;
	Fri, 14 Jan 2022 10:54:01 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id EC87740DDD
 for <dev@dpdk.org>; Fri, 14 Jan 2022 10:54:00 +0100 (CET)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: rte_memcpy alignment
Date: Fri, 14 Jan 2022 10:53:54 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86E02@smartserver.smartshare.dk>
In-Reply-To: <YeE+K08sU6wnkEgx@bricha3-MOBL.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: rte_memcpy alignment
Thread-Index: AdgJJrCr5q/5qmhnT6mvSSvp7QL4awAApizQ
References: <98CBD80474FA8B44BF855DF32C47DC35D86E00@smartserver.smartshare.dk>
 <YeE+K08sU6wnkEgx@bricha3-MOBL.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: "Jan Viktorin" <viktorin@rehivetech.com>,
 "Ruifeng Wang" <ruifeng.wang@arm.com>,
 "David Christensen" <drc@linux.vnet.ibm.com>,
 "Konstantin Ananyev" <konstantin.ananyev@intel.com>, <dev@dpdk.org>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 14 January 2022 10.11
>=20
> On Fri, Jan 14, 2022 at 09:56:50AM +0100, Morten Br=F8rup wrote:
> > Dear ARM/POWER/x86 maintainers,
> >
> > The architecture specific rte_memcpy() provides optimized variants =
to
> copy aligned data. However, the alignment requirements depend on the
> hardware architecture, and there is no common definition for the
> alignment.
> >
> > DPDK provides __rte_cache_aligned for cache optimization purposes,
> with architecture specific values. Would you consider providing an
> __rte_memcpy_aligned for rte_memcpy() optimization purposes?
> >
> > Or should I just use __rte_cache_aligned, although it is overkill?
> >
> >
> > Specifically, I am working on a mempool optimization where the objs
> field in the rte_mempool_cache structure may benefit by being aligned
> for optimized rte_memcpy().
> >
> For me the difficulty with such a memcpy proposal - apart from =
probably
> adding to the amount of memcpy code we have to maintain - is the
> specific meaning
> of what "aligned" in the memcpy case. Unlike for a struct definition,
> the
> possible meaning of aligned in memcpy could be:
> * the source address is aligned
> * the destination address is aligned
> * both source and destination is aligned
> * both source and destination are aligned and the copy length is a
> multiple
>   of the alignment length
> * the data is aligned to a cacheline boundary
> * the data is aligned to the largest load-store size for system
> * the data is aligned to the boundary suitable for the copy size, e.g.
>   memcpy of 8 bytes is 8-byte aligned etc.
>=20
> Can you clarify a bit more on your own thinking here? Personally, I am
> a
> little dubious of the benefit of general memcpy optimization, but I do
> believe that for specific usecases there is value is having their own
> copy
> operations which include constraints for that specific usecase. For
> example, in the AVX-512 ice/i40e PMD code, we fold the memcpy from the
> mempool cache into the descriptor rearm function because we know we =
can
> always do 64-byte loads and stores, and also because we know that for
> each
> load in the copy, we can reuse the data just after storing it (giving
> good
> perf boost). Perhaps something similar could work for you in your
> mempool
> optimization.
>=20
> /Bruce

I'm going to copy array of pointers, specifically the 'objs' array in =
the rte_mempool_cache structure.

The 'objs' array starts at byte 24, which is only 8 byte aligned. So it =
always fails the ALIGNMENT_MASK test in the x86 specific rte_memcpy(), =
and thus cannot ever use the optimized rte_memcpy_aligned() function to =
copy the array, but will use the rte_memcpy_generic() function.

If the 'objs' array was optimally aligned, and the other array that is =
being copied to/from is also optimally aligned, rte_memcpy() would use =
the optimized rte_memcpy_aligned() function.

Please also note that the value of ALIGNMENT_MASK depends on which =
vector instruction set DPDK is being compiled with.

The other CPU architectures have similar stuff in their rte_memcpy() =
implementations, and their alignment requirements are also different.

Please also note that rte_memcpy() becomes even more optimized when the =
size of the memcpy() operation is known at compile time.

So I am asking for a public #define __rte_memcpy_aligned I can use to =
meet the alignment requirements for optimal rte_memcpy().