From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Bj1j=LM=dpdk.org=dev-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4A286C48BDF
	for <dpdk-dev@archiver.kernel.org>; Fri, 18 Jun 2021 10:28:21 +0000 (UTC)
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by mail.kernel.org (Postfix) with ESMTP id C89C76121D
	for <dpdk-dev@archiver.kernel.org>; Fri, 18 Jun 2021 10:28:20 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C89C76121D
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=dev-bounces@dpdk.org
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 17F0F40150;
	Fri, 18 Jun 2021 12:28:20 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by mails.dpdk.org (Postfix) with ESMTP id AB82440142
 for <dev@dpdk.org>; Fri, 18 Jun 2021 12:28:17 +0200 (CEST)
IronPort-SDR: T3FZJ59J3gR6UdXgw3XSfdnZbMjzTch85fwLSpkifM1J4wcCBN588+kb7GPpC0xj8fQjuI5BAE
 ZM/xe6yT4s9g==
X-IronPort-AV: E=McAfee;i="6200,9189,10018"; a="186224248"
X-IronPort-AV: E=Sophos;i="5.83,283,1616482800"; d="scan'208";a="186224248"
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jun 2021 03:28:15 -0700
IronPort-SDR: H2o9UTflRaa+CAh3PV5YHATGqujO0i/2PoHSIq8Yvtfge9qejb3wr7HaZBq4WtBj3B4xhSMMAh
 Zw9Lh+ttd14w==
X-IronPort-AV: E=Sophos;i="5.83,283,1616482800"; d="scan'208";a="422180835"
Received: from fyigit-mobl1.ger.corp.intel.com (HELO [10.213.219.119])
 ([10.213.219.119])
 by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jun 2021 03:28:13 -0700
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 =?UTF-8?Q?Morten_Br=c3=b8rup?= <mb@smartsharesystems.com>,
 Thomas Monjalon <thomas@monjalon.net>,
 "Richardson, Bruce" <bruce.richardson@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>,
 "olivier.matz@6wind.com" <olivier.matz@6wind.com>,
 "andrew.rybchenko@oktetlabs.ru" <andrew.rybchenko@oktetlabs.ru>,
 "honnappa.nagarahalli@arm.com" <honnappa.nagarahalli@arm.com>,
 "jerinj@marvell.com" <jerinj@marvell.com>,
 "gakhil@marvell.com" <gakhil@marvell.com>
References: <20210614105839.3379790-1-thomas@monjalon.net>
 <98CBD80474FA8B44BF855DF32C47DC35C6184E@smartserver.smartshare.dk>
 <YMdWVzsk2+FQkLPJ@bricha3-MOBL.ger.corp.intel.com>
 <2004320.XGyPsaEoyj@thomas>
 <DM6PR11MB449118BF66F99FEA086158ED9A319@DM6PR11MB4491.namprd11.prod.outlook.com>
 <DM6PR11MB44914E3D1701EE88373F5CDF9A319@DM6PR11MB4491.namprd11.prod.outlook.com>
 <bdb7428e-fa2a-20b0-25a3-8d0cf02f0fa6@intel.com>
 <DM6PR11MB4491E0B36CC1B7E30095FFE39A0E9@DM6PR11MB4491.namprd11.prod.outlook.com>
 <98CBD80474FA8B44BF855DF32C47DC35C61868@smartserver.smartshare.dk>
 <c36bfc8a-06bc-95ef-79fd-81ba18ca4176@intel.com>
 <DM6PR11MB449103FDD3FF2EDB913AA4E19A0E9@DM6PR11MB4491.namprd11.prod.outlook.com>
From: Ferruh Yigit <ferruh.yigit@intel.com>
X-User: ferruhy
Message-ID: <ae026da0-b0f3-7596-e485-3c0b5d32b13a@intel.com>
Date: Fri, 18 Jun 2021 11:28:09 +0100
MIME-Version: 1.0
In-Reply-To: <DM6PR11MB449103FDD3FF2EDB913AA4E19A0E9@DM6PR11MB4491.namprd11.prod.outlook.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [PATCH] parray: introduce internal API for dynamic
 arrays
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 6/17/2021 6:05 PM, Ananyev, Konstantin wrote:
> 
> 
>> On 6/17/2021 4:17 PM, Morten Brørup wrote:
>>>> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
>>>> Sent: Thursday, 17 June 2021 16.59
>>>>
>>>>>>>>
>>>>>>>> 14/06/2021 15:15, Bruce Richardson:
>>>>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
>>>> Monjalon
>>>>>>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>>>>>>
>>>>>>>>>>> Performance of access in a fixed-size array is very good
>>>>>>>>>>> because of cache locality
>>>>>>>>>>> and because there is a single pointer to dereference.
>>>>>>>>>>> The only drawback is the lack of flexibility:
>>>>>>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>>>>>>
>>>>>>>>>>> An approach to this problem is to allocate the array at
>>>> runtime,
>>>>>>>>>>> being as efficient as static arrays, but still limited to a
>>>> maximum.
>>>>>>>>>>>
>>>>>>>>>>> That's why the API rte_parray is introduced,
>>>>>>>>>>> allowing to declare an array of pointer which can be resized
>>>>>>>>>>> dynamically
>>>>>>>>>>> and automatically at runtime while keeping a good read
>>>> performance.
>>>>>>>>>>>
>>>>>>>>>>> After resize, the previous array is kept until the next resize
>>>>>>>>>>> to avoid crashs during a read without any lock.
>>>>>>>>>>>
>>>>>>>>>>> Each element is a pointer to a memory chunk dynamically
>>>> allocated.
>>>>>>>>>>> This is not good for cache locality but it allows to keep the
>>>> same
>>>>>>>>>>> memory per element, no matter how the array is resized.
>>>>>>>>>>> Cache locality could be improved with mempools.
>>>>>>>>>>> The other drawback is having to dereference one more pointer
>>>>>>>>>>> to read an element.
>>>>>>>>>>>
>>>>>>>>>>> There is not much locks, so the API is for internal use only.
>>>>>>>>>>> This API may be used to completely remove some compilation-
>>>> time
>>>>>>>>>>> maximums.
>>>>>>>>>>
>>>>>>>>>> I get the purpose and overall intention of this library.
>>>>>>>>>>
>>>>>>>>>> I probably already mentioned that I prefer "embedded style
>>>> programming" with fixed size arrays, rather than runtime
>>>> configurability.
>>>>>>> It's
>>>>>>>> my personal opinion, and the DPDK Tech Board clearly prefers
>>>> reducing the amount of compile time configurability, so there is no way
>>>>> for
>>>>>>>> me to stop this progress, and I do not intend to oppose to this
>>>> library. :-)
>>>>>>>>>>
>>>>>>>>>> This library is likely to become a core library of DPDK, so I
>>>> think it is important getting it right. Could you please mention a few
>>>>>>> examples
>>>>>>>> where you think this internal library should be used, and where
>>>> it should not be used. Then it is easier to discuss if the border line
>>>>> between
>>>>>>>> control path and data plane is correct. E.g. this library is not
>>>> intended to be used for dynamically sized packet queues that grow and
>>>>> shrink
>>>>>>> in
>>>>>>>> the fast path.
>>>>>>>>>>
>>>>>>>>>> If the library becomes a core DPDK library, it should probably
>>>> be public instead of internal. E.g. if the library is used to make
>>>>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some
>>>> applications might also need dynamically sized arrays for their
>>>>>>>> application specific per-port runtime data, and this library
>>>> could serve that purpose too.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks Thomas for starting this discussion and Morten for
>>>> follow-up.
>>>>>>>>>
>>>>>>>>> My thinking is as follows, and I'm particularly keeping in mind
>>>> the cases
>>>>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>>>>>>
>>>>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not
>>>> convinced that
>>>>>>>>> we should switch away from the flat arrays or that we need fully
>>>> dynamic
>>>>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
>>>> a half-way
>>>>>>>>> house here, where we keep the ethdevs as an array, but one
>>>> allocated/sized
>>>>>>>>> at runtime rather than statically. This would allow us to have a
>>>>>>>>> compile-time default value, but, for use cases that need it,
>>>> allow use of a
>>>>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter
>>>> given to the
>>>>>>>>> malloc call for the array.  This max limit could then be
>>>> provided to apps
>>>>>>>>> too if they want to match any array sizes. [Alternatively those
>>>> apps could
>>>>>>>>> check the provided size and error out if the size has been
>>>> increased beyond
>>>>>>>>> what the app is designed to use?]. There would be no extra
>>>> dereferences per
>>>>>>>>> rx/tx burst call in this scenario so performance should be the
>>>> same as
>>>>>>>>> before (potentially better if array is in hugepage memory, I
>>>> suppose).
>>>>>>>>
>>>>>>>> I think we need some benchmarks to decide what is the best
>>>> tradeoff.
>>>>>>>> I spent time on this implementation, but sorry I won't have time
>>>> for benchmarks.
>>>>>>>> Volunteers?
>>>>>>>
>>>>>>> I had only a quick look at your approach so far.
>>>>>>> But from what I can read, in MT environment your suggestion will
>>>> require
>>>>>>> extra synchronization for each read-write access to such parray
>>>> element (lock, rcu, ...).
>>>>>>> I think what Bruce suggests will be much ligther, easier to
>>>> implement and less error prone.
>>>>>>> At least for rte_ethdevs[] and friends.
>>>>>>> Konstantin
>>>>>>
>>>>>> One more thought here - if we are talking about rte_ethdev[] in
>>>> particular, I think  we can:
>>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from
>>>> rte_ethdev into a separate flat array.
>>>>>> We can keep it public to still use inline functions for 'fast'
>>>> calls rte_eth_rx_burst(), etc. to avoid
>>>>>> any regressions.
>>>>>> That could still be flat array with max_size specified at
>>>> application startup.
>>>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>>>> That will allow us to change the struct itself and the whole
>>>> rte_ethdev[] table in a way we like
>>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>>>
>>>>>> Yes, it would require all PMDs to change prototype for
>>>> pkt_rx_burst() function
>>>>>> (to accept port_id, queue_id instead of queue pointer), but the
>>>> change is mechanical one.
>>>>>> Probably some macro can be provided to simplify it.
>>>>>>
>>>>>
>>>>> We are already planning some tasks for ABI stability for v21.11, I
>>>> think
>>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables
>>>> hiding more
>>>>> internal data.
>>>>
>>>> Ok, sounds good.
>>>>
>>>>>
>>>>>> The only significant complication I can foresee with implementing
>>>> that approach -
>>>>>> we'll need a an array of 'fast' function pointers per queue, not
>>>> per device as we have now
>>>>>> (to avoid extra indirection for callback implementation).
>>>>>> Though as a bonus we'll have ability to use different RX/TX
>>>> funcions per queue.
>>>>>>
>>>>>
>>>>> What do you think split Rx/Tx callback into its own struct too?
>>>>>
>>>>> Overall 'rte_eth_dev' can be split into three as:
>>>>> 1. rte_eth_dev
>>>>> 2. rte_eth_dev_burst
>>>>> 3. rte_eth_dev_cb
>>>>>
>>>>> And we can hide 1 from applications even with the inline functions.
>>>>
>>>> As discussed off-line, I think:
>>>> it is possible.
>>>> My absolute preference would be to have just 1/2 (with CB hidden).
>>>> But even with 1/2/3 in place I think it would be  a good step forward.
>>>> Probably worth to start with 1/2/3 first and then see how difficult it
>>>> would be to switch to 1/2.
>>>> Do you plan to start working on it?
>>>>
>>>> Konstantin
>>>
>>> If you do proceed with this, be very careful. E.g. the inlined rx/tx burst functions should not touch more cache lines than they do today -
>> especially if there are many active ports. The inlined rx/tx burst functions are very simple, so thorough code review (and possibly also of the
>> resulting assembly) is appropriate. Simple performance testing might not detect if more cache lines are accessed than before the
>> modifications.
>>>
>>> Don't get me wrong... I do consider this an improvement of the ethdev library; I'm only asking you to take extra care!
>>>
>>
>> ack
>>
>> If we split as above, I think device specific data 'struct rte_eth_dev_data'
>> should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions access
>> additional cache line.
>>
>> To prevent this, what about duplicating 'data' in 2 (rte_eth_dev_burst)?
> 
> I think it would be better to change rx_pkt_burst() to accept port_id and queue_id,
> instead of void *.
> I.E:
> typedef uint16_t (*eth_rx_burst_t)(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts,  uint16_t nb_pkts);
> 

May not need to add 'port_id', since in the callback you are already in the
driver scope and all required device specific variables already accessible via
help of queue struct.

> And we can do actual de-referencing of private rxq data inside the actual rx function.
> 

Yes we can replace queue struct with 'queue_id', and do the referencing in the
Rx instead of burst API, but what is the benefit of it?

>> We have
>> enough space for it to fit into single cache line, currently it is:
>> struct rte_eth_dev {
>>         eth_rx_burst_t             rx_pkt_burst;         /*     0     8 */
>>         eth_tx_burst_t             tx_pkt_burst;         /*     8     8 */
>>         eth_tx_prep_t              tx_pkt_prepare;       /*    16     8 */
>>         eth_rx_queue_count_t       rx_queue_count;       /*    24     8 */
>>         eth_rx_descriptor_done_t   rx_descriptor_done;   /*    32     8 */
>>         eth_rx_descriptor_status_t rx_descriptor_status; /*    40     8 */
>>         eth_tx_descriptor_status_t tx_descriptor_status; /*    48     8 */
>>         struct rte_eth_dev_data *  data;                 /*    56     8 */
>>         /* --- cacheline 1 boundary (64 bytes) --- */
>>
>> 'rx_descriptor_done' is deprecated and will be removed;