Re: [dpdk-dev] [PATCH] parray: introduce internal API for dynamic arrays

From: Ferruh Yigit <ferruh.yigit@intel.com>
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"Thomas Monjalon" <thomas@monjalon.net>,
	"Richardson, Bruce" <bruce.richardson@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>,
	"olivier.matz@6wind.com" <olivier.matz@6wind.com>,
	"andrew.rybchenko@oktetlabs.ru" <andrew.rybchenko@oktetlabs.ru>,
	"honnappa.nagarahalli@arm.com" <honnappa.nagarahalli@arm.com>,
	"jerinj@marvell.com" <jerinj@marvell.com>,
	"gakhil@marvell.com" <gakhil@marvell.com>
Subject: Re: [dpdk-dev] [PATCH] parray: introduce internal API for dynamic arrays
Date: Fri, 18 Jun 2021 11:28:09 +0100	[thread overview]
Message-ID: <ae026da0-b0f3-7596-e485-3c0b5d32b13a@intel.com> (raw)
In-Reply-To: <DM6PR11MB449103FDD3FF2EDB913AA4E19A0E9@DM6PR11MB4491.namprd11.prod.outlook.com>

On 6/17/2021 6:05 PM, Ananyev, Konstantin wrote:
> 
> 
>> On 6/17/2021 4:17 PM, Morten Brørup wrote:
>>>> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
>>>> Sent: Thursday, 17 June 2021 16.59
>>>>
>>>>>>>>
>>>>>>>> 14/06/2021 15:15, Bruce Richardson:
>>>>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
>>>> Monjalon
>>>>>>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>>>>>>
>>>>>>>>>>> Performance of access in a fixed-size array is very good
>>>>>>>>>>> because of cache locality
>>>>>>>>>>> and because there is a single pointer to dereference.
>>>>>>>>>>> The only drawback is the lack of flexibility:
>>>>>>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>>>>>>
>>>>>>>>>>> An approach to this problem is to allocate the array at
>>>> runtime,
>>>>>>>>>>> being as efficient as static arrays, but still limited to a
>>>> maximum.
>>>>>>>>>>>
>>>>>>>>>>> That's why the API rte_parray is introduced,
>>>>>>>>>>> allowing to declare an array of pointer which can be resized
>>>>>>>>>>> dynamically
>>>>>>>>>>> and automatically at runtime while keeping a good read
>>>> performance.
>>>>>>>>>>>
>>>>>>>>>>> After resize, the previous array is kept until the next resize
>>>>>>>>>>> to avoid crashs during a read without any lock.
>>>>>>>>>>>
>>>>>>>>>>> Each element is a pointer to a memory chunk dynamically
>>>> allocated.
>>>>>>>>>>> This is not good for cache locality but it allows to keep the
>>>> same
>>>>>>>>>>> memory per element, no matter how the array is resized.
>>>>>>>>>>> Cache locality could be improved with mempools.
>>>>>>>>>>> The other drawback is having to dereference one more pointer
>>>>>>>>>>> to read an element.
>>>>>>>>>>>
>>>>>>>>>>> There is not much locks, so the API is for internal use only.
>>>>>>>>>>> This API may be used to completely remove some compilation-
>>>> time
>>>>>>>>>>> maximums.
>>>>>>>>>>
>>>>>>>>>> I get the purpose and overall intention of this library.
>>>>>>>>>>
>>>>>>>>>> I probably already mentioned that I prefer "embedded style
>>>> programming" with fixed size arrays, rather than runtime
>>>> configurability.
>>>>>>> It's
>>>>>>>> my personal opinion, and the DPDK Tech Board clearly prefers
>>>> reducing the amount of compile time configurability, so there is no way
>>>>> for
>>>>>>>> me to stop this progress, and I do not intend to oppose to this
>>>> library. :-)
>>>>>>>>>>
>>>>>>>>>> This library is likely to become a core library of DPDK, so I
>>>> think it is important getting it right. Could you please mention a few
>>>>>>> examples
>>>>>>>> where you think this internal library should be used, and where
>>>> it should not be used. Then it is easier to discuss if the border line
>>>>> between
>>>>>>>> control path and data plane is correct. E.g. this library is not
>>>> intended to be used for dynamically sized packet queues that grow and
>>>>> shrink
>>>>>>> in
>>>>>>>> the fast path.
>>>>>>>>>>
>>>>>>>>>> If the library becomes a core DPDK library, it should probably
>>>> be public instead of internal. E.g. if the library is used to make
>>>>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some
>>>> applications might also need dynamically sized arrays for their
>>>>>>>> application specific per-port runtime data, and this library
>>>> could serve that purpose too.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks Thomas for starting this discussion and Morten for
>>>> follow-up.
>>>>>>>>>
>>>>>>>>> My thinking is as follows, and I'm particularly keeping in mind
>>>> the cases
>>>>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>>>>>>
>>>>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not
>>>> convinced that
>>>>>>>>> we should switch away from the flat arrays or that we need fully
>>>> dynamic
>>>>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
>>>> a half-way
>>>>>>>>> house here, where we keep the ethdevs as an array, but one
>>>> allocated/sized
>>>>>>>>> at runtime rather than statically. This would allow us to have a
>>>>>>>>> compile-time default value, but, for use cases that need it,
>>>> allow use of a
>>>>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter
>>>> given to the
>>>>>>>>> malloc call for the array.  This max limit could then be
>>>> provided to apps
>>>>>>>>> too if they want to match any array sizes. [Alternatively those
>>>> apps could
>>>>>>>>> check the provided size and error out if the size has been
>>>> increased beyond
>>>>>>>>> what the app is designed to use?]. There would be no extra
>>>> dereferences per
>>>>>>>>> rx/tx burst call in this scenario so performance should be the
>>>> same as
>>>>>>>>> before (potentially better if array is in hugepage memory, I
>>>> suppose).
>>>>>>>>
>>>>>>>> I think we need some benchmarks to decide what is the best
>>>> tradeoff.
>>>>>>>> I spent time on this implementation, but sorry I won't have time
>>>> for benchmarks.
>>>>>>>> Volunteers?
>>>>>>>
>>>>>>> I had only a quick look at your approach so far.
>>>>>>> But from what I can read, in MT environment your suggestion will
>>>> require
>>>>>>> extra synchronization for each read-write access to such parray
>>>> element (lock, rcu, ...).
>>>>>>> I think what Bruce suggests will be much ligther, easier to
>>>> implement and less error prone.
>>>>>>> At least for rte_ethdevs[] and friends.
>>>>>>> Konstantin
>>>>>>
>>>>>> One more thought here - if we are talking about rte_ethdev[] in
>>>> particular, I think  we can:
>>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from
>>>> rte_ethdev into a separate flat array.
>>>>>> We can keep it public to still use inline functions for 'fast'
>>>> calls rte_eth_rx_burst(), etc. to avoid
>>>>>> any regressions.
>>>>>> That could still be flat array with max_size specified at
>>>> application startup.
>>>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>>>> That will allow us to change the struct itself and the whole
>>>> rte_ethdev[] table in a way we like
>>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>>>
>>>>>> Yes, it would require all PMDs to change prototype for
>>>> pkt_rx_burst() function
>>>>>> (to accept port_id, queue_id instead of queue pointer), but the
>>>> change is mechanical one.
>>>>>> Probably some macro can be provided to simplify it.
>>>>>>
>>>>>
>>>>> We are already planning some tasks for ABI stability for v21.11, I
>>>> think
>>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables
>>>> hiding more
>>>>> internal data.
>>>>
>>>> Ok, sounds good.
>>>>
>>>>>
>>>>>> The only significant complication I can foresee with implementing
>>>> that approach -
>>>>>> we'll need a an array of 'fast' function pointers per queue, not
>>>> per device as we have now
>>>>>> (to avoid extra indirection for callback implementation).
>>>>>> Though as a bonus we'll have ability to use different RX/TX
>>>> funcions per queue.
>>>>>>
>>>>>
>>>>> What do you think split Rx/Tx callback into its own struct too?
>>>>>
>>>>> Overall 'rte_eth_dev' can be split into three as:
>>>>> 1. rte_eth_dev
>>>>> 2. rte_eth_dev_burst
>>>>> 3. rte_eth_dev_cb
>>>>>
>>>>> And we can hide 1 from applications even with the inline functions.
>>>>
>>>> As discussed off-line, I think:
>>>> it is possible.
>>>> My absolute preference would be to have just 1/2 (with CB hidden).
>>>> But even with 1/2/3 in place I think it would be  a good step forward.
>>>> Probably worth to start with 1/2/3 first and then see how difficult it
>>>> would be to switch to 1/2.
>>>> Do you plan to start working on it?
>>>>
>>>> Konstantin
>>>
>>> If you do proceed with this, be very careful. E.g. the inlined rx/tx burst functions should not touch more cache lines than they do today -
>> especially if there are many active ports. The inlined rx/tx burst functions are very simple, so thorough code review (and possibly also of the
>> resulting assembly) is appropriate. Simple performance testing might not detect if more cache lines are accessed than before the
>> modifications.
>>>
>>> Don't get me wrong... I do consider this an improvement of the ethdev library; I'm only asking you to take extra care!
>>>
>>
>> ack
>>
>> If we split as above, I think device specific data 'struct rte_eth_dev_data'
>> should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions access
>> additional cache line.
>>
>> To prevent this, what about duplicating 'data' in 2 (rte_eth_dev_burst)?
> 
> I think it would be better to change rx_pkt_burst() to accept port_id and queue_id,
> instead of void *.
> I.E:
> typedef uint16_t (*eth_rx_burst_t)(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts,  uint16_t nb_pkts);
> 

May not need to add 'port_id', since in the callback you are already in the
driver scope and all required device specific variables already accessible via
help of queue struct.

> And we can do actual de-referencing of private rxq data inside the actual rx function.
> 

Yes we can replace queue struct with 'queue_id', and do the referencing in the
Rx instead of burst API, but what is the benefit of it?

>> We have
>> enough space for it to fit into single cache line, currently it is:
>> struct rte_eth_dev {
>>         eth_rx_burst_t             rx_pkt_burst;         /*     0     8 */
>>         eth_tx_burst_t             tx_pkt_burst;         /*     8     8 */
>>         eth_tx_prep_t              tx_pkt_prepare;       /*    16     8 */
>>         eth_rx_queue_count_t       rx_queue_count;       /*    24     8 */
>>         eth_rx_descriptor_done_t   rx_descriptor_done;   /*    32     8 */
>>         eth_rx_descriptor_status_t rx_descriptor_status; /*    40     8 */
>>         eth_tx_descriptor_status_t tx_descriptor_status; /*    48     8 */
>>         struct rte_eth_dev_data *  data;                 /*    56     8 */
>>         /* --- cacheline 1 boundary (64 bytes) --- */
>>
>> 'rx_descriptor_done' is deprecated and will be removed;