Re: [RFC] EAL: legacy memory fixed address translations

From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
To: Don Wallwork <donw@xsightlabs.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [RFC] EAL: legacy memory fixed address translations
Date: Wed, 27 Jul 2022 23:36:44 +0300	[thread overview]
Message-ID: <20220727233644.21f0b2a3@sovereign> (raw)
In-Reply-To: <e426c21b-0235-11a7-7039-0c55dcc15cde@xsightlabs.com>

I now understand more about _why_ you want this feature
but became less confident _what_ do you're proposing specifically.

>> 2022-07-26 14:33 (UTC-0400), Don Wallwork:  
>>> This proposal describes a method for translating any huge page
>>> address from virtual to physical or vice versa using simple
>>> addition or subtraction of a single fixed value.

This works for one region with continuous PA and VA.
At first I assumed you want all DPDK memory to be such a region
and legacy mode was needed for the mapping to be static.
However, you say that all the memory does not need to be PA-continuous.
Hence, address translation requires a more complex data structure,
something like a tree to lookup the region, the do the translation.
This means the lookup is still not a trivial operation.
Effectively you want fast rte_mem_virt2iova() and rte_mem_iova2virt()
and point to an optimization opportunity.
Do I miss anything?

Static mapping would also allow to cache offsets for regions,
e.g. device knows it works with some area and looks up the offset once.
There is no API to detect that the memory layout is static.
Is this the missing piece?
Because the proposed data structure can be build with existing API:
EAL memory callbacks to track all DPDK memory.

[...]
> Several examples where this could help include:
> 
> 1. A device could return flow lookup results containing the physical
> address of a matching entry that needs to be translated to a virtual
> address.

See above, rte_mem_iova2virt().

> 2. Hardware can perform offloads on dynamically allocted heap
> memory objects and would need PA to avoid requiring IOMMU.

I wonder where these objects come from and how they are given to the HW.
If there are many similar objects, why not use mempools?
If there are few objects, why not use memzones?

> 3. It may be useful to prepare data such as descriptors in stack
> variables, then pass the PA to hardware which can DMA directly
> from stack memory.

pthread_attr_getstack() can give the base address
which can be translated once and then used for VA-to-PA conversions.
The missing bit is PA-continuity of the stack memory.
That can be done transparently when allocating worker stacks
(probably try to allocate a continuous chunk, fallback to non-contiguous).

> 4. The CPU instruction set provides memory operations such as
> prefetch, atomics, ALU and so on which operate on virtual
> addresses with no software requirement to provide physical
> addresses. A device may be able to provide a more optimized
> implementation of such features that could avoid performance
> degradation associated with using a hardware IOMMU if provided
> virtual addresses. Having the ability to offload such operations
> without requiring data structure modifications to store an IOVA for
> every virtual address is desirable.

Either it's me lacking experience with such accelerators
or this item needs clarification.

> All of these cases can run at packet rate and are not operating on
> mbuf data. These would all benefit from efficient address translation
> in the same way that mbufs already do. Unlike mbuf translation
> that only covers VA to PA, this translation can perform both VA to PA
> and PA to VA with equal efficiency.
> 
> >
> > When drivers need to process a large number of memory blocks,
> > these are typically packets in the form of mbufs,
> > which already have IOVA attached, so there is no translation.
> > Does translation of mbuf VA to PA with the proposed method
> > show significant improvement over reading mbuf->iova?  
> 
> This proposal does not relate to mbufs.  As you say, there is
> already an efficient VA to PA mechanism in place for those.

Are you aware of externally-attached mbufs?
Those carry a pointer to arbitrary IOVA-continuous memory and its IOVA.
They can be used to convey any object in memory to the API consuming mbufs.

> > When drivers need to process a few IOVA-contiguous memory blocks,
> > they can calculate VA-to-PA offsets in advance,
> > amortizing translation cost.
> > Hugepage stack falls within this category.  
> 
> As the cases listed above hopefully show, there are cases where
> it is not practical or desirable to precalculate the offsets.

Arguably only PA-to-VA tasks.

> >> When legacy memory mode is used, it is possible to map a single
> >> virtual memory region large enough to cover all huge pages. During
> >> legacy hugepage init, each hugepage is mapped into that region.  
> > Legacy mode is called "legacy" with an intent to be deprecated :)  
> 
> Understood.  For our initial implementation, we were okay with
> that limitation given that supporting in legacy mode was simpler.
> 
> > There is initial allocation (-m) and --socket-limit in dynamic mode.
> > When initial allocation is equal to the socket limit,
> > it should be the same behavior as in legacy mode:
> > the number of hugepages mapped is constant and cannot grow,
> > so the feature seems applicable as well.  
> 
> It seems feasible to implement this feature in non-legacy mode as
> well. The approach would be similar; reserve a region of virtual
> address space large enough to cover all huge pages before they are
> allocated.  As huge pages are allocated, they are mapped into the
> appropriate location within that virtual address space.

This is what EAL is trying to do.

[...]
> >> This feature is applicable when rte_eal_iova_mode() == RTE_IOVA_PA  
> > One can say it always works for RTE_IOVA_VA with VA-to-PA offset of 0.  
> 
> This is true, but requires the use of a hardware IOMMU which
> degrades performance.

What I meant is this: if there was an API to ask EAL
whether the fast translation is available,
in RTE_IOVA_VA mode it would always return true;
and if asked for an offset, it would always return 0.
Bottom line: optimization is not limited to RTE_IOVA_PA,
it's just trivial in that mode.

> >> and could be enabled either by default when the legacy memory EAL
> >> option is given, or a new EAL option could be added to specifically
> >> enable this feature.
> >>
> >> It may be desirable to set a capability bit when this feature is
> >> enabled to allow drivers to behave differently depending on the
> >> state of that flag.  
> > The feature requires, in IOVA-as-PA mode:
> > 1) that hugepage mapping is static (legacy mode or "-m" == "--socket-limit");
> > 2) that EAL has succeeded to map all hugepages in one PA-continuous block.  
> 
> It does not require huge pages to be physically contiguous.
> Theoretically the mapping a giant VA region could fail, but
> we have not seen this in practice even when running on x86_64
> servers with multiple NUMA nodes, many cores and huge pages
> that span TBs of physical address space.

Size does not matter as long as there are free hugepages.
Physical continuity does, it's unpredictable.
But it's best-effort for DPDK anyway.
My point: no need for a command-line argument to request this optimized mode,
DPDK can always try and report via API whether it has succeeded.