RE: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side

From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"thomas@monjalon.net" <thomas@monjalon.net>,
	"Feifei Wang" <Feifei.Wang2@arm.com>,
	"Yigit, Ferruh" <ferruh.yigit@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>,
	Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>,
	"Zhang, Qi Z" <qi.z.zhang@intel.com>,
	"Xing,  Beilei" <beilei.xing@intel.com>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	nd <nd@arm.com>
Subject: RE: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
Date: Wed, 2 Feb 2022 19:46:43 +0000	[thread overview]
Message-ID: <DBAPR08MB58147EAE54ECA9B5423CDB9898279@DBAPR08MB5814.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <DM6PR11MB44916CAEBB72B2A3F662E8419A219@DM6PR11MB4491.namprd11.prod.outlook.com>

<snip>

> 
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Tuesday, 18 January 2022 17.54
> > > >
> > > > [quick summary: ethdev API to bypass mempool]
> > > >
> > > > 18/01/2022 16:51, Ferruh Yigit:
> > > > > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > > > > Morten Brørup <mb@smartsharesystems.com>:
> > > > > >> The patch provides a significant performance improvement, but
> > > > > >> I am wondering if any real world applications exist that
> > > > > >> would use
> > > > this. Only a
> > > > > >> "router on a stick" (i.e. a single-port router) comes to my
> > > > > >> mind,
> > > > and that is
> > > > > >> probably sufficient to call it useful in the real world. Do
> > > > > >> you
> > > > have any other
> > > > > >> examples to support the usefulness of this patch?
> > > > > >>
> > > > > > One case I have is about network security. For network
> > > > > > firewall,
> > > > all packets need
> > > > > > to ingress on the specified port and egress on the specified
> > > > > > port
> > > > to do packet filtering.
> > > > > > In this case, we can know flow direction in advance.
> > > > >
> > > > > I also have some concerns on how useful this API will be in real
> > > > life,
> > > > > and does the use case worth the complexity it brings.
> > > > > And it looks too much low level detail for the application.
> > > >
> > > > That's difficult to judge.
> > > > The use case is limited and the API has some severe limitations.
> > > > The benefit is measured with l3fwd, which is not exactly a real app.
> > > > Do we want an API which improves performance in limited scenarios
> > > > at the cost of breaking some general design assumptions?
> > > >
> > > > Can we achieve the same level of performance with a mempool trick?
> > >
> > > Perhaps the mbuf library could offer bulk functions for alloc/free
> > > of raw mbufs - essentially a shortcut directly to the mempool library.
> > >
> > > There might be a few more details to micro-optimize in the mempool
> > > library, if approached with this use case in mind. E.g. the
> > > rte_mempool_default_cache() could do with a few unlikely() in its
> > > comparisons.
> > >
> > > Also, for this use case, the mempool library adds tracing overhead,
> > > which this API bypasses. And considering how short the code path
> > > through the mempool cache is, the tracing overhead is relatively much.
> I.e.: memcpy(NIC->NIC) vs.
> > > trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).
> > >
> > > A key optimization point could be the number of mbufs being moved
> > > to/from the mempool cache. If that number was fixed at compile time,
> > > a faster
> > > memcpy() could be used. However, it seems that different PMDs use
> > > bursts of either 4, 8, or in this case 32 mbufs. If only they could
> > > agree on such a simple detail.
> > This patch removes the stores and loads which saves on the backend cycles.
> I do not think, other optimizations can do the same.
> 
> My thought here was that we can try to introduce for mempool-cache ZC API,
> similar to one we have for the ring.
> Then on TX free path we wouldn't need to copy mbufs to be freed to
> temporary array on the stack.
> Instead we can put them straight from TX SW ring to the mempool cache.
> That should save extra store/load for mbuf and might help to achieve some
> performance gain
> without by-passing mempool.
Agree, it will remove one set of loads and stores, but not all of them. I am not sure if it can solve the performance problems. We will give it a try.

> 
> >
> > >
> > > Overall, I strongly agree that it is preferable to optimize the core
> > > libraries, rather than bypass them. Bypassing will eventually lead to
> "spaghetti code".
> > IMO, this is not "spaghetti code". There is no design rule in DPDK
> > that says the RX side must allocate buffers from a mempool or TX side must
> free buffers to a mempool. This patch does not break any modular
> boundaries. For ex: access internal details of another library.
> 
> I also have few concerns about that approach:
> - proposed implementation breaks boundary logical boundary between RX/TX
> code.
>   Right now they co-exist independently, and design of TX path doesn't directly
> affect RX path
>   and visa-versa. With proposed approach RX path need to be aware about TX
> queue details and
>   mbuf freeing strategy. So if we'll decide to change TX code, we probably
> would be able to do that
>   without affecting RX path.
Agree that now both paths will be coupled on the areas you have mentioned. This is happening within the driver code. From the application perspective, they still remain separated. I also do not see that the TX free strategy has not changed much.

>   That probably can be fixed by formalizing things a bit more by introducing
> new dev-ops API:
>   eth_dev_tx_queue_free_mbufs(port id, queue id, mbufs_to_free[], ...)
>   But that would probably eat-up significant portion of the gain you are seeing
> right now.
> 
> - very limited usage scenario - it will have a positive effect only when we have
Agree, it is limited to few scenarios. But, the scenario itself is a major scenario.

> a fixed forwarding mapping:
>   all (or nearly all) packets from the RX queue are forwarded into the same TX
> queue.
>   Even for l3fwd it doesn’t look like a generic scenario.
I think it is possible to have some logic (based on the port mask and the routes involved) to enable this feature. We will try to add that in the next version.

> 
> - we effectively link RX and TX queues - when this feature is enabled, user
> can't stop TX queue,
>   without stopping RX queue first.
Agree. How much of an issue is this? I would think when the application is shutting down, one would stop the RX side first. Are there any other scenarios we need to be aware of?

> 
>