Re: [dpdk-dev] [PATCH] vhost: add support to large linear mbufs

From: Flavio Leitner <fbl@sysclose.org>
To: Shahaf Shuler <shahafs@mellanox.com>
Cc: David Marchand <david.marchand@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	Maxime Coquelin <maxime.coquelin@redhat.com>,
	Tiwei Bie <tiwei.bie@intel.com>,
	Zhihong Wang <zhihong.wang@intel.com>,
	Obrembski MichalX <michalx.obrembski@intel.com>,
	Stokes Ian <ian.stokes@intel.com>
Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear mbufs
Date: Wed, 2 Oct 2019 09:58:31 -0300	[thread overview]
Message-ID: <20191002095831.5927af93@p50.lan> (raw)
In-Reply-To: <AM0PR0502MB3795E94C2D0570A715525C3AC39C0@AM0PR0502MB3795.eurprd05.prod.outlook.com>

Hi Shahaf,

Thanks for looking into this, see my inline comments.

On Wed, 2 Oct 2019 09:00:11 +0000
Shahaf Shuler <shahafs@mellanox.com> wrote:

> Wednesday, October 2, 2019 11:05 AM, David Marchand:
> > Subject: Re: [dpdk-dev] [PATCH] vhost: add support to large linear
> > mbufs
> > 
> > Hello Shahaf,
> > 
> > On Wed, Oct 2, 2019 at 6:46 AM Shahaf Shuler <shahafs@mellanox.com>
> > wrote:  
> > >
> > > Wednesday, October 2, 2019 1:20 AM, Flavio Leitner:  
> > > > Subject: [dpdk-dev] [PATCH] vhost: add support to large linear
> > > > mbufs
> > > >
> > > > The rte_vhost_dequeue_burst supports two ways of dequeuing
> > > > data. If the data fits into a buffer, then all data is copied
> > > > and a single linear buffer is returned. Otherwise it allocates
> > > > additional mbufs and chains them together to return a multiple
> > > > segments mbuf.
> > > >
> > > > While that covers most use cases, it forces applications that
> > > > need to work with larger data sizes to support multiple
> > > > segments mbufs. The non-linear characteristic brings complexity
> > > > and performance implications to the application.
> > > >
> > > > To resolve the issue, change the API so that the application can
> > > > optionally provide a second mempool containing larger mbufs. If
> > > > that is not provided (NULL), the behavior remains as before the
> > > > change. Otherwise, the data size is checked and the
> > > > corresponding mempool is used to return linear mbufs.  
> > >
> > > I understand the motivation.
> > > However, providing a static pool w/ large buffers is not so
> > > efficient in terms  
> > of memory footprint. You will need to prepare to worst case (all
> > packet are large) w/ max size of 64KB.  
> > > Also, the two mempools are quite restrictive as the memory fill
> > > of the  
> > mbufs might be very sparse. E.g. mempool1 mbuf.size = 1.5K ,
> > mempool2 mbuf.size = 64K, packet size 4KB.  
> > >
> > > Instead, how about using the mbuf external buffer feature?
> > > The flow will be:
> > > 1. vhost PMD always receive a single mempool (like today) 2. on
> > > dequeue, PMD looks on the virtio packet size. If smaller than the
> > > mbuf size use the mbuf as is (like today) 3. otherwise, allocate
> > > a new buffer (inside the PMD) and link it to the mbuf as external
> > > buffer (rte_pktmbuf_attach_extbuf)  
> > 
> > I am missing some piece here.
> > Which pool would the PMD take those external buffers from?  
> 
> The mbuf is always taken from the single mempool associated w/ the
> rxq. The buffer for the mbuf may be allocated (in case virtio payload
> is bigger than current mbuf size) from DPDK hugepages or any other
> system memory and be attached to the mbuf.
> 
> You can see example implementation of it in mlx5 PMD (checkout
> rte_pktmbuf_attach_extbuf call)

Thanks, I wasn't aware of external buffers.

I see that attaching external buffers of the correct size would be more
efficient in terms of saving memory/avoiding sparsing.

However, we still need to be prepared to the worse case scenario (all
packets 64K), so that doesn't help with the total memory required.

The current patch pushes the decision to the application which knows
better the workload. If more memory is available, it can optionally use
large buffers, otherwise just don't pass that. Or even decide whether
to share the same 64K mempool between multiple vhost ports or use one
mempool per port.

Perhaps I missed something, but managing memory with mempool still
require us to have buffers of 64K regardless if the data consumes less
space. Otherwise the application or the PMD will have to manage memory
itself.

If we let the PMD manages the memory, what happens if a port/queue
is closed and one or more buffers are still in use (switching)? I don't
see how to solve this cleanly.

fbl

> 
> > 
> > If it is from an additional mempool passed to the vhost pmd, I
> > can't see the difference with Flavio proposal.
> > 
> >   
> > > The pros of this approach is that you have full flexibility on
> > > the memory  
> > allocation, and therefore a lower footprint.  
> > > The cons is the OVS will need to know how to handle mbuf w/
> > > external  
> > buffers (not too complex IMO).
> > 
> > 
> > --
> > David Marchand