From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
Subject: Re: [PATCH v2 1/6] mbuf: add buffer offset field for
 flexible indirection
Date: Wed, 11 Apr 2018 11:39:47 +0000
Message-ID: <2601191342CEEE43887BDE71AB977258AE913944@IRSMSX102.ger.corp.intel.com>
References: <20180310012532.15809-1-yskoh@mellanox.com>
 <20180402185008.13073-1-yskoh@mellanox.com>
 <20180402185008.13073-2-yskoh@mellanox.com>
 <20180403082615.etnr33cuyey7i3u3@platinum>
 <20180404001205.GB1867@yongseok-MBP.local>
 <20180409160434.kmw4iyztemrkzmtc@platinum>
 <20180410015902.GA20627@yongseok-MBP.local>
 <2601191342CEEE43887BDE71AB977258AE91344A@IRSMSX102.ger.corp.intel.com>
 <20180411053302.GA26252@yongseok-MBP.local>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Cc: Olivier Matz <olivier.matz@6wind.com>, "Lu, Wenzhuo"
 <wenzhuo.lu@intel.com>, "Wu, Jingjing" <jingjing.wu@intel.com>, "Adrien
 Mazarguil" <adrien.mazarguil@6wind.com>, =?iso-8859-1?Q?N=E9lio_Laranjeiro?=
 <nelio.laranjeiro@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>
To: Yongseok Koh <yskoh@mellanox.com>
Return-path: <dev-bounces@dpdk.org>
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id 489E71BAAC
 for <dev@dpdk.org>; Wed, 11 Apr 2018 13:41:08 +0200 (CEST)
In-Reply-To: <20180411053302.GA26252@yongseok-MBP.local>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


Hi Yongseok,

> > >
> > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > Hi Yongseok,
> > > >
> > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > When attaching a mbuf, indirect mbuf has to point to start of=
 buffer of
> > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this become=
s more
> > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf =
by calling
> > > > > > > rte_pktmbuf_attach_at().
> > > > > > >
> > > > > > > Possible use-cases could be:
> > > > > > > - If a packet has multiple layers of encapsulation, multiple =
indirect
> > > > > > >   buffers can reference different layers of the encapsulated =
packet.
> > > > > > > - A large direct mbuf can even contain multiple packets in se=
ries and
> > > > > > >   each packet can be referenced by multiple mbuf indirections=
.
> > > > > > >
> > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > >
> > > > > > I think the current API is already able to do what you want.
> > > > > >
> > > > > > 1/ Here is a mbuf m with its data
> > > > > >
> > > > > >                off
> > > > > >                <-->
> > > > > >                       len
> > > > > >           +----+   <---------->
> > > > > >           |    |
> > > > > >         +-|----v----------------------+
> > > > > >         | |    -----------------------|
> > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > >         |      -----------------------|
> > > > > >         +-----------------------------+
> > > > > >
> > > > > >
> > > > > > 2/ clone m:
> > > > > >
> > > > > >   c =3D rte_pktmbuf_alloc(pool);
> > > > > >   rte_pktmbuf_attach(c, m);
> > > > > >
> > > > > >   Note that c has its own offset and length fields.
> > > > > >
> > > > > >
> > > > > >                off
> > > > > >                <-->
> > > > > >                       len
> > > > > >           +----+   <---------->
> > > > > >           |    |
> > > > > >         +-|----v----------------------+
> > > > > >         | |    -----------------------|
> > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > >         |      -----------------------|
> > > > > >         +------^----------------------+
> > > > > >                |
> > > > > >           +----+
> > > > > > indirect  |
> > > > > >         +-|---------------------------+
> > > > > >         | |    -----------------------|
> > > > > > c       | buf  |                     ||
> > > > > >         |      -----------------------|
> > > > > >         +-----------------------------+
> > > > > >
> > > > > >                 off    len
> > > > > >                 <--><---------->
> > > > > >
> > > > > >
> > > > > > 3/ remove some data from c without changing m
> > > > > >
> > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > >
> > > > > >
> > > > > > Please let me know if it fits your needs.
> > > > >
> > > > > No, it doesn't.
> > > > >
> > > > > Trimming head and tail with the current APIs removes data and mak=
e the space
> > > > > available. Adjusting packet head means giving more headroom, not =
shifting the
> > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those =
are pointing to
> > > > > difference offsets in m,
> > > > >
> > > > > rte_pktmbuf_adj(c1, 10);
> > > > > rte_pktmbuf_adj(c2, 20);
> > > > >
> > > > > then the owner of c2 regard the first (off+20)B as available head=
room. If it
> > > > > wants to attach outer header, it will overwrite the headroom even=
 though the
> > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for=
 the outer
> > > > > header should be linked by h1->next =3D c2.
> > > >
> > > > Yes, after these operations c1, c2 and m should become read-only. S=
o, to
> > > > prepend headers, another mbuf has to be inserted before as you sugg=
est. It
> > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, of=
fset,
> > > > length) that will:
> > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > >     in the range [offset : length+offset].
> > > >   - prepend an empty and writable mbuf for the headers
> > > >
> > > > > If c1 and c2 are attached with shifting buffer address by adjusti=
ng buf_off,
> > > > > which actually shrink the headroom, this case can be properly han=
dled.
> > > >
> > > > What do you mean by properly handled?
> > > >
> > > > Yes, prepending data or adding data in the indirect mbuf won't over=
ride
> > > > the direct mbuf. But prepending data or adding data in the direct m=
buf m
> > > > won't be protected.
> > > >
> > > > From an application point of view, indirect mbufs, or direct mbufs =
that
> > > > have refcnt !=3D 1, should be both considered as read-only because =
they
> > > > may share their data. How an application can know if the data is sh=
ared
> > > > or not?
> > > >
> > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > (something like SHARED_DATA, or simply READONLY). In your case, if =
my
> > > > understanding is correct, you want to have indirect mbufs with RW d=
ata.
> > >
> > > Agree that indirect mbuf must be treated as read-only, Then the curre=
nt code is
> > > enough to handle that use-case.
> > >
> > > > > And another use-case (this is my actual use-case) is to make a la=
rge mbuf have
> > > > > multiple packets in series. AFAIK, this will also be helpful for =
some FPGA NICs
> > > > > because it transfers multiple packets to a single large buffer to=
 reduce PCIe
> > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx=
5 does.
> > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one=
 instead of
> > > > > indirect referencing.
> >
> > But just to make HW to RX multiple packets into one mbuf,
> > data_off inside indirect mbuf should be enough, correct?
> Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough f=
or mlx5
> to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 pac=
kets in
> a buffer. So, it needs ~32kB buffer. Having more bits in length fields wo=
uld be
> better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in =
order
> to saturate the network link.

There were few complains that 64KB max is a limitation for some use-cases.
I am not against increasing it, but I don't think we have free space on fir=
st cache-line for that
without another big rework of mbuf layout.=20
Considering that we need to increase size for buf_len, data_off, data_len, =
and probably priv_size too.=20

>=20
> > As I understand, what you'd like to achieve with this new field -
> > ability to manipulate packet boundaries after RX, probably at upper lay=
er.
> > As Olivier pointed above, that doesn't sound as safe approach - as you =
have multiple
> > indirect mbufs trying to modify same direct buffer.
>=20
> I agree that there's an implication that indirect mbuf or mbuf having ref=
cnt > 1
> is read-only. What that means, all the entities which own such mbufs have=
 to be
> aware of that and keep the principle as DPDK can't enforce the rule and t=
here
> can't be such sanity check. In this sense, HW doesn't violate it because =
the
> direct mbuf is injected to HW before indirection. When packets are writte=
n by
> HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> application layer with freeing the original direct mbuf (decrement refcnt=
 by 1).
> So, HW doesn't touch the direct buffer once it reaches to upper layer.

Yes, I understand that. But as I can see you introduced functions to adjust=
 head and tail,
which implies that it should be possible by some entity (upper layer?) to m=
anipulate these
indirect mbufs.
And we don't know how exactly it will be done.

> The direct buffer will be freed and get available for reuse when all the =
attached
> indirect mbufs are freed.
>=20
> > Though if you really need to do that, why it can be achieved by updatin=
g buf_len and priv_size
> > Fields for indirect mbufs, straight after attach()?
>=20
> Good point.
> Actually that was my draft (Mellanox internal) version of this patch :-) =
But I
> had to consider a case where priv_size is really given by user. Even thou=
gh it
> is less likely, but if original priv_size is quite big, it can't cover en=
tire
> buf_len. For this, I had to increase priv_size to 32-bit but adding anoth=
er
> 16bit field (buf_off) looked more plausible.

As I remember, we can't have mbufs bigger then 64K,
so priv_size + buf_len should be always less than 64K, correct?
Konstantin =20

>=20
> Thanks for good comments,
> Yongseok
>=20
> > > > >
> > > > > Does this make sense?
> > > >
> > > > I understand the need.
> > > >
> > > > Another option would be to make the mbuf->buffer point to an extern=
al
> > > > buffer (not inside the direct mbuf). This would require to add a
> > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1=
] for
> > > > a quick overview.
> > > >
> > > > [1]
> > >
> https://emea01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fdpdk=
summit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > -Session05-OlivierMatz-
> > >
> Userspace2016.pdf&data=3D02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e=
6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=3Dllw%2BwiY5cC56naOUhB=
bIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=3D0
> > > >
> > > > The advantage is that it does not require the large data to be insi=
de a
> > > > mbuf (requiring a mbuf structure before the buffer, and requiring t=
o be
> > > > allocated from a mempool). On the other hand, it is maybe more comp=
lex
> > > > to implement compared to your solution.
> > >
> > > I knew that you presented the slides and frankly, I had considered th=
at option
> > > at first. But even with that option, metadata to store refcnt should =
also be
> > > allocated and managed anyway. Kernel also maintains the skb_shared_in=
fo at the
> > > end of the data segment. Even though it could have smaller metadata s=
tructure,
> > > I just wanted to make full use of the existing framework because it i=
s less
> > > complex as you mentioned. Given that you presented the idea of extern=
al data
> > > buffer in 2016 and there hasn't been many follow-up discussions/activ=
ities so
> > > far, I thought the demand isn't so big yet thus I wanted to make this=
 patch
> > > simpler.  I personally think that we can take the idea of external da=
ta seg when
> > > more demands come from users in the future as it would be a huge chan=
ge and may
> > > break current ABI/API. When the day comes, I'll gladly participate in=
 the
> > > discussions and write codes for it if I can be helpful.
> > >
> > > Do you think this patch is okay for now?
> > >
> > >
> > > Thanks for your comments,
> > > Yongseok