All of lore.kernel.org
 help / color / mirror / Atom feed
* mbuf changes
@ 2016-10-24 15:49 Morten Brørup
  2016-10-24 16:11 ` Wiles, Keith
  2016-10-25 13:14 ` Olivier Matz
  0 siblings, 2 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-24 15:49 UTC (permalink / raw)
  To: dev; +Cc: Olivier Matz

First of all: Thanks for a great DPDK Userspace 2016!

 

Continuing the Userspace discussion about Olivier Matz’s proposed mbuf changes...

 

1.

Stephen Hemminger had a noteworthy general comment about keeping metadata for the NIC in the appropriate section of the mbuf: Metadata generated by the NIC’s RX handler belongs in the first cache line, and metadata required by the NIC’s TX handler belongs in the second cache line. This also means that touching the second cache line on ingress should be avoided if possible; and Bruce Richardson mentioned that for this reason m->next was zeroed on free().

 

2.

There seemed to be consensus that the size of m->refcnt should match the size of m->port because a packet could be duplicated on all physical ports for L3 multicast and L2 flooding.

Furthermore, although a single physical machine (i.e. a single server) with 255 physical ports probably doesn’t exist, it might contain more than 255 virtual machines with a virtual port each, so it makes sense extending these mbuf fields from 8 to 16 bits.

 

3.

Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to the second cache line, which then generated questions from the audience about the real life purpose of m->port, and if m->port could be removed from the mbuf structure.

 

4.

I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on first allocation. This is based on the assumption that other mbuf fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper than setting it to 1.

Furthermore (regardless of m->refcnt offset), I suggested that it is not required to modify m->refcnt when allocating and freeing the mbuf, thus saving one write operation on both alloc() and free(). However, this assumes that m->refcnt debugging, e.g. underrun detection, is not required.

 

5.

And here’s something new to think about:

m->next already reveals if there are more segments to a packet. Which purpose does m->nb_segs serve that is not already covered by m->next?

 

 

Med venlig hilsen / kind regards

 

Morten Brørup

CTO

 

 

SmartShare Systems A/S

Tonsbakken 16-18

DK-2740 Skovlunde

Denmark

 

Office      +45 70 20 00 93

Direct      +45 89 93 50 22

Mobile     +45 25 40 82 12

 

mb@smartsharesystems.com <mailto:mb@smartsharesystems.com> 

www.smartsharesystems.com <https://www.smartsharesystems.com/> 

 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-24 15:49 mbuf changes Morten Brørup
@ 2016-10-24 16:11 ` Wiles, Keith
  2016-10-24 16:25   ` Bruce Richardson
  2016-10-25 13:14 ` Olivier Matz
  1 sibling, 1 reply; 46+ messages in thread
From: Wiles, Keith @ 2016-10-24 16:11 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, Olivier Matz


> On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> First of all: Thanks for a great DPDK Userspace 2016!
> 
> 
> 
> Continuing the Userspace discussion about Olivier Matz’s proposed mbuf changes...
> 
> 
> 
> 1.
> 
> Stephen Hemminger had a noteworthy general comment about keeping metadata for the NIC in the appropriate section of the mbuf: Metadata generated by the NIC’s RX handler belongs in the first cache line, and metadata required by the NIC’s TX handler belongs in the second cache line. This also means that touching the second cache line on ingress should be avoided if possible; and Bruce Richardson mentioned that for this reason m->next was zeroed on free().
> 
> 
> 
> 2.
> 
> There seemed to be consensus that the size of m->refcnt should match the size of m->port because a packet could be duplicated on all physical ports for L3 multicast and L2 flooding.
> 
> Furthermore, although a single physical machine (i.e. a single server) with 255 physical ports probably doesn’t exist, it might contain more than 255 virtual machines with a virtual port each, so it makes sense extending these mbuf fields from 8 to 16 bits.

I thought we also talked about removing the m->port from the mbuf as it is not really needed.

> 
> 
> 
> 3.
> 
> Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to the second cache line, which then generated questions from the audience about the real life purpose of m->port, and if m->port could be removed from the mbuf structure.
> 
> 
> 
> 4.
> 
> I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on first allocation. This is based on the assumption that other mbuf fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper than setting it to 1.
> 
> Furthermore (regardless of m->refcnt offset), I suggested that it is not required to modify m->refcnt when allocating and freeing the mbuf, thus saving one write operation on both alloc() and free(). However, this assumes that m->refcnt debugging, e.g. underrun detection, is not required.
> 
> 
> 
> 5.
> 
> And here’s something new to think about:
> 
> m->next already reveals if there are more segments to a packet. Which purpose does m->nb_segs serve that is not already covered by m->next?
> 
> 
> 
> 
> 
> Med venlig hilsen / kind regards
> 
> 
> 
> Morten Brørup
> 
> CTO
> 
> 
> 
> 
> 
> SmartShare Systems A/S
> 
> Tonsbakken 16-18
> 
> DK-2740 Skovlunde
> 
> Denmark
> 
> 
> 
> Office      +45 70 20 00 93
> 
> Direct      +45 89 93 50 22
> 
> Mobile     +45 25 40 82 12
> 
> 
> 
> mb@smartsharesystems.com <mailto:mb@smartsharesystems.com> 
> 
> www.smartsharesystems.com <https://www.smartsharesystems.com/> 
> 
> 
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-24 16:11 ` Wiles, Keith
@ 2016-10-24 16:25   ` Bruce Richardson
  2016-10-24 21:47     ` Morten Brørup
                       ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Bruce Richardson @ 2016-10-24 16:25 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: Morten Brørup, dev, Olivier Matz

On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> 
> > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com> wrote:
> > 
> > First of all: Thanks for a great DPDK Userspace 2016!
> > 
> > 
> > 
> > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf changes...

Thanks for keeping the discussion going!
> > 
> > 
> > 
> > 1.
> > 
> > Stephen Hemminger had a noteworthy general comment about keeping metadata for the NIC in the appropriate section of the mbuf: Metadata generated by the NIC’s RX handler belongs in the first cache line, and metadata required by the NIC’s TX handler belongs in the second cache line. This also means that touching the second cache line on ingress should be avoided if possible; and Bruce Richardson mentioned that for this reason m->next was zeroed on free().
> > 
Thinking about it, I suspect there are more fields we can reset on free
to save time on alloc. Refcnt, as discussed below is one of them, but so
too could be the nb_segs field and possibly others.

> > 
> > 
> > 2.
> > 
> > There seemed to be consensus that the size of m->refcnt should match the size of m->port because a packet could be duplicated on all physical ports for L3 multicast and L2 flooding.
> > 
> > Furthermore, although a single physical machine (i.e. a single server) with 255 physical ports probably doesn’t exist, it might contain more than 255 virtual machines with a virtual port each, so it makes sense extending these mbuf fields from 8 to 16 bits.
> 
> I thought we also talked about removing the m->port from the mbuf as it is not really needed.
> 
Yes, this was mentioned, and also the option of moving the port value to
the second cacheline, but it appears that NXP are using the port value
in their NIC drivers for passing in metadata, so we'd need their
agreement on any move (or removal).

> > 
> > 
> > 
> > 3.
> > 
> > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to the second cache line, which then generated questions from the audience about the real life purpose of m->port, and if m->port could be removed from the mbuf structure.
> > 
> > 
> > 
> > 4.
> > 
> > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on first allocation. This is based on the assumption that other mbuf fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper than setting it to 1.
> > 
> > Furthermore (regardless of m->refcnt offset), I suggested that it is not required to modify m->refcnt when allocating and freeing the mbuf, thus saving one write operation on both alloc() and free(). However, this assumes that m->refcnt debugging, e.g. underrun detection, is not required.

I don't think it really matters what sentinal value is used for the
refcnt because it can't be blindly assigned on free like other fields.
However, I think 0 as first reference value becomes more awkward
than 1, because we need to deal with underflow. Consider the situation
where we have two references to the mbuf, so refcnt is 1, and both are
freed at the same time. Since the refcnt is not-zero, then both cores
will do an atomic decrement simultaneously giving a refcnt of -1. We can
then set this back to zero before freeing, however, I'd still prefer to
have refcnt be an accurate value so that it always stays positive, and
we can still set it to "one" on free to avoid having to set on alloc.

Also, if we set refcnt on free rather than alloc, it does set itself up
as a good candidate for moving to the second cacheline. Fast-path
processing does not normally update the value.

> > 
> > 
> > 
> > 5.
> > 
> > And here’s something new to think about:
> > 
> > m->next already reveals if there are more segments to a packet. Which purpose does m->nb_segs serve that is not already covered by m->next?

It is duplicate info, but nb_segs can be used to check the validity of
the next pointer without having to read the second mbuf cacheline.

Whether it's worth having is something I'm happy enough to discuss,
though.

One other point I'll mention is that we need to have a discussion on
how/where to add in a timestamp value into the mbuf. Personally, I think
it can be in a union with the sequence number value, but I also suspect
that 32-bits of a timestamp is not going to be enough for many.

Thoughts?

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-24 16:25   ` Bruce Richardson
@ 2016-10-24 21:47     ` Morten Brørup
  2016-10-25  8:53       ` Bruce Richardson
  2016-10-25  9:39     ` Adrien Mazarguil
  2016-10-25 11:54     ` Shreyansh Jain
  2 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2016-10-24 21:47 UTC (permalink / raw)
  To: Bruce Richardson, Wiles, Keith; +Cc: dev, Olivier Matz

Comments inline.

Med venlig hilsen / kind regards
- Morten Brørup

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Monday, October 24, 2016 6:26 PM
> To: Wiles, Keith
> Cc: Morten Brørup; dev@dpdk.org; Olivier Matz
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> >
> > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com>
> wrote:
> > >
> > > First of all: Thanks for a great DPDK Userspace 2016!
> > >
> > >
> > >
> > > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf
> changes...
> 
> Thanks for keeping the discussion going!
> > >
> > >
> > >
> > > 1.
> > >
> > > Stephen Hemminger had a noteworthy general comment about keeping
> metadata for the NIC in the appropriate section of the mbuf: Metadata
> generated by the NIC’s RX handler belongs in the first cache line, and
> metadata required by the NIC’s TX handler belongs in the second cache line.
> This also means that touching the second cache line on ingress should be
> avoided if possible; and Bruce Richardson mentioned that for this reason m-
> >next was zeroed on free().
> > >
> Thinking about it, I suspect there are more fields we can reset on free
> to save time on alloc. Refcnt, as discussed below is one of them, but so
> too could be the nb_segs field and possibly others.
Yes. Consider the use of rte_pktmbuf_reset() or add a rte_pktmbuf_prealloc() for this purpose.


> > > 2.
> > >
> > > There seemed to be consensus that the size of m->refcnt should match
> the size of m->port because a packet could be duplicated on all physical
> ports for L3 multicast and L2 flooding.
> > >
> > > Furthermore, although a single physical machine (i.e. a single server)
> with 255 physical ports probably doesn’t exist, it might contain more than
> 255 virtual machines with a virtual port each, so it makes sense extending
> these mbuf fields from 8 to 16 bits.
> >
> > I thought we also talked about removing the m->port from the mbuf as it
> is not really needed.
> >
> Yes, this was mentioned, and also the option of moving the port value to
> the second cacheline, but it appears that NXP are using the port value
> in their NIC drivers for passing in metadata, so we'd need their
> agreement on any move (or removal).
> 
If a single driver instance services multiple ports, so the ports are not polled individually, the m->port member will be required to identify the port. In that case, it might also used elsewhere in the ingress path, and thus appropriate having in the first cache line. So yes, this needs further discussion with NXP.


> > > 3.
> > >
> > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> the second cache line, which then generated questions from the audience
> about the real life purpose of m->port, and if m->port could be removed
> from the mbuf structure.
> > >
> > >
> > >
> > > 4.
> > >
> > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> first allocation. This is based on the assumption that other mbuf fields
> must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> than setting it to 1.
> > >
> > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> not required to modify m->refcnt when allocating and freeing the mbuf, thus
> saving one write operation on both alloc() and free(). However, this
> assumes that m->refcnt debugging, e.g. underrun detection, is not required.
> 
> I don't think it really matters what sentinal value is used for the
> refcnt because it can't be blindly assigned on free like other fields.
> However, I think 0 as first reference value becomes more awkward
> than 1, because we need to deal with underflow. Consider the situation
> where we have two references to the mbuf, so refcnt is 1, and both are
> freed at the same time. Since the refcnt is not-zero, then both cores
> will do an atomic decrement simultaneously giving a refcnt of -1. We can
> then set this back to zero before freeing, however, I'd still prefer to
> have refcnt be an accurate value so that it always stays positive, and
> we can still set it to "one" on free to avoid having to set on alloc.
> 
Good example. It explains very clearly why my suggested optimization is tricky. My optimization is targeting the most likely scenario where m->refcnt is only "one", but at the expense of some added complexity for the unlikely scenarios. (I only suggested 0 as "one" because I assumed the value could be zeroed faster than set to 1, so forget about that, if you like.)

I still argue that if we keep m->refcnt at "one" value while in the free pool, a write operation can be avoided at both free() and alloc(). Here is a conceptual example (using m->refcnt==1 for "one"):

void __rte_mbuf_raw_free(struct rte_mbuf *m)
{
	RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1);
	rte_mempool_put(m->pool, m);
}

void __rte_pktmbuf_free_seg_inner(struct rte_mbuf *m)
{
	/* if this is an indirect mbuf, it is detached. */
	if (RTE_MBUF_INDIRECT(m))
		rte_pktmbuf_detach(m);
	m->next = NULL; /* Preload to avoid setting in alloc(). */
	__rte_mbuf_raw_free(m);
}

void rte_pktmbuf_free_seg(struct rte_mbuf *m)
{
	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
		/* We have the last reference, so we return it to the free pool. */
		__rte_pktmbuf_free_seg_inner(m);
	} else {
		/* More references, so decrement the refcnt. */
		if (unlikely(rte_mbuf_refcnt_update(m, -1) == 0)) {
			/* Someone raced to decrement the refcnt before us,
			   so now we have the last reference,
			   so we return it to the free pool. */
		    rte_mbuf_refcnt_set(m, 1); /* Preload to avoid in alloc(). */
			__rte_pktmbuf_free_seg_inner(m);
		}
	}
}


> Also, if we set refcnt on free rather than alloc, it does set itself up
> as a good candidate for moving to the second cacheline. Fast-path
> processing does not normally update the value.
> 
Agreed. But also consider that TX handling should be optimized too. And this includes free() processing and preloading values for alloc(); this should only be done in free() instead of alloc() if it provides a total performance improvement. We should not improve RX handling performance if the TX handling performance decreases much more.


> > > 5.
> > >
> > > And here’s something new to think about:
> > >
> > > m->next already reveals if there are more segments to a packet. Which
> purpose does m->nb_segs serve that is not already covered by m->next?
> 
> It is duplicate info, but nb_segs can be used to check the validity of
> the next pointer without having to read the second mbuf cacheline.
> 
> Whether it's worth having is something I'm happy enough to discuss,
> though.
If anybody needs to check for additional segments (m->next != NULL) without hitting the second cache line, a single bit in the first cache line will suffice. (If this is a common case, they probably need to read the m->next value anyway, and then it would optimally belong in the first cache line. I am just mentioning this. It's not my preference.)


> One other point I'll mention is that we need to have a discussion on
> how/where to add in a timestamp value into the mbuf. Personally, I think
> it can be in a union with the sequence number value, but I also suspect
> that 32-bits of a timestamp is not going to be enough for many.
> 
> Thoughts?
Speaking for myself, a high resolution RX timestamp is "nice to have", so I like it, but I would hate to sacrifice the sequence number for it. Generally, I would assume that the sequence number used for reordering is more important than a timestamp, except for applications dealing with timing measurements or high precision packet capturing.

Conversion of timestamps from one unit to another can be relatively expensive, so I recommend keeping timestamps in whatever resolution the NIC hardware provides, leaving the conversion to e.g. nanoseconds up to the applications that use timestamps. A lot of protocols use opaque timestamps, e.g. RTP, so this would not be exceptional. Since the timestamp belongs in the first cache line, I suggest we limit it to 32 bit. This can be the least significant 32 bit of a 64 bit NIC hardware timestamp, leaving it up to the application to keep track of the most significant 32 bit by some other means. If it's counting nanoseconds, 32 bit can hold 4 seconds.


> 
> /Bruce


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-24 21:47     ` Morten Brørup
@ 2016-10-25  8:53       ` Bruce Richardson
  2016-10-25 10:02         ` Ananyev, Konstantin
  0 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2016-10-25  8:53 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Wiles, Keith, dev, Olivier Matz

On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
> Comments inline.
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Monday, October 24, 2016 6:26 PM
> > To: Wiles, Keith
> > Cc: Morten Brørup; dev@dpdk.org; Olivier Matz
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > >
> > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com>
> > wrote:
> > > >
> > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > >
> > > >
> > > >
> > > > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf
> > changes...
> > 
> > Thanks for keeping the discussion going!
> > > >
> > > >
> > > >
> > > > 1.
> > > >
> > > > Stephen Hemminger had a noteworthy general comment about keeping
> > metadata for the NIC in the appropriate section of the mbuf: Metadata
> > generated by the NIC’s RX handler belongs in the first cache line, and
> > metadata required by the NIC’s TX handler belongs in the second cache line.
> > This also means that touching the second cache line on ingress should be
> > avoided if possible; and Bruce Richardson mentioned that for this reason m-
> > >next was zeroed on free().
> > > >
> > Thinking about it, I suspect there are more fields we can reset on free
> > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > too could be the nb_segs field and possibly others.
> Yes. Consider the use of rte_pktmbuf_reset() or add a rte_pktmbuf_prealloc() for this purpose.
> 
Possibly. We just want to make sure that we don't reset too much either!
:-)

> 
> > > > 2.
> > > >
> > > > There seemed to be consensus that the size of m->refcnt should match
> > the size of m->port because a packet could be duplicated on all physical
> > ports for L3 multicast and L2 flooding.
> > > >
> > > > Furthermore, although a single physical machine (i.e. a single server)
> > with 255 physical ports probably doesn’t exist, it might contain more than
> > 255 virtual machines with a virtual port each, so it makes sense extending
> > these mbuf fields from 8 to 16 bits.
> > >
> > > I thought we also talked about removing the m->port from the mbuf as it
> > is not really needed.
> > >
> > Yes, this was mentioned, and also the option of moving the port value to
> > the second cacheline, but it appears that NXP are using the port value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> > 
> If a single driver instance services multiple ports, so the ports are not polled individually, the m->port member will be required to identify the port. In that case, it might also used elsewhere in the ingress path, and thus appropriate having in the first cache line. So yes, this needs further discussion with NXP.
> 
> 
> > > > 3.
> > > >
> > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> > the second cache line, which then generated questions from the audience
> > about the real life purpose of m->port, and if m->port could be removed
> > from the mbuf structure.
> > > >
> > > >
> > > >
> > > > 4.
> > > >
> > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> > first allocation. This is based on the assumption that other mbuf fields
> > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> > than setting it to 1.
> > > >
> > > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> > not required to modify m->refcnt when allocating and freeing the mbuf, thus
> > saving one write operation on both alloc() and free(). However, this
> > assumes that m->refcnt debugging, e.g. underrun detection, is not required.
> > 
> > I don't think it really matters what sentinal value is used for the
> > refcnt because it can't be blindly assigned on free like other fields.
> > However, I think 0 as first reference value becomes more awkward
> > than 1, because we need to deal with underflow. Consider the situation
> > where we have two references to the mbuf, so refcnt is 1, and both are
> > freed at the same time. Since the refcnt is not-zero, then both cores
> > will do an atomic decrement simultaneously giving a refcnt of -1. We can
> > then set this back to zero before freeing, however, I'd still prefer to
> > have refcnt be an accurate value so that it always stays positive, and
> > we can still set it to "one" on free to avoid having to set on alloc.
> > 
> Good example. It explains very clearly why my suggested optimization is tricky. My optimization is targeting the most likely scenario where m->refcnt is only "one", but at the expense of some added complexity for the unlikely scenarios. (I only suggested 0 as "one" because I assumed the value could be zeroed faster than set to 1, so forget about that, if you like.)
> 
> I still argue that if we keep m->refcnt at "one" value while in the free pool, a write operation can be avoided at both free() and alloc(). Here is a conceptual example (using m->refcnt==1 for "one"):
> 
Yes, I agree with you on this.

> void __rte_mbuf_raw_free(struct rte_mbuf *m)
> {
> 	RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1);
> 	rte_mempool_put(m->pool, m);
> }
> 
> void __rte_pktmbuf_free_seg_inner(struct rte_mbuf *m)
> {
> 	/* if this is an indirect mbuf, it is detached. */
> 	if (RTE_MBUF_INDIRECT(m))
> 		rte_pktmbuf_detach(m);
> 	m->next = NULL; /* Preload to avoid setting in alloc(). */
> 	__rte_mbuf_raw_free(m);
> }
> 
> void rte_pktmbuf_free_seg(struct rte_mbuf *m)
> {
> 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> 		/* We have the last reference, so we return it to the free pool. */
> 		__rte_pktmbuf_free_seg_inner(m);
> 	} else {
> 		/* More references, so decrement the refcnt. */
> 		if (unlikely(rte_mbuf_refcnt_update(m, -1) == 0)) {
> 			/* Someone raced to decrement the refcnt before us,
> 			   so now we have the last reference,
> 			   so we return it to the free pool. */
> 		    rte_mbuf_refcnt_set(m, 1); /* Preload to avoid in alloc(). */
> 			__rte_pktmbuf_free_seg_inner(m);
> 		}
> 	}
> }
> 
> 
> > Also, if we set refcnt on free rather than alloc, it does set itself up
> > as a good candidate for moving to the second cacheline. Fast-path
> > processing does not normally update the value.
> > 
> Agreed. But also consider that TX handling should be optimized too. And this includes free() processing and preloading values for alloc(); this should only be done in free() instead of alloc() if it provides a total performance improvement. We should not improve RX handling performance if the TX handling performance decreases much more.

Agreed.
> 
> 
> > > > 5.
> > > >
> > > > And here’s something new to think about:
> > > >
> > > > m->next already reveals if there are more segments to a packet. Which
> > purpose does m->nb_segs serve that is not already covered by m->next?
> > 
> > It is duplicate info, but nb_segs can be used to check the validity of
> > the next pointer without having to read the second mbuf cacheline.
> > 
> > Whether it's worth having is something I'm happy enough to discuss,
> > though.
> If anybody needs to check for additional segments (m->next != NULL) without hitting the second cache line, a single bit in the first cache line will suffice. (If this is a common case, they probably need to read the m->next value anyway, and then it would optimally belong in the first cache line. I am just mentioning this. It's not my preference.)
> 
> 
> > One other point I'll mention is that we need to have a discussion on
> > how/where to add in a timestamp value into the mbuf. Personally, I think
> > it can be in a union with the sequence number value, but I also suspect
> > that 32-bits of a timestamp is not going to be enough for many.
> > 
> > Thoughts?
> Speaking for myself, a high resolution RX timestamp is "nice to have", so I like it, but I would hate to sacrifice the sequence number for it. Generally, I would assume that the sequence number used for reordering is more important than a timestamp, except for applications dealing with timing measurements or high precision packet capturing.
> 
> Conversion of timestamps from one unit to another can be relatively expensive, so I recommend keeping timestamps in whatever resolution the NIC hardware provides, leaving the conversion to e.g. nanoseconds up to the applications that use timestamps. A lot of protocols use opaque timestamps, e.g. RTP, so this would not be exceptional. Since the timestamp belongs in the first cache line, I suggest we limit it to 32 bit. This can be the least significant 32 bit of a 64 bit NIC hardware timestamp, leaving it up to the application to keep track of the most significant 32 bit by some other means. If it's counting nanoseconds, 32 bit can hold 4 seconds.
> 

I worry that 32-bits is not enough. If we do use a 32-bit value, I also
think that in many cases we can't just take the low 32-bits. For a 2GHz
IA CPU the rdtsc() value would wrap around in just 2 seconds if
truncated to 32 bits, for instance. In a case like that, we may be
better to take 32-bits from somewhere in the middle - trading accuracy
for better wrap-around time.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-24 16:25   ` Bruce Richardson
  2016-10-24 21:47     ` Morten Brørup
@ 2016-10-25  9:39     ` Adrien Mazarguil
  2016-10-25 10:11       ` Morten Brørup
  2016-10-25 11:54     ` Shreyansh Jain
  2 siblings, 1 reply; 46+ messages in thread
From: Adrien Mazarguil @ 2016-10-25  9:39 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Wiles, Keith, Morten Brørup, dev, Olivier Matz, Oleg Kuporosov

On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
[...]
> > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com> wrote:
[...]
> > > 5.
> > > 
> > > And here’s something new to think about:
> > > 
> > > m->next already reveals if there are more segments to a packet. Which purpose does m->nb_segs serve that is not already covered by m->next?
> 
> It is duplicate info, but nb_segs can be used to check the validity of
> the next pointer without having to read the second mbuf cacheline.
> 
> Whether it's worth having is something I'm happy enough to discuss,
> though.

Although slower in some cases than a full blown "next packet" pointer,
nb_segs can also be conveniently abused to link several packets and their
segments in the same list without wasting space.

> One other point I'll mention is that we need to have a discussion on
> how/where to add in a timestamp value into the mbuf. Personally, I think
> it can be in a union with the sequence number value, but I also suspect
> that 32-bits of a timestamp is not going to be enough for many.
> 
> Thoughts?

If we consider that timestamp representation should use nanosecond
granularity, a 32-bit value may likely wrap around too quickly to be
useful. We can also assume that applications requesting timestamps may care
more about latency than throughput, Oleg found that using the second cache
line for this purpose had a noticeable impact [1].

 [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25  8:53       ` Bruce Richardson
@ 2016-10-25 10:02         ` Ananyev, Konstantin
  2016-10-25 10:22           ` Morten Brørup
  0 siblings, 1 reply; 46+ messages in thread
From: Ananyev, Konstantin @ 2016-10-25 10:02 UTC (permalink / raw)
  To: Richardson, Bruce, Morten Brørup; +Cc: Wiles, Keith, dev, Olivier Matz

Hi everyone,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, October 25, 2016 9:54 AM
> To: Morten Brørup <mb@smartsharesystems.com>
> Cc: Wiles, Keith <keith.wiles@intel.com>; dev@dpdk.org; Olivier Matz <olivier.matz@6wind.com>
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
> > Comments inline.
> >
> > Med venlig hilsen / kind regards
> > - Morten Brørup
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Monday, October 24, 2016 6:26 PM
> > > To: Wiles, Keith
> > > Cc: Morten Brørup; dev@dpdk.org; Olivier Matz
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > >
> > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com>
> > > wrote:
> > > > >
> > > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > >
> > > > >
> > > > >
> > > > > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf
> > > changes...
> > >
> > > Thanks for keeping the discussion going!
> > > > >
> > > > >
> > > > >
> > > > > 1.
> > > > >
> > > > > Stephen Hemminger had a noteworthy general comment about keeping
> > > metadata for the NIC in the appropriate section of the mbuf: Metadata
> > > generated by the NIC’s RX handler belongs in the first cache line, and
> > > metadata required by the NIC’s TX handler belongs in the second cache line.
> > > This also means that touching the second cache line on ingress should be
> > > avoided if possible; and Bruce Richardson mentioned that for this reason m-
> > > >next was zeroed on free().
> > > > >
> > > Thinking about it, I suspect there are more fields we can reset on free
> > > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > > too could be the nb_segs field and possibly others.
> > Yes. Consider the use of rte_pktmbuf_reset() or add a rte_pktmbuf_prealloc() for this purpose.
> >
> Possibly. We just want to make sure that we don't reset too much either!
> :-)
> 
> >
> > > > > 2.
> > > > >
> > > > > There seemed to be consensus that the size of m->refcnt should match
> > > the size of m->port because a packet could be duplicated on all physical
> > > ports for L3 multicast and L2 flooding.
> > > > >
> > > > > Furthermore, although a single physical machine (i.e. a single server)
> > > with 255 physical ports probably doesn’t exist, it might contain more than
> > > 255 virtual machines with a virtual port each, so it makes sense extending
> > > these mbuf fields from 8 to 16 bits.
> > > >
> > > > I thought we also talked about removing the m->port from the mbuf as it
> > > is not really needed.
> > > >
> > > Yes, this was mentioned, and also the option of moving the port value to
> > > the second cacheline, but it appears that NXP are using the port value
> > > in their NIC drivers for passing in metadata, so we'd need their
> > > agreement on any move (or removal).
> > >
> > If a single driver instance services multiple ports, so the ports are not polled individually, the m->port member will be required to identify
> the port. In that case, it might also used elsewhere in the ingress path, and thus appropriate having in the first cache line. 

Ok, but these days most of devices have multiple rx queues.
So identify the RX source properly you need not only port, but port+queue (at least 3 bytes).
But I suppose better to wait for NXP input here. 

>So yes, this needs
> further discussion with NXP.
> >
> >
> > > > > 3.
> > > > >
> > > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> > > the second cache line, which then generated questions from the audience
> > > about the real life purpose of m->port, and if m->port could be removed
> > > from the mbuf structure.
> > > > >
> > > > >
> > > > >
> > > > > 4.
> > > > >
> > > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> > > first allocation. This is based on the assumption that other mbuf fields
> > > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> > > than setting it to 1.
> > > > >
> > > > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> > > not required to modify m->refcnt when allocating and freeing the mbuf, thus
> > > saving one write operation on both alloc() and free(). However, this
> > > assumes that m->refcnt debugging, e.g. underrun detection, is not required.
> > >
> > > I don't think it really matters what sentinal value is used for the
> > > refcnt because it can't be blindly assigned on free like other fields.
> > > However, I think 0 as first reference value becomes more awkward
> > > than 1, because we need to deal with underflow. Consider the situation
> > > where we have two references to the mbuf, so refcnt is 1, and both are
> > > freed at the same time. Since the refcnt is not-zero, then both cores
> > > will do an atomic decrement simultaneously giving a refcnt of -1. We can
> > > then set this back to zero before freeing, however, I'd still prefer to
> > > have refcnt be an accurate value so that it always stays positive, and
> > > we can still set it to "one" on free to avoid having to set on alloc.
> > >
> > Good example. It explains very clearly why my suggested optimization is tricky. My optimization is targeting the most likely scenario
> where m->refcnt is only "one", but at the expense of some added complexity for the unlikely scenarios. (I only suggested 0 as "one"
> because I assumed the value could be zeroed faster than set to 1, so forget about that, if you like.)
> >
> > I still argue that if we keep m->refcnt at "one" value while in the free pool, a write operation can be avoided at both free() and alloc().
> Here is a conceptual example (using m->refcnt==1 for "one"):
> >
> Yes, I agree with you on this.
> 
> > void __rte_mbuf_raw_free(struct rte_mbuf *m)
> > {
> > 	RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1);
> > 	rte_mempool_put(m->pool, m);
> > }
> >
> > void __rte_pktmbuf_free_seg_inner(struct rte_mbuf *m)
> > {
> > 	/* if this is an indirect mbuf, it is detached. */
> > 	if (RTE_MBUF_INDIRECT(m))
> > 		rte_pktmbuf_detach(m);
> > 	m->next = NULL; /* Preload to avoid setting in alloc(). */
> > 	__rte_mbuf_raw_free(m);
> > }
> >
> > void rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > {
> > 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> > 		/* We have the last reference, so we return it to the free pool. */
> > 		__rte_pktmbuf_free_seg_inner(m);
> > 	} else {
> > 		/* More references, so decrement the refcnt. */
> > 		if (unlikely(rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 			/* Someone raced to decrement the refcnt before us,
> > 			   so now we have the last reference,
> > 			   so we return it to the free pool. */
> > 		    rte_mbuf_refcnt_set(m, 1); /* Preload to avoid in alloc(). */
> > 			__rte_pktmbuf_free_seg_inner(m);
> > 		}
> > 	}
> > }
> >
> >
> > > Also, if we set refcnt on free rather than alloc, it does set itself up
> > > as a good candidate for moving to the second cacheline. Fast-path
> > > processing does not normally update the value.
> > >
> > Agreed. But also consider that TX handling should be optimized too. And this includes free() processing and preloading values for alloc();
> this should only be done in free() instead of alloc() if it provides a total performance improvement. We should not improve RX handling
> performance if the TX handling performance decreases much more.
> 
> Agreed.
> >
> >
> > > > > 5.
> > > > >
> > > > > And here’s something new to think about:
> > > > >
> > > > > m->next already reveals if there are more segments to a packet. Which
> > > purpose does m->nb_segs serve that is not already covered by m->next?
> > >
> > > It is duplicate info, but nb_segs can be used to check the validity of
> > > the next pointer without having to read the second mbuf cacheline.
> > >
> > > Whether it's worth having is something I'm happy enough to discuss,
> > > though.
> > If anybody needs to check for additional segments (m->next != NULL) without hitting the second cache line, a single bit in the first cache
> line will suffice. (If this is a common case, they probably need to read the m->next value anyway, and then it would optimally belong in the
> first cache line. I am just mentioning this. It's not my preference.)
> >
> >
> > > One other point I'll mention is that we need to have a discussion on
> > > how/where to add in a timestamp value into the mbuf. Personally, I think
> > > it can be in a union with the sequence number value, but I also suspect
> > > that 32-bits of a timestamp is not going to be enough for many.
> > >
> > > Thoughts?
> > Speaking for myself, a high resolution RX timestamp is "nice to have", so I like it, but I would hate to sacrifice the sequence number for it.
> Generally, I would assume that the sequence number used for reordering is more important than a timestamp, except for applications
> dealing with timing measurements or high precision packet capturing.
> >
> > Conversion of timestamps from one unit to another can be relatively expensive, so I recommend keeping timestamps in whatever
> resolution the NIC hardware provides, leaving the conversion to e.g. nanoseconds up to the applications that use timestamps. A lot of
> protocols use opaque timestamps, e.g. RTP, so this would not be exceptional. Since the timestamp belongs in the first cache line, I suggest
> we limit it to 32 bit. This can be the least significant 32 bit of a 64 bit NIC hardware timestamp, leaving it up to the application to keep track
> of the most significant 32 bit by some other means. If it's counting nanoseconds, 32 bit can hold 4 seconds.
> >
> 
> I worry that 32-bits is not enough. If we do use a 32-bit value, I also
> think that in many cases we can't just take the low 32-bits. For a 2GHz
> IA CPU the rdtsc() value would wrap around in just 2 seconds if
> truncated to 32 bits, for instance. In a case like that, we may be
> better to take 32-bits from somewhere in the middle - trading accuracy
> for better wrap-around time.

I also think that 32bit for timestamps would not be enough.
Though as we agree that first cache line should be for fields filled by fast-path RX,
I suppose timestamp (and seqn) should be in the second line and shouldn't be filled by PMD itself.
About what should be stored here (tsc value/system clock in ns/something else) -
I think it is better left to particular SW that would fill/use that field to decide.
For me personally tsc seems like the most appropriate.
Konstantin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25  9:39     ` Adrien Mazarguil
@ 2016-10-25 10:11       ` Morten Brørup
  2016-10-25 11:04         ` Adrien Mazarguil
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 10:11 UTC (permalink / raw)
  To: Adrien Mazarguil, Bruce Richardson
  Cc: Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

Comments inline.

Med venlig hilsen / kind regards
- Morten Brørup


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Tuesday, October 25, 2016 11:39 AM
> To: Bruce Richardson
> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz; Oleg
> Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> [...]
> > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> <mb@smartsharesystems.com> wrote:
> [...]
> > > > 5.
> > > >
> > > > And here’s something new to think about:
> > > >
> > > > m->next already reveals if there are more segments to a packet.
> Which purpose does m->nb_segs serve that is not already covered by m-
> >next?
> >
> > It is duplicate info, but nb_segs can be used to check the validity
> of
> > the next pointer without having to read the second mbuf cacheline.
> >
> > Whether it's worth having is something I'm happy enough to discuss,
> > though.
> 
> Although slower in some cases than a full blown "next packet" pointer,
> nb_segs can also be conveniently abused to link several packets and
> their segments in the same list without wasting space.

I don’t understand that; can you please elaborate? Are you abusing m->nb_segs as an index into an array in your application? If that is the case, and it is endorsed by the community, we should get rid of m->nb_segs and add a member for application specific use instead. 

> > One other point I'll mention is that we need to have a discussion on
> > how/where to add in a timestamp value into the mbuf. Personally, I
> > think it can be in a union with the sequence number value, but I also
> > suspect that 32-bits of a timestamp is not going to be enough for
> many.
> >
> > Thoughts?
> 
> If we consider that timestamp representation should use nanosecond
> granularity, a 32-bit value may likely wrap around too quickly to be
> useful. We can also assume that applications requesting timestamps may
> care more about latency than throughput, Oleg found that using the
> second cache line for this purpose had a noticeable impact [1].
> 
>  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html

I agree with Oleg about the latency vs. throughput importance for such applications.

If you need high resolution timestamps, consider them to be generated by the NIC RX driver, possibly by the hardware itself (http://w3new.napatech.com/features/time-precision/hardware-time-stamp), so the timestamp belongs in the first cache line. And I am proposing that it should have the highest possible accuracy, which makes the value hardware dependent.

Furthermore, I am arguing that we leave it up to the application to keep track of the slowly moving bits (i.e. counting whole seconds, hours and calendar date) out of band, so we don't use precious space in the mbuf. The application doesn't need the NIC RX driver's fast path to capture which date (or even which second) a packet was received. Yes, it adds complexity to the application, but we can't set aside 64 bit for a generic timestamp. Or as a weird tradeoff: Put the fast moving 32 bit in the first cache line and the slow moving 32 bit in the second cache line, as a placeholder for the application to fill out if needed. Yes, it means that the application needs to check the time and update its variable holding the slow moving time once every second or so; but that should be doable without significant effort.


> 
> --
> Adrien Mazarguil
> 6WIND


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 10:02         ` Ananyev, Konstantin
@ 2016-10-25 10:22           ` Morten Brørup
  2016-10-25 13:00             ` Olivier Matz
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 10:22 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Wiles, Keith, dev, Olivier Matz

Comments inline (at the end).

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Tuesday, October 25, 2016 12:03 PM
> To: Richardson, Bruce; Morten Brørup
> Cc: Wiles, Keith; dev@dpdk.org; Olivier Matz
> Subject: Re: [dpdk-dev] mbuf changes
> 
> Hi everyone,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, October 25, 2016 9:54 AM
> > To: Morten Brørup <mb@smartsharesystems.com>
> > Cc: Wiles, Keith <keith.wiles@intel.com>; dev@dpdk.org; Olivier Matz
> > <olivier.matz@6wind.com>
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
> > > Comments inline.
> > >
> > > Med venlig hilsen / kind regards
> > > - Morten Brørup
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > Richardson
> > > > Sent: Monday, October 24, 2016 6:26 PM
> > > > To: Wiles, Keith
> > > > Cc: Morten Brørup; dev@dpdk.org; Olivier Matz
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > > >
> > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > > <mb@smartsharesystems.com>
> > > > wrote:
> > > > > >
> > > > > > 2.
> > > > > >
> > > > > > There seemed to be consensus that the size of m->refcnt
> should
> > > > > > match
> > > > the size of m->port because a packet could be duplicated on all
> > > > physical ports for L3 multicast and L2 flooding.
> > > > > >
> > > > > > Furthermore, although a single physical machine (i.e. a
> single
> > > > > > server)
> > > > with 255 physical ports probably doesn’t exist, it might contain
> > > > more than
> > > > 255 virtual machines with a virtual port each, so it makes sense
> > > > extending these mbuf fields from 8 to 16 bits.
> > > > >
> > > > > I thought we also talked about removing the m->port from the
> > > > > mbuf as it
> > > > is not really needed.
> > > > >
> > > > Yes, this was mentioned, and also the option of moving the port
> > > > value to the second cacheline, but it appears that NXP are using
> > > > the port value in their NIC drivers for passing in metadata, so
> > > > we'd need their agreement on any move (or removal).
> > > >
> > > If a single driver instance services multiple ports, so the ports
> > > are not polled individually, the m->port member will be required to
> > > identify
> > the port. In that case, it might also used elsewhere in the ingress
> path, and thus appropriate having in the first cache line.
> 
> Ok, but these days most of devices have multiple rx queues.
> So identify the RX source properly you need not only port, but
> port+queue (at least 3 bytes).
> But I suppose better to wait for NXP input here.
> 
> >So yes, this needs
> > further discussion with NXP.

It seems that this concept might be somewhat similar to the Flow Director information, which already has plenty of bits in the first cache line. You should consider if this can be somehow folded into the m->hash union.

-Morten


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 10:11       ` Morten Brørup
@ 2016-10-25 11:04         ` Adrien Mazarguil
  2016-10-25 11:13           ` Bruce Richardson
  0 siblings, 1 reply; 46+ messages in thread
From: Adrien Mazarguil @ 2016-10-25 11:04 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Bruce Richardson, Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> Comments inline.
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> 
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > Sent: Tuesday, October 25, 2016 11:39 AM
> > To: Bruce Richardson
> > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > [...]
> > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > <mb@smartsharesystems.com> wrote:
> > [...]
> > > > > 5.
> > > > >
> > > > > And here’s something new to think about:
> > > > >
> > > > > m->next already reveals if there are more segments to a packet.
> > Which purpose does m->nb_segs serve that is not already covered by m-
> > >next?
> > >
> > > It is duplicate info, but nb_segs can be used to check the validity
> > of
> > > the next pointer without having to read the second mbuf cacheline.
> > >
> > > Whether it's worth having is something I'm happy enough to discuss,
> > > though.
> > 
> > Although slower in some cases than a full blown "next packet" pointer,
> > nb_segs can also be conveniently abused to link several packets and
> > their segments in the same list without wasting space.
> 
> I don’t understand that; can you please elaborate? Are you abusing m->nb_segs as an index into an array in your application? If that is the case, and it is endorsed by the community, we should get rid of m->nb_segs and add a member for application specific use instead. 

Well, that's just an idea, I'm not aware of any application using this,
however the ability to link several packets with segments seems
useful to me (e.g. buffering packets). Here's a diagram:

 .-----------.   .-----------.   .-----------.   .-----------.   .------
 | pkt 0     |   | seg 1     |   | seg 2     |   | pkt 1     |   | pkt 2
 |      next --->|      next --->|      next --->|      next --->| ...
 | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
 `-----------'   `-----------'   `-----------'   `-----------'   `------

> > > One other point I'll mention is that we need to have a discussion on
> > > how/where to add in a timestamp value into the mbuf. Personally, I
> > > think it can be in a union with the sequence number value, but I also
> > > suspect that 32-bits of a timestamp is not going to be enough for
> > many.
> > >
> > > Thoughts?
> > 
> > If we consider that timestamp representation should use nanosecond
> > granularity, a 32-bit value may likely wrap around too quickly to be
> > useful. We can also assume that applications requesting timestamps may
> > care more about latency than throughput, Oleg found that using the
> > second cache line for this purpose had a noticeable impact [1].
> > 
> >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> 
> I agree with Oleg about the latency vs. throughput importance for such applications.
> 
> If you need high resolution timestamps, consider them to be generated by the NIC RX driver, possibly by the hardware itself (http://w3new.napatech.com/features/time-precision/hardware-time-stamp), so the timestamp belongs in the first cache line. And I am proposing that it should have the highest possible accuracy, which makes the value hardware dependent.
> 
> Furthermore, I am arguing that we leave it up to the application to keep track of the slowly moving bits (i.e. counting whole seconds, hours and calendar date) out of band, so we don't use precious space in the mbuf. The application doesn't need the NIC RX driver's fast path to capture which date (or even which second) a packet was received. Yes, it adds complexity to the application, but we can't set aside 64 bit for a generic timestamp. Or as a weird tradeoff: Put the fast moving 32 bit in the first cache line and the slow moving 32 bit in the second cache line, as a placeholder for the application to fill out if needed. Yes, it means that the application needs to check the time and update its variable holding the slow moving time once every second or so; but that should be doable with
 out significant effort.

That's a good point, however without a 64 bit value, elapsed time between
two arbitrary mbufs cannot be measured reliably due to not enough context,
one way or another the low resolution value is also needed.

Obviously latency-sensitive applications are unlikely to perform lengthy
buffering and require this but I'm not sure about all the possible
use-cases. Considering many NICs expose 64 bit timestaps, I suggest we do
not truncate them.

I'm not a fan of the weird tradeoff either, PMDs will be tempted to fill the
extra 32 bits whenever they can and negate the performance improvement of
the first cache line.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 11:04         ` Adrien Mazarguil
@ 2016-10-25 11:13           ` Bruce Richardson
  2016-10-25 12:16             ` Morten Brørup
  0 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2016-10-25 11:13 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: Morten Brørup, Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > Comments inline.
> > 
> > Med venlig hilsen / kind regards
> > - Morten Brørup
> > 
> > 
> > > -----Original Message-----
> > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > To: Bruce Richardson
> > > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz; Oleg
> > > Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > > 
> > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > [...]
> > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > <mb@smartsharesystems.com> wrote:
> > > [...]
> > > > > > 5.
> > > > > >
> > > > > > And here’s something new to think about:
> > > > > >
> > > > > > m->next already reveals if there are more segments to a packet.
> > > Which purpose does m->nb_segs serve that is not already covered by m-
> > > >next?
> > > >
> > > > It is duplicate info, but nb_segs can be used to check the validity
> > > of
> > > > the next pointer without having to read the second mbuf cacheline.
> > > >
> > > > Whether it's worth having is something I'm happy enough to discuss,
> > > > though.
> > > 
> > > Although slower in some cases than a full blown "next packet" pointer,
> > > nb_segs can also be conveniently abused to link several packets and
> > > their segments in the same list without wasting space.
> > 
> > I don’t understand that; can you please elaborate? Are you abusing m->nb_segs as an index into an array in your application? If that is the case, and it is endorsed by the community, we should get rid of m->nb_segs and add a member for application specific use instead. 
> 
> Well, that's just an idea, I'm not aware of any application using this,
> however the ability to link several packets with segments seems
> useful to me (e.g. buffering packets). Here's a diagram:
> 
>  .-----------.   .-----------.   .-----------.   .-----------.   .------
>  | pkt 0     |   | seg 1     |   | seg 2     |   | pkt 1     |   | pkt 2
>  |      next --->|      next --->|      next --->|      next --->| ...
>  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
>  `-----------'   `-----------'   `-----------'   `-----------'   `------
> 
> > > > One other point I'll mention is that we need to have a discussion on
> > > > how/where to add in a timestamp value into the mbuf. Personally, I
> > > > think it can be in a union with the sequence number value, but I also
> > > > suspect that 32-bits of a timestamp is not going to be enough for
> > > many.
> > > >
> > > > Thoughts?
> > > 
> > > If we consider that timestamp representation should use nanosecond
> > > granularity, a 32-bit value may likely wrap around too quickly to be
> > > useful. We can also assume that applications requesting timestamps may
> > > care more about latency than throughput, Oleg found that using the
> > > second cache line for this purpose had a noticeable impact [1].
> > > 
> > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > 
> > I agree with Oleg about the latency vs. throughput importance for such applications.
> > 
> > If you need high resolution timestamps, consider them to be generated by the NIC RX driver, possibly by the hardware itself (http://w3new.napatech.com/features/time-precision/hardware-time-stamp), so the timestamp belongs in the first cache line. And I am proposing that it should have the highest possible accuracy, which makes the value hardware dependent.
> > 
> > Furthermore, I am arguing that we leave it up to the application to keep track of the slowly moving bits (i.e. counting whole seconds, hours and calendar date) out of band, so we don't use precious space in the mbuf. The application doesn't need the NIC RX driver's fast path to capture which date (or even which second) a packet was received. Yes, it adds complexity to the application, but we can't set aside 64 bit for a generic timestamp. Or as a weird tradeoff: Put the fast moving 32 bit in the first cache line and the slow moving 32 bit in the second cache line, as a placeholder for the application to fill out if needed. Yes, it means that the application needs to check the time and update its variable holding the slow moving time once every second or so; but that should be doable wi
 thout significant effort.
> 
> That's a good point, however without a 64 bit value, elapsed time between
> two arbitrary mbufs cannot be measured reliably due to not enough context,
> one way or another the low resolution value is also needed.
> 
> Obviously latency-sensitive applications are unlikely to perform lengthy
> buffering and require this but I'm not sure about all the possible
> use-cases. Considering many NICs expose 64 bit timestaps, I suggest we do
> not truncate them.
> 
> I'm not a fan of the weird tradeoff either, PMDs will be tempted to fill the
> extra 32 bits whenever they can and negate the performance improvement of
> the first cache line.

I would tend to agree, and I don't really see any convenient way to
avoid putting in a 64-bit field for the timestamp in cache-line 0. If we
are ok with having this overlap/partially overlap with sequence number,
it will use up an extra 4B of storage in that cacheline. However,
nb_segs may be a good candidate for demotion, along with possibly the
port value, or the reference count.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-24 16:25   ` Bruce Richardson
  2016-10-24 21:47     ` Morten Brørup
  2016-10-25  9:39     ` Adrien Mazarguil
@ 2016-10-25 11:54     ` Shreyansh Jain
  2016-10-25 12:05       ` Bruce Richardson
  2016-10-25 12:49       ` Morten Brørup
  2 siblings, 2 replies; 46+ messages in thread
From: Shreyansh Jain @ 2016-10-25 11:54 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Wiles, Keith, Morten Brørup, dev, Olivier Matz

On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
>>
>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com> wrote:
>>>
>>> First of all: Thanks for a great DPDK Userspace 2016!
>>>
>>>
>>>
>>> Continuing the Userspace discussion about Olivier Matz’s proposed mbuf changes...
>
> Thanks for keeping the discussion going!
>>>
>>>
>>>
>>> 1.
>>>
>>> Stephen Hemminger had a noteworthy general comment about keeping metadata for the NIC in the appropriate section of the mbuf: Metadata generated by the NIC’s RX handler belongs in the first cache line, and metadata required by the NIC’s TX handler belongs in the second cache line. This also means that touching the second cache line on ingress should be avoided if possible; and Bruce Richardson mentioned that for this reason m->next was zeroed on free().
>>>
> Thinking about it, I suspect there are more fields we can reset on free
> to save time on alloc. Refcnt, as discussed below is one of them, but so
> too could be the nb_segs field and possibly others.
>
>>>
>>>
>>> 2.
>>>
>>> There seemed to be consensus that the size of m->refcnt should match the size of m->port because a packet could be duplicated on all physical ports for L3 multicast and L2 flooding.
>>>
>>> Furthermore, although a single physical machine (i.e. a single server) with 255 physical ports probably doesn’t exist, it might contain more than 255 virtual machines with a virtual port each, so it makes sense extending these mbuf fields from 8 to 16 bits.
>>
>> I thought we also talked about removing the m->port from the mbuf as it is not really needed.
>>
> Yes, this was mentioned, and also the option of moving the port value to
> the second cacheline, but it appears that NXP are using the port value
> in their NIC drivers for passing in metadata, so we'd need their
> agreement on any move (or removal).

I am not sure where NXP's NIC came into picture on this, but now that it 
is highlighted, this field is required for libevent implementation [1].

A scheduler sending an event, which can be a packet, would only have 
information of a flow_id. From this matching it back to a port, without 
mbuf->port, would be very difficult (costly). There may be way around 
this but at least in current proposal I think port would be important to 
have - even if in second cache line.

But, off the top of my head, as of now it is not being used for any 
specific purpose in NXP's PMD implementation.

Even the SoC patches don't necessarily rely on it except using it 
because it is available.

@Bruce: where did you get the NXP context here from?

[1] http://dpdk.org/ml/archives/dev/2016-October/048592.html

>
>>>
>>>
>>>
>>> 3.
>>>
>>> Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to the second cache line, which then generated questions from the audience about the real life purpose of m->port, and if m->port could be removed from the mbuf structure.
>>>
>>>
>>>
>>> 4.
>>>
>>> I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on first allocation. This is based on the assumption that other mbuf fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper than setting it to 1.
>>>
>>> Furthermore (regardless of m->refcnt offset), I suggested that it is not required to modify m->refcnt when allocating and freeing the mbuf, thus saving one write operation on both alloc() and free(). However, this assumes that m->refcnt debugging, e.g. underrun detection, is not required.
>
> I don't think it really matters what sentinal value is used for the
> refcnt because it can't be blindly assigned on free like other fields.
> However, I think 0 as first reference value becomes more awkward
> than 1, because we need to deal with underflow. Consider the situation
> where we have two references to the mbuf, so refcnt is 1, and both are
> freed at the same time. Since the refcnt is not-zero, then both cores
> will do an atomic decrement simultaneously giving a refcnt of -1. We can
> then set this back to zero before freeing, however, I'd still prefer to
> have refcnt be an accurate value so that it always stays positive, and
> we can still set it to "one" on free to avoid having to set on alloc.
>
> Also, if we set refcnt on free rather than alloc, it does set itself up
> as a good candidate for moving to the second cacheline. Fast-path
> processing does not normally update the value.
>
>>>
>>>
>>>
>>> 5.
>>>
>>> And here’s something new to think about:
>>>
>>> m->next already reveals if there are more segments to a packet. Which purpose does m->nb_segs serve that is not already covered by m->next?
>
> It is duplicate info, but nb_segs can be used to check the validity of
> the next pointer without having to read the second mbuf cacheline.
>
> Whether it's worth having is something I'm happy enough to discuss,
> though.
>
> One other point I'll mention is that we need to have a discussion on
> how/where to add in a timestamp value into the mbuf. Personally, I think
> it can be in a union with the sequence number value, but I also suspect
> that 32-bits of a timestamp is not going to be enough for many.
>
> Thoughts?
>
> /Bruce
>


-- 
-
Shreyansh

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 11:54     ` Shreyansh Jain
@ 2016-10-25 12:05       ` Bruce Richardson
  2016-10-26  8:28         ` Alejandro Lucero
  2016-10-25 12:49       ` Morten Brørup
  1 sibling, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2016-10-25 12:05 UTC (permalink / raw)
  To: Shreyansh Jain
  Cc: Wiles, Keith, Morten Brørup, dev, Olivier Matz, alejandro.lucero

On Tue, Oct 25, 2016 at 05:24:28PM +0530, Shreyansh Jain wrote:
> On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > 
> > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com> wrote:
> > > > 
> > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > 
> > > > 
> > > > 
> > > > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf changes...
> > 
> > Thanks for keeping the discussion going!
> > > > 
> > > > 
> > > > 
> > > > 1.
> > > > 
> > > > Stephen Hemminger had a noteworthy general comment about keeping metadata for the NIC in the appropriate section of the mbuf: Metadata generated by the NIC’s RX handler belongs in the first cache line, and metadata required by the NIC’s TX handler belongs in the second cache line. This also means that touching the second cache line on ingress should be avoided if possible; and Bruce Richardson mentioned that for this reason m->next was zeroed on free().
> > > > 
> > Thinking about it, I suspect there are more fields we can reset on free
> > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > too could be the nb_segs field and possibly others.
> > 
> > > > 
> > > > 
> > > > 2.
> > > > 
> > > > There seemed to be consensus that the size of m->refcnt should match the size of m->port because a packet could be duplicated on all physical ports for L3 multicast and L2 flooding.
> > > > 
> > > > Furthermore, although a single physical machine (i.e. a single server) with 255 physical ports probably doesn’t exist, it might contain more than 255 virtual machines with a virtual port each, so it makes sense extending these mbuf fields from 8 to 16 bits.
> > > 
> > > I thought we also talked about removing the m->port from the mbuf as it is not really needed.
> > > 
> > Yes, this was mentioned, and also the option of moving the port value to
> > the second cacheline, but it appears that NXP are using the port value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> 
> I am not sure where NXP's NIC came into picture on this, but now that it is
> highlighted, this field is required for libevent implementation [1].
> 
> A scheduler sending an event, which can be a packet, would only have
> information of a flow_id. From this matching it back to a port, without
> mbuf->port, would be very difficult (costly). There may be way around this
> but at least in current proposal I think port would be important to have -
> even if in second cache line.
> 
> But, off the top of my head, as of now it is not being used for any specific
> purpose in NXP's PMD implementation.
> 
> Even the SoC patches don't necessarily rely on it except using it because it
> is available.
> 
> @Bruce: where did you get the NXP context here from?
> 
Oh, I'm just mis-remembering. :-( It was someone else who was looking for
this - Netronome, perhaps?

CC'ing Alejandro in the hope I'm remembering correctly second time
round!

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 11:13           ` Bruce Richardson
@ 2016-10-25 12:16             ` Morten Brørup
  2016-10-25 12:20               ` Bruce Richardson
  2016-10-25 13:48               ` Adrien Mazarguil
  0 siblings, 2 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 12:16 UTC (permalink / raw)
  To: Bruce Richardson, Adrien Mazarguil
  Cc: Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

Comments inline.

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, October 25, 2016 1:14 PM
> To: Adrien Mazarguil
> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > Comments inline.
> > >
> > > Med venlig hilsen / kind regards
> > > - Morten Brørup
> > >
> > >
> > > > -----Original Message-----
> > > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > To: Bruce Richardson
> > > > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz; Oleg
> > > > Kuporosov
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > > [...]
> > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > <mb@smartsharesystems.com> wrote:
> > > > [...]
> > > > > > > 5.
> > > > > > >
> > > > > > > And here’s something new to think about:
> > > > > > >
> > > > > > > m->next already reveals if there are more segments to a
> packet.
> > > > Which purpose does m->nb_segs serve that is not already covered
> by
> > > > m-
> > > > >next?
> > > > >
> > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > validity
> > > > of
> > > > > the next pointer without having to read the second mbuf
> cacheline.
> > > > >
> > > > > Whether it's worth having is something I'm happy enough to
> > > > > discuss, though.
> > > >
> > > > Although slower in some cases than a full blown "next packet"
> > > > pointer, nb_segs can also be conveniently abused to link several
> > > > packets and their segments in the same list without wasting
> space.
> > >
> > > I don’t understand that; can you please elaborate? Are you abusing
> m->nb_segs as an index into an array in your application? If that is
> the case, and it is endorsed by the community, we should get rid of m-
> >nb_segs and add a member for application specific use instead.
> >
> > Well, that's just an idea, I'm not aware of any application using
> > this, however the ability to link several packets with segments seems
> > useful to me (e.g. buffering packets). Here's a diagram:
> >
> >  .-----------.   .-----------.   .-----------.   .-----------.   .---
> ---
> >  | pkt 0     |   | seg 1     |   | seg 2     |   | pkt 1     |   |
> pkt 2
> >  |      next --->|      next --->|      next --->|      next --->|
> ...
> >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> >  `-----------'   `-----------'   `-----------'   `-----------'   `---
> ---

I see. It makes it possible to refer to a burst of packets (with segments or not) by a single mbuf reference, as an alternative to the current design pattern of using an array and length (struct rte_mbuf **mbufs, unsigned count).

This would require implementation in the PMDs etc.

And even in this case, m->nb_segs does not need to be an integer, but could be replaced by a single bit indicating if the segment is a continuation of a packet or the beginning (alternatively the end) of a packet, i.e. the bit can be set for either the first or the last segment in the packet.

It is an almost equivalent alternative to the fundamental design pattern of using an array of mbuf with count, which is widely implemented in DPDK. And m->next still lives in the second cache line, so I don't see any gain by this.

I still don't get how m->nb_segs can be abused without m->next.


> > > > > One other point I'll mention is that we need to have a
> > > > > discussion on how/where to add in a timestamp value into the
> > > > > mbuf. Personally, I think it can be in a union with the
> sequence
> > > > > number value, but I also suspect that 32-bits of a timestamp is
> > > > > not going to be enough for
> > > > many.
> > > > >
> > > > > Thoughts?
> > > >
> > > > If we consider that timestamp representation should use
> nanosecond
> > > > granularity, a 32-bit value may likely wrap around too quickly to
> > > > be useful. We can also assume that applications requesting
> > > > timestamps may care more about latency than throughput, Oleg
> found
> > > > that using the second cache line for this purpose had a
> noticeable impact [1].
> > > >
> > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > >
> > > I agree with Oleg about the latency vs. throughput importance for
> such applications.
> > >
> > > If you need high resolution timestamps, consider them to be
> generated by the NIC RX driver, possibly by the hardware itself
> (http://w3new.napatech.com/features/time-precision/hardware-time-
> stamp), so the timestamp belongs in the first cache line. And I am
> proposing that it should have the highest possible accuracy, which
> makes the value hardware dependent.
> > >
> > > Furthermore, I am arguing that we leave it up to the application to
> keep track of the slowly moving bits (i.e. counting whole seconds,
> hours and calendar date) out of band, so we don't use precious space in
> the mbuf. The application doesn't need the NIC RX driver's fast path to
> capture which date (or even which second) a packet was received. Yes,
> it adds complexity to the application, but we can't set aside 64 bit
> for a generic timestamp. Or as a weird tradeoff: Put the fast moving 32
> bit in the first cache line and the slow moving 32 bit in the second
> cache line, as a placeholder for the application to fill out if needed.
> Yes, it means that the application needs to check the time and update
> its variable holding the slow moving time once every second or so; but
> that should be doable without significant effort.
> >
> > That's a good point, however without a 64 bit value, elapsed time
> > between two arbitrary mbufs cannot be measured reliably due to not
> > enough context, one way or another the low resolution value is also
> needed.
> >
> > Obviously latency-sensitive applications are unlikely to perform
> > lengthy buffering and require this but I'm not sure about all the
> > possible use-cases. Considering many NICs expose 64 bit timestaps, I
> > suggest we do not truncate them.
> >
> > I'm not a fan of the weird tradeoff either, PMDs will be tempted to
> > fill the extra 32 bits whenever they can and negate the performance
> > improvement of the first cache line.
> 
> I would tend to agree, and I don't really see any convenient way to
> avoid putting in a 64-bit field for the timestamp in cache-line 0. If
> we are ok with having this overlap/partially overlap with sequence
> number, it will use up an extra 4B of storage in that cacheline.

I agree about the lack of convenience! And Adrien certainly has a point about PMD temptations.

However, I still don't think that a NICs ability to date-stamp a packet is sufficient reason to put a date-stamp in cache line 0 of the mbuf. Storing only the fast moving 32 bit in cache line 0 seems like a good compromise to me.

Maybe you can find just one more byte, so it can hold 17 minutes with nanosecond resolution. (I'm joking!)

Please don't sacrifice the sequence number for the seconds/hours/days part a timestamp. Maybe it could be configurable to use a 32 bit or 64 bit timestamp.


> However, nb_segs may be a good candidate for demotion, along with
> possibly the port value, or the reference count.
> 
> /Bruce
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:16             ` Morten Brørup
@ 2016-10-25 12:20               ` Bruce Richardson
  2016-10-25 12:33                 ` Morten Brørup
  2016-10-25 13:48               ` Adrien Mazarguil
  1 sibling, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2016-10-25 12:20 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> Comments inline.
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, October 25, 2016 1:14 PM
> > To: Adrien Mazarguil
> > Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > > Comments inline.
> > > >
> > > > Med venlig hilsen / kind regards
> > > > - Morten Brørup
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > To: Bruce Richardson
> > > > > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz; Oleg
> > > > > Kuporosov
> > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > >
> > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > > > [...]
> > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > <mb@smartsharesystems.com> wrote:
> > > > > [...]
> > > > > > > > 5.
> > > > > > > >
> > > > > > > > And here’s something new to think about:
> > > > > > > >
> > > > > > > > m->next already reveals if there are more segments to a
> > packet.
> > > > > Which purpose does m->nb_segs serve that is not already covered
> > by
> > > > > m-
> > > > > >next?
> > > > > >
> > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > validity
> > > > > of
> > > > > > the next pointer without having to read the second mbuf
> > cacheline.
> > > > > >
> > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > discuss, though.
> > > > >
> > > > > Although slower in some cases than a full blown "next packet"
> > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > packets and their segments in the same list without wasting
> > space.
> > > >
> > > > I don’t understand that; can you please elaborate? Are you abusing
> > m->nb_segs as an index into an array in your application? If that is
> > the case, and it is endorsed by the community, we should get rid of m-
> > >nb_segs and add a member for application specific use instead.
> > >
> > > Well, that's just an idea, I'm not aware of any application using
> > > this, however the ability to link several packets with segments seems
> > > useful to me (e.g. buffering packets). Here's a diagram:
> > >
> > >  .-----------.   .-----------.   .-----------.   .-----------.   .---
> > ---
> > >  | pkt 0     |   | seg 1     |   | seg 2     |   | pkt 1     |   |
> > pkt 2
> > >  |      next --->|      next --->|      next --->|      next --->|
> > ...
> > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > >  `-----------'   `-----------'   `-----------'   `-----------'   `---
> > ---
> 
> I see. It makes it possible to refer to a burst of packets (with segments or not) by a single mbuf reference, as an alternative to the current design pattern of using an array and length (struct rte_mbuf **mbufs, unsigned count).
> 
> This would require implementation in the PMDs etc.
> 
> And even in this case, m->nb_segs does not need to be an integer, but could be replaced by a single bit indicating if the segment is a continuation of a packet or the beginning (alternatively the end) of a packet, i.e. the bit can be set for either the first or the last segment in the packet.
> 
> It is an almost equivalent alternative to the fundamental design pattern of using an array of mbuf with count, which is widely implemented in DPDK. And m->next still lives in the second cache line, so I don't see any gain by this.
> 
> I still don't get how m->nb_segs can be abused without m->next.
> 
> 
> > > > > > One other point I'll mention is that we need to have a
> > > > > > discussion on how/where to add in a timestamp value into the
> > > > > > mbuf. Personally, I think it can be in a union with the
> > sequence
> > > > > > number value, but I also suspect that 32-bits of a timestamp is
> > > > > > not going to be enough for
> > > > > many.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > If we consider that timestamp representation should use
> > nanosecond
> > > > > granularity, a 32-bit value may likely wrap around too quickly to
> > > > > be useful. We can also assume that applications requesting
> > > > > timestamps may care more about latency than throughput, Oleg
> > found
> > > > > that using the second cache line for this purpose had a
> > noticeable impact [1].
> > > > >
> > > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > > >
> > > > I agree with Oleg about the latency vs. throughput importance for
> > such applications.
> > > >
> > > > If you need high resolution timestamps, consider them to be
> > generated by the NIC RX driver, possibly by the hardware itself
> > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > stamp), so the timestamp belongs in the first cache line. And I am
> > proposing that it should have the highest possible accuracy, which
> > makes the value hardware dependent.
> > > >
> > > > Furthermore, I am arguing that we leave it up to the application to
> > keep track of the slowly moving bits (i.e. counting whole seconds,
> > hours and calendar date) out of band, so we don't use precious space in
> > the mbuf. The application doesn't need the NIC RX driver's fast path to
> > capture which date (or even which second) a packet was received. Yes,
> > it adds complexity to the application, but we can't set aside 64 bit
> > for a generic timestamp. Or as a weird tradeoff: Put the fast moving 32
> > bit in the first cache line and the slow moving 32 bit in the second
> > cache line, as a placeholder for the application to fill out if needed.
> > Yes, it means that the application needs to check the time and update
> > its variable holding the slow moving time once every second or so; but
> > that should be doable without significant effort.
> > >
> > > That's a good point, however without a 64 bit value, elapsed time
> > > between two arbitrary mbufs cannot be measured reliably due to not
> > > enough context, one way or another the low resolution value is also
> > needed.
> > >
> > > Obviously latency-sensitive applications are unlikely to perform
> > > lengthy buffering and require this but I'm not sure about all the
> > > possible use-cases. Considering many NICs expose 64 bit timestaps, I
> > > suggest we do not truncate them.
> > >
> > > I'm not a fan of the weird tradeoff either, PMDs will be tempted to
> > > fill the extra 32 bits whenever they can and negate the performance
> > > improvement of the first cache line.
> > 
> > I would tend to agree, and I don't really see any convenient way to
> > avoid putting in a 64-bit field for the timestamp in cache-line 0. If
> > we are ok with having this overlap/partially overlap with sequence
> > number, it will use up an extra 4B of storage in that cacheline.
> 
> I agree about the lack of convenience! And Adrien certainly has a point about PMD temptations.
> 
> However, I still don't think that a NICs ability to date-stamp a packet is sufficient reason to put a date-stamp in cache line 0 of the mbuf. Storing only the fast moving 32 bit in cache line 0 seems like a good compromise to me.
> 
> Maybe you can find just one more byte, so it can hold 17 minutes with nanosecond resolution. (I'm joking!)
> 
> Please don't sacrifice the sequence number for the seconds/hours/days part a timestamp. Maybe it could be configurable to use a 32 bit or 64 bit timestamp.
> 
Do you see both timestamp and sequence numbers being used together? I
would have thought that apps would either use one or the other? However,
your suggestion is workable in any case, to allow the sequence number to
overlap just the high 32 bits of the timestamp, rather than the low.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:20               ` Bruce Richardson
@ 2016-10-25 12:33                 ` Morten Brørup
  2016-10-25 12:45                   ` Bruce Richardson
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 12:33 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

Comments at the end.

Med venlig hilsen / kind regards
- Morten Brørup

> -----Original Message-----
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, October 25, 2016 2:20 PM
> To: Morten Brørup
> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > Comments inline.
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > To: Adrien Mazarguil
> > > Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> > > Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > > > Comments inline.
> > > > >
> > > > > Med venlig hilsen / kind regards
> > > > > - Morten Brørup
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > > To: Bruce Richardson
> > > > > > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> > > > > > Oleg Kuporosov
> > > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > > >
> > > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> wrote:
> > > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> wrote:
> > > > > > [...]
> > > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > > <mb@smartsharesystems.com> wrote:
> > > > > > [...]
> >
> > > > > > > One other point I'll mention is that we need to have a
> > > > > > > discussion on how/where to add in a timestamp value into
> the
> > > > > > > mbuf. Personally, I think it can be in a union with the
> > > sequence
> > > > > > > number value, but I also suspect that 32-bits of a
> timestamp
> > > > > > > is not going to be enough for
> > > > > > many.
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > If we consider that timestamp representation should use
> > > nanosecond
> > > > > > granularity, a 32-bit value may likely wrap around too
> quickly
> > > > > > to be useful. We can also assume that applications requesting
> > > > > > timestamps may care more about latency than throughput, Oleg
> > > found
> > > > > > that using the second cache line for this purpose had a
> > > noticeable impact [1].
> > > > > >
> > > > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > > > >
> > > > > I agree with Oleg about the latency vs. throughput importance
> > > > > for
> > > such applications.
> > > > >
> > > > > If you need high resolution timestamps, consider them to be
> > > generated by the NIC RX driver, possibly by the hardware itself
> > > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > > stamp), so the timestamp belongs in the first cache line. And I am
> > > proposing that it should have the highest possible accuracy, which
> > > makes the value hardware dependent.
> > > > >
> > > > > Furthermore, I am arguing that we leave it up to the
> application
> > > > > to
> > > keep track of the slowly moving bits (i.e. counting whole seconds,
> > > hours and calendar date) out of band, so we don't use precious
> space
> > > in the mbuf. The application doesn't need the NIC RX driver's fast
> > > path to capture which date (or even which second) a packet was
> > > received. Yes, it adds complexity to the application, but we can't
> > > set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> > > Put the fast moving 32 bit in the first cache line and the slow
> > > moving 32 bit in the second cache line, as a placeholder for the
> application to fill out if needed.
> > > Yes, it means that the application needs to check the time and
> > > update its variable holding the slow moving time once every second
> > > or so; but that should be doable without significant effort.
> > > >
> > > > That's a good point, however without a 64 bit value, elapsed time
> > > > between two arbitrary mbufs cannot be measured reliably due to
> not
> > > > enough context, one way or another the low resolution value is
> > > > also
> > > needed.
> > > >
> > > > Obviously latency-sensitive applications are unlikely to perform
> > > > lengthy buffering and require this but I'm not sure about all the
> > > > possible use-cases. Considering many NICs expose 64 bit
> timestaps,
> > > > I suggest we do not truncate them.
> > > >
> > > > I'm not a fan of the weird tradeoff either, PMDs will be tempted
> > > > to fill the extra 32 bits whenever they can and negate the
> > > > performance improvement of the first cache line.
> > >
> > > I would tend to agree, and I don't really see any convenient way to
> > > avoid putting in a 64-bit field for the timestamp in cache-line 0.
> > > If we are ok with having this overlap/partially overlap with
> > > sequence number, it will use up an extra 4B of storage in that
> cacheline.
> >
> > I agree about the lack of convenience! And Adrien certainly has a
> point about PMD temptations.
> >
> > However, I still don't think that a NICs ability to date-stamp a
> packet is sufficient reason to put a date-stamp in cache line 0 of the
> mbuf. Storing only the fast moving 32 bit in cache line 0 seems like a
> good compromise to me.
> >
> > Maybe you can find just one more byte, so it can hold 17 minutes with
> > nanosecond resolution. (I'm joking!)
> >
> > Please don't sacrifice the sequence number for the seconds/hours/days
> part a timestamp. Maybe it could be configurable to use a 32 bit or 64
> bit timestamp.
> >
> Do you see both timestamp and sequence numbers being used together? I
> would have thought that apps would either use one or the other?
> However, your suggestion is workable in any case, to allow the sequence
> number to overlap just the high 32 bits of the timestamp, rather than
> the low.

In our case, I can foresee sequence numbers used for packet processing and timestamps for timing analysis (and possibly for packet capturing, when being used). For timing analysis, we don’t need long durations, e.g. 4 seconds with 32 bit nanosecond resolution suffices. And for packet capturing we are perfectly capable of adding the slowly moving 32 bit of the timestamp to our output data stream without fetching it from the mbuf.

> 
> /Bruce


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:33                 ` Morten Brørup
@ 2016-10-25 12:45                   ` Bruce Richardson
  2016-10-25 12:48                     ` Olivier Matz
  0 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2016-10-25 12:45 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> Comments at the end.
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> > -----Original Message-----
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, October 25, 2016 2:20 PM
> > To: Morten Brørup
> > Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > > Comments inline.
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > Richardson
> > > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > > To: Adrien Mazarguil
> > > > Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> > > > Kuporosov
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > > > > Comments inline.
> > > > > >
> > > > > > Med venlig hilsen / kind regards
> > > > > > - Morten Brørup
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > > > To: Bruce Richardson
> > > > > > > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> > > > > > > Oleg Kuporosov
> > > > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > > > >
> > > > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > wrote:
> > > > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > wrote:
> > > > > > > [...]
> > > > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > > > <mb@smartsharesystems.com> wrote:
> > > > > > > [...]
> > >
> > > > > > > > One other point I'll mention is that we need to have a
> > > > > > > > discussion on how/where to add in a timestamp value into
> > the
> > > > > > > > mbuf. Personally, I think it can be in a union with the
> > > > sequence
> > > > > > > > number value, but I also suspect that 32-bits of a
> > timestamp
> > > > > > > > is not going to be enough for
> > > > > > > many.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > >
> > > > > > > If we consider that timestamp representation should use
> > > > nanosecond
> > > > > > > granularity, a 32-bit value may likely wrap around too
> > quickly
> > > > > > > to be useful. We can also assume that applications requesting
> > > > > > > timestamps may care more about latency than throughput, Oleg
> > > > found
> > > > > > > that using the second cache line for this purpose had a
> > > > noticeable impact [1].
> > > > > > >
> > > > > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > > > > >
> > > > > > I agree with Oleg about the latency vs. throughput importance
> > > > > > for
> > > > such applications.
> > > > > >
> > > > > > If you need high resolution timestamps, consider them to be
> > > > generated by the NIC RX driver, possibly by the hardware itself
> > > > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > > > stamp), so the timestamp belongs in the first cache line. And I am
> > > > proposing that it should have the highest possible accuracy, which
> > > > makes the value hardware dependent.
> > > > > >
> > > > > > Furthermore, I am arguing that we leave it up to the
> > application
> > > > > > to
> > > > keep track of the slowly moving bits (i.e. counting whole seconds,
> > > > hours and calendar date) out of band, so we don't use precious
> > space
> > > > in the mbuf. The application doesn't need the NIC RX driver's fast
> > > > path to capture which date (or even which second) a packet was
> > > > received. Yes, it adds complexity to the application, but we can't
> > > > set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> > > > Put the fast moving 32 bit in the first cache line and the slow
> > > > moving 32 bit in the second cache line, as a placeholder for the
> > application to fill out if needed.
> > > > Yes, it means that the application needs to check the time and
> > > > update its variable holding the slow moving time once every second
> > > > or so; but that should be doable without significant effort.
> > > > >
> > > > > That's a good point, however without a 64 bit value, elapsed time
> > > > > between two arbitrary mbufs cannot be measured reliably due to
> > not
> > > > > enough context, one way or another the low resolution value is
> > > > > also
> > > > needed.
> > > > >
> > > > > Obviously latency-sensitive applications are unlikely to perform
> > > > > lengthy buffering and require this but I'm not sure about all the
> > > > > possible use-cases. Considering many NICs expose 64 bit
> > timestaps,
> > > > > I suggest we do not truncate them.
> > > > >
> > > > > I'm not a fan of the weird tradeoff either, PMDs will be tempted
> > > > > to fill the extra 32 bits whenever they can and negate the
> > > > > performance improvement of the first cache line.
> > > >
> > > > I would tend to agree, and I don't really see any convenient way to
> > > > avoid putting in a 64-bit field for the timestamp in cache-line 0.
> > > > If we are ok with having this overlap/partially overlap with
> > > > sequence number, it will use up an extra 4B of storage in that
> > cacheline.
> > >
> > > I agree about the lack of convenience! And Adrien certainly has a
> > point about PMD temptations.
> > >
> > > However, I still don't think that a NICs ability to date-stamp a
> > packet is sufficient reason to put a date-stamp in cache line 0 of the
> > mbuf. Storing only the fast moving 32 bit in cache line 0 seems like a
> > good compromise to me.
> > >
> > > Maybe you can find just one more byte, so it can hold 17 minutes with
> > > nanosecond resolution. (I'm joking!)
> > >
> > > Please don't sacrifice the sequence number for the seconds/hours/days
> > part a timestamp. Maybe it could be configurable to use a 32 bit or 64
> > bit timestamp.
> > >
> > Do you see both timestamp and sequence numbers being used together? I
> > would have thought that apps would either use one or the other?
> > However, your suggestion is workable in any case, to allow the sequence
> > number to overlap just the high 32 bits of the timestamp, rather than
> > the low.
> 
> In our case, I can foresee sequence numbers used for packet processing and timestamps for timing analysis (and possibly for packet capturing, when being used). For timing analysis, we don’t need long durations, e.g. 4 seconds with 32 bit nanosecond resolution suffices. And for packet capturing we are perfectly capable of adding the slowly moving 32 bit of the timestamp to our output data stream without fetching it from the mbuf.
> 

For the 32-bit timestamp case, it might be useful to have a right-shift
value passed in to the ethdev driver. If we assume a NIC with nanosecond
resolution, (or TSC value with resolution of that order of magnitude),
then the app can choose to have 1 ns resolution with 4 second
wraparound, or alternatively 4ns resolution with 16 second wraparound,
or even microsecond resolution with wrap around of over an hour.
The cost is obviously just a shift op in the driver code per packet -
hopefully with multiple packets done at a time using vector operations.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:45                   ` Bruce Richardson
@ 2016-10-25 12:48                     ` Olivier Matz
  2016-10-25 13:13                       ` Morten Brørup
                                         ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Olivier Matz @ 2016-10-25 12:48 UTC (permalink / raw)
  To: Bruce Richardson, Morten Brørup
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Oleg Kuporosov



On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
>> Comments at the end.
>>
>> Med venlig hilsen / kind regards
>> - Morten Brørup
>>
>>> -----Original Message-----
>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>> Sent: Tuesday, October 25, 2016 2:20 PM
>>> To: Morten Brørup
>>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
>>> Kuporosov
>>> Subject: Re: [dpdk-dev] mbuf changes
>>>
>>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
>>>> Comments inline.
>>>>
>>>>> -----Original Message-----
>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
>>>>> Richardson
>>>>> Sent: Tuesday, October 25, 2016 1:14 PM
>>>>> To: Adrien Mazarguil
>>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
>>>>> Kuporosov
>>>>> Subject: Re: [dpdk-dev] mbuf changes
>>>>>
>>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
>>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
>>>>>>> Comments inline.
>>>>>>>
>>>>>>> Med venlig hilsen / kind regards
>>>>>>> - Morten Brørup
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
>>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
>>>>>>>> To: Bruce Richardson
>>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
>>>>>>>> Oleg Kuporosov
>>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
>>>>>>>>
>>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
>>> wrote:
>>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
>>> wrote:
>>>>>>>> [...]
>>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
>>>>>>>> <mb@smartsharesystems.com> wrote:
>>>>>>>> [...]
>>>>
>>>>>>>>> One other point I'll mention is that we need to have a
>>>>>>>>> discussion on how/where to add in a timestamp value into
>>> the
>>>>>>>>> mbuf. Personally, I think it can be in a union with the
>>>>> sequence
>>>>>>>>> number value, but I also suspect that 32-bits of a
>>> timestamp
>>>>>>>>> is not going to be enough for
>>>>>>>> many.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> If we consider that timestamp representation should use
>>>>> nanosecond
>>>>>>>> granularity, a 32-bit value may likely wrap around too
>>> quickly
>>>>>>>> to be useful. We can also assume that applications requesting
>>>>>>>> timestamps may care more about latency than throughput, Oleg
>>>>> found
>>>>>>>> that using the second cache line for this purpose had a
>>>>> noticeable impact [1].
>>>>>>>>
>>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
>>>>>>>
>>>>>>> I agree with Oleg about the latency vs. throughput importance
>>>>>>> for
>>>>> such applications.
>>>>>>>
>>>>>>> If you need high resolution timestamps, consider them to be
>>>>> generated by the NIC RX driver, possibly by the hardware itself
>>>>> (http://w3new.napatech.com/features/time-precision/hardware-time-
>>>>> stamp), so the timestamp belongs in the first cache line. And I am
>>>>> proposing that it should have the highest possible accuracy, which
>>>>> makes the value hardware dependent.
>>>>>>>
>>>>>>> Furthermore, I am arguing that we leave it up to the
>>> application
>>>>>>> to
>>>>> keep track of the slowly moving bits (i.e. counting whole seconds,
>>>>> hours and calendar date) out of band, so we don't use precious
>>> space
>>>>> in the mbuf. The application doesn't need the NIC RX driver's fast
>>>>> path to capture which date (or even which second) a packet was
>>>>> received. Yes, it adds complexity to the application, but we can't
>>>>> set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
>>>>> Put the fast moving 32 bit in the first cache line and the slow
>>>>> moving 32 bit in the second cache line, as a placeholder for the
>>> application to fill out if needed.
>>>>> Yes, it means that the application needs to check the time and
>>>>> update its variable holding the slow moving time once every second
>>>>> or so; but that should be doable without significant effort.
>>>>>>
>>>>>> That's a good point, however without a 64 bit value, elapsed time
>>>>>> between two arbitrary mbufs cannot be measured reliably due to
>>> not
>>>>>> enough context, one way or another the low resolution value is
>>>>>> also
>>>>> needed.
>>>>>>
>>>>>> Obviously latency-sensitive applications are unlikely to perform
>>>>>> lengthy buffering and require this but I'm not sure about all the
>>>>>> possible use-cases. Considering many NICs expose 64 bit
>>> timestaps,
>>>>>> I suggest we do not truncate them.
>>>>>>
>>>>>> I'm not a fan of the weird tradeoff either, PMDs will be tempted
>>>>>> to fill the extra 32 bits whenever they can and negate the
>>>>>> performance improvement of the first cache line.
>>>>>
>>>>> I would tend to agree, and I don't really see any convenient way to
>>>>> avoid putting in a 64-bit field for the timestamp in cache-line 0.
>>>>> If we are ok with having this overlap/partially overlap with
>>>>> sequence number, it will use up an extra 4B of storage in that
>>> cacheline.
>>>>
>>>> I agree about the lack of convenience! And Adrien certainly has a
>>> point about PMD temptations.
>>>>
>>>> However, I still don't think that a NICs ability to date-stamp a
>>> packet is sufficient reason to put a date-stamp in cache line 0 of the
>>> mbuf. Storing only the fast moving 32 bit in cache line 0 seems like a
>>> good compromise to me.
>>>>
>>>> Maybe you can find just one more byte, so it can hold 17 minutes with
>>>> nanosecond resolution. (I'm joking!)
>>>>
>>>> Please don't sacrifice the sequence number for the seconds/hours/days
>>> part a timestamp. Maybe it could be configurable to use a 32 bit or 64
>>> bit timestamp.
>>>>
>>> Do you see both timestamp and sequence numbers being used together? I
>>> would have thought that apps would either use one or the other?
>>> However, your suggestion is workable in any case, to allow the sequence
>>> number to overlap just the high 32 bits of the timestamp, rather than
>>> the low.
>>
>> In our case, I can foresee sequence numbers used for packet processing and timestamps for timing analysis (and possibly for packet capturing, when being used). For timing analysis, we don’t need long durations, e.g. 4 seconds with 32 bit nanosecond resolution suffices. And for packet capturing we are perfectly capable of adding the slowly moving 32 bit of the timestamp to our output data stream without fetching it from the mbuf.
>>

We should keep in mind that today we have the seqn field but it is
not used by any PMD. In case it is implemented, would it be a per-queue
sequence number? Is it useful from an application view?

This field is only used by the librte_reorder library, and in my
opinion, we should consider moving it in the second cache line since
it is not filled by the PMD.


> For the 32-bit timestamp case, it might be useful to have a right-shift
> value passed in to the ethdev driver. If we assume a NIC with nanosecond
> resolution, (or TSC value with resolution of that order of magnitude),
> then the app can choose to have 1 ns resolution with 4 second
> wraparound, or alternatively 4ns resolution with 16 second wraparound,
> or even microsecond resolution with wrap around of over an hour.
> The cost is obviously just a shift op in the driver code per packet -
> hopefully with multiple packets done at a time using vector operations.


About the timestamp, we can manage to find 64 bits in the first cache
line, without sacrifying any field we have today. The question is more
for the fields we may want to add later.

To answer to the question of the size of the timestamp, the first
question is to know what is the precision required for the
applications using it?

I don't quite like the idea of splitting the timestamp in the 2 cache
lines, I think it would not be easy to use.


Olivier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 11:54     ` Shreyansh Jain
  2016-10-25 12:05       ` Bruce Richardson
@ 2016-10-25 12:49       ` Morten Brørup
  1 sibling, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 12:49 UTC (permalink / raw)
  To: Shreyansh Jain, Bruce Richardson; +Cc: Wiles, Keith, dev, Olivier Matz

Comments inline.

Med venlig hilsen / kind regards
- Morten Brørup


> -----Original Message-----
> From: Shreyansh Jain [mailto:shreyansh.jain@nxp.com]
> Sent: Tuesday, October 25, 2016 1:54 PM
> To: Bruce Richardson
> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz
> Subject: Re: mbuf changes
> 
> On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> >>
> >>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> <mb@smartsharesystems.com> wrote:
> >>>
> >>> First of all: Thanks for a great DPDK Userspace 2016!
> >>>
> >>>
> >>>
> >>> Continuing the Userspace discussion about Olivier Matz’s proposed
> mbuf changes...
> >
> > Thanks for keeping the discussion going!
> >>>
> >>>
> >>>
> >>> 1.
> >>>
> >>> Stephen Hemminger had a noteworthy general comment about keeping
> metadata for the NIC in the appropriate section of the mbuf: Metadata
> generated by the NIC’s RX handler belongs in the first cache line, and
> metadata required by the NIC’s TX handler belongs in the second cache
> line. This also means that touching the second cache line on ingress
> should be avoided if possible; and Bruce Richardson mentioned that for
> this reason m->next was zeroed on free().
> >>>
> > Thinking about it, I suspect there are more fields we can reset on
> free
> > to save time on alloc. Refcnt, as discussed below is one of them, but
> so
> > too could be the nb_segs field and possibly others.
> >
> >>>
> >>>
> >>> 2.
> >>>
> >>> There seemed to be consensus that the size of m->refcnt should
> match the size of m->port because a packet could be duplicated on all
> physical ports for L3 multicast and L2 flooding.
> >>>
> >>> Furthermore, although a single physical machine (i.e. a single
> server) with 255 physical ports probably doesn’t exist, it might
> contain more than 255 virtual machines with a virtual port each, so it
> makes sense extending these mbuf fields from 8 to 16 bits.
> >>
> >> I thought we also talked about removing the m->port from the mbuf as
> it is not really needed.
> >>
> > Yes, this was mentioned, and also the option of moving the port value
> to
> > the second cacheline, but it appears that NXP are using the port
> value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> 
> I am not sure where NXP's NIC came into picture on this, but now that
> it
> is highlighted, this field is required for libevent implementation [1].
> 
> A scheduler sending an event, which can be a packet, would only have
> information of a flow_id. From this matching it back to a port, without
> mbuf->port, would be very difficult (costly). There may be way around
> this but at least in current proposal I think port would be important
> to
> have - even if in second cache line.
> 

Since it's a natural part of the RX handler, it rightfully belongs in the first cache line.

Admitting that I haven't read the referenced proposal in detail, I would like to reiterate that perhaps the Flow Director fields in the mbuf could be used instead of the port field.

> But, off the top of my head, as of now it is not being used for any
> specific purpose in NXP's PMD implementation.
> 
> Even the SoC patches don't necessarily rely on it except using it
> because it is available.
> 
> @Bruce: where did you get the NXP context here from?
> 
> [1] http://dpdk.org/ml/archives/dev/2016-October/048592.html
> 
> >
> >>>
> >>>
> >>>
> >>> 3.
> >>>
> >>> Someone (Bruce Richardson?) suggested moving m->refcnt and m->port
> to the second cache line, which then generated questions from the
> audience about the real life purpose of m->port, and if m->port could
> be removed from the mbuf structure.
> >>>
> >>>
> >>>
> >>> 4.
> >>>
> >>> I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0
> on first allocation. This is based on the assumption that other mbuf
> fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is
> cheaper than setting it to 1.
> >>>
> >>> Furthermore (regardless of m->refcnt offset), I suggested that it
> is not required to modify m->refcnt when allocating and freeing the
> mbuf, thus saving one write operation on both alloc() and free().
> However, this assumes that m->refcnt debugging, e.g. underrun
> detection, is not required.
> >
> > I don't think it really matters what sentinal value is used for the
> > refcnt because it can't be blindly assigned on free like other
> fields.
> > However, I think 0 as first reference value becomes more awkward
> > than 1, because we need to deal with underflow. Consider the
> situation
> > where we have two references to the mbuf, so refcnt is 1, and both
> are
> > freed at the same time. Since the refcnt is not-zero, then both cores
> > will do an atomic decrement simultaneously giving a refcnt of -1. We
> can
> > then set this back to zero before freeing, however, I'd still prefer
> to
> > have refcnt be an accurate value so that it always stays positive,
> and
> > we can still set it to "one" on free to avoid having to set on alloc.
> >
> > Also, if we set refcnt on free rather than alloc, it does set itself
> up
> > as a good candidate for moving to the second cacheline. Fast-path
> > processing does not normally update the value.
> >
> >>>
> >>>
> >>>
> >>> 5.
> >>>
> >>> And here’s something new to think about:
> >>>
> >>> m->next already reveals if there are more segments to a packet.
> Which purpose does m->nb_segs serve that is not already covered by m-
> >next?
> >
> > It is duplicate info, but nb_segs can be used to check the validity
> of
> > the next pointer without having to read the second mbuf cacheline.
> >
> > Whether it's worth having is something I'm happy enough to discuss,
> > though.
> >
> > One other point I'll mention is that we need to have a discussion on
> > how/where to add in a timestamp value into the mbuf. Personally, I
> think
> > it can be in a union with the sequence number value, but I also
> suspect
> > that 32-bits of a timestamp is not going to be enough for many.
> >
> > Thoughts?
> >
> > /Bruce
> >
> 
> 
> --
> -
> Shreyansh


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 10:22           ` Morten Brørup
@ 2016-10-25 13:00             ` Olivier Matz
  2016-10-25 13:04               ` Ramia, Kannan Babu
                                 ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Olivier Matz @ 2016-10-25 13:00 UTC (permalink / raw)
  To: Morten Brørup, Ananyev, Konstantin, Richardson, Bruce
  Cc: Wiles, Keith, dev

Hi,

On 10/25/2016 12:22 PM, Morten Brørup wrote:
> Comments inline (at the end).
> 
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
>> Konstantin
>> Sent: Tuesday, October 25, 2016 12:03 PM
>> To: Richardson, Bruce; Morten Brørup
>> Cc: Wiles, Keith; dev@dpdk.org; Olivier Matz
>> Subject: Re: [dpdk-dev] mbuf changes
>>
>> Hi everyone,
>>
>>> -----Original Message-----
>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>>> Sent: Tuesday, October 25, 2016 9:54 AM
>>> To: Morten Brørup <mb@smartsharesystems.com>
>>> Cc: Wiles, Keith <keith.wiles@intel.com>; dev@dpdk.org; Olivier Matz
>>> <olivier.matz@6wind.com>
>>> Subject: Re: [dpdk-dev] mbuf changes
>>>
>>> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
>>>> Comments inline.
>>>>
>>>> Med venlig hilsen / kind regards
>>>> - Morten Brørup
>>>>
>>>>> -----Original Message-----
>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
>>>>> Richardson
>>>>> Sent: Monday, October 24, 2016 6:26 PM
>>>>> To: Wiles, Keith
>>>>> Cc: Morten Brørup; dev@dpdk.org; Olivier Matz
>>>>> Subject: Re: [dpdk-dev] mbuf changes
>>>>>
>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
>>>>>>
>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
>>>>>>> <mb@smartsharesystems.com>
>>>>> wrote:
>>>>>>>
>>>>>>> 2.
>>>>>>>
>>>>>>> There seemed to be consensus that the size of m->refcnt
>> should
>>>>>>> match
>>>>> the size of m->port because a packet could be duplicated on all
>>>>> physical ports for L3 multicast and L2 flooding.
>>>>>>>
>>>>>>> Furthermore, although a single physical machine (i.e. a
>> single
>>>>>>> server)
>>>>> with 255 physical ports probably doesn’t exist, it might contain
>>>>> more than
>>>>> 255 virtual machines with a virtual port each, so it makes sense
>>>>> extending these mbuf fields from 8 to 16 bits.
>>>>>>
>>>>>> I thought we also talked about removing the m->port from the
>>>>>> mbuf as it
>>>>> is not really needed.
>>>>>>
>>>>> Yes, this was mentioned, and also the option of moving the port
>>>>> value to the second cacheline, but it appears that NXP are using
>>>>> the port value in their NIC drivers for passing in metadata, so
>>>>> we'd need their agreement on any move (or removal).
>>>>>
>>>> If a single driver instance services multiple ports, so the ports
>>>> are not polled individually, the m->port member will be required to
>>>> identify
>>> the port. In that case, it might also used elsewhere in the ingress
>> path, and thus appropriate having in the first cache line.
>>
>> Ok, but these days most of devices have multiple rx queues.
>> So identify the RX source properly you need not only port, but
>> port+queue (at least 3 bytes).
>> But I suppose better to wait for NXP input here.
>>
>>> So yes, this needs
>>> further discussion with NXP.
> 
> It seems that this concept might be somewhat similar to the Flow Director information, which already has plenty of bits in the first cache line. You should consider if this can be somehow folded into the m->hash union.


I'll tend to agree with Morten.

First, I think that having more than 255 virtual links is possible,
so increasing the port size to 16 bits is not a bad idea.

I think the port information is more useful for a network stack
than the queue_id, but it may not be the case for all applications.
On the other hand, the queue_id (or something else providing the same
level of information) could led in the flow director structure.

Now, the question is: do we really need the port to be inside the mbuf?
Indeed, the application can already associate a bulk of packet with a
port id.

The reason why I'll prefer to keep it is because I'm pretty sure that
some applications use it. At least in the examples/ directory, we can
find distributor, dpdk_qat, ipv4_multicast, load_balancer,
packet_ordering. I did not check these apps in detail, but it makes me
feel that other external applications could make use of the port field.


Regards,
Olivier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:00             ` Olivier Matz
@ 2016-10-25 13:04               ` Ramia, Kannan Babu
  2016-10-25 13:24                 ` Thomas Monjalon
  2016-10-25 13:15               ` Bruce Richardson
  2016-10-25 13:20               ` Thomas Monjalon
  2 siblings, 1 reply; 46+ messages in thread
From: Ramia, Kannan Babu @ 2016-10-25 13:04 UTC (permalink / raw)
  To: Olivier Matz, Morten Brørup, Ananyev, Konstantin, Richardson, Bruce
  Cc: Wiles, Keith, dev

Port filed is important meta information for the application use like CGNAT vEPC functions etc. I strongly recommend to keep the field in mind meta.

Sent from my ASUS

-------- Original Message --------
From:Olivier Matz
Sent:Tue, 25 Oct 2016 18:31:36 +0530
To:Morten Brørup ,"Ananyev, Konstantin" ,"Richardson, Bruce"
Cc:"Wiles, Keith" ,dev@dpdk.org
Subject:Re: [dpdk-dev] mbuf changes

Hi,

On 10/25/2016 12:22 PM, Morten Brørup wrote:
> Comments inline (at the end).
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
>> Konstantin
>> Sent: Tuesday, October 25, 2016 12:03 PM
>> To: Richardson, Bruce; Morten Brørup
>> Cc: Wiles, Keith; dev@dpdk.org; Olivier Matz
>> Subject: Re: [dpdk-dev] mbuf changes
>>
>> Hi everyone,
>>
>>> -----Original Message-----
>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
>>> Sent: Tuesday, October 25, 2016 9:54 AM
>>> To: Morten Brørup <mb@smartsharesystems.com>
>>> Cc: Wiles, Keith <keith.wiles@intel.com>; dev@dpdk.org; Olivier Matz
>>> <olivier.matz@6wind.com>
>>> Subject: Re: [dpdk-dev] mbuf changes
>>>
>>> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
>>>> Comments inline.
>>>>
>>>> Med venlig hilsen / kind regards
>>>> - Morten Brørup
>>>>
>>>>> -----Original Message-----
>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
>>>>> Richardson
>>>>> Sent: Monday, October 24, 2016 6:26 PM
>>>>> To: Wiles, Keith
>>>>> Cc: Morten Brørup; dev@dpdk.org; Olivier Matz
>>>>> Subject: Re: [dpdk-dev] mbuf changes
>>>>>
>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
>>>>>>
>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
>>>>>>> <mb@smartsharesystems.com>
>>>>> wrote:
>>>>>>>
>>>>>>> 2.
>>>>>>>
>>>>>>> There seemed to be consensus that the size of m->refcnt
>> should
>>>>>>> match
>>>>> the size of m->port because a packet could be duplicated on all
>>>>> physical ports for L3 multicast and L2 flooding.
>>>>>>>
>>>>>>> Furthermore, although a single physical machine (i.e. a
>> single
>>>>>>> server)
>>>>> with 255 physical ports probably doesn’t exist, it might contain
>>>>> more than
>>>>> 255 virtual machines with a virtual port each, so it makes sense
>>>>> extending these mbuf fields from 8 to 16 bits.
>>>>>>
>>>>>> I thought we also talked about removing the m->port from the
>>>>>> mbuf as it
>>>>> is not really needed.
>>>>>>
>>>>> Yes, this was mentioned, and also the option of moving the port
>>>>> value to the second cacheline, but it appears that NXP are using
>>>>> the port value in their NIC drivers for passing in metadata, so
>>>>> we'd need their agreement on any move (or removal).
>>>>>
>>>> If a single driver instance services multiple ports, so the ports
>>>> are not polled individually, the m->port member will be required to
>>>> identify
>>> the port. In that case, it might also used elsewhere in the ingress
>> path, and thus appropriate having in the first cache line.
>>
>> Ok, but these days most of devices have multiple rx queues.
>> So identify the RX source properly you need not only port, but
>> port+queue (at least 3 bytes).
>> But I suppose better to wait for NXP input here.
>>
>>> So yes, this needs
>>> further discussion with NXP.
>
> It seems that this concept might be somewhat similar to the Flow Director information, which already has plenty of bits in the first cache line. You should consider if this can be somehow folded into the m->hash union.


I'll tend to agree with Morten.

First, I think that having more than 255 virtual links is possible,
so increasing the port size to 16 bits is not a bad idea.

I think the port information is more useful for a network stack
than the queue_id, but it may not be the case for all applications.
On the other hand, the queue_id (or something else providing the same
level of information) could led in the flow director structure.

Now, the question is: do we really need the port to be inside the mbuf?
Indeed, the application can already associate a bulk of packet with a
port id.

The reason why I'll prefer to keep it is because I'm pretty sure that
some applications use it. At least in the examples/ directory, we can
find distributor, dpdk_qat, ipv4_multicast, load_balancer,
packet_ordering. I did not check these apps in detail, but it makes me
feel that other external applications could make use of the port field.


Regards,
Olivier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:48                     ` Olivier Matz
@ 2016-10-25 13:13                       ` Morten Brørup
  2016-10-25 13:38                       ` Ananyev, Konstantin
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 13:13 UTC (permalink / raw)
  To: Olivier Matz, Bruce Richardson
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Oleg Kuporosov

Comments inline.

Med venlig hilsen / kind regards
- Morten Brørup


> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, October 25, 2016 2:49 PM
> To: Bruce Richardson; Morten Brørup
> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Oleg Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
> 
> 
> 
> On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> >> Comments at the end.
> >>
> >> Med venlig hilsen / kind regards
> >> - Morten Brørup
> >>
> >>> -----Original Message-----
> >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>> Sent: Tuesday, October 25, 2016 2:20 PM
> >>> To: Morten Brørup
> >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> Oleg
> >>> Kuporosov
> >>> Subject: Re: [dpdk-dev] mbuf changes
> >>>
> >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> >>>> Comments inline.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> >>>>> Richardson
> >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> >>>>> To: Adrien Mazarguil
> >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> >>>>> Kuporosov
> >>>>> Subject: Re: [dpdk-dev] mbuf changes
> >>>>>
> >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> >>>>>>> Comments inline.
> >>>>>>>
> >>>>>>> Med venlig hilsen / kind regards
> >>>>>>> - Morten Brørup
> >>>>>>>
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> >>>>>>>> To: Bruce Richardson
> >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> >>>>>>>> Oleg Kuporosov
> >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> >>>>>>>>
> >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> >>> wrote:
> >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> >>> wrote:
> >>>>>>>> [...]
> >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> >>>>>>>> <mb@smartsharesystems.com> wrote:
> >>>>>>>> [...]
> >>>>
> >>>>>>>>> One other point I'll mention is that we need to have a
> >>>>>>>>> discussion on how/where to add in a timestamp value into
> >>> the
> >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> >>>>> sequence
> >>>>>>>>> number value, but I also suspect that 32-bits of a
> >>> timestamp
> >>>>>>>>> is not going to be enough for
> >>>>>>>> many.
> >>>>>>>>>
> >>>>>>>>> Thoughts?
> >>>>>>>>
> >>>>>>>> If we consider that timestamp representation should use
> >>>>> nanosecond
> >>>>>>>> granularity, a 32-bit value may likely wrap around too
> >>> quickly
> >>>>>>>> to be useful. We can also assume that applications requesting
> >>>>>>>> timestamps may care more about latency than throughput, Oleg
> >>>>> found
> >>>>>>>> that using the second cache line for this purpose had a
> >>>>> noticeable impact [1].
> >>>>>>>>
> >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> >>>>>>>
> >>>>>>> I agree with Oleg about the latency vs. throughput importance
> >>>>>>> for
> >>>>> such applications.
> >>>>>>>
> >>>>>>> If you need high resolution timestamps, consider them to be
> >>>>> generated by the NIC RX driver, possibly by the hardware itself
> >>>>> (http://w3new.napatech.com/features/time-precision/hardware-time-
> >>>>> stamp), so the timestamp belongs in the first cache line. And I
> am
> >>>>> proposing that it should have the highest possible accuracy,
> which
> >>>>> makes the value hardware dependent.
> >>>>>>>
> >>>>>>> Furthermore, I am arguing that we leave it up to the
> >>> application
> >>>>>>> to
> >>>>> keep track of the slowly moving bits (i.e. counting whole
> seconds,
> >>>>> hours and calendar date) out of band, so we don't use precious
> >>> space
> >>>>> in the mbuf. The application doesn't need the NIC RX driver's
> fast
> >>>>> path to capture which date (or even which second) a packet was
> >>>>> received. Yes, it adds complexity to the application, but we
> can't
> >>>>> set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> >>>>> Put the fast moving 32 bit in the first cache line and the slow
> >>>>> moving 32 bit in the second cache line, as a placeholder for the
> >>> application to fill out if needed.
> >>>>> Yes, it means that the application needs to check the time and
> >>>>> update its variable holding the slow moving time once every
> second
> >>>>> or so; but that should be doable without significant effort.
> >>>>>>
> >>>>>> That's a good point, however without a 64 bit value, elapsed
> time
> >>>>>> between two arbitrary mbufs cannot be measured reliably due to
> >>> not
> >>>>>> enough context, one way or another the low resolution value is
> >>>>>> also
> >>>>> needed.
> >>>>>>
> >>>>>> Obviously latency-sensitive applications are unlikely to perform
> >>>>>> lengthy buffering and require this but I'm not sure about all
> the
> >>>>>> possible use-cases. Considering many NICs expose 64 bit
> >>> timestaps,
> >>>>>> I suggest we do not truncate them.
> >>>>>>
> >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be tempted
> >>>>>> to fill the extra 32 bits whenever they can and negate the
> >>>>>> performance improvement of the first cache line.
> >>>>>
> >>>>> I would tend to agree, and I don't really see any convenient way
> >>>>> to avoid putting in a 64-bit field for the timestamp in cache-
> line 0.
> >>>>> If we are ok with having this overlap/partially overlap with
> >>>>> sequence number, it will use up an extra 4B of storage in that
> >>> cacheline.
> >>>>
> >>>> I agree about the lack of convenience! And Adrien certainly has a
> >>> point about PMD temptations.
> >>>>
> >>>> However, I still don't think that a NICs ability to date-stamp a
> >>> packet is sufficient reason to put a date-stamp in cache line 0 of
> >>> the mbuf. Storing only the fast moving 32 bit in cache line 0 seems
> >>> like a good compromise to me.
> >>>>
> >>>> Maybe you can find just one more byte, so it can hold 17 minutes
> >>>> with nanosecond resolution. (I'm joking!)
> >>>>
> >>>> Please don't sacrifice the sequence number for the
> >>>> seconds/hours/days
> >>> part a timestamp. Maybe it could be configurable to use a 32 bit or
> >>> 64 bit timestamp.
> >>>>
> >>> Do you see both timestamp and sequence numbers being used together?
> >>> I would have thought that apps would either use one or the other?
> >>> However, your suggestion is workable in any case, to allow the
> >>> sequence number to overlap just the high 32 bits of the timestamp,
> >>> rather than the low.
> >>
> >> In our case, I can foresee sequence numbers used for packet
> processing and timestamps for timing analysis (and possibly for packet
> capturing, when being used). For timing analysis, we don’t need long
> durations, e.g. 4 seconds with 32 bit nanosecond resolution suffices.
> And for packet capturing we are perfectly capable of adding the slowly
> moving 32 bit of the timestamp to our output data stream without
> fetching it from the mbuf.
> >>
> 
> We should keep in mind that today we have the seqn field but it is not
> used by any PMD. In case it is implemented, would it be a per-queue
> sequence number? Is it useful from an application view?
> 
> This field is only used by the librte_reorder library, and in my
> opinion, we should consider moving it in the second cache line since it
> is not filled by the PMD.

Our DPDK project is not yet sufficiently progressed to be able to provide quality feedback about this.

> > For the 32-bit timestamp case, it might be useful to have a
> > right-shift value passed in to the ethdev driver. If we assume a NIC
> > with nanosecond resolution, (or TSC value with resolution of that
> > order of magnitude), then the app can choose to have 1 ns resolution
> > with 4 second wraparound, or alternatively 4ns resolution with 16
> > second wraparound, or even microsecond resolution with wrap around of
> over an hour.
> > The cost is obviously just a shift op in the driver code per packet -
> > hopefully with multiple packets done at a time using vector
> operations.
> 
> 
> About the timestamp, we can manage to find 64 bits in the first cache
> line, without sacrifying any field we have today. The question is more
> for the fields we may want to add later.
> 
> To answer to the question of the size of the timestamp, the first
> question is to know what is the precision required for the applications
> using it?

In our application, microsecond resolution probably suffices, although we do use nanoseconds today. And a wrapping counter is fine for us, so we don’t need the slow moving bits.

For dedicated packet capture applications, please refer to e.g. Napatech's reasoning: http://w3new.napatech.com/features/time-precision/hardware-time-stamp

> I don't quite like the idea of splitting the timestamp in the 2 cache
> lines, I think it would not be easy to use.
> 
> 
> Olivier


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-24 15:49 mbuf changes Morten Brørup
  2016-10-24 16:11 ` Wiles, Keith
@ 2016-10-25 13:14 ` Olivier Matz
  2016-10-25 13:18   ` Morten Brørup
  1 sibling, 1 reply; 46+ messages in thread
From: Olivier Matz @ 2016-10-25 13:14 UTC (permalink / raw)
  To: Morten Brørup, dev

Hi,

On 10/24/2016 05:49 PM, Morten Brørup wrote:
> And here’s something new to think about:
> 
> m->next already reveals if there are more segments to a packet. Which
> purpose does m->nb_segs serve that is not already covered by m->next?
> 

I was asking myself the same question some time ago:
http://dpdk.org/ml/archives/dev/2016-May/039483.html

But it seems nb_segs is useful for PMDs on TX side, to
anticipate how many descriptors a packet will use in the
TX ring. It can also help a PMD to check that this packet
is supported by the hardware (too many segments) without
browsing the list.

So finally I think it should be kept in the mbuf. But as
suggested by Bruce, it could go in the second cache line,
since m->next is also there, at the condition that this
field is set to 1 in mbuf_free().

Regards,
Olivier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:00             ` Olivier Matz
  2016-10-25 13:04               ` Ramia, Kannan Babu
@ 2016-10-25 13:15               ` Bruce Richardson
  2016-10-25 13:20               ` Thomas Monjalon
  2 siblings, 0 replies; 46+ messages in thread
From: Bruce Richardson @ 2016-10-25 13:15 UTC (permalink / raw)
  To: Olivier Matz; +Cc: Morten Brørup, Ananyev, Konstantin, Wiles, Keith, dev

On Tue, Oct 25, 2016 at 03:00:39PM +0200, Olivier Matz wrote:
> Hi,
> 
> On 10/25/2016 12:22 PM, Morten Brørup wrote:
> > Comments inline (at the end).
> > 
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> >> Konstantin
> >> Sent: Tuesday, October 25, 2016 12:03 PM
> >> To: Richardson, Bruce; Morten Brørup
> >> Cc: Wiles, Keith; dev@dpdk.org; Olivier Matz
> >> Subject: Re: [dpdk-dev] mbuf changes
> >>
> >> Hi everyone,
> >>
> >>> -----Original Message-----
> >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> >>> Sent: Tuesday, October 25, 2016 9:54 AM
> >>> To: Morten Brørup <mb@smartsharesystems.com>
> >>> Cc: Wiles, Keith <keith.wiles@intel.com>; dev@dpdk.org; Olivier Matz
> >>> <olivier.matz@6wind.com>
> >>> Subject: Re: [dpdk-dev] mbuf changes
> >>>
> >>> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
> >>>> Comments inline.
> >>>>
> >>>> Med venlig hilsen / kind regards
> >>>> - Morten Brørup
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> >>>>> Richardson
> >>>>> Sent: Monday, October 24, 2016 6:26 PM
> >>>>> To: Wiles, Keith
> >>>>> Cc: Morten Brørup; dev@dpdk.org; Olivier Matz
> >>>>> Subject: Re: [dpdk-dev] mbuf changes
> >>>>>
> >>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> >>>>>>
> >>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> >>>>>>> <mb@smartsharesystems.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> 2.
> >>>>>>>
> >>>>>>> There seemed to be consensus that the size of m->refcnt
> >> should
> >>>>>>> match
> >>>>> the size of m->port because a packet could be duplicated on all
> >>>>> physical ports for L3 multicast and L2 flooding.
> >>>>>>>
> >>>>>>> Furthermore, although a single physical machine (i.e. a
> >> single
> >>>>>>> server)
> >>>>> with 255 physical ports probably doesn’t exist, it might contain
> >>>>> more than
> >>>>> 255 virtual machines with a virtual port each, so it makes sense
> >>>>> extending these mbuf fields from 8 to 16 bits.
> >>>>>>
> >>>>>> I thought we also talked about removing the m->port from the
> >>>>>> mbuf as it
> >>>>> is not really needed.
> >>>>>>
> >>>>> Yes, this was mentioned, and also the option of moving the port
> >>>>> value to the second cacheline, but it appears that NXP are using
> >>>>> the port value in their NIC drivers for passing in metadata, so
> >>>>> we'd need their agreement on any move (or removal).
> >>>>>
> >>>> If a single driver instance services multiple ports, so the ports
> >>>> are not polled individually, the m->port member will be required to
> >>>> identify
> >>> the port. In that case, it might also used elsewhere in the ingress
> >> path, and thus appropriate having in the first cache line.
> >>
> >> Ok, but these days most of devices have multiple rx queues.
> >> So identify the RX source properly you need not only port, but
> >> port+queue (at least 3 bytes).
> >> But I suppose better to wait for NXP input here.
> >>
> >>> So yes, this needs
> >>> further discussion with NXP.
> > 
> > It seems that this concept might be somewhat similar to the Flow Director information, which already has plenty of bits in the first cache line. You should consider if this can be somehow folded into the m->hash union.
> 
> 
> I'll tend to agree with Morten.
> 
> First, I think that having more than 255 virtual links is possible,
> so increasing the port size to 16 bits is not a bad idea.
> 
> I think the port information is more useful for a network stack
> than the queue_id, but it may not be the case for all applications.
> On the other hand, the queue_id (or something else providing the same
> level of information) could led in the flow director structure.
> 
> Now, the question is: do we really need the port to be inside the mbuf?
> Indeed, the application can already associate a bulk of packet with a
> port id.
> 
> The reason why I'll prefer to keep it is because I'm pretty sure that
> some applications use it. At least in the examples/ directory, we can
> find distributor, dpdk_qat, ipv4_multicast, load_balancer,
> packet_ordering. I did not check these apps in detail, but it makes me
> feel that other external applications could make use of the port field.
> 

Yes, judging from the reaction at Userspace and here on list, it appears
that port needs to be kept in the mbuf, and since it's present it should
be filled in by the drivers on RX, so it needs to stay where it is, I
think.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:14 ` Olivier Matz
@ 2016-10-25 13:18   ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 13:18 UTC (permalink / raw)
  To: Olivier Matz, dev

I support that!

Med venlig hilsen / kind regards
- Morten Brørup


> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Tuesday, October 25, 2016 3:14 PM
> To: Morten Brørup; dev@dpdk.org
> Subject: Re: mbuf changes
> 
> Hi,
> 
> On 10/24/2016 05:49 PM, Morten Brørup wrote:
> > And here’s something new to think about:
> >
> > m->next already reveals if there are more segments to a packet. Which
> > purpose does m->nb_segs serve that is not already covered by m->next?
> >
> 
> I was asking myself the same question some time ago:
> http://dpdk.org/ml/archives/dev/2016-May/039483.html
> 
> But it seems nb_segs is useful for PMDs on TX side, to anticipate how
> many descriptors a packet will use in the TX ring. It can also help a
> PMD to check that this packet is supported by the hardware (too many
> segments) without browsing the list.
> 
> So finally I think it should be kept in the mbuf. But as suggested by
> Bruce, it could go in the second cache line, since m->next is also
> there, at the condition that this field is set to 1 in mbuf_free().
> 
> Regards,
> Olivier


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:00             ` Olivier Matz
  2016-10-25 13:04               ` Ramia, Kannan Babu
  2016-10-25 13:15               ` Bruce Richardson
@ 2016-10-25 13:20               ` Thomas Monjalon
  2 siblings, 0 replies; 46+ messages in thread
From: Thomas Monjalon @ 2016-10-25 13:20 UTC (permalink / raw)
  To: Olivier Matz
  Cc: dev, Morten Brørup, Ananyev, Konstantin, Richardson, Bruce,
	Wiles, Keith

2016-10-25 15:00, Olivier Matz:
> On 10/25/2016 12:22 PM, Morten Brørup wrote:
> > From: Ananyev, Konstantin
> >> From: Bruce Richardson
> >>> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
> >>>> From: Bruce Richardson
> >>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> >>>>>> I thought we also talked about removing the m->port from the
> >>>>>> mbuf as it is not really needed.
> >>>>>>
> >>>>> Yes, this was mentioned, and also the option of moving the port
> >>>>> value to the second cacheline, but it appears that NXP are using
> >>>>> the port value in their NIC drivers for passing in metadata, so
> >>>>> we'd need their agreement on any move (or removal).
> >>>>>
> >>>> If a single driver instance services multiple ports, so the ports
> >>>> are not polled individually, the m->port member will be required to
> >>>> identify the port.
> >>>> In that case, it might also used elsewhere in the ingress path,
> >>>> and thus appropriate having in the first cache line.
> >>
> >> Ok, but these days most of devices have multiple rx queues.
> >> So identify the RX source properly you need not only port, but
> >> port+queue (at least 3 bytes).
> >> But I suppose better to wait for NXP input here.
> >>
> >>> So yes, this needs further discussion with NXP.
> > 
> > It seems that this concept might be somewhat similar to the
> > Flow Director information, which already has plenty of bits
> > in the first cache line. You should consider if this can be
> > somehow folded into the m->hash union.
> 
> I'll tend to agree with Morten.
> 
> First, I think that having more than 255 virtual links is possible,
> so increasing the port size to 16 bits is not a bad idea.
> 
> I think the port information is more useful for a network stack
> than the queue_id, but it may not be the case for all applications.
> On the other hand, the queue_id (or something else providing the same
> level of information) could led in the flow director structure.
> 
> Now, the question is: do we really need the port to be inside the mbuf?
> Indeed, the application can already associate a bulk of packet with a
> port id.
> 
> The reason why I'll prefer to keep it is because I'm pretty sure that
> some applications use it. At least in the examples/ directory, we can
> find distributor, dpdk_qat, ipv4_multicast, load_balancer,
> packet_ordering. I did not check these apps in detail, but it makes me
> feel that other external applications could make use of the port field.

Having some applications using it does not mean that there is a good
justification to keep it.
I think the data must be stored inside the mbuf struct only if it
increases the performance of known use cases significantly.
So the questions should be:
- How significant are the use cases?
- What is the performance drop when having the data outside of the struct?

The mbuf space must be kept jealously like a DPDK treasure :)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:04               ` Ramia, Kannan Babu
@ 2016-10-25 13:24                 ` Thomas Monjalon
  2016-10-25 14:32                   ` Ramia, Kannan Babu
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Monjalon @ 2016-10-25 13:24 UTC (permalink / raw)
  To: Ramia, Kannan Babu
  Cc: dev, Olivier Matz, Morten Brørup, Ananyev, Konstantin,
	Richardson, Bruce, Wiles, Keith

2016-10-25 13:04, Ramia, Kannan Babu:
> Port filed is important meta information for the application use like
> CGNAT vEPC functions etc.
> I strongly recommend to keep the field in mind meta.

Have you tried to move this field outside of the mbuf?
What is the performance degradation?
We need more information than some assumptions.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:48                     ` Olivier Matz
  2016-10-25 13:13                       ` Morten Brørup
@ 2016-10-25 13:38                       ` Ananyev, Konstantin
  2016-10-25 14:25                         ` Morten Brørup
  2016-10-28 13:34                       ` Pattan, Reshma
  2016-10-31 16:05                       ` Morten Brørup
  3 siblings, 1 reply; 46+ messages in thread
From: Ananyev, Konstantin @ 2016-10-25 13:38 UTC (permalink / raw)
  To: Olivier Matz, Richardson, Bruce, Morten Brørup
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Oleg Kuporosov



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, October 25, 2016 1:49 PM
> To: Richardson, Bruce <bruce.richardson@intel.com>; Morten Brørup <mb@smartsharesystems.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> <olegk@mellanox.com>
> Subject: Re: [dpdk-dev] mbuf changes
> 
> 
> 
> On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> >> Comments at the end.
> >>
> >> Med venlig hilsen / kind regards
> >> - Morten Brørup
> >>
> >>> -----Original Message-----
> >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>> Sent: Tuesday, October 25, 2016 2:20 PM
> >>> To: Morten Brørup
> >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> >>> Kuporosov
> >>> Subject: Re: [dpdk-dev] mbuf changes
> >>>
> >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> >>>> Comments inline.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> >>>>> Richardson
> >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> >>>>> To: Adrien Mazarguil
> >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> >>>>> Kuporosov
> >>>>> Subject: Re: [dpdk-dev] mbuf changes
> >>>>>
> >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> >>>>>>> Comments inline.
> >>>>>>>
> >>>>>>> Med venlig hilsen / kind regards
> >>>>>>> - Morten Brørup
> >>>>>>>
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> >>>>>>>> To: Bruce Richardson
> >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> >>>>>>>> Oleg Kuporosov
> >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> >>>>>>>>
> >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> >>> wrote:
> >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> >>> wrote:
> >>>>>>>> [...]
> >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> >>>>>>>> <mb@smartsharesystems.com> wrote:
> >>>>>>>> [...]
> >>>>
> >>>>>>>>> One other point I'll mention is that we need to have a
> >>>>>>>>> discussion on how/where to add in a timestamp value into
> >>> the
> >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> >>>>> sequence
> >>>>>>>>> number value, but I also suspect that 32-bits of a
> >>> timestamp
> >>>>>>>>> is not going to be enough for
> >>>>>>>> many.
> >>>>>>>>>
> >>>>>>>>> Thoughts?
> >>>>>>>>
> >>>>>>>> If we consider that timestamp representation should use
> >>>>> nanosecond
> >>>>>>>> granularity, a 32-bit value may likely wrap around too
> >>> quickly
> >>>>>>>> to be useful. We can also assume that applications requesting
> >>>>>>>> timestamps may care more about latency than throughput, Oleg
> >>>>> found
> >>>>>>>> that using the second cache line for this purpose had a
> >>>>> noticeable impact [1].
> >>>>>>>>
> >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> >>>>>>>
> >>>>>>> I agree with Oleg about the latency vs. throughput importance
> >>>>>>> for
> >>>>> such applications.
> >>>>>>>
> >>>>>>> If you need high resolution timestamps, consider them to be
> >>>>> generated by the NIC RX driver, possibly by the hardware itself
> >>>>> (http://w3new.napatech.com/features/time-precision/hardware-time-
> >>>>> stamp), so the timestamp belongs in the first cache line. And I am
> >>>>> proposing that it should have the highest possible accuracy, which
> >>>>> makes the value hardware dependent.
> >>>>>>>
> >>>>>>> Furthermore, I am arguing that we leave it up to the
> >>> application
> >>>>>>> to
> >>>>> keep track of the slowly moving bits (i.e. counting whole seconds,
> >>>>> hours and calendar date) out of band, so we don't use precious
> >>> space
> >>>>> in the mbuf. The application doesn't need the NIC RX driver's fast
> >>>>> path to capture which date (or even which second) a packet was
> >>>>> received. Yes, it adds complexity to the application, but we can't
> >>>>> set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> >>>>> Put the fast moving 32 bit in the first cache line and the slow
> >>>>> moving 32 bit in the second cache line, as a placeholder for the
> >>> application to fill out if needed.
> >>>>> Yes, it means that the application needs to check the time and
> >>>>> update its variable holding the slow moving time once every second
> >>>>> or so; but that should be doable without significant effort.
> >>>>>>
> >>>>>> That's a good point, however without a 64 bit value, elapsed time
> >>>>>> between two arbitrary mbufs cannot be measured reliably due to
> >>> not
> >>>>>> enough context, one way or another the low resolution value is
> >>>>>> also
> >>>>> needed.
> >>>>>>
> >>>>>> Obviously latency-sensitive applications are unlikely to perform
> >>>>>> lengthy buffering and require this but I'm not sure about all the
> >>>>>> possible use-cases. Considering many NICs expose 64 bit
> >>> timestaps,
> >>>>>> I suggest we do not truncate them.
> >>>>>>
> >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be tempted
> >>>>>> to fill the extra 32 bits whenever they can and negate the
> >>>>>> performance improvement of the first cache line.
> >>>>>
> >>>>> I would tend to agree, and I don't really see any convenient way to
> >>>>> avoid putting in a 64-bit field for the timestamp in cache-line 0.
> >>>>> If we are ok with having this overlap/partially overlap with
> >>>>> sequence number, it will use up an extra 4B of storage in that
> >>> cacheline.
> >>>>
> >>>> I agree about the lack of convenience! And Adrien certainly has a
> >>> point about PMD temptations.
> >>>>
> >>>> However, I still don't think that a NICs ability to date-stamp a
> >>> packet is sufficient reason to put a date-stamp in cache line 0 of the
> >>> mbuf. Storing only the fast moving 32 bit in cache line 0 seems like a
> >>> good compromise to me.
> >>>>
> >>>> Maybe you can find just one more byte, so it can hold 17 minutes with
> >>>> nanosecond resolution. (I'm joking!)
> >>>>
> >>>> Please don't sacrifice the sequence number for the seconds/hours/days
> >>> part a timestamp. Maybe it could be configurable to use a 32 bit or 64
> >>> bit timestamp.
> >>>>
> >>> Do you see both timestamp and sequence numbers being used together? I
> >>> would have thought that apps would either use one or the other?
> >>> However, your suggestion is workable in any case, to allow the sequence
> >>> number to overlap just the high 32 bits of the timestamp, rather than
> >>> the low.
> >>
> >> In our case, I can foresee sequence numbers used for packet processing and timestamps for timing analysis (and possibly for packet
> capturing, when being used). 

Great, but right now none of these fields are filled from NIC HW by PMD RX function
(except RFC for melanox provided by Oleg, but again it is pure SW implementation).
So I would repeat my question: why these fields should stay in the first cache-line?
I understand that  it would speed-up some particular application, but there are plenty
of apps which do use different metadata.
Let say l2/l3/l4 len - is very useful information for upper layer (L3/L4) packet processing.
Should people who do use it start to push moving that fields into first cache-line too? 

>For timing analysis, we don’t need long durations, e.g. 4 seconds with 32 bit nanosecond resolution suffices.
> And for packet capturing we are perfectly capable of adding the slowly moving 32 bit of the timestamp to our output data stream without
> fetching it from the mbuf.
> >>
> 
> We should keep in mind that today we have the seqn field but it is
> not used by any PMD. In case it is implemented, would it be a per-queue
> sequence number? Is it useful from an application view?

Exactly - it depends from SW that uses that field, right?
It could be per process / per group of lcores / per port / per queue, etc.
Depending on what are upper layer needs.

> 
> This field is only used by the librte_reorder library, and in my
> opinion, we should consider moving it in the second cache line since
> it is not filled by the PMD.

+1

> 
> 
> > For the 32-bit timestamp case, it might be useful to have a right-shift
> > value passed in to the ethdev driver. If we assume a NIC with nanosecond
> > resolution, (or TSC value with resolution of that order of magnitude),
> > then the app can choose to have 1 ns resolution with 4 second
> > wraparound, or alternatively 4ns resolution with 16 second wraparound,
> > or even microsecond resolution with wrap around of over an hour.
> > The cost is obviously just a shift op in the driver code per packet -
> > hopefully with multiple packets done at a time using vector operations.
> 
> 
> About the timestamp, we can manage to find 64 bits in the first cache
> line, without sacrifying any field we have today. 

We can I suppose, but again what for and who will fill it?
If PMD, then where it will get this information?
If it is from rte_rdtsc() or clock(), then why upper layer can't do it itself?
Konstantin

>The question is more
> for the fields we may want to add later.
> 
> To answer to the question of the size of the timestamp, the first
> question is to know what is the precision required for the
> applications using it?
> 
> I don't quite like the idea of splitting the timestamp in the 2 cache
> lines, I think it would not be easy to use.
> 
> 
> Olivier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:16             ` Morten Brørup
  2016-10-25 12:20               ` Bruce Richardson
@ 2016-10-25 13:48               ` Adrien Mazarguil
  2016-10-25 13:58                 ` Ananyev, Konstantin
  1 sibling, 1 reply; 46+ messages in thread
From: Adrien Mazarguil @ 2016-10-25 13:48 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Bruce Richardson, Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov

On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> Comments inline.

I'm only replying to the nb_segs bits here.

> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, October 25, 2016 1:14 PM
> > To: Adrien Mazarguil
> > Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > > Comments inline.
> > > >
> > > > Med venlig hilsen / kind regards
> > > > - Morten Brørup
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > To: Bruce Richardson
> > > > > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz; Oleg
> > > > > Kuporosov
> > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > >
> > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > > > [...]
> > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > <mb@smartsharesystems.com> wrote:
> > > > > [...]
> > > > > > > > 5.
> > > > > > > >
> > > > > > > > And here’s something new to think about:
> > > > > > > >
> > > > > > > > m->next already reveals if there are more segments to a
> > packet.
> > > > > Which purpose does m->nb_segs serve that is not already covered
> > by
> > > > > m-
> > > > > >next?
> > > > > >
> > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > validity
> > > > > of
> > > > > > the next pointer without having to read the second mbuf
> > cacheline.
> > > > > >
> > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > discuss, though.
> > > > >
> > > > > Although slower in some cases than a full blown "next packet"
> > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > packets and their segments in the same list without wasting
> > space.
> > > >
> > > > I don’t understand that; can you please elaborate? Are you abusing
> > m->nb_segs as an index into an array in your application? If that is
> > the case, and it is endorsed by the community, we should get rid of m-
> > >nb_segs and add a member for application specific use instead.
> > >
> > > Well, that's just an idea, I'm not aware of any application using
> > > this, however the ability to link several packets with segments seems
> > > useful to me (e.g. buffering packets). Here's a diagram:
> > >
> > >  .-----------.   .-----------.   .-----------.   .-----------.   .---
> > ---
> > >  | pkt 0     |   | seg 1     |   | seg 2     |   | pkt 1     |   |
> > pkt 2
> > >  |      next --->|      next --->|      next --->|      next --->|
> > ...
> > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > >  `-----------'   `-----------'   `-----------'   `-----------'   `---
> > ---
> 
> I see. It makes it possible to refer to a burst of packets (with segments or not) by a single mbuf reference, as an alternative to the current design pattern of using an array and length (struct rte_mbuf **mbufs, unsigned count).
> 
> This would require implementation in the PMDs etc.
> 
> And even in this case, m->nb_segs does not need to be an integer, but could be replaced by a single bit indicating if the segment is a continuation of a packet or the beginning (alternatively the end) of a packet, i.e. the bit can be set for either the first or the last segment in the packet.

Sure however if we keep the current definition, a single bit would not be
enough as it must be nonzero for the buffer to be valid. I think a 8 bit
field is not that expensive for a counter.

> It is an almost equivalent alternative to the fundamental design pattern of using an array of mbuf with count, which is widely implemented in DPDK. And m->next still lives in the second cache line, so I don't see any gain by this.

That's right, it does not have to live in the first cache line, my only
concern was its entire removal.

> I still don't get how m->nb_segs can be abused without m->next.

By "abused" I mean that applications are not supposed to pass this kind of
mbuf lists directly to existing mbuf-handling functions (TX burst,
rte_pktmbuf_free() and so on), however these same applications (even PMDs)
can do so internally temporarily because it's so simple.

The next pointer of the last segment of a packet must still be set to NULL
every time a packet is retrieved from such a list to be processed.

> > However, nb_segs may be a good candidate for demotion, along with
> > possibly the port value, or the reference count.

Yes, I think that's fine as long as it's kept somewhere.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:48               ` Adrien Mazarguil
@ 2016-10-25 13:58                 ` Ananyev, Konstantin
  0 siblings, 0 replies; 46+ messages in thread
From: Ananyev, Konstantin @ 2016-10-25 13:58 UTC (permalink / raw)
  To: Adrien Mazarguil, Morten Brørup
  Cc: Richardson, Bruce, Wiles, Keith, dev, Olivier Matz, Oleg Kuporosov



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Tuesday, October 25, 2016 2:48 PM
> To: Morten Brørup <mb@smartsharesystems.com>
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Wiles, Keith <keith.wiles@intel.com>; dev@dpdk.org; Olivier Matz
> <olivier.matz@6wind.com>; Oleg Kuporosov <olegk@mellanox.com>
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > Comments inline.
> 
> I'm only replying to the nb_segs bits here.
> 
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > To: Adrien Mazarguil
> > > Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> > > Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > > > Comments inline.
> > > > >
> > > > > Med venlig hilsen / kind regards
> > > > > - Morten Brørup
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > > To: Bruce Richardson
> > > > > > Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz; Oleg
> > > > > > Kuporosov
> > > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > > >
> > > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > > > > [...]
> > > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > > <mb@smartsharesystems.com> wrote:
> > > > > > [...]
> > > > > > > > > 5.
> > > > > > > > >
> > > > > > > > > And here’s something new to think about:
> > > > > > > > >
> > > > > > > > > m->next already reveals if there are more segments to a
> > > packet.
> > > > > > Which purpose does m->nb_segs serve that is not already covered
> > > by
> > > > > > m-
> > > > > > >next?
> > > > > > >
> > > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > > validity
> > > > > > of
> > > > > > > the next pointer without having to read the second mbuf
> > > cacheline.
> > > > > > >
> > > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > > discuss, though.
> > > > > >
> > > > > > Although slower in some cases than a full blown "next packet"
> > > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > > packets and their segments in the same list without wasting
> > > space.
> > > > >
> > > > > I don’t understand that; can you please elaborate? Are you abusing
> > > m->nb_segs as an index into an array in your application? If that is
> > > the case, and it is endorsed by the community, we should get rid of m-
> > > >nb_segs and add a member for application specific use instead.
> > > >
> > > > Well, that's just an idea, I'm not aware of any application using
> > > > this, however the ability to link several packets with segments seems
> > > > useful to me (e.g. buffering packets). Here's a diagram:
> > > >
> > > >  .-----------.   .-----------.   .-----------.   .-----------.   .---
> > > ---
> > > >  | pkt 0     |   | seg 1     |   | seg 2     |   | pkt 1     |   |
> > > pkt 2
> > > >  |      next --->|      next --->|      next --->|      next --->|
> > > ...
> > > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > > >  `-----------'   `-----------'   `-----------'   `-----------'   `---
> > > ---
> >
> > I see. It makes it possible to refer to a burst of packets (with segments or not) by a single mbuf reference, as an alternative to the current
> design pattern of using an array and length (struct rte_mbuf **mbufs, unsigned count).
> >
> > This would require implementation in the PMDs etc.
> >
> > And even in this case, m->nb_segs does not need to be an integer, but could be replaced by a single bit indicating if the segment is a
> continuation of a packet or the beginning (alternatively the end) of a packet, i.e. the bit can be set for either the first or the last segment in
> the packet.

We do need nb_segs -  at least for TX.
That's how TX function calculates how many TXDs it needs to allocate(and fill).
Of-course it can re-scan whole chain of segments to count them, but I think
it would slowdown things even more.
Though yes, I suppose it can be moved to the second cahe-line.
Konstantin

> 
> Sure however if we keep the current definition, a single bit would not be
> enough as it must be nonzero for the buffer to be valid. I think a 8 bit
> field is not that expensive for a counter.
> 
> > It is an almost equivalent alternative to the fundamental design pattern of using an array of mbuf with count, which is widely implemented
> in DPDK. And m->next still lives in the second cache line, so I don't see any gain by this.
> 
> That's right, it does not have to live in the first cache line, my only
> concern was its entire removal.
> 
> > I still don't get how m->nb_segs can be abused without m->next.
> 
> By "abused" I mean that applications are not supposed to pass this kind of
> mbuf lists directly to existing mbuf-handling functions (TX burst,
> rte_pktmbuf_free() and so on), however these same applications (even PMDs)
> can do so internally temporarily because it's so simple.
> 
> The next pointer of the last segment of a packet must still be set to NULL
> every time a packet is retrieved from such a list to be processed.
> 
> > > However, nb_segs may be a good candidate for demotion, along with
> > > possibly the port value, or the reference count.
> 
> Yes, I think that's fine as long as it's kept somewhere.
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:38                       ` Ananyev, Konstantin
@ 2016-10-25 14:25                         ` Morten Brørup
  2016-10-25 14:45                           ` Olivier Matz
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2016-10-25 14:25 UTC (permalink / raw)
  To: Ananyev, Konstantin, Olivier Matz, Richardson, Bruce
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Oleg Kuporosov

Lots of good arguments from Konstantin below!

Thomas also wrote something worth repeating:
"The mbuf space must be kept jealously like a DPDK treasure :)"

Maybe we should take a step away from the keyboard and consider the decision process and guidance criteria for mbuf members.

Here is an example of what I am thinking:

1. It is the intention to optimize mbuf handling for the most likely fast path through the entire system on the most common applications.
2. It is the intention that the most likely fast path through the RX handler and RX processing of the most common applications only needs to touch the first cache line.
3. The first cache line should be preferred for metadata set by most common NIC hardware in the RX handler.
4. The second cache line should be preferred for metadata required by most common NIC hardware in the TX handler.
5. The first cache line could be used for metadata used by the RX processing in the most likely fast path of most common applications.
6. The second cache line could be used for metadata used by the TX processing in the most likely fast path of most common applications.
7. The RX and TX handlers are equally important, and the RX handler should not be optimized at the cost of the TX handler.

Please note the use of "most common" and "most likely". It should be clear which cases we are optimizing for.

We should probably also define "the most common applications" (e.g. a Layer 3 router with or without VPN) and "the most likely fast path" (e.g. a single packet matching a series of lookups and ultimately forwarded to a single port).

It might also make sense documenting the mbuf fields in more detail somewhere. E.g. the need for nb_segs in the NIC's TX handler.


Med venlig hilsen / kind regards
- Morten Brørup


> -----Original Message-----
> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: Tuesday, October 25, 2016 3:39 PM
> To: Olivier Matz; Richardson, Bruce; Morten Brørup
> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Oleg Kuporosov
> Subject: RE: [dpdk-dev] mbuf changes
> 
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> > Sent: Tuesday, October 25, 2016 1:49 PM
> > To: Richardson, Bruce <bruce.richardson@intel.com>; Morten Brørup
> > <mb@smartsharesystems.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith
> > <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> > <olegk@mellanox.com>
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> >
> >
> > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > >> Comments at the end.
> > >>
> > >> Med venlig hilsen / kind regards
> > >> - Morten Brørup
> > >>
> > >>> -----Original Message-----
> > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > >>> To: Morten Brørup
> > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > >>> Oleg Kuporosov
> > >>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>
> > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > >>>> Comments inline.
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > >>>>> Richardson
> > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > >>>>> To: Adrien Mazarguil
> > >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > >>>>> Oleg Kuporosov
> > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>>>
> > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> wrote:
> > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > >>>>>>> Comments inline.
> > >>>>>>>
> > >>>>>>> Med venlig hilsen / kind regards
> > >>>>>>> - Morten Brørup
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > >>>>>>>> To: Bruce Richardson
> > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> > >>>>>>>> Oleg Kuporosov
> > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>>>>>>
> > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > >>> wrote:
> > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > >>> wrote:
> > >>>>>>>> [...]
> > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > >>>>>>>> <mb@smartsharesystems.com> wrote:
> > >>>>>>>> [...]
> > >>>>
> > >>>>>>>>> One other point I'll mention is that we need to have a
> > >>>>>>>>> discussion on how/where to add in a timestamp value into
> > >>> the
> > >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> > >>>>> sequence
> > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > >>> timestamp
> > >>>>>>>>> is not going to be enough for
> > >>>>>>>> many.
> > >>>>>>>>>
> > >>>>>>>>> Thoughts?
> > >>>>>>>>
> > >>>>>>>> If we consider that timestamp representation should use
> > >>>>> nanosecond
> > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > >>> quickly
> > >>>>>>>> to be useful. We can also assume that applications
> requesting
> > >>>>>>>> timestamps may care more about latency than throughput, Oleg
> > >>>>> found
> > >>>>>>>> that using the second cache line for this purpose had a
> > >>>>> noticeable impact [1].
> > >>>>>>>>
> > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> October/049237.html
> > >>>>>>>
> > >>>>>>> I agree with Oleg about the latency vs. throughput importance
> > >>>>>>> for
> > >>>>> such applications.
> > >>>>>>>
> > >>>>>>> If you need high resolution timestamps, consider them to be
> > >>>>> generated by the NIC RX driver, possibly by the hardware itself
> > >>>>> (http://w3new.napatech.com/features/time-precision/hardware-
> time
> > >>>>> - stamp), so the timestamp belongs in the first cache line. And
> > >>>>> I am proposing that it should have the highest possible
> > >>>>> accuracy, which makes the value hardware dependent.
> > >>>>>>>
> > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > >>> application
> > >>>>>>> to
> > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > >>>>> seconds, hours and calendar date) out of band, so we don't use
> > >>>>> precious
> > >>> space
> > >>>>> in the mbuf. The application doesn't need the NIC RX driver's
> > >>>>> fast path to capture which date (or even which second) a packet
> > >>>>> was received. Yes, it adds complexity to the application, but
> we
> > >>>>> can't set aside 64 bit for a generic timestamp. Or as a weird
> tradeoff:
> > >>>>> Put the fast moving 32 bit in the first cache line and the slow
> > >>>>> moving 32 bit in the second cache line, as a placeholder for
> the
> > >>> application to fill out if needed.
> > >>>>> Yes, it means that the application needs to check the time and
> > >>>>> update its variable holding the slow moving time once every
> > >>>>> second or so; but that should be doable without significant
> effort.
> > >>>>>>
> > >>>>>> That's a good point, however without a 64 bit value, elapsed
> > >>>>>> time between two arbitrary mbufs cannot be measured reliably
> > >>>>>> due to
> > >>> not
> > >>>>>> enough context, one way or another the low resolution value is
> > >>>>>> also
> > >>>>> needed.
> > >>>>>>
> > >>>>>> Obviously latency-sensitive applications are unlikely to
> > >>>>>> perform lengthy buffering and require this but I'm not sure
> > >>>>>> about all the possible use-cases. Considering many NICs expose
> > >>>>>> 64 bit
> > >>> timestaps,
> > >>>>>> I suggest we do not truncate them.
> > >>>>>>
> > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > >>>>>> tempted to fill the extra 32 bits whenever they can and negate
> > >>>>>> the performance improvement of the first cache line.
> > >>>>>
> > >>>>> I would tend to agree, and I don't really see any convenient
> way
> > >>>>> to avoid putting in a 64-bit field for the timestamp in cache-
> line 0.
> > >>>>> If we are ok with having this overlap/partially overlap with
> > >>>>> sequence number, it will use up an extra 4B of storage in that
> > >>> cacheline.
> > >>>>
> > >>>> I agree about the lack of convenience! And Adrien certainly has
> a
> > >>> point about PMD temptations.
> > >>>>
> > >>>> However, I still don't think that a NICs ability to date-stamp a
> > >>> packet is sufficient reason to put a date-stamp in cache line 0
> of
> > >>> the mbuf. Storing only the fast moving 32 bit in cache line 0
> > >>> seems like a good compromise to me.
> > >>>>
> > >>>> Maybe you can find just one more byte, so it can hold 17 minutes
> > >>>> with nanosecond resolution. (I'm joking!)
> > >>>>
> > >>>> Please don't sacrifice the sequence number for the
> > >>>> seconds/hours/days
> > >>> part a timestamp. Maybe it could be configurable to use a 32 bit
> > >>> or 64 bit timestamp.
> > >>>>
> > >>> Do you see both timestamp and sequence numbers being used
> > >>> together? I would have thought that apps would either use one or
> the other?
> > >>> However, your suggestion is workable in any case, to allow the
> > >>> sequence number to overlap just the high 32 bits of the
> timestamp,
> > >>> rather than the low.
> > >>
> > >> In our case, I can foresee sequence numbers used for packet
> > >> processing and timestamps for timing analysis (and possibly for
> > >> packet
> > capturing, when being used).
> 
> Great, but right now none of these fields are filled from NIC HW by PMD
> RX function (except RFC for melanox provided by Oleg, but again it is
> pure SW implementation).
> So I would repeat my question: why these fields should stay in the
> first cache-line?
> I understand that  it would speed-up some particular application, but
> there are plenty of apps which do use different metadata.
> Let say l2/l3/l4 len - is very useful information for upper layer
> (L3/L4) packet processing.
> Should people who do use it start to push moving that fields into first
> cache-line too?
> 
> >For timing analysis, we don’t need long durations, e.g. 4 seconds with
> 32 bit nanosecond resolution suffices.
> > And for packet capturing we are perfectly capable of adding the
> slowly
> >moving 32 bit of the timestamp to our output data stream without
> fetching it from the mbuf.
> > >>
> >
> > We should keep in mind that today we have the seqn field but it is
> not
> > used by any PMD. In case it is implemented, would it be a per-queue
> > sequence number? Is it useful from an application view?
> 
> Exactly - it depends from SW that uses that field, right?
> It could be per process / per group of lcores / per port / per queue,
> etc.
> Depending on what are upper layer needs.
> 
> >
> > This field is only used by the librte_reorder library, and in my
> > opinion, we should consider moving it in the second cache line since
> > it is not filled by the PMD.
> 
> +1
> 
> >
> >
> > > For the 32-bit timestamp case, it might be useful to have a
> > > right-shift value passed in to the ethdev driver. If we assume a
> NIC
> > > with nanosecond resolution, (or TSC value with resolution of that
> > > order of magnitude), then the app can choose to have 1 ns
> resolution
> > > with 4 second wraparound, or alternatively 4ns resolution with 16
> > > second wraparound, or even microsecond resolution with wrap around
> of over an hour.
> > > The cost is obviously just a shift op in the driver code per packet
> > > - hopefully with multiple packets done at a time using vector
> operations.
> >
> >
> > About the timestamp, we can manage to find 64 bits in the first cache
> > line, without sacrifying any field we have today.
> 
> We can I suppose, but again what for and who will fill it?
> If PMD, then where it will get this information?
> If it is from rte_rdtsc() or clock(), then why upper layer can't do it
> itself?
> Konstantin
> 
> >The question is more
> > for the fields we may want to add later.
> >
> > To answer to the question of the size of the timestamp, the first
> > question is to know what is the precision required for the
> > applications using it?
> >
> > I don't quite like the idea of splitting the timestamp in the 2 cache
> > lines, I think it would not be easy to use.
> >
> >
> > Olivier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 13:24                 ` Thomas Monjalon
@ 2016-10-25 14:32                   ` Ramia, Kannan Babu
  2016-10-25 14:54                     ` Thomas Monjalon
  0 siblings, 1 reply; 46+ messages in thread
From: Ramia, Kannan Babu @ 2016-10-25 14:32 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Olivier Matz, Morten Brørup, Ananyev, Konstantin,
	Richardson, Bruce, Wiles, Keith

I didn't get your question. The only information exchanged between the stages is mbuf pointer. So the information has to be in mbuf, whether it's part of Meta or in private area in the packet buffer headroom is that the question you are asking. The private area is application specific, while I am looking for the port information getting updated from driver to application and it's generic to multiple applications. 

Regards
Kannan Babu


-----Original Message-----
From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] 
Sent: Tuesday, October 25, 2016 6:54 PM
To: Ramia, Kannan Babu <kannan.babu.ramia@intel.com>
Cc: dev@dpdk.org; Olivier Matz <olivier.matz@6wind.com>; Morten Brørup <mb@smartsharesystems.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Wiles, Keith <keith.wiles@intel.com>
Subject: Re: [dpdk-dev] mbuf changes

2016-10-25 13:04, Ramia, Kannan Babu:
> Port filed is important meta information for the application use like 
> CGNAT vEPC functions etc.
> I strongly recommend to keep the field in mind meta.

Have you tried to move this field outside of the mbuf?
What is the performance degradation?
We need more information than some assumptions.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 14:25                         ` Morten Brørup
@ 2016-10-25 14:45                           ` Olivier Matz
  0 siblings, 0 replies; 46+ messages in thread
From: Olivier Matz @ 2016-10-25 14:45 UTC (permalink / raw)
  To: Morten Brørup, Ananyev, Konstantin, Richardson, Bruce
  Cc: Adrien Mazarguil, Wiles, Keith, dev, Oleg Kuporosov



On 10/25/2016 04:25 PM, Morten Brørup wrote:
> It might also make sense documenting the mbuf fields in more detail somewhere. E.g. the need for nb_segs in the NIC's TX handler.

Good point, I'll do it at the same time than the first rework
proposition.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 14:32                   ` Ramia, Kannan Babu
@ 2016-10-25 14:54                     ` Thomas Monjalon
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Monjalon @ 2016-10-25 14:54 UTC (permalink / raw)
  To: Ramia, Kannan Babu
  Cc: dev, Olivier Matz, Morten Brørup, Ananyev, Konstantin,
	Richardson, Bruce, Wiles, Keith

2016-10-25 14:32, Ramia, Kannan Babu:
> I didn't get your question. The only information exchanged between the stages is mbuf pointer. So the information has to be in mbuf, whether it's part of Meta or in private area in the packet buffer headroom is that the question you are asking. The private area is application specific, while I am looking for the port information getting updated from driver to application and it's generic to multiple applications. 

Thanks, your answer is perfect (except it is top-posted ;)

The discussion is more about performance. So you agree that the port
information is not performance sensitive.
It appears that it could be an information filled in a private area
at app-level, because the application knows which port it is polling.
However we may need a way to put some generic informations somewhere,
not in the first cache lines.


> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] 
> Sent: Tuesday, October 25, 2016 6:54 PM
> To: Ramia, Kannan Babu <kannan.babu.ramia@intel.com>
> Cc: dev@dpdk.org; Olivier Matz <olivier.matz@6wind.com>; Morten Brørup <mb@smartsharesystems.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Wiles, Keith <keith.wiles@intel.com>
> Subject: Re: [dpdk-dev] mbuf changes
> 
> 2016-10-25 13:04, Ramia, Kannan Babu:
> > Port filed is important meta information for the application use like 
> > CGNAT vEPC functions etc.
> > I strongly recommend to keep the field in mind meta.
> 
> Have you tried to move this field outside of the mbuf?
> What is the performance degradation?
> We need more information than some assumptions.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:05       ` Bruce Richardson
@ 2016-10-26  8:28         ` Alejandro Lucero
  2016-10-26  9:01           ` Morten Brørup
  2016-11-09 11:42           ` Alejandro Lucero
  0 siblings, 2 replies; 46+ messages in thread
From: Alejandro Lucero @ 2016-10-26  8:28 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Shreyansh Jain, dev, Morten Brørup

On Tue, Oct 25, 2016 at 2:05 PM, Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Tue, Oct 25, 2016 at 05:24:28PM +0530, Shreyansh Jain wrote:
> > On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > >
> > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <
> mb@smartsharesystems.com> wrote:
> > > > >
> > > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > >
> > > > >
> > > > >
> > > > > Continuing the Userspace discussion about Olivier Matz’s proposed
> mbuf changes...
> > >
> > > Thanks for keeping the discussion going!
> > > > >
> > > > >
> > > > >
> > > > > 1.
> > > > >
> > > > > Stephen Hemminger had a noteworthy general comment about keeping
> metadata for the NIC in the appropriate section of the mbuf: Metadata
> generated by the NIC’s RX handler belongs in the first cache line, and
> metadata required by the NIC’s TX handler belongs in the second cache line.
> This also means that touching the second cache line on ingress should be
> avoided if possible; and Bruce Richardson mentioned that for this reason
> m->next was zeroed on free().
> > > > >
> > > Thinking about it, I suspect there are more fields we can reset on free
> > > to save time on alloc. Refcnt, as discussed below is one of them, but
> so
> > > too could be the nb_segs field and possibly others.
> > >
> > > > >
> > > > >
> > > > > 2.
> > > > >
> > > > > There seemed to be consensus that the size of m->refcnt should
> match the size of m->port because a packet could be duplicated on all
> physical ports for L3 multicast and L2 flooding.
> > > > >
> > > > > Furthermore, although a single physical machine (i.e. a single
> server) with 255 physical ports probably doesn’t exist, it might contain
> more than 255 virtual machines with a virtual port each, so it makes sense
> extending these mbuf fields from 8 to 16 bits.
> > > >
> > > > I thought we also talked about removing the m->port from the mbuf as
> it is not really needed.
> > > >
> > > Yes, this was mentioned, and also the option of moving the port value
> to
> > > the second cacheline, but it appears that NXP are using the port value
> > > in their NIC drivers for passing in metadata, so we'd need their
> > > agreement on any move (or removal).
> >
> > I am not sure where NXP's NIC came into picture on this, but now that it
> is
> > highlighted, this field is required for libevent implementation [1].
> >
> > A scheduler sending an event, which can be a packet, would only have
> > information of a flow_id. From this matching it back to a port, without
> > mbuf->port, would be very difficult (costly). There may be way around
> this
> > but at least in current proposal I think port would be important to have
> -
> > even if in second cache line.
> >
> > But, off the top of my head, as of now it is not being used for any
> specific
> > purpose in NXP's PMD implementation.
> >
> > Even the SoC patches don't necessarily rely on it except using it
> because it
> > is available.
> >
> > @Bruce: where did you get the NXP context here from?
> >
> Oh, I'm just mis-remembering. :-( It was someone else who was looking for
> this - Netronome, perhaps?
>
> CC'ing Alejandro in the hope I'm remembering correctly second time
> round!
>
>
Yes. Thanks Bruce!

So Netronome uses the port field and, as I commented on the user meeting,
we are happy with the field going from 8 to 16 bits.

In our case, this is something some clients have demanded, and if I'm not
wrong (I'll double check this asap), the port value is for knowing where
the packet is coming from. Think about a switch in the NIC, with ports
linked to VFs/VMs, and one or more physical ports. That port value is not
related to DPDK ports but to the switch ports. Code in the host (DPDK or
not) can receive packets from the wire or from VFs through the NIC. This is
also true for packets received by VMs, but I guess the port value is just
interested for host code.



> /Bruce
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-26  8:28         ` Alejandro Lucero
@ 2016-10-26  9:01           ` Morten Brørup
  2016-11-09 11:42           ` Alejandro Lucero
  1 sibling, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-26  9:01 UTC (permalink / raw)
  To: Alejandro Lucero, Bruce Richardson; +Cc: Shreyansh Jain, dev

> From: Alejandro Lucero [mailto:alejandro.lucero@netronome.com] 
> On Tue, Oct 25, 2016 at 2:05 PM, Bruce Richardson <bruce.richardson@intel.com> wrote:
> > On Tue, Oct 25, 2016 at 05:24:28PM +0530, Shreyansh Jain wrote:
> > > On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > > >
> > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb@smartsharesystems.com> wrote:
> > > > > >
> > > > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > > >
> > > > > >
> > > > > >
> > > > > > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf changes...
> > > >
> > > > Thanks for keeping the discussion going!
> > > > > >
> > > > > >
> > > > > >
> > > > > > 1.
> > > > > >
> > > > > > Stephen Hemminger had a noteworthy general comment about keeping metadata for the NIC in the appropriate section of the mbuf: Metadata generated by the NIC’s RX handler belongs in the first cache line, and metadata required by the NIC’s TX handler belongs in the second cache line. This also means that touching the second cache line on ingress should be avoided if possible; and Bruce Richardson mentioned that for this reason m->next was zeroed on free().
> > > > > >
> > > > Thinking about it, I suspect there are more fields we can reset on free
> > > > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > > > too could be the nb_segs field and possibly others.
> > > >
> > > > > >
> > > > > >
> > > > > > 2.
> > > > > >
> > > > > > There seemed to be consensus that the size of m->refcnt should match the size of m->port because a packet could be duplicated on all physical ports for L3 multicast and L2 flooding.
> > > > > >
> > > > > > Furthermore, although a single physical machine (i.e. a single server) with 255 physical ports probably doesn’t exist, it might contain more than 255 virtual machines with a virtual port each, so it makes sense extending these mbuf fields from 8 to 16 bits.
> > > > >
> > > > > I thought we also talked about removing the m->port from the mbuf as it is not really needed.
> > > > >
> > > > Yes, this was mentioned, and also the option of moving the port value to
> > > > the second cacheline, but it appears that NXP are using the port value
> > > > in their NIC drivers for passing in metadata, so we'd need their
> > > > agreement on any move (or removal).
> > >
> > > I am not sure where NXP's NIC came into picture on this, but now that it is
> > > highlighted, this field is required for libevent implementation [1].
> > >
> > > A scheduler sending an event, which can be a packet, would only have
> > > information of a flow_id. From this matching it back to a port, without
> > > mbuf->port, would be very difficult (costly). There may be way around this
> > > but at least in current proposal I think port would be important to have -
> > > even if in second cache line.
> > >
> > > But, off the top of my head, as of now it is not being used for any specific
> > > purpose in NXP's PMD implementation.
> > >
> > > Even the SoC patches don't necessarily rely on it except using it because it
> > > is available.
> > >
> > > @Bruce: where did you get the NXP context here from?
> > >
> > Oh, I'm just mis-remembering. :-( It was someone else who was looking for
> > this - Netronome, perhaps?
> > 
> > CC'ing Alejandro in the hope I'm remembering correctly second time
> > round!
> > 
> Yes. Thanks Bruce!
> 
> So Netronome uses the port field and, as I commented on the user meeting, we are happy with the field going from 8 to 16 bits.
> 
> In our case, this is something some clients have demanded, and if I'm not wrong (I'll double check this asap), the port value is for knowing where the packet is coming from. Think about a switch in the NIC, with ports linked to VFs/VMs, and one or more physical ports. That port value is not related to DPDK ports but to the switch ports. Code in the host (DPDK or not) can receive packets from the wire or from VFs through the NIC. This is also true for packets received by VMs, but I guess the port value is just interested for host code.

Come to think of it: About ten years ago I was directly involved in the design of a silicon architecture for a stackable switch, and the packets on the backplane used a proprietary stack header which included a port number. Among other purposes, it allowed the management CPU to inject packets addressed to any physical port in the stack; and this could easily be more than 255 ports. I have seen similar proprietary headers in chips used for consumer wifi routers, allowing a single port on the CPU chip to communicate individually with the 4 LAN ports through the 5 port switch chip on the PCB. These are other examples of what Alejandro describes. So maybe this is quite common.

I think the Userspace consensus strongly supported increasing the m->port from 8 to 16 bit, so let's go for that. But do we keep port in the first cache line as is, do we fold it into the Flow Director part, or do we move it to the second cache line?

Maybe we should also consider that since m->port is widely used by the examples, it indicates that this variable is so fundamental that it is probably also widely used by real life applications. E.g. ingress filters in firewalls are quite useful (https://tools.ietf.org/html/bcp38).

-Morten


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:48                     ` Olivier Matz
  2016-10-25 13:13                       ` Morten Brørup
  2016-10-25 13:38                       ` Ananyev, Konstantin
@ 2016-10-28 13:34                       ` Pattan, Reshma
  2016-10-28 14:11                         ` Morten Brørup
  2016-10-31 16:05                       ` Morten Brørup
  3 siblings, 1 reply; 46+ messages in thread
From: Pattan, Reshma @ 2016-10-28 13:34 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup

Hi Olivier,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, October 25, 2016 1:49 PM
> To: Richardson, Bruce <bruce.richardson@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith
> <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> <olegk@mellanox.com>
> Subject: Re: [dpdk-dev] mbuf changes
> 
> 
> 
> On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> >> Comments at the end.
> >>
> >> Med venlig hilsen / kind regards
> >> - Morten Brørup
> >>
> >>> -----Original Message-----
> >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>> Sent: Tuesday, October 25, 2016 2:20 PM
> >>> To: Morten Brørup
> >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> >>> Kuporosov
> >>> Subject: Re: [dpdk-dev] mbuf changes
> >>>
> >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> >>>> Comments inline.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> >>>>> Richardson
> >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> >>>>> To: Adrien Mazarguil
> >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz; Oleg
> >>>>> Kuporosov
> >>>>> Subject: Re: [dpdk-dev] mbuf changes
> >>>>>
> >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> >>>>>>> Comments inline.
> >>>>>>>
> >>>>>>> Med venlig hilsen / kind regards
> >>>>>>> - Morten Brørup
> >>>>>>>
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> >>>>>>>> To: Bruce Richardson
> >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> >>>>>>>> Oleg Kuporosov
> >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> >>>>>>>>
> >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> >>> wrote:
> >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> >>> wrote:
> >>>>>>>> [...]
> >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> >>>>>>>> <mb@smartsharesystems.com> wrote:
> >>>>>>>> [...]
> >>>>
> >>>>>>>>> One other point I'll mention is that we need to have a
> >>>>>>>>> discussion on how/where to add in a timestamp value into
> >>> the
> >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> >>>>> sequence
> >>>>>>>>> number value, but I also suspect that 32-bits of a
> >>> timestamp
> >>>>>>>>> is not going to be enough for
> >>>>>>>> many.
> >>>>>>>>>
> >>>>>>>>> Thoughts?
> >>>>>>>>
> >>>>>>>> If we consider that timestamp representation should use
> >>>>> nanosecond
> >>>>>>>> granularity, a 32-bit value may likely wrap around too
> >>> quickly
> >>>>>>>> to be useful. We can also assume that applications requesting
> >>>>>>>> timestamps may care more about latency than throughput, Oleg
> >>>>> found
> >>>>>>>> that using the second cache line for this purpose had a
> >>>>> noticeable impact [1].
> >>>>>>>>
> >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> >>>>>>>
> >>>>>>> I agree with Oleg about the latency vs. throughput importance
> >>>>>>> for
> >>>>> such applications.
> >>>>>>>
> >>>>>>> If you need high resolution timestamps, consider them to be
> >>>>> generated by the NIC RX driver, possibly by the hardware itself
> >>>>> (http://w3new.napatech.com/features/time-precision/hardware-time-
> >>>>> stamp), so the timestamp belongs in the first cache line. And I am
> >>>>> proposing that it should have the highest possible accuracy, which
> >>>>> makes the value hardware dependent.
> >>>>>>>
> >>>>>>> Furthermore, I am arguing that we leave it up to the
> >>> application
> >>>>>>> to
> >>>>> keep track of the slowly moving bits (i.e. counting whole seconds,
> >>>>> hours and calendar date) out of band, so we don't use precious
> >>> space
> >>>>> in the mbuf. The application doesn't need the NIC RX driver's fast
> >>>>> path to capture which date (or even which second) a packet was
> >>>>> received. Yes, it adds complexity to the application, but we can't
> >>>>> set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> >>>>> Put the fast moving 32 bit in the first cache line and the slow
> >>>>> moving 32 bit in the second cache line, as a placeholder for the
> >>> application to fill out if needed.
> >>>>> Yes, it means that the application needs to check the time and
> >>>>> update its variable holding the slow moving time once every second
> >>>>> or so; but that should be doable without significant effort.
> >>>>>>
> >>>>>> That's a good point, however without a 64 bit value, elapsed time
> >>>>>> between two arbitrary mbufs cannot be measured reliably due to
> >>> not
> >>>>>> enough context, one way or another the low resolution value is
> >>>>>> also
> >>>>> needed.
> >>>>>>
> >>>>>> Obviously latency-sensitive applications are unlikely to perform
> >>>>>> lengthy buffering and require this but I'm not sure about all the
> >>>>>> possible use-cases. Considering many NICs expose 64 bit
> >>> timestaps,
> >>>>>> I suggest we do not truncate them.
> >>>>>>
> >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be tempted
> >>>>>> to fill the extra 32 bits whenever they can and negate the
> >>>>>> performance improvement of the first cache line.
> >>>>>
> >>>>> I would tend to agree, and I don't really see any convenient way
> >>>>> to avoid putting in a 64-bit field for the timestamp in cache-line 0.
> >>>>> If we are ok with having this overlap/partially overlap with
> >>>>> sequence number, it will use up an extra 4B of storage in that
> >>> cacheline.
> >>>>
> >>>> I agree about the lack of convenience! And Adrien certainly has a
> >>> point about PMD temptations.
> >>>>
> >>>> However, I still don't think that a NICs ability to date-stamp a
> >>> packet is sufficient reason to put a date-stamp in cache line 0 of
> >>> the mbuf. Storing only the fast moving 32 bit in cache line 0 seems
> >>> like a good compromise to me.
> >>>>
> >>>> Maybe you can find just one more byte, so it can hold 17 minutes
> >>>> with nanosecond resolution. (I'm joking!)
> >>>>
> >>>> Please don't sacrifice the sequence number for the
> >>>> seconds/hours/days
> >>> part a timestamp. Maybe it could be configurable to use a 32 bit or
> >>> 64 bit timestamp.
> >>>>
> >>> Do you see both timestamp and sequence numbers being used together?
> >>> I would have thought that apps would either use one or the other?
> >>> However, your suggestion is workable in any case, to allow the
> >>> sequence number to overlap just the high 32 bits of the timestamp,
> >>> rather than the low.
> >>
> >> In our case, I can foresee sequence numbers used for packet processing and
> timestamps for timing analysis (and possibly for packet capturing, when being
> used). For timing analysis, we don’t need long durations, e.g. 4 seconds with 32
> bit nanosecond resolution suffices. And for packet capturing we are perfectly
> capable of adding the slowly moving 32 bit of the timestamp to our output data
> stream without fetching it from the mbuf.
> >>
> 
> We should keep in mind that today we have the seqn field but it is not used by
> any PMD. In case it is implemented, would it be a per-queue sequence number?
> Is it useful from an application view?
> 
> This field is only used by the librte_reorder library, and in my opinion, we should
> consider moving it in the second cache line since it is not filled by the PMD.
> 
> 
> > For the 32-bit timestamp case, it might be useful to have a
> > right-shift value passed in to the ethdev driver. If we assume a NIC
> > with nanosecond resolution, (or TSC value with resolution of that
> > order of magnitude), then the app can choose to have 1 ns resolution
> > with 4 second wraparound, or alternatively 4ns resolution with 16
> > second wraparound, or even microsecond resolution with wrap around of over
> an hour.
> > The cost is obviously just a shift op in the driver code per packet -
> > hopefully with multiple packets done at a time using vector operations.
> 
> 
> About the timestamp, we can manage to find 64 bits in the first cache line,
> without sacrifying any field we have today. The question is more for the fields
> we may want to add later.
> 
> To answer to the question of the size of the timestamp, the first question is to
> know what is the precision required for the applications using it?

As part of the 17.02 latency calculation feature, the requirement is to report latency in nanoseconds. So, +1 on keeping timestamp as 64bits. 
Since the feature is planned for 17.02, can we finalize on the timestamp position in the mbuf struct? 
Based on the decision, I am planning to make the change and send ABI notice if needed.

Reshma

> 
> I don't quite like the idea of splitting the timestamp in the 2 cache lines, I think it
> would not be easy to use.
> 
> 
> Olivier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-28 13:34                       ` Pattan, Reshma
@ 2016-10-28 14:11                         ` Morten Brørup
  2016-10-28 15:25                           ` Pattan, Reshma
  2016-10-28 16:50                           ` Adrien Mazarguil
  0 siblings, 2 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-28 14:11 UTC (permalink / raw)
  To: Pattan, Reshma, Olivier Matz; +Cc: dev

Comments at the end.

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Pattan, Reshma
> Sent: Friday, October 28, 2016 3:35 PM
> To: Olivier Matz
> Cc: dev@dpdk.org; Morten Brørup
> Subject: Re: [dpdk-dev] mbuf changes
> 
> Hi Olivier,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> > Sent: Tuesday, October 25, 2016 1:49 PM
> > To: Richardson, Bruce <bruce.richardson@intel.com>; Morten Brørup
> > <mb@smartsharesystems.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith
> > <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> > <olegk@mellanox.com>
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> >
> >
> > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > >> Comments at the end.
> > >>
> > >> Med venlig hilsen / kind regards
> > >> - Morten Brørup
> > >>
> > >>> -----Original Message-----
> > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > >>> To: Morten Brørup
> > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > >>> Oleg Kuporosov
> > >>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>
> > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > >>>> Comments inline.
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > >>>>> Richardson
> > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > >>>>> To: Adrien Mazarguil
> > >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > >>>>> Oleg Kuporosov
> > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>>>
> > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> wrote:
> > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > >>>>>>> Comments inline.
> > >>>>>>>
> > >>>>>>> Med venlig hilsen / kind regards
> > >>>>>>> - Morten Brørup
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > >>>>>>>> To: Bruce Richardson
> > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> > >>>>>>>> Oleg Kuporosov
> > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>>>>>>
> > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > >>> wrote:
> > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > >>> wrote:
> > >>>>>>>> [...]
> > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > >>>>>>>> <mb@smartsharesystems.com> wrote:
> > >>>>>>>> [...]
> > >>>>
> > >>>>>>>>> One other point I'll mention is that we need to have a
> > >>>>>>>>> discussion on how/where to add in a timestamp value into
> > >>> the
> > >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> > >>>>> sequence
> > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > >>> timestamp
> > >>>>>>>>> is not going to be enough for
> > >>>>>>>> many.
> > >>>>>>>>>
> > >>>>>>>>> Thoughts?
> > >>>>>>>>
> > >>>>>>>> If we consider that timestamp representation should use
> > >>>>> nanosecond
> > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > >>> quickly
> > >>>>>>>> to be useful. We can also assume that applications
> requesting
> > >>>>>>>> timestamps may care more about latency than throughput, Oleg
> > >>>>> found
> > >>>>>>>> that using the second cache line for this purpose had a
> > >>>>> noticeable impact [1].
> > >>>>>>>>
> > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> October/049237.html
> > >>>>>>>
> > >>>>>>> I agree with Oleg about the latency vs. throughput importance
> > >>>>>>> for
> > >>>>> such applications.
> > >>>>>>>
> > >>>>>>> If you need high resolution timestamps, consider them to be
> > >>>>> generated by the NIC RX driver, possibly by the hardware itself
> > >>>>> (http://w3new.napatech.com/features/time-precision/hardware-
> time
> > >>>>> - stamp), so the timestamp belongs in the first cache line. And
> > >>>>> I am proposing that it should have the highest possible
> > >>>>> accuracy, which makes the value hardware dependent.
> > >>>>>>>
> > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > >>> application
> > >>>>>>> to
> > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > >>>>> seconds, hours and calendar date) out of band, so we don't use
> > >>>>> precious
> > >>> space
> > >>>>> in the mbuf. The application doesn't need the NIC RX driver's
> > >>>>> fast path to capture which date (or even which second) a packet
> > >>>>> was received. Yes, it adds complexity to the application, but
> we
> > >>>>> can't set aside 64 bit for a generic timestamp. Or as a weird
> tradeoff:
> > >>>>> Put the fast moving 32 bit in the first cache line and the slow
> > >>>>> moving 32 bit in the second cache line, as a placeholder for
> the
> > >>> application to fill out if needed.
> > >>>>> Yes, it means that the application needs to check the time and
> > >>>>> update its variable holding the slow moving time once every
> > >>>>> second or so; but that should be doable without significant
> effort.
> > >>>>>>
> > >>>>>> That's a good point, however without a 64 bit value, elapsed
> > >>>>>> time between two arbitrary mbufs cannot be measured reliably
> > >>>>>> due to
> > >>> not
> > >>>>>> enough context, one way or another the low resolution value is
> > >>>>>> also
> > >>>>> needed.
> > >>>>>>
> > >>>>>> Obviously latency-sensitive applications are unlikely to
> > >>>>>> perform lengthy buffering and require this but I'm not sure
> > >>>>>> about all the possible use-cases. Considering many NICs expose
> > >>>>>> 64 bit
> > >>> timestaps,
> > >>>>>> I suggest we do not truncate them.
> > >>>>>>
> > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > >>>>>> tempted to fill the extra 32 bits whenever they can and negate
> > >>>>>> the performance improvement of the first cache line.
> > >>>>>
> > >>>>> I would tend to agree, and I don't really see any convenient
> way
> > >>>>> to avoid putting in a 64-bit field for the timestamp in cache-
> line 0.
> > >>>>> If we are ok with having this overlap/partially overlap with
> > >>>>> sequence number, it will use up an extra 4B of storage in that
> > >>> cacheline.
> > >>>>
> > >>>> I agree about the lack of convenience! And Adrien certainly has
> a
> > >>> point about PMD temptations.
> > >>>>
> > >>>> However, I still don't think that a NICs ability to date-stamp a
> > >>> packet is sufficient reason to put a date-stamp in cache line 0
> of
> > >>> the mbuf. Storing only the fast moving 32 bit in cache line 0
> > >>> seems like a good compromise to me.
> > >>>>
> > >>>> Maybe you can find just one more byte, so it can hold 17 minutes
> > >>>> with nanosecond resolution. (I'm joking!)
> > >>>>
> > >>>> Please don't sacrifice the sequence number for the
> > >>>> seconds/hours/days
> > >>> part a timestamp. Maybe it could be configurable to use a 32 bit
> > >>> or
> > >>> 64 bit timestamp.
> > >>>>
> > >>> Do you see both timestamp and sequence numbers being used
> together?
> > >>> I would have thought that apps would either use one or the other?
> > >>> However, your suggestion is workable in any case, to allow the
> > >>> sequence number to overlap just the high 32 bits of the
> timestamp,
> > >>> rather than the low.
> > >>
> > >> In our case, I can foresee sequence numbers used for packet
> > >> processing and
> > timestamps for timing analysis (and possibly for packet capturing,
> > when being used). For timing analysis, we don’t need long durations,
> > e.g. 4 seconds with 32 bit nanosecond resolution suffices. And for
> > packet capturing we are perfectly capable of adding the slowly moving
> > 32 bit of the timestamp to our output data stream without fetching it
> from the mbuf.
> > >>
> >
> > We should keep in mind that today we have the seqn field but it is
> not
> > used by any PMD. In case it is implemented, would it be a per-queue
> sequence number?
> > Is it useful from an application view?
> >
> > This field is only used by the librte_reorder library, and in my
> > opinion, we should consider moving it in the second cache line since
> it is not filled by the PMD.
> >
> >
> > > For the 32-bit timestamp case, it might be useful to have a
> > > right-shift value passed in to the ethdev driver. If we assume a
> NIC
> > > with nanosecond resolution, (or TSC value with resolution of that
> > > order of magnitude), then the app can choose to have 1 ns
> resolution
> > > with 4 second wraparound, or alternatively 4ns resolution with 16
> > > second wraparound, or even microsecond resolution with wrap around
> > > of over
> > an hour.
> > > The cost is obviously just a shift op in the driver code per packet
> > > - hopefully with multiple packets done at a time using vector
> operations.
> >
> >
> > About the timestamp, we can manage to find 64 bits in the first cache
> > line, without sacrifying any field we have today. The question is
> more
> > for the fields we may want to add later.
> >
> > To answer to the question of the size of the timestamp, the first
> > question is to know what is the precision required for the
> applications using it?
> 
> As part of the 17.02 latency calculation feature, the requirement is to
> report latency in nanoseconds. So, +1 on keeping timestamp as 64bits.
> Since the feature is planned for 17.02, can we finalize on the
> timestamp position in the mbuf struct?
> Based on the decision, I am planning to make the change and send ABI
> notice if needed.
> 
> Reshma
> 
> >
> > I don't quite like the idea of splitting the timestamp in the 2 cache
> > lines, I think it would not be easy to use.
> >
> >
> > Olivier

Nanosecond precision in latency calculations does not require more than 32 bit, unless we must also be able to measure more than 4 seconds of latency. And please consider how many packets we would need to hold in memory to keep track of 4 seconds of traffic on a 100 Gbit/s link (if we are measuring network latency): With an average packet size of 300 B that would be 40 million packets.

If I recall correctly, the 17.02 feature is about measuring application latency, not network latency (as my calculation above is about). I don't consider application latency measuring a common feature for most applications, except for application debugging/profiling purposes (not network debugging purposes). So do we really need both nanosecond precision and the ability to measure 4+ seconds of latency?

Furthermore, if the timestamp isn't set by hardware, why not just take the entire array of packets pulled out from the NIC and set their timestamp to the same value, instead of setting them one by one in the PMD - that would also consolidate the expense of obtaining the time from some clock source.

-Morten


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-28 14:11                         ` Morten Brørup
@ 2016-10-28 15:25                           ` Pattan, Reshma
  2016-10-28 16:50                           ` Adrien Mazarguil
  1 sibling, 0 replies; 46+ messages in thread
From: Pattan, Reshma @ 2016-10-28 15:25 UTC (permalink / raw)
  To: Morten Brørup, Olivier Matz; +Cc: dev



> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Friday, October 28, 2016 3:12 PM
> To: Pattan, Reshma <reshma.pattan@intel.com>; Olivier Matz
> <olivier.matz@6wind.com>
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] mbuf changes
> 
> Comments at the end.
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Pattan, Reshma
> > Sent: Friday, October 28, 2016 3:35 PM
> > To: Olivier Matz
> > Cc: dev@dpdk.org; Morten Brørup
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > Hi Olivier,
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> > > Sent: Tuesday, October 25, 2016 1:49 PM
> > > To: Richardson, Bruce <bruce.richardson@intel.com>; Morten Brørup
> > > <mb@smartsharesystems.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith
> > > <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> > > <olegk@mellanox.com>
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > >
> > >
> > > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > > >> Comments at the end.
> > > >>
> > > >> Med venlig hilsen / kind regards
> > > >> - Morten Brørup
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > > >>> To: Morten Brørup
> > > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > > >>> Oleg Kuporosov
> > > >>> Subject: Re: [dpdk-dev] mbuf changes
> > > >>>
> > > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > > >>>> Comments inline.
> > > >>>>
> > > >>>>> -----Original Message-----
> > > >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > >>>>> Richardson
> > > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > > >>>>> To: Adrien Mazarguil
> > > >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > > >>>>> Oleg Kuporosov
> > > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > >>>>>
> > > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> > wrote:
> > > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > >>>>>>> Comments inline.
> > > >>>>>>>
> > > >>>>>>> Med venlig hilsen / kind regards
> > > >>>>>>> - Morten Brørup
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> -----Original Message-----
> > > >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > > >>>>>>>> To: Bruce Richardson
> > > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier
> > > >>>>>>>> Matz; Oleg Kuporosov
> > > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > >>>>>>>>
> > > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > > >>> wrote:
> > > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > > >>> wrote:
> > > >>>>>>>> [...]
> > > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > >>>>>>>> <mb@smartsharesystems.com> wrote:
> > > >>>>>>>> [...]
> > > >>>>
> > > >>>>>>>>> One other point I'll mention is that we need to have a
> > > >>>>>>>>> discussion on how/where to add in a timestamp value into
> > > >>> the
> > > >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> > > >>>>> sequence
> > > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > > >>> timestamp
> > > >>>>>>>>> is not going to be enough for
> > > >>>>>>>> many.
> > > >>>>>>>>>
> > > >>>>>>>>> Thoughts?
> > > >>>>>>>>
> > > >>>>>>>> If we consider that timestamp representation should use
> > > >>>>> nanosecond
> > > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > > >>> quickly
> > > >>>>>>>> to be useful. We can also assume that applications
> > requesting
> > > >>>>>>>> timestamps may care more about latency than throughput,
> > > >>>>>>>> Oleg
> > > >>>>> found
> > > >>>>>>>> that using the second cache line for this purpose had a
> > > >>>>> noticeable impact [1].
> > > >>>>>>>>
> > > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> > October/049237.html
> > > >>>>>>>
> > > >>>>>>> I agree with Oleg about the latency vs. throughput
> > > >>>>>>> importance for
> > > >>>>> such applications.
> > > >>>>>>>
> > > >>>>>>> If you need high resolution timestamps, consider them to be
> > > >>>>> generated by the NIC RX driver, possibly by the hardware
> > > >>>>> itself
> > > >>>>> (http://w3new.napatech.com/features/time-precision/hardware-
> > time
> > > >>>>> - stamp), so the timestamp belongs in the first cache line.
> > > >>>>> And I am proposing that it should have the highest possible
> > > >>>>> accuracy, which makes the value hardware dependent.
> > > >>>>>>>
> > > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > > >>> application
> > > >>>>>>> to
> > > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > > >>>>> seconds, hours and calendar date) out of band, so we don't use
> > > >>>>> precious
> > > >>> space
> > > >>>>> in the mbuf. The application doesn't need the NIC RX driver's
> > > >>>>> fast path to capture which date (or even which second) a
> > > >>>>> packet was received. Yes, it adds complexity to the
> > > >>>>> application, but
> > we
> > > >>>>> can't set aside 64 bit for a generic timestamp. Or as a weird
> > tradeoff:
> > > >>>>> Put the fast moving 32 bit in the first cache line and the
> > > >>>>> slow moving 32 bit in the second cache line, as a placeholder
> > > >>>>> for
> > the
> > > >>> application to fill out if needed.
> > > >>>>> Yes, it means that the application needs to check the time and
> > > >>>>> update its variable holding the slow moving time once every
> > > >>>>> second or so; but that should be doable without significant
> > effort.
> > > >>>>>>
> > > >>>>>> That's a good point, however without a 64 bit value, elapsed
> > > >>>>>> time between two arbitrary mbufs cannot be measured reliably
> > > >>>>>> due to
> > > >>> not
> > > >>>>>> enough context, one way or another the low resolution value
> > > >>>>>> is also
> > > >>>>> needed.
> > > >>>>>>
> > > >>>>>> Obviously latency-sensitive applications are unlikely to
> > > >>>>>> perform lengthy buffering and require this but I'm not sure
> > > >>>>>> about all the possible use-cases. Considering many NICs
> > > >>>>>> expose
> > > >>>>>> 64 bit
> > > >>> timestaps,
> > > >>>>>> I suggest we do not truncate them.
> > > >>>>>>
> > > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > > >>>>>> tempted to fill the extra 32 bits whenever they can and
> > > >>>>>> negate the performance improvement of the first cache line.
> > > >>>>>
> > > >>>>> I would tend to agree, and I don't really see any convenient
> > way
> > > >>>>> to avoid putting in a 64-bit field for the timestamp in cache-
> > line 0.
> > > >>>>> If we are ok with having this overlap/partially overlap with
> > > >>>>> sequence number, it will use up an extra 4B of storage in that
> > > >>> cacheline.
> > > >>>>
> > > >>>> I agree about the lack of convenience! And Adrien certainly has
> > a
> > > >>> point about PMD temptations.
> > > >>>>
> > > >>>> However, I still don't think that a NICs ability to date-stamp
> > > >>>> a
> > > >>> packet is sufficient reason to put a date-stamp in cache line 0
> > of
> > > >>> the mbuf. Storing only the fast moving 32 bit in cache line 0
> > > >>> seems like a good compromise to me.
> > > >>>>
> > > >>>> Maybe you can find just one more byte, so it can hold 17
> > > >>>> minutes with nanosecond resolution. (I'm joking!)
> > > >>>>
> > > >>>> Please don't sacrifice the sequence number for the
> > > >>>> seconds/hours/days
> > > >>> part a timestamp. Maybe it could be configurable to use a 32 bit
> > > >>> or
> > > >>> 64 bit timestamp.
> > > >>>>
> > > >>> Do you see both timestamp and sequence numbers being used
> > together?
> > > >>> I would have thought that apps would either use one or the other?
> > > >>> However, your suggestion is workable in any case, to allow the
> > > >>> sequence number to overlap just the high 32 bits of the
> > timestamp,
> > > >>> rather than the low.
> > > >>
> > > >> In our case, I can foresee sequence numbers used for packet
> > > >> processing and
> > > timestamps for timing analysis (and possibly for packet capturing,
> > > when being used). For timing analysis, we don’t need long durations,
> > > e.g. 4 seconds with 32 bit nanosecond resolution suffices. And for
> > > packet capturing we are perfectly capable of adding the slowly
> > > moving
> > > 32 bit of the timestamp to our output data stream without fetching
> > > it
> > from the mbuf.
> > > >>
> > >
> > > We should keep in mind that today we have the seqn field but it is
> > not
> > > used by any PMD. In case it is implemented, would it be a per-queue
> > sequence number?
> > > Is it useful from an application view?
> > >
> > > This field is only used by the librte_reorder library, and in my
> > > opinion, we should consider moving it in the second cache line since
> > it is not filled by the PMD.
> > >
> > >
> > > > For the 32-bit timestamp case, it might be useful to have a
> > > > right-shift value passed in to the ethdev driver. If we assume a
> > NIC
> > > > with nanosecond resolution, (or TSC value with resolution of that
> > > > order of magnitude), then the app can choose to have 1 ns
> > resolution
> > > > with 4 second wraparound, or alternatively 4ns resolution with 16
> > > > second wraparound, or even microsecond resolution with wrap around
> > > > of over
> > > an hour.
> > > > The cost is obviously just a shift op in the driver code per
> > > > packet
> > > > - hopefully with multiple packets done at a time using vector
> > operations.
> > >
> > >
> > > About the timestamp, we can manage to find 64 bits in the first
> > > cache line, without sacrifying any field we have today. The question
> > > is
> > more
> > > for the fields we may want to add later.
> > >
> > > To answer to the question of the size of the timestamp, the first
> > > question is to know what is the precision required for the
> > applications using it?
> >
> > As part of the 17.02 latency calculation feature, the requirement is
> > to report latency in nanoseconds. So, +1 on keeping timestamp as 64bits.
> > Since the feature is planned for 17.02, can we finalize on the
> > timestamp position in the mbuf struct?
> > Based on the decision, I am planning to make the change and send ABI
> > notice if needed.
> >
> > Reshma
> >
> > >
> > > I don't quite like the idea of splitting the timestamp in the 2
> > > cache lines, I think it would not be easy to use.
> > >
> > >
> > > Olivier
> 
> Nanosecond precision in latency calculations does not require more than 32 bit,
> unless we must also be able to measure more than 4 seconds of latency. And
> please consider how many packets we would need to hold in memory to keep
> track of 4 seconds of traffic on a 100 Gbit/s link (if we are measuring network
> latency): With an average packet size of 300 B that would be 40 million packets.
> 

Ok, I got the point, For 17.02 feature, latency is not measured for a particular duration of time, instead
 average latency is measured on the fly for every time stamped packet that is observed on the Tx and 
on Rx side packets are marked with timestamps for every sampling intervals. I should be able to say 
my opinion on the size again after having internal discussion.

Reshma

> If I recall correctly, the 17.02 feature is about measuring application latency, not
> network latency (as my calculation above is about). I don't consider application
> latency measuring a common feature for most applications, except for
> application debugging/profiling purposes (not network debugging purposes). So
> do we really need both nanosecond precision and the ability to measure 4+
> seconds of latency?
> 
> Furthermore, if the timestamp isn't set by hardware, why not just take the entire
> array of packets pulled out from the NIC and set their timestamp to the same
> value, instead of setting them one by one in the PMD - that would also
> consolidate the expense of obtaining the time from some clock source.
> 
> -Morten


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-28 14:11                         ` Morten Brørup
  2016-10-28 15:25                           ` Pattan, Reshma
@ 2016-10-28 16:50                           ` Adrien Mazarguil
  2016-10-28 17:00                             ` Richardson, Bruce
  1 sibling, 1 reply; 46+ messages in thread
From: Adrien Mazarguil @ 2016-10-28 16:50 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Fri, Oct 28, 2016 at 04:11:45PM +0200, Morten Brørup wrote:
> Comments at the end.
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Pattan, Reshma
> > Sent: Friday, October 28, 2016 3:35 PM
> > To: Olivier Matz
> > Cc: dev@dpdk.org; Morten Brørup
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > Hi Olivier,
> > 
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> > > Sent: Tuesday, October 25, 2016 1:49 PM
> > > To: Richardson, Bruce <bruce.richardson@intel.com>; Morten Brørup
> > > <mb@smartsharesystems.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith
> > > <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> > > <olegk@mellanox.com>
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > >
> > >
> > > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > > >> Comments at the end.
> > > >>
> > > >> Med venlig hilsen / kind regards
> > > >> - Morten Brørup
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > > >>> To: Morten Brørup
> > > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > > >>> Oleg Kuporosov
> > > >>> Subject: Re: [dpdk-dev] mbuf changes
> > > >>>
> > > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > > >>>> Comments inline.
> > > >>>>
> > > >>>>> -----Original Message-----
> > > >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > >>>>> Richardson
> > > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > > >>>>> To: Adrien Mazarguil
> > > >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > > >>>>> Oleg Kuporosov
> > > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > >>>>>
> > > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> > wrote:
> > > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > >>>>>>> Comments inline.
> > > >>>>>>>
> > > >>>>>>> Med venlig hilsen / kind regards
> > > >>>>>>> - Morten Brørup
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> -----Original Message-----
> > > >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > > >>>>>>>> To: Bruce Richardson
> > > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier Matz;
> > > >>>>>>>> Oleg Kuporosov
> > > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > >>>>>>>>
> > > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > > >>> wrote:
> > > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > > >>> wrote:
> > > >>>>>>>> [...]
> > > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > >>>>>>>> <mb@smartsharesystems.com> wrote:
> > > >>>>>>>> [...]
> > > >>>>
> > > >>>>>>>>> One other point I'll mention is that we need to have a
> > > >>>>>>>>> discussion on how/where to add in a timestamp value into
> > > >>> the
> > > >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> > > >>>>> sequence
> > > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > > >>> timestamp
> > > >>>>>>>>> is not going to be enough for
> > > >>>>>>>> many.
> > > >>>>>>>>>
> > > >>>>>>>>> Thoughts?
> > > >>>>>>>>
> > > >>>>>>>> If we consider that timestamp representation should use
> > > >>>>> nanosecond
> > > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > > >>> quickly
> > > >>>>>>>> to be useful. We can also assume that applications
> > requesting
> > > >>>>>>>> timestamps may care more about latency than throughput, Oleg
> > > >>>>> found
> > > >>>>>>>> that using the second cache line for this purpose had a
> > > >>>>> noticeable impact [1].
> > > >>>>>>>>
> > > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> > October/049237.html
> > > >>>>>>>
> > > >>>>>>> I agree with Oleg about the latency vs. throughput importance
> > > >>>>>>> for
> > > >>>>> such applications.
> > > >>>>>>>
> > > >>>>>>> If you need high resolution timestamps, consider them to be
> > > >>>>> generated by the NIC RX driver, possibly by the hardware itself
> > > >>>>> (http://w3new.napatech.com/features/time-precision/hardware-
> > time
> > > >>>>> - stamp), so the timestamp belongs in the first cache line. And
> > > >>>>> I am proposing that it should have the highest possible
> > > >>>>> accuracy, which makes the value hardware dependent.
> > > >>>>>>>
> > > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > > >>> application
> > > >>>>>>> to
> > > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > > >>>>> seconds, hours and calendar date) out of band, so we don't use
> > > >>>>> precious
> > > >>> space
> > > >>>>> in the mbuf. The application doesn't need the NIC RX driver's
> > > >>>>> fast path to capture which date (or even which second) a packet
> > > >>>>> was received. Yes, it adds complexity to the application, but
> > we
> > > >>>>> can't set aside 64 bit for a generic timestamp. Or as a weird
> > tradeoff:
> > > >>>>> Put the fast moving 32 bit in the first cache line and the slow
> > > >>>>> moving 32 bit in the second cache line, as a placeholder for
> > the
> > > >>> application to fill out if needed.
> > > >>>>> Yes, it means that the application needs to check the time and
> > > >>>>> update its variable holding the slow moving time once every
> > > >>>>> second or so; but that should be doable without significant
> > effort.
> > > >>>>>>
> > > >>>>>> That's a good point, however without a 64 bit value, elapsed
> > > >>>>>> time between two arbitrary mbufs cannot be measured reliably
> > > >>>>>> due to
> > > >>> not
> > > >>>>>> enough context, one way or another the low resolution value is
> > > >>>>>> also
> > > >>>>> needed.
> > > >>>>>>
> > > >>>>>> Obviously latency-sensitive applications are unlikely to
> > > >>>>>> perform lengthy buffering and require this but I'm not sure
> > > >>>>>> about all the possible use-cases. Considering many NICs expose
> > > >>>>>> 64 bit
> > > >>> timestaps,
> > > >>>>>> I suggest we do not truncate them.
> > > >>>>>>
> > > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > > >>>>>> tempted to fill the extra 32 bits whenever they can and negate
> > > >>>>>> the performance improvement of the first cache line.
> > > >>>>>
> > > >>>>> I would tend to agree, and I don't really see any convenient
> > way
> > > >>>>> to avoid putting in a 64-bit field for the timestamp in cache-
> > line 0.
> > > >>>>> If we are ok with having this overlap/partially overlap with
> > > >>>>> sequence number, it will use up an extra 4B of storage in that
> > > >>> cacheline.
> > > >>>>
> > > >>>> I agree about the lack of convenience! And Adrien certainly has
> > a
> > > >>> point about PMD temptations.
> > > >>>>
> > > >>>> However, I still don't think that a NICs ability to date-stamp a
> > > >>> packet is sufficient reason to put a date-stamp in cache line 0
> > of
> > > >>> the mbuf. Storing only the fast moving 32 bit in cache line 0
> > > >>> seems like a good compromise to me.
> > > >>>>
> > > >>>> Maybe you can find just one more byte, so it can hold 17 minutes
> > > >>>> with nanosecond resolution. (I'm joking!)
> > > >>>>
> > > >>>> Please don't sacrifice the sequence number for the
> > > >>>> seconds/hours/days
> > > >>> part a timestamp. Maybe it could be configurable to use a 32 bit
> > > >>> or
> > > >>> 64 bit timestamp.
> > > >>>>
> > > >>> Do you see both timestamp and sequence numbers being used
> > together?
> > > >>> I would have thought that apps would either use one or the other?
> > > >>> However, your suggestion is workable in any case, to allow the
> > > >>> sequence number to overlap just the high 32 bits of the
> > timestamp,
> > > >>> rather than the low.
> > > >>
> > > >> In our case, I can foresee sequence numbers used for packet
> > > >> processing and
> > > timestamps for timing analysis (and possibly for packet capturing,
> > > when being used). For timing analysis, we don’t need long durations,
> > > e.g. 4 seconds with 32 bit nanosecond resolution suffices. And for
> > > packet capturing we are perfectly capable of adding the slowly moving
> > > 32 bit of the timestamp to our output data stream without fetching it
> > from the mbuf.
> > > >>
> > >
> > > We should keep in mind that today we have the seqn field but it is
> > not
> > > used by any PMD. In case it is implemented, would it be a per-queue
> > sequence number?
> > > Is it useful from an application view?
> > >
> > > This field is only used by the librte_reorder library, and in my
> > > opinion, we should consider moving it in the second cache line since
> > it is not filled by the PMD.
> > >
> > >
> > > > For the 32-bit timestamp case, it might be useful to have a
> > > > right-shift value passed in to the ethdev driver. If we assume a
> > NIC
> > > > with nanosecond resolution, (or TSC value with resolution of that
> > > > order of magnitude), then the app can choose to have 1 ns
> > resolution
> > > > with 4 second wraparound, or alternatively 4ns resolution with 16
> > > > second wraparound, or even microsecond resolution with wrap around
> > > > of over
> > > an hour.
> > > > The cost is obviously just a shift op in the driver code per packet
> > > > - hopefully with multiple packets done at a time using vector
> > operations.
> > >
> > >
> > > About the timestamp, we can manage to find 64 bits in the first cache
> > > line, without sacrifying any field we have today. The question is
> > more
> > > for the fields we may want to add later.
> > >
> > > To answer to the question of the size of the timestamp, the first
> > > question is to know what is the precision required for the
> > applications using it?
> > 
> > As part of the 17.02 latency calculation feature, the requirement is to
> > report latency in nanoseconds. So, +1 on keeping timestamp as 64bits.
> > Since the feature is planned for 17.02, can we finalize on the
> > timestamp position in the mbuf struct?
> > Based on the decision, I am planning to make the change and send ABI
> > notice if needed.
> > 
> > Reshma
> > 
> > >
> > > I don't quite like the idea of splitting the timestamp in the 2 cache
> > > lines, I think it would not be easy to use.
> > >
> > >
> > > Olivier
> 
> Nanosecond precision in latency calculations does not require more than 32 bit, unless we must also be able to measure more than 4 seconds of latency. And please consider how many packets we would need to hold in memory to keep track of 4 seconds of traffic on a 100 Gbit/s link (if we are measuring network latency): With an average packet size of 300 B that would be 40 million packets.

Consider another, perhaps more realistic use-case: keeping a few relevant
packets from a stream for later analysis, which could occur outside of the
normal TX processing path for performance reasons. In this case 4 seconds
worth of context may not be enough to avoid losing information.

> If I recall correctly, the 17.02 feature is about measuring application latency, not network latency (as my calculation above is about). I don't consider application latency measuring a common feature for most applications, except for application debugging/profiling purposes (not network debugging purposes). So do we really need both nanosecond precision and the ability to measure 4+ seconds of latency?

Existing HW can already provide 64-bit precision, why truncate part of it?
The only valid reason would be that there is not enough room in the mbuf
header, which is not the case. Looks like today there is even enough room in
the first 64 bytes.

As suggested by Bruce, we can even union (part of) this field with another
if necessary, as long as we make sure both cannot be requested at the same
time (determining which field should be sacrificed is going to be tricky
though).

> Furthermore, if the timestamp isn't set by hardware, why not just take the entire array of packets pulled out from the NIC and set their timestamp to the same value, instead of setting them one by one in the PMD - that would also consolidate the expense of obtaining the time from some clock source.

Like any offload, some PMD processing is expected to convert the possibly
HW-specific value to a common mbuf format, however if hardware does not
report any kind of information relevant to timestamp individual packets, I
do not think the PMD should expose that feature and implement it in
software. It would be much worse, certainly not what the application
wanted as it could have done the same on its own.

What matters is to know when NIC receives a packet, not when the application
fetches it. You should assume there is a significant delay between these two
events, hence the need for it to be offloaded.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-28 16:50                           ` Adrien Mazarguil
@ 2016-10-28 17:00                             ` Richardson, Bruce
  2016-10-28 20:27                               ` Morten Brørup
  2016-10-31 10:55                               ` Morten Brørup
  0 siblings, 2 replies; 46+ messages in thread
From: Richardson, Bruce @ 2016-10-28 17:00 UTC (permalink / raw)
  To: Adrien Mazarguil, Morten Brørup; +Cc: dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Friday, October 28, 2016 5:50 PM
> To: Morten Brørup <mb@smartsharesystems.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Fri, Oct 28, 2016 at 04:11:45PM +0200, Morten Brørup wrote:
> > Comments at the end.
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Pattan, Reshma
> > > Sent: Friday, October 28, 2016 3:35 PM
> > > To: Olivier Matz
> > > Cc: dev@dpdk.org; Morten Brørup
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > Hi Olivier,
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> > > > Sent: Tuesday, October 25, 2016 1:49 PM
> > > > To: Richardson, Bruce <bruce.richardson@intel.com>; Morten Brørup
> > > > <mb@smartsharesystems.com>
> > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith
> > > > <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> > > > <olegk@mellanox.com>
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > >
> > > >
> > > > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > > > >> Comments at the end.
> > > > >>
> > > > >> Med venlig hilsen / kind regards
> > > > >> - Morten Brørup
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > > > >>> To: Morten Brørup
> > > > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier
> > > > >>> Matz; Oleg Kuporosov
> > > > >>> Subject: Re: [dpdk-dev] mbuf changes
> > > > >>>
> > > > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > > > >>>> Comments inline.
> > > > >>>>
> > > > >>>>> -----Original Message-----
> > > > >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > >>>>> Richardson
> > > > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > > > >>>>> To: Adrien Mazarguil
> > > > >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier Matz;
> > > > >>>>> Oleg Kuporosov
> > > > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > > >>>>>
> > > > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> > > wrote:
> > > > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup
> wrote:
> > > > >>>>>>> Comments inline.
> > > > >>>>>>>
> > > > >>>>>>> Med venlig hilsen / kind regards
> > > > >>>>>>> - Morten Brørup
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> -----Original Message-----
> > > > >>>>>>>> From: Adrien Mazarguil
> > > > >>>>>>>> [mailto:adrien.mazarguil@6wind.com]
> > > > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > > > >>>>>>>> To: Bruce Richardson
> > > > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier
> > > > >>>>>>>> Matz; Oleg Kuporosov
> > > > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > > >>>>>>>>
> > > > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce
> > > > >>>>>>>> Richardson
> > > > >>> wrote:
> > > > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > > > >>> wrote:
> > > > >>>>>>>> [...]
> > > > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > >>>>>>>> <mb@smartsharesystems.com> wrote:
> > > > >>>>>>>> [...]
> > > > >>>>
> > > > >>>>>>>>> One other point I'll mention is that we need to have a
> > > > >>>>>>>>> discussion on how/where to add in a timestamp value into
> > > > >>> the
> > > > >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> > > > >>>>> sequence
> > > > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > > > >>> timestamp
> > > > >>>>>>>>> is not going to be enough for
> > > > >>>>>>>> many.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thoughts?
> > > > >>>>>>>>
> > > > >>>>>>>> If we consider that timestamp representation should use
> > > > >>>>> nanosecond
> > > > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > > > >>> quickly
> > > > >>>>>>>> to be useful. We can also assume that applications
> > > requesting
> > > > >>>>>>>> timestamps may care more about latency than throughput,
> > > > >>>>>>>> Oleg
> > > > >>>>> found
> > > > >>>>>>>> that using the second cache line for this purpose had a
> > > > >>>>> noticeable impact [1].
> > > > >>>>>>>>
> > > > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> > > October/049237.html
> > > > >>>>>>>
> > > > >>>>>>> I agree with Oleg about the latency vs. throughput
> > > > >>>>>>> importance for
> > > > >>>>> such applications.
> > > > >>>>>>>
> > > > >>>>>>> If you need high resolution timestamps, consider them to
> > > > >>>>>>> be
> > > > >>>>> generated by the NIC RX driver, possibly by the hardware
> > > > >>>>> itself
> > > > >>>>> (http://w3new.napatech.com/features/time-precision/hardware-
> > > time
> > > > >>>>> - stamp), so the timestamp belongs in the first cache line.
> > > > >>>>> And I am proposing that it should have the highest possible
> > > > >>>>> accuracy, which makes the value hardware dependent.
> > > > >>>>>>>
> > > > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > > > >>> application
> > > > >>>>>>> to
> > > > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > > > >>>>> seconds, hours and calendar date) out of band, so we don't
> > > > >>>>> use precious
> > > > >>> space
> > > > >>>>> in the mbuf. The application doesn't need the NIC RX
> > > > >>>>> driver's fast path to capture which date (or even which
> > > > >>>>> second) a packet was received. Yes, it adds complexity to
> > > > >>>>> the application, but
> > > we
> > > > >>>>> can't set aside 64 bit for a generic timestamp. Or as a
> > > > >>>>> weird
> > > tradeoff:
> > > > >>>>> Put the fast moving 32 bit in the first cache line and the
> > > > >>>>> slow moving 32 bit in the second cache line, as a
> > > > >>>>> placeholder for
> > > the
> > > > >>> application to fill out if needed.
> > > > >>>>> Yes, it means that the application needs to check the time
> > > > >>>>> and update its variable holding the slow moving time once
> > > > >>>>> every second or so; but that should be doable without
> > > > >>>>> significant
> > > effort.
> > > > >>>>>>
> > > > >>>>>> That's a good point, however without a 64 bit value,
> > > > >>>>>> elapsed time between two arbitrary mbufs cannot be measured
> > > > >>>>>> reliably due to
> > > > >>> not
> > > > >>>>>> enough context, one way or another the low resolution value
> > > > >>>>>> is also
> > > > >>>>> needed.
> > > > >>>>>>
> > > > >>>>>> Obviously latency-sensitive applications are unlikely to
> > > > >>>>>> perform lengthy buffering and require this but I'm not sure
> > > > >>>>>> about all the possible use-cases. Considering many NICs
> > > > >>>>>> expose
> > > > >>>>>> 64 bit
> > > > >>> timestaps,
> > > > >>>>>> I suggest we do not truncate them.
> > > > >>>>>>
> > > > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > > > >>>>>> tempted to fill the extra 32 bits whenever they can and
> > > > >>>>>> negate the performance improvement of the first cache line.
> > > > >>>>>
> > > > >>>>> I would tend to agree, and I don't really see any convenient
> > > way
> > > > >>>>> to avoid putting in a 64-bit field for the timestamp in
> > > > >>>>> cache-
> > > line 0.
> > > > >>>>> If we are ok with having this overlap/partially overlap with
> > > > >>>>> sequence number, it will use up an extra 4B of storage in
> > > > >>>>> that
> > > > >>> cacheline.
> > > > >>>>
> > > > >>>> I agree about the lack of convenience! And Adrien certainly
> > > > >>>> has
> > > a
> > > > >>> point about PMD temptations.
> > > > >>>>
> > > > >>>> However, I still don't think that a NICs ability to
> > > > >>>> date-stamp a
> > > > >>> packet is sufficient reason to put a date-stamp in cache line
> > > > >>> 0
> > > of
> > > > >>> the mbuf. Storing only the fast moving 32 bit in cache line 0
> > > > >>> seems like a good compromise to me.
> > > > >>>>
> > > > >>>> Maybe you can find just one more byte, so it can hold 17
> > > > >>>> minutes with nanosecond resolution. (I'm joking!)
> > > > >>>>
> > > > >>>> Please don't sacrifice the sequence number for the
> > > > >>>> seconds/hours/days
> > > > >>> part a timestamp. Maybe it could be configurable to use a 32
> > > > >>> bit or
> > > > >>> 64 bit timestamp.
> > > > >>>>
> > > > >>> Do you see both timestamp and sequence numbers being used
> > > together?
> > > > >>> I would have thought that apps would either use one or the
> other?
> > > > >>> However, your suggestion is workable in any case, to allow the
> > > > >>> sequence number to overlap just the high 32 bits of the
> > > timestamp,
> > > > >>> rather than the low.
> > > > >>
> > > > >> In our case, I can foresee sequence numbers used for packet
> > > > >> processing and
> > > > timestamps for timing analysis (and possibly for packet capturing,
> > > > when being used). For timing analysis, we don’t need long
> > > > durations, e.g. 4 seconds with 32 bit nanosecond resolution
> > > > suffices. And for packet capturing we are perfectly capable of
> > > > adding the slowly moving
> > > > 32 bit of the timestamp to our output data stream without fetching
> > > > it
> > > from the mbuf.
> > > > >>
> > > >
> > > > We should keep in mind that today we have the seqn field but it is
> > > not
> > > > used by any PMD. In case it is implemented, would it be a
> > > > per-queue
> > > sequence number?
> > > > Is it useful from an application view?
> > > >
> > > > This field is only used by the librte_reorder library, and in my
> > > > opinion, we should consider moving it in the second cache line
> > > > since
> > > it is not filled by the PMD.
> > > >
> > > >
> > > > > For the 32-bit timestamp case, it might be useful to have a
> > > > > right-shift value passed in to the ethdev driver. If we assume a
> > > NIC
> > > > > with nanosecond resolution, (or TSC value with resolution of
> > > > > that order of magnitude), then the app can choose to have 1 ns
> > > resolution
> > > > > with 4 second wraparound, or alternatively 4ns resolution with
> > > > > 16 second wraparound, or even microsecond resolution with wrap
> > > > > around of over
> > > > an hour.
> > > > > The cost is obviously just a shift op in the driver code per
> > > > > packet
> > > > > - hopefully with multiple packets done at a time using vector
> > > operations.
> > > >
> > > >
> > > > About the timestamp, we can manage to find 64 bits in the first
> > > > cache line, without sacrifying any field we have today. The
> > > > question is
> > > more
> > > > for the fields we may want to add later.
> > > >
> > > > To answer to the question of the size of the timestamp, the first
> > > > question is to know what is the precision required for the
> > > applications using it?
> > >
> > > As part of the 17.02 latency calculation feature, the requirement is
> > > to report latency in nanoseconds. So, +1 on keeping timestamp as
> 64bits.
> > > Since the feature is planned for 17.02, can we finalize on the
> > > timestamp position in the mbuf struct?
> > > Based on the decision, I am planning to make the change and send ABI
> > > notice if needed.
> > >
> > > Reshma
> > >
> > > >
> > > > I don't quite like the idea of splitting the timestamp in the 2
> > > > cache lines, I think it would not be easy to use.
> > > >
> > > >
> > > > Olivier
> >
> > Nanosecond precision in latency calculations does not require more than
> 32 bit, unless we must also be able to measure more than 4 seconds of
> latency. And please consider how many packets we would need to hold in
> memory to keep track of 4 seconds of traffic on a 100 Gbit/s link (if we
> are measuring network latency): With an average packet size of 300 B that
> would be 40 million packets.
> 
> Consider another, perhaps more realistic use-case: keeping a few relevant
> packets from a stream for later analysis, which could occur outside of the
> normal TX processing path for performance reasons. In this case 4 seconds
> worth of context may not be enough to avoid losing information.
> 
> > If I recall correctly, the 17.02 feature is about measuring application
> latency, not network latency (as my calculation above is about). I don't
> consider application latency measuring a common feature for most
> applications, except for application debugging/profiling purposes (not
> network debugging purposes). So do we really need both nanosecond
> precision and the ability to measure 4+ seconds of latency?
> 
> Existing HW can already provide 64-bit precision, why truncate part of it?
> The only valid reason would be that there is not enough room in the mbuf
> header, which is not the case. Looks like today there is even enough room
> in the first 64 bytes.
> 
> As suggested by Bruce, we can even union (part of) this field with another
> if necessary, as long as we make sure both cannot be requested at the same
> time (determining which field should be sacrificed is going to be tricky
> though).
> 
> > Furthermore, if the timestamp isn't set by hardware, why not just take
> the entire array of packets pulled out from the NIC and set their
> timestamp to the same value, instead of setting them one by one in the PMD
> - that would also consolidate the expense of obtaining the time from some
> clock source.
> 
> Like any offload, some PMD processing is expected to convert the possibly
> HW-specific value to a common mbuf format, however if hardware does not
> report any kind of information relevant to timestamp individual packets, I
> do not think the PMD should expose that feature and implement it in
> software. It would be much worse, certainly not what the application
> wanted as it could have done the same on its own.
> 
> What matters is to know when NIC receives a packet, not when the
> application fetches it. You should assume there is a significant delay
> between these two events, hence the need for it to be offloaded.
> 

I would actually see both use cases being used. In some cases, NICs will
have hardware offload capability for timestamping which can be used, while
in many cases, timestamping on retrieval by the core will suffice for
latency measurements within an app. I don't think we can discount either
possibility. Thankfully, AFAIK which model is used doesn't really affect
whether we need 32 or 64 bits for the data.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-28 17:00                             ` Richardson, Bruce
@ 2016-10-28 20:27                               ` Morten Brørup
  2016-10-31 10:55                               ` Morten Brørup
  1 sibling, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-28 20:27 UTC (permalink / raw)
  To: Richardson, Bruce, Adrien Mazarguil; +Cc: dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce
> Sent: Friday, October 28, 2016 7:01 PM
> To: Adrien Mazarguil; Morten Brørup
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] mbuf changes
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> > Sent: Friday, October 28, 2016 5:50 PM
> > To: Morten Brørup <mb@smartsharesystems.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > On Fri, Oct 28, 2016 at 04:11:45PM +0200, Morten Brørup wrote:
> > > Comments at the end.
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Pattan,
> > > > Reshma
> > > > Sent: Friday, October 28, 2016 3:35 PM
> > > > To: Olivier Matz
> > > > Cc: dev@dpdk.org; Morten Brørup
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > Hi Olivier,
> > > >
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier
> > > > > Matz
> > > > > Sent: Tuesday, October 25, 2016 1:49 PM
> > > > > To: Richardson, Bruce <bruce.richardson@intel.com>; Morten
> > > > > Brørup <mb@smartsharesystems.com>
> > > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Wiles, Keith
> > > > > <keith.wiles@intel.com>; dev@dpdk.org; Oleg Kuporosov
> > > > > <olegk@mellanox.com>
> > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > >
> > > > >
> > > > >
> > > > > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > > > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > > > > >> Comments at the end.
> > > > > >>
> > > > > >> Med venlig hilsen / kind regards
> > > > > >> - Morten Brørup
> > > > > >>
> > > > > >>> -----Original Message-----
> > > > > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > > > > >>> To: Morten Brørup
> > > > > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev@dpdk.org; Olivier
> > > > > >>> Matz; Oleg Kuporosov
> > > > > >>> Subject: Re: [dpdk-dev] mbuf changes
> > > > > >>>
> > > > > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > > > > >>>> Comments inline.
> > > > > >>>>
> > > > > >>>>> -----Original Message-----
> > > > > >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > > >>>>> Richardson
> > > > > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > > > > >>>>> To: Adrien Mazarguil
> > > > > >>>>> Cc: Morten Brørup; Wiles, Keith; dev@dpdk.org; Olivier
> > > > > >>>>> Matz; Oleg Kuporosov
> > > > > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > > > >>>>>
> > > > > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> > > > wrote:
> > > > > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup
> > wrote:
> > > > > >>>>>>> Comments inline.
> > > > > >>>>>>>
> > > > > >>>>>>> Med venlig hilsen / kind regards
> > > > > >>>>>>> - Morten Brørup
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>> -----Original Message-----
> > > > > >>>>>>>> From: Adrien Mazarguil
> > > > > >>>>>>>> [mailto:adrien.mazarguil@6wind.com]
> > > > > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > >>>>>>>> To: Bruce Richardson
> > > > > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev@dpdk.org; Olivier
> > > > > >>>>>>>> Matz; Oleg Kuporosov
> > > > > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce
> > > > > >>>>>>>> Richardson
> > > > > >>> wrote:
> > > > > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > > > > >>> wrote:
> > > > > >>>>>>>> [...]
> > > > > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > >>>>>>>> <mb@smartsharesystems.com> wrote:
> > > > > >>>>>>>> [...]
> > > > > >>>>
> > > > > >>>>>>>>> One other point I'll mention is that we need to have a
> > > > > >>>>>>>>> discussion on how/where to add in a timestamp value
> > > > > >>>>>>>>> into
> > > > > >>> the
> > > > > >>>>>>>>> mbuf. Personally, I think it can be in a union with
> > > > > >>>>>>>>> the
> > > > > >>>>> sequence
> > > > > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > > > > >>> timestamp
> > > > > >>>>>>>>> is not going to be enough for
> > > > > >>>>>>>> many.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Thoughts?
> > > > > >>>>>>>>
> > > > > >>>>>>>> If we consider that timestamp representation should use
> > > > > >>>>> nanosecond
> > > > > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > > > > >>> quickly
> > > > > >>>>>>>> to be useful. We can also assume that applications
> > > > requesting
> > > > > >>>>>>>> timestamps may care more about latency than throughput,
> > > > > >>>>>>>> Oleg
> > > > > >>>>> found
> > > > > >>>>>>>> that using the second cache line for this purpose had a
> > > > > >>>>> noticeable impact [1].
> > > > > >>>>>>>>
> > > > > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> > > > October/049237.html
> > > > > >>>>>>>
> > > > > >>>>>>> I agree with Oleg about the latency vs. throughput
> > > > > >>>>>>> importance for
> > > > > >>>>> such applications.
> > > > > >>>>>>>
> > > > > >>>>>>> If you need high resolution timestamps, consider them to
> > > > > >>>>>>> be
> > > > > >>>>> generated by the NIC RX driver, possibly by the hardware
> > > > > >>>>> itself
> > > > > >>>>> (http://w3new.napatech.com/features/time-precision/hardwar
> > > > > >>>>> e-
> > > > time
> > > > > >>>>> - stamp), so the timestamp belongs in the first cache line.
> > > > > >>>>> And I am proposing that it should have the highest
> > > > > >>>>> possible accuracy, which makes the value hardware dependent.
> > > > > >>>>>>>
> > > > > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > > > > >>> application
> > > > > >>>>>>> to
> > > > > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > > > > >>>>> seconds, hours and calendar date) out of band, so we don't
> > > > > >>>>> use precious
> > > > > >>> space
> > > > > >>>>> in the mbuf. The application doesn't need the NIC RX
> > > > > >>>>> driver's fast path to capture which date (or even which
> > > > > >>>>> second) a packet was received. Yes, it adds complexity to
> > > > > >>>>> the application, but
> > > > we
> > > > > >>>>> can't set aside 64 bit for a generic timestamp. Or as a
> > > > > >>>>> weird
> > > > tradeoff:
> > > > > >>>>> Put the fast moving 32 bit in the first cache line and the
> > > > > >>>>> slow moving 32 bit in the second cache line, as a
> > > > > >>>>> placeholder for
> > > > the
> > > > > >>> application to fill out if needed.
> > > > > >>>>> Yes, it means that the application needs to check the time
> > > > > >>>>> and update its variable holding the slow moving time once
> > > > > >>>>> every second or so; but that should be doable without
> > > > > >>>>> significant
> > > > effort.
> > > > > >>>>>>
> > > > > >>>>>> That's a good point, however without a 64 bit value,
> > > > > >>>>>> elapsed time between two arbitrary mbufs cannot be
> > > > > >>>>>> measured reliably due to
> > > > > >>> not
> > > > > >>>>>> enough context, one way or another the low resolution
> > > > > >>>>>> value is also
> > > > > >>>>> needed.
> > > > > >>>>>>
> > > > > >>>>>> Obviously latency-sensitive applications are unlikely to
> > > > > >>>>>> perform lengthy buffering and require this but I'm not
> > > > > >>>>>> sure about all the possible use-cases. Considering many
> > > > > >>>>>> NICs expose
> > > > > >>>>>> 64 bit
> > > > > >>> timestaps,
> > > > > >>>>>> I suggest we do not truncate them.
> > > > > >>>>>>
> > > > > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > > > > >>>>>> tempted to fill the extra 32 bits whenever they can and
> > > > > >>>>>> negate the performance improvement of the first cache line.
> > > > > >>>>>
> > > > > >>>>> I would tend to agree, and I don't really see any
> > > > > >>>>> convenient
> > > > way
> > > > > >>>>> to avoid putting in a 64-bit field for the timestamp in
> > > > > >>>>> cache-
> > > > line 0.
> > > > > >>>>> If we are ok with having this overlap/partially overlap
> > > > > >>>>> with sequence number, it will use up an extra 4B of
> > > > > >>>>> storage in that
> > > > > >>> cacheline.
> > > > > >>>>
> > > > > >>>> I agree about the lack of convenience! And Adrien certainly
> > > > > >>>> has
> > > > a
> > > > > >>> point about PMD temptations.
> > > > > >>>>
> > > > > >>>> However, I still don't think that a NICs ability to
> > > > > >>>> date-stamp a
> > > > > >>> packet is sufficient reason to put a date-stamp in cache
> > > > > >>> line
> > > > > >>> 0
> > > > of
> > > > > >>> the mbuf. Storing only the fast moving 32 bit in cache line
> > > > > >>> 0 seems like a good compromise to me.
> > > > > >>>>
> > > > > >>>> Maybe you can find just one more byte, so it can hold 17
> > > > > >>>> minutes with nanosecond resolution. (I'm joking!)
> > > > > >>>>
> > > > > >>>> Please don't sacrifice the sequence number for the
> > > > > >>>> seconds/hours/days
> > > > > >>> part a timestamp. Maybe it could be configurable to use a 32
> > > > > >>> bit or
> > > > > >>> 64 bit timestamp.
> > > > > >>>>
> > > > > >>> Do you see both timestamp and sequence numbers being used
> > > > together?
> > > > > >>> I would have thought that apps would either use one or the
> > other?
> > > > > >>> However, your suggestion is workable in any case, to allow
> > > > > >>> the sequence number to overlap just the high 32 bits of the
> > > > timestamp,
> > > > > >>> rather than the low.
> > > > > >>
> > > > > >> In our case, I can foresee sequence numbers used for packet
> > > > > >> processing and
> > > > > timestamps for timing analysis (and possibly for packet
> > > > > capturing, when being used). For timing analysis, we don’t need
> > > > > long durations, e.g. 4 seconds with 32 bit nanosecond resolution
> > > > > suffices. And for packet capturing we are perfectly capable of
> > > > > adding the slowly moving
> > > > > 32 bit of the timestamp to our output data stream without
> > > > > fetching it
> > > > from the mbuf.
> > > > > >>
> > > > >
> > > > > We should keep in mind that today we have the seqn field but it
> > > > > is
> > > > not
> > > > > used by any PMD. In case it is implemented, would it be a
> > > > > per-queue
> > > > sequence number?
> > > > > Is it useful from an application view?
> > > > >
> > > > > This field is only used by the librte_reorder library, and in my
> > > > > opinion, we should consider moving it in the second cache line
> > > > > since
> > > > it is not filled by the PMD.
> > > > >
> > > > >
> > > > > > For the 32-bit timestamp case, it might be useful to have a
> > > > > > right-shift value passed in to the ethdev driver. If we assume
> > > > > > a
> > > > NIC
> > > > > > with nanosecond resolution, (or TSC value with resolution of
> > > > > > that order of magnitude), then the app can choose to have 1 ns
> > > > resolution
> > > > > > with 4 second wraparound, or alternatively 4ns resolution with
> > > > > > 16 second wraparound, or even microsecond resolution with wrap
> > > > > > around of over
> > > > > an hour.
> > > > > > The cost is obviously just a shift op in the driver code per
> > > > > > packet
> > > > > > - hopefully with multiple packets done at a time using vector
> > > > operations.
> > > > >
> > > > >
> > > > > About the timestamp, we can manage to find 64 bits in the first
> > > > > cache line, without sacrifying any field we have today. The
> > > > > question is
> > > > more
> > > > > for the fields we may want to add later.
> > > > >
> > > > > To answer to the question of the size of the timestamp, the
> > > > > first question is to know what is the precision required for the
> > > > applications using it?
> > > >
> > > > As part of the 17.02 latency calculation feature, the requirement
> > > > is to report latency in nanoseconds. So, +1 on keeping timestamp
> > > > as
> > 64bits.
> > > > Since the feature is planned for 17.02, can we finalize on the
> > > > timestamp position in the mbuf struct?
> > > > Based on the decision, I am planning to make the change and send
> > > > ABI notice if needed.
> > > >
> > > > Reshma
> > > >
> > > > >
> > > > > I don't quite like the idea of splitting the timestamp in the 2
> > > > > cache lines, I think it would not be easy to use.
> > > > >
> > > > >
> > > > > Olivier
> > >
> > > Nanosecond precision in latency calculations does not require more
> > > than
> > 32 bit, unless we must also be able to measure more than 4 seconds of
> > latency. And please consider how many packets we would need to hold in
> > memory to keep track of 4 seconds of traffic on a 100 Gbit/s link (if
> > we are measuring network latency): With an average packet size of 300
> > B that would be 40 million packets.
> >
> > Consider another, perhaps more realistic use-case: keeping a few
> > relevant packets from a stream for later analysis, which could occur
> > outside of the normal TX processing path for performance reasons. In
> > this case 4 seconds worth of context may not be enough to avoid losing
> information.

Agreed. And I would say that this exception path can add the higher 32 bit itself; it is not required by the PMD or even in the mbuf.

> >
> > > If I recall correctly, the 17.02 feature is about measuring
> > > application
> > latency, not network latency (as my calculation above is about). I
> > don't consider application latency measuring a common feature for most
> > applications, except for application debugging/profiling purposes (not
> > network debugging purposes). So do we really need both nanosecond
> > precision and the ability to measure 4+ seconds of latency?
> >
> > Existing HW can already provide 64-bit precision, why truncate part of
> it?
> > The only valid reason would be that there is not enough room in the
> > mbuf header, which is not the case. Looks like today there is even
> > enough room in the first 64 bytes.
> >
> > As suggested by Bruce, we can even union (part of) this field with
> > another if necessary, as long as we make sure both cannot be requested
> > at the same time (determining which field should be sacrificed is
> > going to be tricky though).

It would make some applications easier to develop if the mbuf contained a 64 bit timestamp; but I don't think these are very common applications. It's a tradeoff, and I'm advocating for conserving the very precious space in the mbuf, at the cost of requiring some extra work in these uncommon applications.

> >
> > > Furthermore, if the timestamp isn't set by hardware, why not just
> > > take
> > the entire array of packets pulled out from the NIC and set their
> > timestamp to the same value, instead of setting them one by one in the
> > PMD
> > - that would also consolidate the expense of obtaining the time from
> > some clock source.
> >
> > Like any offload, some PMD processing is expected to convert the
> > possibly HW-specific value to a common mbuf format, however if
> > hardware does not report any kind of information relevant to timestamp
> > individual packets, I do not think the PMD should expose that feature
> > and implement it in software. It would be much worse, certainly not
> > what the application wanted as it could have done the same on its own.
Agree. And I would go so far as to say that this should be a guiding principle for DPDK. If the NIC HW doesn't support something, the PMD shouldn't try to emulate it in software; instead, DPDK could provide various libraries to offer these features in software, like the packet type library mentioned by Olivier at the recent DPDK Userspace conference.

> >
> > What matters is to know when NIC receives a packet, not when the
> > application fetches it. You should assume there is a significant delay
> > between these two events, hence the need for it to be offloaded.
Agree.

> >
> 
> I would actually see both use cases being used. In some cases, NICs will
> have hardware offload capability for timestamping which can be used, while
> in many cases, timestamping on retrieval by the core will suffice for
> latency measurements within an app. I don't think we can discount either
> possibility. Thankfully, AFAIK which model is used doesn't really affect
> whether we need 32 or 64 bits for the data.
Agreed about 32/64 bit choice not being affected by this.

But take note of Adrien's comment about the potential delay from the packet arrives on the NIC until the application gets round to calling the PMD to fetch it. Software based timestamps may give a false impression about the application latency. Assume that the application polls the PMD and bulk fetches all the packets, then proceeds to processing them, and then starts over. If a packet arrives just after the first batch was fetched, a software timestamp in the PMD will set the time when the application starts fetching the next batch of packets, ignoring all the time the packet spent waiting for being fetched while the application processed the first batch of packets.

> 
> /Bruce

-Morten

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-28 17:00                             ` Richardson, Bruce
  2016-10-28 20:27                               ` Morten Brørup
@ 2016-10-31 10:55                               ` Morten Brørup
  1 sibling, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-31 10:55 UTC (permalink / raw)
  To: Richardson, Bruce, Adrien Mazarguil, Pattan, Reshma; +Cc: dev

About the mbuf timestamp (regardless of size etc.) accuracy:

We should also consider the clock source. If the NIC HW is the clock source, there will be a fixed offset from the CPU's clock, but there will also be varying drift (they have separate XTALs). And if there are multiple NICs, there will be offset and varying drift between their clocks too.

If I remember correctly, the IEEE Ethernet specifications require a clock accuracy of 100 ppm. So XTALs of this quality are probably being used in the hardware. This means that the clock drift between two NICs (or a NIC and the actual time) can be in the magnitude of 100.000 ns per second.

BTW: If you are interested in NTP, you should take a look at Poul-Henning Kamp's ntimed research: http://phk.freebsd.dk/time/index.html


Med venlig hilsen / kind regards
- Morten Brørup

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-25 12:48                     ` Olivier Matz
                                         ` (2 preceding siblings ...)
  2016-10-28 13:34                       ` Pattan, Reshma
@ 2016-10-31 16:05                       ` Morten Brørup
  3 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2016-10-31 16:05 UTC (permalink / raw)
  To: Olivier Matz, Bruce Richardson; +Cc: dev

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, October 25, 2016 2:49 PM
> 
> We should keep in mind that today we have the seqn field but it is not
> used by any PMD. In case it is implemented, would it be a per-queue
> sequence number? Is it useful from an application view?
> 
> This field is only used by the librte_reorder library, and in my
> opinion, we should consider moving it in the second cache line since it
> is not filled by the PMD.

+1 because it is not filled by the NIC HW or PMD. If our argument is "better performance", I would prefer having the generic 64 bit m->userdata in cacheline0 than m->seqn.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: mbuf changes
  2016-10-26  8:28         ` Alejandro Lucero
  2016-10-26  9:01           ` Morten Brørup
@ 2016-11-09 11:42           ` Alejandro Lucero
  2016-11-10  9:19             ` Fwd: " Alejandro Lucero
  1 sibling, 1 reply; 46+ messages in thread
From: Alejandro Lucero @ 2016-11-09 11:42 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Shreyansh Jain, Wiles, Keith, Morten Brørup, dev, Olivier Matz

On Wed, Oct 26, 2016 at 10:28 AM, Alejandro Lucero <
alejandro.lucero@netronome.com> wrote:

>
>
> On Tue, Oct 25, 2016 at 2:05 PM, Bruce Richardson <
> bruce.richardson@intel.com> wrote:
>
>> On Tue, Oct 25, 2016 at 05:24:28PM +0530, Shreyansh Jain wrote:
>> > On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
>> > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
>> > > >
>> > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <
>> mb@smartsharesystems.com> wrote:
>> > > > >
>> > > > > First of all: Thanks for a great DPDK Userspace 2016!
>> > > > >
>> > > > >
>> > > > >
>> > > > > Continuing the Userspace discussion about Olivier Matz’s proposed
>> mbuf changes...
>> > >
>> > > Thanks for keeping the discussion going!
>> > > > >
>> > > > >
>> > > > >
>> > > > > 1.
>> > > > >
>> > > > > Stephen Hemminger had a noteworthy general comment about keeping
>> metadata for the NIC in the appropriate section of the mbuf: Metadata
>> generated by the NIC’s RX handler belongs in the first cache line, and
>> metadata required by the NIC’s TX handler belongs in the second cache line.
>> This also means that touching the second cache line on ingress should be
>> avoided if possible; and Bruce Richardson mentioned that for this reason
>> m->next was zeroed on free().
>> > > > >
>> > > Thinking about it, I suspect there are more fields we can reset on
>> free
>> > > to save time on alloc. Refcnt, as discussed below is one of them, but
>> so
>> > > too could be the nb_segs field and possibly others.
>> > >
>> > > > >
>> > > > >
>> > > > > 2.
>> > > > >
>> > > > > There seemed to be consensus that the size of m->refcnt should
>> match the size of m->port because a packet could be duplicated on all
>> physical ports for L3 multicast and L2 flooding.
>> > > > >
>> > > > > Furthermore, although a single physical machine (i.e. a single
>> server) with 255 physical ports probably doesn’t exist, it might contain
>> more than 255 virtual machines with a virtual port each, so it makes sense
>> extending these mbuf fields from 8 to 16 bits.
>> > > >
>> > > > I thought we also talked about removing the m->port from the mbuf
>> as it is not really needed.
>> > > >
>> > > Yes, this was mentioned, and also the option of moving the port value
>> to
>> > > the second cacheline, but it appears that NXP are using the port value
>> > > in their NIC drivers for passing in metadata, so we'd need their
>> > > agreement on any move (or removal).
>> >
>> > I am not sure where NXP's NIC came into picture on this, but now that
>> it is
>> > highlighted, this field is required for libevent implementation [1].
>> >
>> > A scheduler sending an event, which can be a packet, would only have
>> > information of a flow_id. From this matching it back to a port, without
>> > mbuf->port, would be very difficult (costly). There may be way around
>> this
>> > but at least in current proposal I think port would be important to
>> have -
>> > even if in second cache line.
>> >
>> > But, off the top of my head, as of now it is not being used for any
>> specific
>> > purpose in NXP's PMD implementation.
>> >
>> > Even the SoC patches don't necessarily rely on it except using it
>> because it
>> > is available.
>> >
>> > @Bruce: where did you get the NXP context here from?
>> >
>> Oh, I'm just mis-remembering. :-( It was someone else who was looking for
>> this - Netronome, perhaps?
>>
>> CC'ing Alejandro in the hope I'm remembering correctly second time
>> round!
>>
>>
> Yes. Thanks Bruce!
>
> So Netronome uses the port field and, as I commented on the user meeting,
> we are happy with the field going from 8 to 16 bits.
>
> In our case, this is something some clients have demanded, and if I'm not
> wrong (I'll double check this asap), the port value is for knowing where
> the packet is coming from. Think about a switch in the NIC, with ports
> linked to VFs/VMs, and one or more physical ports. That port value is not
> related to DPDK ports but to the switch ports. Code in the host (DPDK or
> not) can receive packets from the wire or from VFs through the NIC. This is
> also true for packets received by VMs, but I guess the port value is just
> interested for host code.
>
>
>

I consulted this functionality internally and it seems we do not need this
anymore. In fact, I will remove the metadata port handling soon from our
PMD.



> /Bruce
>>
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Fwd: mbuf changes
  2016-11-09 11:42           ` Alejandro Lucero
@ 2016-11-10  9:19             ` Alejandro Lucero
  0 siblings, 0 replies; 46+ messages in thread
From: Alejandro Lucero @ 2016-11-10  9:19 UTC (permalink / raw)
  To: dev

I forgot to include dev@dpdk.org in my response.

My comment at the end o this email.

On Wed, Oct 26, 2016 at 10:28 AM, Alejandro Lucero <
alejandro.lucero@netronome.com> wrote:

>
>
> On Tue, Oct 25, 2016 at 2:05 PM, Bruce Richardson <
> bruce.richardson@intel.com> wrote:
>
>> On Tue, Oct 25, 2016 at 05:24:28PM +0530, Shreyansh Jain wrote:
>> > On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
>> > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
>> > > >
>> > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <
>> mb@smartsharesystems.com> wrote:
>> > > > >
>> > > > > First of all: Thanks for a great DPDK Userspace 2016!
>> > > > >
>> > > > >
>> > > > >
>> > > > > Continuing the Userspace discussion about Olivier Matz’s proposed
>> mbuf changes...
>> > >
>> > > Thanks for keeping the discussion going!
>> > > > >
>> > > > >
>> > > > >
>> > > > > 1.
>> > > > >
>> > > > > Stephen Hemminger had a noteworthy general comment about keeping
>> metadata for the NIC in the appropriate section of the mbuf: Metadata
>> generated by the NIC’s RX handler belongs in the first cache line, and
>> metadata required by the NIC’s TX handler belongs in the second cache line.
>> This also means that touching the second cache line on ingress should be
>> avoided if possible; and Bruce Richardson mentioned that for this reason
>> m->next was zeroed on free().
>> > > > >
>> > > Thinking about it, I suspect there are more fields we can reset on
>> free
>> > > to save time on alloc. Refcnt, as discussed below is one of them, but
>> so
>> > > too could be the nb_segs field and possibly others.
>> > >
>> > > > >
>> > > > >
>> > > > > 2.
>> > > > >
>> > > > > There seemed to be consensus that the size of m->refcnt should
>> match the size of m->port because a packet could be duplicated on all
>> physical ports for L3 multicast and L2 flooding.
>> > > > >
>> > > > > Furthermore, although a single physical machine (i.e. a single
>> server) with 255 physical ports probably doesn’t exist, it might contain
>> more than 255 virtual machines with a virtual port each, so it makes sense
>> extending these mbuf fields from 8 to 16 bits.
>> > > >
>> > > > I thought we also talked about removing the m->port from the mbuf
>> as it is not really needed.
>> > > >
>> > > Yes, this was mentioned, and also the option of moving the port value
>> to
>> > > the second cacheline, but it appears that NXP are using the port value
>> > > in their NIC drivers for passing in metadata, so we'd need their
>> > > agreement on any move (or removal).
>> >
>> > I am not sure where NXP's NIC came into picture on this, but now that
>> it is
>> > highlighted, this field is required for libevent implementation [1].
>> >
>> > A scheduler sending an event, which can be a packet, would only have
>> > information of a flow_id. From this matching it back to a port, without
>> > mbuf->port, would be very difficult (costly). There may be way around
>> this
>> > but at least in current proposal I think port would be important to
>> have -
>> > even if in second cache line.
>> >
>> > But, off the top of my head, as of now it is not being used for any
>> specific
>> > purpose in NXP's PMD implementation.
>> >
>> > Even the SoC patches don't necessarily rely on it except using it
>> because it
>> > is available.
>> >
>> > @Bruce: where did you get the NXP context here from?
>> >
>> Oh, I'm just mis-remembering. :-( It was someone else who was looking for
>> this - Netronome, perhaps?
>>
>> CC'ing Alejandro in the hope I'm remembering correctly second time
>> round!
>>
>>
> Yes. Thanks Bruce!
>
> So Netronome uses the port field and, as I commented on the user meeting,
> we are happy with the field going from 8 to 16 bits.
>
> In our case, this is something some clients have demanded, and if I'm not
> wrong (I'll double check this asap), the port value is for knowing where
> the packet is coming from. Think about a switch in the NIC, with ports
> linked to VFs/VMs, and one or more physical ports. That port value is not
> related to DPDK ports but to the switch ports. Code in the host (DPDK or
> not) can receive packets from the wire or from VFs through the NIC. This is
> also true for packets received by VMs, but I guess the port value is just
> interested for host code.
>
>
>

I consulted this functionality internally and it seems we do not need this
anymore. In fact, I will remove the metadata port handling soon from our
PMD.



> /Bruce
>>
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-11-10  9:19 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 15:49 mbuf changes Morten Brørup
2016-10-24 16:11 ` Wiles, Keith
2016-10-24 16:25   ` Bruce Richardson
2016-10-24 21:47     ` Morten Brørup
2016-10-25  8:53       ` Bruce Richardson
2016-10-25 10:02         ` Ananyev, Konstantin
2016-10-25 10:22           ` Morten Brørup
2016-10-25 13:00             ` Olivier Matz
2016-10-25 13:04               ` Ramia, Kannan Babu
2016-10-25 13:24                 ` Thomas Monjalon
2016-10-25 14:32                   ` Ramia, Kannan Babu
2016-10-25 14:54                     ` Thomas Monjalon
2016-10-25 13:15               ` Bruce Richardson
2016-10-25 13:20               ` Thomas Monjalon
2016-10-25  9:39     ` Adrien Mazarguil
2016-10-25 10:11       ` Morten Brørup
2016-10-25 11:04         ` Adrien Mazarguil
2016-10-25 11:13           ` Bruce Richardson
2016-10-25 12:16             ` Morten Brørup
2016-10-25 12:20               ` Bruce Richardson
2016-10-25 12:33                 ` Morten Brørup
2016-10-25 12:45                   ` Bruce Richardson
2016-10-25 12:48                     ` Olivier Matz
2016-10-25 13:13                       ` Morten Brørup
2016-10-25 13:38                       ` Ananyev, Konstantin
2016-10-25 14:25                         ` Morten Brørup
2016-10-25 14:45                           ` Olivier Matz
2016-10-28 13:34                       ` Pattan, Reshma
2016-10-28 14:11                         ` Morten Brørup
2016-10-28 15:25                           ` Pattan, Reshma
2016-10-28 16:50                           ` Adrien Mazarguil
2016-10-28 17:00                             ` Richardson, Bruce
2016-10-28 20:27                               ` Morten Brørup
2016-10-31 10:55                               ` Morten Brørup
2016-10-31 16:05                       ` Morten Brørup
2016-10-25 13:48               ` Adrien Mazarguil
2016-10-25 13:58                 ` Ananyev, Konstantin
2016-10-25 11:54     ` Shreyansh Jain
2016-10-25 12:05       ` Bruce Richardson
2016-10-26  8:28         ` Alejandro Lucero
2016-10-26  9:01           ` Morten Brørup
2016-11-09 11:42           ` Alejandro Lucero
2016-11-10  9:19             ` Fwd: " Alejandro Lucero
2016-10-25 12:49       ` Morten Brørup
2016-10-25 13:14 ` Olivier Matz
2016-10-25 13:18   ` Morten Brørup

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.