From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ophir Munk Subject: Re: [PATCH v2 2/7] net/mlx4: inline more Tx functions Date: Wed, 25 Oct 2017 21:42:46 +0000 Message-ID: References: <1508752838-30408-1-git-send-email-ophirmu@mellanox.com> <1508768520-4810-1-git-send-email-ophirmu@mellanox.com> <1508768520-4810-3-git-send-email-ophirmu@mellanox.com> <20171025164938.GH26782@6wind.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: "dev@dpdk.org" , Thomas Monjalon , "Olga Shern" , Matan Azrad To: Adrien Mazarguil Return-path: Received: from EUR02-AM5-obe.outbound.protection.outlook.com (mail-eopbgr00084.outbound.protection.outlook.com [40.107.0.84]) by dpdk.org (Postfix) with ESMTP id 702E51BA30 for ; Wed, 25 Oct 2017 23:42:49 +0200 (CEST) In-Reply-To: <20171025164938.GH26782@6wind.com> Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Adrien, On Wednesday, October 25, 2017 7:50 PM, Adrien Mazarguil wrote: >=20 > Hi Ophir, >=20 > On Mon, Oct 23, 2017 at 02:21:55PM +0000, Ophir Munk wrote: > > Change functions to inline on Tx fast path to improve performance > > > > Inside the inline function call other functions to handle "unlikely" > > cases such that the inline function code footprint is small. > > > > Signed-off-by: Ophir Munk >=20 > Reading this, it's like adding __rte_always_inline improves performance a= t > all, which I doubt unless you can show proof through performance results. >=20 > When in doubt, leave it to the compiler, the static keyword is usually en= ough > of a hint. Too much forced inlining may actually be harmful. >=20 > What this patch really does is splitting the heavy lookup/registration fu= nction > in two halves with one small static inline function for the lookup part t= hat > calls the separate registration part in the unlikely event MR is not alre= ady > registered. >=20 > Thankfully the compiler doesn't inline the large registration function ba= ck, > which results in the perceived performance improvement for the time being= , > however there is no guarantee it won't happen in the future (you didn't u= se > the noinline keyword on the registration function for that). >=20 > Therefore I have a bunch of comments and suggestions, see below. >=20 > > --- > > drivers/net/mlx4/mlx4_rxtx.c | 43 > > ++++++------------------------------ > > drivers/net/mlx4/mlx4_rxtx.h | 52 > > +++++++++++++++++++++++++++++++++++++++++++- > > 2 files changed, 58 insertions(+), 37 deletions(-) > > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c > > b/drivers/net/mlx4/mlx4_rxtx.c index 011ea79..ae37f9b 100644 > > --- a/drivers/net/mlx4/mlx4_rxtx.c > > +++ b/drivers/net/mlx4/mlx4_rxtx.c > > @@ -220,54 +220,25 @@ mlx4_txq_complete(struct txq *txq) > > return 0; > > } > > > > -/** > > - * Get memory pool (MP) from mbuf. If mbuf is indirect, the pool from > > which > > - * the cloned mbuf is allocated is returned instead. > > - * > > - * @param buf > > - * Pointer to mbuf. > > - * > > - * @return > > - * Memory pool where data is located for given mbuf. > > - */ > > -static struct rte_mempool * > > -mlx4_txq_mb2mp(struct rte_mbuf *buf) > > -{ > > - if (unlikely(RTE_MBUF_INDIRECT(buf))) > > - return rte_mbuf_from_indirect(buf)->pool; > > - return buf->pool; > > -} > > > > /** > > - * Get memory region (MR) <-> memory pool (MP) association from txq- > >mp2mr[]. > > - * Add MP to txq->mp2mr[] if it's not registered yet. If mp2mr[] is > > full, > > - * remove an entry first. > > + * Add memory region (MR) <-> memory pool (MP) association to txq- > >mp2mr[]. > > + * If mp2mr[] is full, remove an entry first. > > * > > * @param txq > > * Pointer to Tx queue structure. > > * @param[in] mp > > - * Memory pool for which a memory region lkey must be returned. > > + * Memory pool for which a memory region lkey must be added > > + * @param[in] i > > + * Index in memory pool (MP) where to add memory region (MR) > > * > > * @return > > - * mr->lkey on success, (uint32_t)-1 on failure. > > + * Added mr->lkey on success, (uint32_t)-1 on failure. > > */ > > -uint32_t > > -mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp) > > +uint32_t mlx4_txq_add_mr(struct txq *txq, struct rte_mempool *mp, > > +uint32_t i) > > { > > - unsigned int i; > > struct ibv_mr *mr; > > > > - for (i =3D 0; (i !=3D RTE_DIM(txq->mp2mr)); ++i) { > > - if (unlikely(txq->mp2mr[i].mp =3D=3D NULL)) { > > - /* Unknown MP, add a new MR for it. */ > > - break; > > - } > > - if (txq->mp2mr[i].mp =3D=3D mp) { > > - assert(txq->mp2mr[i].lkey !=3D (uint32_t)-1); > > - assert(txq->mp2mr[i].mr->lkey =3D=3D txq- > >mp2mr[i].lkey); > > - return txq->mp2mr[i].lkey; > > - } > > - } > > /* Add a new entry, register MR first. */ > > DEBUG("%p: discovered new memory pool \"%s\" (%p)", > > (void *)txq, mp->name, (void *)mp); diff --git > > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index > > e10bbca..719ef45 100644 > > --- a/drivers/net/mlx4/mlx4_rxtx.h > > +++ b/drivers/net/mlx4/mlx4_rxtx.h > > @@ -53,6 +53,7 @@ > > > > #include "mlx4.h" > > #include "mlx4_prm.h" > > +#include "mlx4_utils.h" >=20 > Why? >=20 > > > > /** Rx queue counters. */ > > struct mlx4_rxq_stats { > > @@ -160,7 +161,6 @@ void mlx4_rx_queue_release(void *dpdk_rxq); > > > > /* mlx4_rxtx.c */ > > > > -uint32_t mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp); > > uint16_t mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, > > uint16_t pkts_n); > > uint16_t mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, @@ > > -169,6 +169,8 @@ uint16_t mlx4_tx_burst_removed(void *dpdk_txq, > struct rte_mbuf **pkts, > > uint16_t pkts_n); > > uint16_t mlx4_rx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, > > uint16_t pkts_n); > > +uint32_t mlx4_txq_add_mr(struct txq *txq, struct rte_mempool *mp, > > + unsigned int i); > > > > /* mlx4_txq.c */ > > > > @@ -177,4 +179,52 @@ int mlx4_tx_queue_setup(struct rte_eth_dev > *dev, uint16_t idx, > > const struct rte_eth_txconf *conf); void > > mlx4_tx_queue_release(void *dpdk_txq); > > > > +/** > > + * Get memory pool (MP) from mbuf. If mbuf is indirect, the pool from > > +which > > + * the cloned mbuf is allocated is returned instead. > > + * > > + * @param buf > > + * Pointer to mbuf. > > + * > > + * @return > > + * Memory pool where data is located for given mbuf. > > + */ > > +static __rte_always_inline struct rte_mempool * mlx4_txq_mb2mp(struct > > +rte_mbuf *buf) { > > + if (unlikely(RTE_MBUF_INDIRECT(buf))) > > + return rte_mbuf_from_indirect(buf)->pool; > > + return buf->pool; > > +} > > + > > +/** > > + * Get memory region (MR) <-> memory pool (MP) association from txq- > >mp2mr[]. > > + * Call mlx4_txq_add_mr() if MP is not registered yet. > > + * > > + * @param txq > > + * Pointer to Tx queue structure. > > + * @param[in] mp > > + * Memory pool for which a memory region lkey must be returned. > > + * > > + * @return > > + * mr->lkey on success, (uint32_t)-1 on failure. > > + */ > > +static __rte_always_inline uint32_t >=20 > Note __rte_always_inline is defined in rte_common.h and should be > explicitly included (however don't do that, see below). >=20 > > +mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp) { > > + unsigned int i; > > + > > + for (i =3D 0; (i !=3D RTE_DIM(txq->mp2mr)); ++i) { > > + if (unlikely(txq->mp2mr[i].mp =3D=3D NULL)) { > > + /* Unknown MP, add a new MR for it. */ > > + break; > > + } > > + if (txq->mp2mr[i].mp =3D=3D mp) { > > + assert(txq->mp2mr[i].lkey !=3D (uint32_t)-1); > > + assert(txq->mp2mr[i].mr->lkey =3D=3D txq- > >mp2mr[i].lkey); >=20 > assert() requires assert.h (but don't include it, see subsequent suggesti= on). >=20 > > + return txq->mp2mr[i].lkey; > > + } > > + } > > + return mlx4_txq_add_mr(txq, mp, i); > > +} > > #endif /* MLX4_RXTX_H_ */ >=20 > So as described above, these functions do not need the __rte_always_inlin= e, > please remove it. They also do not need to be located in a header file; t= he > reason it's the case for their mlx5 counterparts is that they have to be = shared > between vectorized/non-vectorized code. No such requirement here, you > should move them back to their original spot. >=20 Static function mlx4_txq_mp2mr() must be in a header file because it is sha= red by 2 files: mlx4_txq.c and mlx4_rxtx.c. It is not related to vectorized/non-vectorized code in mlx5. Having said that -__rte_always_inline is required as well otherwise compila= tion fails with=20 drivers/net/mlx4/mlx4_rxtx.h:200:1: error: 'mlx4_txq_mp2mr' defined but not= used [-Werror=3Dunused-function] for files which include mlx4_rxtx.h > My suggestion for this performance improvement is to move > mlx4_txq_add_mr() to a different file, mlx4_mr.c looks like a good > candidate. This fact will ensure it's never inlined and far away from the= data > path. >=20 Function mlx4_txq_add_mr() is relatively small.=20 What do you say about preceding it with __attribute((noinline)) instead of = creating a new file? > -- > Adrien Mazarguil > 6WIND