From mboxrd@z Thu Jan  1 00:00:00 1970
From: Matan Azrad <matan@mellanox.com>
Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory
	barriers
Date: Mon, 30 Oct 2017 19:47:20 +0000
Message-ID: <HE1PR0502MB3659B337992E2E7D33E587ABD2590@HE1PR0502MB3659.eurprd05.prod.outlook.com>
References: <1508768520-4810-1-git-send-email-ophirmu@mellanox.com>
 <1509358049-18854-1-git-send-email-matan@mellanox.com>
 <1509358049-18854-7-git-send-email-matan@mellanox.com>
 <20171030142350.GC26782@6wind.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Cc: "dev@dpdk.org" <dev@dpdk.org>, Ophir Munk <ophirmu@mellanox.com>
To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Return-path: <dev-bounces@dpdk.org>
Received: from EUR01-DB5-obe.outbound.protection.outlook.com
 (mail-db5eur01on0051.outbound.protection.outlook.com [104.47.2.51])
 by dpdk.org (Postfix) with ESMTP id 369A21B28B
 for <dev@dpdk.org>; Mon, 30 Oct 2017 20:47:23 +0100 (CET)
In-Reply-To: <20171030142350.GC26782@6wind.com>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Monday, October 30, 2017 4:24 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org; Ophir Munk <ophirmu@mellanox.com>
> Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers
>=20
> On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> > Replace most of the memory barriers by compiler barriers since they
> > are all targeted to the DRAM; This improves code efficiency for
> > systems which force store order between different addresses.
> >
> > Only the doorbell record store should be protected by memory barrier
> > since it is targeted to the PCI memory domain.
> >
> > Limit pre byte count store compiler barrier for systems with cache
> > line size smaller than 64B (TXBB size).
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
>=20
> This sounds like an interesting performance improvement, can you share th=
e
> typical or expected amount (percentage/hard numbers) for a given use case
> as part of the commit log?
>=20

Yes, it improves performance, I will share numbers.

> More comments below.
>=20
> > ---
> >  drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > b/drivers/net/mlx4/mlx4_rxtx.c index 8ea8851..482c399 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > @@ -168,7 +168,7 @@ struct pv {
> >  		/*
> >  		 * Make sure we read the CQE after we read the ownership
> bit.
> >  		 */
> > -		rte_rmb();
> > +		rte_io_rmb();
>=20
> OK for this one since the rest of the code should not be run due to the
> condition (I'm not even sure even a compiler barrier is necessary at all =
here).
>=20
> >  #ifndef NDEBUG
> >  		if (unlikely((cqe->owner_sr_opcode &
> MLX4_CQE_OPCODE_MASK) =3D=3D
> >  			     MLX4_CQE_OPCODE_ERROR)) {
> > @@ -203,7 +203,7 @@ struct pv {
> >  	 */
> >  	cq->cons_index =3D cons_index;
> >  	*cq->set_ci_db =3D rte_cpu_to_be_32(cq->cons_index &
> MLX4_CQ_DB_CI_MASK);
> > -	rte_wmb();
> > +	rte_io_wmb();
>=20
> This one could be removed entirely as well, which is more or less what th=
e
> move to a compiler barrier does. Nothing in subsequent code depends on
> this doorbell being written, so this can piggy back on any subsequent
> rte_wmb().

Yes, you right, probably this code was taken from multi thread implementati=
on.
>=20
> On the other hand in my opinion a barrier (compiler or otherwise) might b=
e
> needed before the doorbell write, to make clear it cannot somehow be done
> earlier in case something attempts to optimize it away.
>=20
I think we can remove it entirely (compiler can't optimize the ci_db store =
since in depends in previous code (cons_index).

> >  	sq->tail =3D sq->tail + nr_txbbs;
> >  	/* Update the list of packets posted for transmission. */
> >  	elts_comp -=3D pkts;
> > @@ -321,6 +321,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  		 * control segment.
> >  		 */
> >  		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> > +#if RTE_CACHE_LINE_SIZE < 64
> >  			/*
> >  			 * Need a barrier here before writing the byte_count
> >  			 * fields to make sure that all the data is visible @@ -
> 331,6
> > +332,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  			 * data, and end up sending the wrong data.
> >  			 */
> >  			rte_io_wmb();
> > +#endif /* RTE_CACHE_LINE_SIZE */
>=20
> Interesting one.
>=20
> >  			dseg->byte_count =3D byte_count;
> >  		} else {
> >  			/*
> > @@ -469,8 +471,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  				break;
> >  			}
> >  #endif /* NDEBUG */
> > -			/* Need a barrier here before byte count store. */
> > -			rte_io_wmb();
> > +			/* Never be TXBB aligned, no need compiler barrier.
> */
>=20
> The reason there was a barrier here at all was unclear, so if it's really=
 useless,
> you don't even need to describe why.

It is because there is a barrier in multi segment similar stage.
I think it can help for future review.

>=20
> >  			dseg->byte_count =3D rte_cpu_to_be_32(buf-
> >data_len);
> >
> >  			/* Fill the control parameters for this packet. */ @@ -
> 533,7
> > +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  		 * setting ownership bit (because HW can start
> >  		 * executing as soon as we do).
> >  		 */
> > -		rte_wmb();
> > +		rte_io_wmb();
>=20
> This one looks dangerous. A compiler barrier is not strong enough to
> guarantee the order in which CPU will execute instructions, it only makes
> sure what follows the barrier doesn't appear before it in the generated c=
ode.
>=20
As I investigated, I understood that for CPUs which don't save store order =
between different addresses(arm,ppc), the rte_io_wmb is converted to rte_wm=
b.
So for thus who save it(x86) we just need the right order in compiler code =
because all the relevant stores are targeted to same memory domain(DRAM) an=
d therefore also the actual store is guaranteed.
Unlike doorbell store which directed to different memory domain (PCI).
So the only place which need rte_wmb() is before doorbell write.

> Unless the comment above this barrier is wrong, this change may cause har=
d-
> to-debug issues down the road, you should drop it.
>=20
> >  		ctrl->owner_opcode =3D rte_cpu_to_be_32(owner_opcode |
> >  					      ((sq->head & sq->txbb_cnt) ?
> >  						       MLX4_BIT_WQE_OWN :
> 0));
> > --
> > 1.8.3.1
> >
>=20
> --
> Adrien Mazarguil
> 6WIND

Thanks!