linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] net: mlx5: Use iowriteXbe() to ring doorbell and remove reduntant wmb()
@ 2020-01-03 17:52 Liran Alon
  2020-01-03 19:17 ` Jason Gunthorpe
  0 siblings, 1 reply; 4+ messages in thread
From: Liran Alon @ 2020-01-03 17:52 UTC (permalink / raw)
  To: saeedm, leon, netdev, linux-rdma
  Cc: eli, tariqt, danielm, jgg, Liran Alon, Håkon Bugge

Currently, mlx5e_notify_hw() executes wmb() to complete writes to
cache-coherent memory before ringing doorbell. Doorbell is written
to by mlx5_write64() which use __raw_writeX().

This is semantically correct but executes reduntant wmb() in some
architectures. For example, in x86, a write to UC memory guarantees
that any previous write to WB/UC memory will be globally visible
before the write to UC memory. Therefore, there is no need to also
execute wmb() before write to doorbell which is mapped as UC memory.

The consideration regarding this between different architectures is
handled properly by the writeX() & iowriteX() accessors. Which is
defined differently for different architectures. E.g. On x86, it is
just a memory write. However, on ARM, it is defined as __iowmb()
folowed by a memory write. __iowmb() is defined as wmb().

Therefore, change mlx5_write64() to use iowriteXbe() and remove wmb()
from it's callers.

Note: This relies on the fact that mlx5 kernel driver currently don't
use Mellanox BlueFlame technology that write Tx descriptors to WC memory.
In that case, iowriteX() isn't sufficient to flush write-combined buffers
(WCBs) on x86 AMD CPUs. That's because AMD CPUs do not flush WCBs on
write to UC memory. In that case, a wmb() (SFENCE) is required before
writing to Tx doorbell.

In addition, change callers of mlx5_write64() to just pass value in
CPU endianness as now mlx5_write64() explicitly pass value as big-endian
to device by using iowriteXbe().

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 drivers/infiniband/hw/mlx5/qp.c                        |  6 +-----
 drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h      |  7 +------
 drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c    |  2 --
 .../net/ethernet/mellanox/mlx5/core/steering/dr_send.c |  4 ----
 include/linux/mlx5/cq.h                                |  9 ++-------
 include/linux/mlx5/doorbell.h                          | 10 ++++------
 6 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 7e51870e9e01..87cbecb693ea 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -5329,11 +5329,7 @@ static int _mlx5_ib_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
 
 		qp->db.db[MLX5_SND_DBR] = cpu_to_be32(qp->sq.cur_post);
 
-		/* Make sure doorbell record is visible to the HCA before
-		 * we hit doorbell */
-		wmb();
-
-		mlx5_write64((__be32 *)ctrl, bf->bfreg->map + bf->offset);
+		mlx5_write64((u32 *)ctrl, bf->bfreg->map + bf->offset);
 		/* Make sure doorbells don't leak out of SQ spinlock
 		 * and reach the HCA out of order.
 		 */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 7c8796d9743f..ad3fd38456b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -108,12 +108,7 @@ mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
 
 	*wq->db = cpu_to_be32(pc);
 
-	/* ensure doorbell record is visible to device before ringing the
-	 * doorbell
-	 */
-	wmb();
-
-	mlx5_write64((__be32 *)ctrl, uar_map);
+	mlx5_write64((u32 *)ctrl, uar_map);
 }
 
 static inline bool mlx5e_transport_inline_tx_wqe(struct mlx5_wqe_ctrl_seg *cseg)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
index 61021133029e..e30b6c771218 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
@@ -133,8 +133,6 @@ static void mlx5_fpga_conn_notify_hw(struct mlx5_fpga_conn *conn, void *wqe)
 	/* ensure wqe is visible to device before updating doorbell record */
 	dma_wmb();
 	*conn->qp.wq.sq.db = cpu_to_be32(conn->qp.sq.pc);
-	/* Make sure that doorbell record is visible before ringing */
-	wmb();
 	mlx5_write64(wqe, conn->fdev->conn_res.uar->map + MLX5_BF_OFFSET);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
index 51803eef13dd..cfcbd4e338cd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
@@ -213,10 +213,6 @@ static void dr_cmd_notify_hw(struct mlx5dr_qp *dr_qp, void *ctrl)
 {
 	dma_wmb();
 	*dr_qp->wq.sq.db = cpu_to_be32(dr_qp->sq.pc & 0xfffff);
-
-	/* After wmb() the hw aware of new work */
-	wmb();
-
 	mlx5_write64(ctrl, dr_qp->uar->map + MLX5_BF_OFFSET);
 }
 
diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
index 40748fc1b11b..4631ad35da53 100644
--- a/include/linux/mlx5/cq.h
+++ b/include/linux/mlx5/cq.h
@@ -162,13 +162,8 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
 
 	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
 
-	/* Make sure that the doorbell record in host memory is
-	 * written before ringing the doorbell via PCI MMIO.
-	 */
-	wmb();
-
-	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
-	doorbell[1] = cpu_to_be32(cq->cqn);
+	doorbell[0] = sn << 28 | cmd | ci;
+	doorbell[1] = cq->cqn;
 
 	mlx5_write64(doorbell, uar_page + MLX5_CQ_DOORBELL);
 }
diff --git a/include/linux/mlx5/doorbell.h b/include/linux/mlx5/doorbell.h
index 5c267707e1df..9c1d35777323 100644
--- a/include/linux/mlx5/doorbell.h
+++ b/include/linux/mlx5/doorbell.h
@@ -43,17 +43,15 @@
  * Note that the write is not atomic on 32-bit systems! In contrast to 64-bit
  * ones, it requires proper locking. mlx5_write64 doesn't do any locking, so use
  * it at your own discretion, protected by some kind of lock on 32 bits.
- *
- * TODO: use write{q,l}_relaxed()
  */
 
-static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
+static inline void mlx5_write64(u32 val[2], void __iomem *dest)
 {
 #if BITS_PER_LONG == 64
-	__raw_writeq(*(u64 *)val, dest);
+	iowrite64be(*(u64 *)val, dest);
 #else
-	__raw_writel((__force u32) val[0], dest);
-	__raw_writel((__force u32) val[1], dest + 4);
+	iowrite32be(val[0], dest);
+	iowrite32be(val[1], dest + 4);
 #endif
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] net: mlx5: Use iowriteXbe() to ring doorbell and remove reduntant wmb()
  2020-01-03 17:52 [PATCH v2] net: mlx5: Use iowriteXbe() to ring doorbell and remove reduntant wmb() Liran Alon
@ 2020-01-03 19:17 ` Jason Gunthorpe
  2020-01-03 22:28   ` Liran Alon
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2020-01-03 19:17 UTC (permalink / raw)
  To: Liran Alon
  Cc: saeedm, leon, netdev, linux-rdma, eli, tariqt, danielm, Håkon Bugge

On Fri, Jan 03, 2020 at 07:52:07PM +0200, Liran Alon wrote:
> diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
> index 40748fc1b11b..4631ad35da53 100644
> +++ b/include/linux/mlx5/cq.h
> @@ -162,13 +162,8 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
>  
>  	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
>  
> -	/* Make sure that the doorbell record in host memory is
> -	 * written before ringing the doorbell via PCI MMIO.
> -	 */
> -	wmb();
> -
> -	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
> -	doorbell[1] = cpu_to_be32(cq->cqn);
> +	doorbell[0] = sn << 28 | cmd | ci;
> +	doorbell[1] = cq->cqn;

This does actually have to change to a u64 otherwise it is not the
same.

On x86 LE, it was
 db[0] = swab(a)
 db[1] = swab(b)
 __raw_writel(db)

Now it is
 db[0] = a
 db[1] = b
 __raw_writel(swab(db))

Putting the swab around the u64 swaps the order of a/b in the TLP.

It might be tempting to swap db[0]/db[1] but IIRC this messed it up on
BE.

The sanest, simplest solution is to use a u64 natively, as the example
I gave did.

There is also the issue of casting a u32 to a u64 and possibly
triggering a unaligned kernel access, presumably this doesn't happen
today only by some lucky chance..

>  	mlx5_write64(doorbell, uar_page + MLX5_CQ_DOORBELL);
>  }
> diff --git a/include/linux/mlx5/doorbell.h b/include/linux/mlx5/doorbell.h
> index 5c267707e1df..9c1d35777323 100644
> +++ b/include/linux/mlx5/doorbell.h
> @@ -43,17 +43,15 @@
>   * Note that the write is not atomic on 32-bit systems! In contrast to 64-bit
>   * ones, it requires proper locking. mlx5_write64 doesn't do any locking, so use
>   * it at your own discretion, protected by some kind of lock on 32 bits.
> - *
> - * TODO: use write{q,l}_relaxed()
>   */
>  
> -static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
> +static inline void mlx5_write64(u32 val[2], void __iomem *dest)
>  {

So this should accept a straight u64, the goofy arrays have to go away

>  #if BITS_PER_LONG == 64
> -	__raw_writeq(*(u64 *)val, dest);
> +	iowrite64be(*(u64 *)val, dest);
>  #else
> -	__raw_writel((__force u32) val[0], dest);
> -	__raw_writel((__force u32) val[1], dest + 4);
> +	iowrite32be(val[0], dest);
> +	iowrite32be(val[1], dest + 4);

With a u64 input this fallback is written as

  iowrite32be(val >> 32, dest)
  iowrite32be((u32)val, dest + 4)

Which matches the definition for how write64 must construct a TLP.

And arguably the first one should be _relaxed (but nobody cares about
this code path)

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] net: mlx5: Use iowriteXbe() to ring doorbell and remove reduntant wmb()
  2020-01-03 19:17 ` Jason Gunthorpe
@ 2020-01-03 22:28   ` Liran Alon
  2020-01-06 22:05     ` Saeed Mahameed
  0 siblings, 1 reply; 4+ messages in thread
From: Liran Alon @ 2020-01-03 22:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: saeedm, leon, netdev, linux-rdma, eli, tariqt, danielm, Håkon Bugge



> On 3 Jan 2020, at 21:17, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Jan 03, 2020 at 07:52:07PM +0200, Liran Alon wrote:
>> diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
>> index 40748fc1b11b..4631ad35da53 100644
>> +++ b/include/linux/mlx5/cq.h
>> @@ -162,13 +162,8 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
>> 
>> 	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
>> 
>> -	/* Make sure that the doorbell record in host memory is
>> -	 * written before ringing the doorbell via PCI MMIO.
>> -	 */
>> -	wmb();
>> -
>> -	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>> -	doorbell[1] = cpu_to_be32(cq->cqn);
>> +	doorbell[0] = sn << 28 | cmd | ci;
>> +	doorbell[1] = cq->cqn;
> 
> This does actually have to change to a u64 otherwise it is not the
> same.
> 
> On x86 LE, it was
> db[0] = swab(a)
> db[1] = swab(b)
> __raw_writel(db)
> 
> Now it is
> db[0] = a
> db[1] = b
> __raw_writel(swab(db))
> 
> Putting the swab around the u64 swaps the order of a/b in the TLP.
> 
> It might be tempting to swap db[0]/db[1] but IIRC this messed it up on
> BE.

Oops. You are right...

> 
> The sanest, simplest solution is to use a u64 natively, as the example
> I gave did.

I agree.

> 
> There is also the issue of casting a u32 to a u64 and possibly
> triggering a unaligned kernel access, presumably this doesn't happen
> today only by some lucky chance..
> 
>> 	mlx5_write64(doorbell, uar_page + MLX5_CQ_DOORBELL);
>> }
>> diff --git a/include/linux/mlx5/doorbell.h b/include/linux/mlx5/doorbell.h
>> index 5c267707e1df..9c1d35777323 100644
>> +++ b/include/linux/mlx5/doorbell.h
>> @@ -43,17 +43,15 @@
>>  * Note that the write is not atomic on 32-bit systems! In contrast to 64-bit
>>  * ones, it requires proper locking. mlx5_write64 doesn't do any locking, so use
>>  * it at your own discretion, protected by some kind of lock on 32 bits.
>> - *
>> - * TODO: use write{q,l}_relaxed()
>>  */
>> 
>> -static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
>> +static inline void mlx5_write64(u32 val[2], void __iomem *dest)
>> {
> 
> So this should accept a straight u64, the goofy arrays have to go away

I agree.

> 
>> #if BITS_PER_LONG == 64
>> -	__raw_writeq(*(u64 *)val, dest);
>> +	iowrite64be(*(u64 *)val, dest);
>> #else
>> -	__raw_writel((__force u32) val[0], dest);
>> -	__raw_writel((__force u32) val[1], dest + 4);
>> +	iowrite32be(val[0], dest);
>> +	iowrite32be(val[1], dest + 4);
> 
> With a u64 input this fallback is written as
> 
>  iowrite32be(val >> 32, dest)
>  iowrite32be((u32)val, dest + 4)
> 
> Which matches the definition for how write64 must construct a TLP.
> 
> And arguably the first one should be _relaxed (but nobody cares about
> this code path)

I agree with everything. Will fix on v3.

Thanks!
-Liran

> 
> Jason


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] net: mlx5: Use iowriteXbe() to ring doorbell and remove reduntant wmb()
  2020-01-03 22:28   ` Liran Alon
@ 2020-01-06 22:05     ` Saeed Mahameed
  0 siblings, 0 replies; 4+ messages in thread
From: Saeed Mahameed @ 2020-01-06 22:05 UTC (permalink / raw)
  To: Eran Ben Elisha, liran.alon, Tal Gilboa, jgg, Moshe Shemesh
  Cc: Eli Cohen, haakon.bugge, Tariq Toukan, netdev, linux-rdma, leon,
	Daniel Marcovitch

On Sat, 2020-01-04 at 00:28 +0200, Liran Alon wrote:
> > On 3 Jan 2020, at 21:17, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Fri, Jan 03, 2020 at 07:52:07PM +0200, Liran Alon wrote:
> > > diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
> > > index 40748fc1b11b..4631ad35da53 100644
> > > +++ b/include/linux/mlx5/cq.h
> > > @@ -162,13 +162,8 @@ static inline void mlx5_cq_arm(struct
> > > mlx5_core_cq *cq, u32 cmd,
> > > 
> > > 	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
> > > 
> > > -	/* Make sure that the doorbell record in host memory is
> > > -	 * written before ringing the doorbell via PCI MMIO.
> > > -	 */
> > > -	wmb();
> > > -
> > > -	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
> > > -	doorbell[1] = cpu_to_be32(cq->cqn);
> > > +	doorbell[0] = sn << 28 | cmd | ci;
> > > +	doorbell[1] = cq->cqn;
> > 
> > This does actually have to change to a u64 otherwise it is not the
> > same.
> > 
> > On x86 LE, it was
> > db[0] = swab(a)
> > db[1] = swab(b)
> > __raw_writel(db)
> > 
> > Now it is
> > db[0] = a
> > db[1] = b
> > __raw_writel(swab(db))
> > 
> > Putting the swab around the u64 swaps the order of a/b in the TLP.
> > 
> > It might be tempting to swap db[0]/db[1] but IIRC this messed it up
> > on
> > BE.
> 
> Oops. You are right...
> 
> > The sanest, simplest solution is to use a u64 natively, as the
> > example
> > I gave did.
> 
> I agree.
> 
> > There is also the issue of casting a u32 to a u64 and possibly
> > triggering a unaligned kernel access, presumably this doesn't
> > happen
> > today only by some lucky chance..
> > 
> > > 	mlx5_write64(doorbell, uar_page + MLX5_CQ_DOORBELL);
> > > }
> > > diff --git a/include/linux/mlx5/doorbell.h
> > > b/include/linux/mlx5/doorbell.h
> > > index 5c267707e1df..9c1d35777323 100644
> > > +++ b/include/linux/mlx5/doorbell.h
> > > @@ -43,17 +43,15 @@
> > >  * Note that the write is not atomic on 32-bit systems! In
> > > contrast to 64-bit
> > >  * ones, it requires proper locking. mlx5_write64 doesn't do any
> > > locking, so use
> > >  * it at your own discretion, protected by some kind of lock on
> > > 32 bits.
> > > - *
> > > - * TODO: use write{q,l}_relaxed()
> > >  */
> > > 
> > > -static inline void mlx5_write64(__be32 val[2], void __iomem
> > > *dest)
> > > +static inline void mlx5_write64(u32 val[2], void __iomem *dest)
> > > {
> > 
> > So this should accept a straight u64, the goofy arrays have to go
> > away
> 
> I agree.
> 
> > > #if BITS_PER_LONG == 64
> > > -	__raw_writeq(*(u64 *)val, dest);
> > > +	iowrite64be(*(u64 *)val, dest);
> > > #else
> > > -	__raw_writel((__force u32) val[0], dest);
> > > -	__raw_writel((__force u32) val[1], dest + 4);
> > > +	iowrite32be(val[0], dest);
> > > +	iowrite32be(val[1], dest + 4);
> > 
> > With a u64 input this fallback is written as
> > 
> >  iowrite32be(val >> 32, dest)
> >  iowrite32be((u32)val, dest + 4)
> > 
> > Which matches the definition for how write64 must construct a TLP.
> > 
> > And arguably the first one should be _relaxed (but nobody cares
> > about
> > this code path)
> 
> I agree with everything. Will fix on v3.
> 

Hi Liran,

As agreed in a separate email thread, we will run performance
regression, before i can merge this patch, CCed Moshe and Tal for
perofmance feedback.

Thanks,
Saeed.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-01-06 22:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-03 17:52 [PATCH v2] net: mlx5: Use iowriteXbe() to ring doorbell and remove reduntant wmb() Liran Alon
2020-01-03 19:17 ` Jason Gunthorpe
2020-01-03 22:28   ` Liran Alon
2020-01-06 22:05     ` Saeed Mahameed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).