All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
@ 2020-01-02 17:44 Liran Alon
  2020-01-02 19:29 ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Liran Alon @ 2020-01-02 17:44 UTC (permalink / raw)
  To: saeedm, leon, netdev, linux-rdma
  Cc: eli, tariqt, danielm, Liran Alon, Håkon Bugge

Currently, mlx5e_notify_hw() executes wmb() to complete writes to cache-coherent
memory before ringing doorbell. Doorbell is written to by mlx5_write64()
which use __raw_writeX().

This is semantically correct but executes reduntant wmb() in some architectures.
For example, in x86, a write to UC memory guarantees that any previous write to
WB memory will be globally visible before the write to UC memory. Therefore, there
is no need to also execute wmb() before write to doorbell which is mapped as UC memory.

The consideration regarding this between different architectures is handled
properly by the writeX() macro. Which is defined differently for different
architectures. E.g. On x86, it is just a memory write. However, on ARM, it
is defined as __iowmb() folowed by a memory write. __iowmb() is defined
as wmb().

Therefore, change mlx5_write64() to use writeX() and remove wmb() from
it's callers.

Note: This relies on the fact that mlx5 kernel driver currently don't use
Mellanox BlueFlame technology that write Tx descriptors to WC memory. In
that case, writeX() isn't sufficient to flush write-combined buffers
(WCBs) on x86 AMD CPUs. That's because AMD CPUs do not flush WCBs on
write to UC memory. In that case, a wmb() (SFENCE) is required before
writing to Tx doorbell.

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 drivers/infiniband/hw/mlx5/qp.c                           | 4 ----
 drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h         | 5 -----
 drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c       | 2 --
 .../net/ethernet/mellanox/mlx5/core/steering/dr_send.c    | 4 ----
 include/linux/mlx5/cq.h                                   | 5 -----
 include/linux/mlx5/doorbell.h                             | 8 +++-----
 6 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 7e51870e9e01..5c2d660fada6 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -5329,10 +5329,6 @@ static int _mlx5_ib_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
 
 		qp->db.db[MLX5_SND_DBR] = cpu_to_be32(qp->sq.cur_post);
 
-		/* Make sure doorbell record is visible to the HCA before
-		 * we hit doorbell */
-		wmb();
-
 		mlx5_write64((__be32 *)ctrl, bf->bfreg->map + bf->offset);
 		/* Make sure doorbells don't leak out of SQ spinlock
 		 * and reach the HCA out of order.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 7c8796d9743f..932891ea78d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -108,11 +108,6 @@ mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
 
 	*wq->db = cpu_to_be32(pc);
 
-	/* ensure doorbell record is visible to device before ringing the
-	 * doorbell
-	 */
-	wmb();
-
 	mlx5_write64((__be32 *)ctrl, uar_map);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
index 61021133029e..e30b6c771218 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
@@ -133,8 +133,6 @@ static void mlx5_fpga_conn_notify_hw(struct mlx5_fpga_conn *conn, void *wqe)
 	/* ensure wqe is visible to device before updating doorbell record */
 	dma_wmb();
 	*conn->qp.wq.sq.db = cpu_to_be32(conn->qp.sq.pc);
-	/* Make sure that doorbell record is visible before ringing */
-	wmb();
 	mlx5_write64(wqe, conn->fdev->conn_res.uar->map + MLX5_BF_OFFSET);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
index 51803eef13dd..cfcbd4e338cd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
@@ -213,10 +213,6 @@ static void dr_cmd_notify_hw(struct mlx5dr_qp *dr_qp, void *ctrl)
 {
 	dma_wmb();
 	*dr_qp->wq.sq.db = cpu_to_be32(dr_qp->sq.pc & 0xfffff);
-
-	/* After wmb() the hw aware of new work */
-	wmb();
-
 	mlx5_write64(ctrl, dr_qp->uar->map + MLX5_BF_OFFSET);
 }
 
diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
index 40748fc1b11b..28744a725e64 100644
--- a/include/linux/mlx5/cq.h
+++ b/include/linux/mlx5/cq.h
@@ -162,11 +162,6 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
 
 	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
 
-	/* Make sure that the doorbell record in host memory is
-	 * written before ringing the doorbell via PCI MMIO.
-	 */
-	wmb();
-
 	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
 	doorbell[1] = cpu_to_be32(cq->cqn);
 
diff --git a/include/linux/mlx5/doorbell.h b/include/linux/mlx5/doorbell.h
index 5c267707e1df..e606d5760fa7 100644
--- a/include/linux/mlx5/doorbell.h
+++ b/include/linux/mlx5/doorbell.h
@@ -43,17 +43,15 @@
  * Note that the write is not atomic on 32-bit systems! In contrast to 64-bit
  * ones, it requires proper locking. mlx5_write64 doesn't do any locking, so use
  * it at your own discretion, protected by some kind of lock on 32 bits.
- *
- * TODO: use write{q,l}_relaxed()
  */
 
 static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
 {
 #if BITS_PER_LONG == 64
-	__raw_writeq(*(u64 *)val, dest);
+	writeq(*(u64 *)val, dest);
 #else
-	__raw_writel((__force u32) val[0], dest);
-	__raw_writel((__force u32) val[1], dest + 4);
+	writel((__force u32) val[0], dest);
+	writel((__force u32) val[1], dest + 4);
 #endif
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-02 17:44 [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb() Liran Alon
@ 2020-01-02 19:29 ` Jason Gunthorpe
  2020-01-02 19:45   ` Liran Alon
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-01-02 19:29 UTC (permalink / raw)
  To: Liran Alon
  Cc: saeedm, leon, netdev, linux-rdma, eli, tariqt, danielm, Håkon Bugge

On Thu, Jan 02, 2020 at 07:44:36PM +0200, Liran Alon wrote:
> Currently, mlx5e_notify_hw() executes wmb() to complete writes to cache-coherent
> memory before ringing doorbell. Doorbell is written to by mlx5_write64()
> which use __raw_writeX().
> 
> This is semantically correct but executes reduntant wmb() in some architectures.
> For example, in x86, a write to UC memory guarantees that any previous write to
> WB memory will be globally visible before the write to UC memory. Therefore, there
> is no need to also execute wmb() before write to doorbell which is mapped as UC memory.
> 
> The consideration regarding this between different architectures is handled
> properly by the writeX() macro. Which is defined differently for different
> architectures. E.g. On x86, it is just a memory write. However, on ARM, it
> is defined as __iowmb() folowed by a memory write. __iowmb() is defined
> as wmb().

This reasoning seems correct, though I would recommend directly
refering to locking/memory-barriers.txt which explains this.

> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
> it's callers.

Yes, wmb(); writel(); is always redundant
  
> diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
> index 40748fc1b11b..28744a725e64 100644
> +++ b/include/linux/mlx5/cq.h
> @@ -162,11 +162,6 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
>  
>  	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
>  
> -	/* Make sure that the doorbell record in host memory is
> -	 * written before ringing the doorbell via PCI MMIO.
> -	 */
> -	wmb();
> -

Why did this one change? The doorbell memory here is not a writel():

>  	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>  	doorbell[1] = cpu_to_be32(cq->cqn);

>  static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
>  {
>  #if BITS_PER_LONG == 64
> -	__raw_writeq(*(u64 *)val, dest);
> +	writeq(*(u64 *)val, dest);

I want to say this might cause problems with endian swapping as writeq
also does some swaps that __raw does not? Is this true?

ie writeq does not accept a be32

Some time ago I reworked this similar code in userspace to use a u64
and remove the swapping from the caller.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-02 19:29 ` Jason Gunthorpe
@ 2020-01-02 19:45   ` Liran Alon
  2020-01-02 20:58     ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Liran Alon @ 2020-01-02 19:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: saeedm, leon, netdev, linux-rdma, eli, tariqt, danielm, Håkon Bugge



> On 2 Jan 2020, at 21:29, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Jan 02, 2020 at 07:44:36PM +0200, Liran Alon wrote:
>> Currently, mlx5e_notify_hw() executes wmb() to complete writes to cache-coherent
>> memory before ringing doorbell. Doorbell is written to by mlx5_write64()
>> which use __raw_writeX().
>> 
>> This is semantically correct but executes reduntant wmb() in some architectures.
>> For example, in x86, a write to UC memory guarantees that any previous write to
>> WB memory will be globally visible before the write to UC memory. Therefore, there
>> is no need to also execute wmb() before write to doorbell which is mapped as UC memory.
>> 
>> The consideration regarding this between different architectures is handled
>> properly by the writeX() macro. Which is defined differently for different
>> architectures. E.g. On x86, it is just a memory write. However, on ARM, it
>> is defined as __iowmb() folowed by a memory write. __iowmb() is defined
>> as wmb().
> 
> This reasoning seems correct, though I would recommend directly
> refering to locking/memory-barriers.txt which explains this.

I find memory-barriers.txt not explicit enough on the semantics of writeX().
(For example: Should it flush write-combined buffers before writing to the UC memory?)
That’s why I preferred to explicitly state here how I perceive it.

But I don’t mind of course adding a pointer to the memory-barriers.txt file.

> 
>> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
>> it's callers.
> 
> Yes, wmb(); writel(); is always redundant

Well, unfortunately not…
See: https://marc.info/?l=linux-netdev&m=157798859215697&w=2
(See my suggestion to add flush_wc_writeX())

> 
>> diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
>> index 40748fc1b11b..28744a725e64 100644
>> +++ b/include/linux/mlx5/cq.h
>> @@ -162,11 +162,6 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
>> 
>> 	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
>> 
>> -	/* Make sure that the doorbell record in host memory is
>> -	 * written before ringing the doorbell via PCI MMIO.
>> -	 */
>> -	wmb();
>> -
> 
> Why did this one change? The doorbell memory here is not a writel():

Well, it’s not seen in the diff but actually the full code is:

    /* Make sure that the doorbell record in host memory is
     * written before ringing the doorbell via PCI MMIO.
     */
    wmb();

    doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
    doorbell[1] = cpu_to_be32(cq->cqn);

    mlx5_write64(doorbell, uar_page + MLX5_CQ_DOORBELL);

i.e. doorbell is not the write to the doorbell itself. It’s still done via mlx5_write64().

> 
>> 	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>> 	doorbell[1] = cpu_to_be32(cq->cqn);
> 
>> static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
>> {
>> #if BITS_PER_LONG == 64
>> -	__raw_writeq(*(u64 *)val, dest);
>> +	writeq(*(u64 *)val, dest);
> 
> I want to say this might cause problems with endian swapping as writeq
> also does some swaps that __raw does not? Is this true?

Hmm... Looking at ARM64 version, writeq() indeed calls cpu_to_le64() on parameter before passing it to __raw_writeq().
Quite surprising from API perspective to be honest.

So should I change this instead to iowrite64be(*(u64 *)val, dest)?

-Liran

> 
> ie writeq does not accept a be32
> 
> Some time ago I reworked this similar code in userspace to use a u64
> and remove the swapping from the caller.
> 
> Jason


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-02 19:45   ` Liran Alon
@ 2020-01-02 20:58     ` Jason Gunthorpe
  2020-01-02 22:21       ` Liran Alon
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-01-02 20:58 UTC (permalink / raw)
  To: Liran Alon, Will Deacon
  Cc: saeedm, leon, netdev, linux-rdma, eli, tariqt, danielm, Håkon Bugge

On Thu, Jan 02, 2020 at 09:45:52PM +0200, Liran Alon wrote:
> 
> 
> > On 2 Jan 2020, at 21:29, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Thu, Jan 02, 2020 at 07:44:36PM +0200, Liran Alon wrote:
> >> Currently, mlx5e_notify_hw() executes wmb() to complete writes to cache-coherent
> >> memory before ringing doorbell. Doorbell is written to by mlx5_write64()
> >> which use __raw_writeX().
> >> 
> >> This is semantically correct but executes reduntant wmb() in some architectures.
> >> For example, in x86, a write to UC memory guarantees that any previous write to
> >> WB memory will be globally visible before the write to UC memory. Therefore, there
> >> is no need to also execute wmb() before write to doorbell which is mapped as UC memory.
> >> 
> >> The consideration regarding this between different architectures is handled
> >> properly by the writeX() macro. Which is defined differently for different
> >> architectures. E.g. On x86, it is just a memory write. However, on ARM, it
> >> is defined as __iowmb() folowed by a memory write. __iowmb() is defined
> >> as wmb().
> > 
> > This reasoning seems correct, though I would recommend directly
> > refering to locking/memory-barriers.txt which explains this.
> 
> I find memory-barriers.txt not explicit enough on the semantics of writeX().
> (For example: Should it flush write-combined buffers before writing to the UC memory?)
> That’s why I preferred to explicitly state here how I perceive it.

AFAIK WC is largely unspecified by the memory model. Is wmb() even
formally specified to interact with WC?

At least in this mlx5 case there is no WC, right? The kernel UAR is
mapped UC?

So we don't need to worry about the poor specification of WC access
and you can refer to memory-barriers.txt at least for this patch.

> > 
> >> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
> >> it's callers.
> > 
> > Yes, wmb(); writel(); is always redundant
> 
> Well, unfortunately not…
> See: https://marc.info/?l=linux-netdev&m=157798859215697&w=2
> (See my suggestion to add flush_wc_writeX())

Well, the last time wmb & writel came up Linus was pretty clear that
writel is supposed to remain in program order and have the barriers
needed to do that.

I don't think WC was considered when that discussion happened, but we
really don't have a formal model for how WC works at all within the
kernel.

The above patch is really not a wmb(); writel() pairing, the wmb() is
actually closing/serializing an earlier WC transaction, and yes you need various
special things to keep WC working right.

IMHO you should start there before going around and adding/removing wmbs
related to WC. Update membory-barriers.txt and related with the model
ordering for WC access and get agreement.

For instance does wmb() even effect WC? Does WC have to be contained
by spinlocks? Do we need extra special barriers like flush_wc and
flush_wc_before_spin_unlock ? etc.

Perhaps Will has some advice?

> >> diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
> >> index 40748fc1b11b..28744a725e64 100644
> >> +++ b/include/linux/mlx5/cq.h
> >> @@ -162,11 +162,6 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
> >> 
> >> 	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
> >> 
> >> -	/* Make sure that the doorbell record in host memory is
> >> -	 * written before ringing the doorbell via PCI MMIO.
> >> -	 */
> >> -	wmb();
> >> -
> > 
> > Why did this one change? The doorbell memory here is not a writel():
> 
> Well, it’s not seen in the diff but actually the full code is:
> 
>     /* Make sure that the doorbell record in host memory is
>      * written before ringing the doorbell via PCI MMIO.
>      */
>     wmb();
> 
>     doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>     doorbell[1] = cpu_to_be32(cq->cqn);
> 
>     mlx5_write64(doorbell, uar_page + MLX5_CQ_DOORBELL);

Ah OK, we have another thing called doorbell which is actually DMA'ble
memory.

> >> 	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
> >> 	doorbell[1] = cpu_to_be32(cq->cqn);
> > 
> >> static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
> >> {
> >> #if BITS_PER_LONG == 64
> >> -	__raw_writeq(*(u64 *)val, dest);
> >> +	writeq(*(u64 *)val, dest);
> > 
> > I want to say this might cause problems with endian swapping as writeq
> > also does some swaps that __raw does not? Is this true?
> 
> Hmm... Looking at ARM64 version, writeq() indeed calls cpu_to_le64()
> on parameter before passing it to __raw_writeq().  Quite surprising
> from API perspective to be honest.

For PCI-E devices writel(x) is defined to generate the same TLP on the
PCI-E bus, across all arches. __raw_* does something arch specific and
should not be called from drivers. It is a long standing bug that this
code is written like this.

> So should I change this instead to iowrite64be(*(u64 *)val, dest)?

This always made my head hurt, but IIRC, when I looked at it years ago
the weird array construction caused problems with that simple conversion.

The userspace version looks like this now:

        uint64_t doorbell;
        uint32_t sn;
        uint32_t ci;
        uint32_t cmd;

        sn  = cq->arm_sn & 3;
        ci  = cq->cons_index & 0xffffff;
        cmd = solicited ? MLX5_CQ_DB_REQ_NOT_SOL : MLX5_CQ_DB_REQ_NOT;

        doorbell = sn << 28 | cmd | ci;
        doorbell <<= 32;
        doorbell |= cq->cqn;

        mmio_write64_be(ctx->uar[0].reg + MLX5_CQ_DOORBELL, htobe64(doorbell));

Where on all supported platforms the mmio_write64_be() expands to a
simple store (no swap)

Which does look functionally the same as

   iowrite64be(doorbell, dest);

So this patch should change the mlx5_write64 to accept a u64 like we
did in userspace when this was all cleaned there.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-02 20:58     ` Jason Gunthorpe
@ 2020-01-02 22:21       ` Liran Alon
  2020-01-03 13:37         ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Liran Alon @ 2020-01-02 22:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, saeedm, leon, netdev, linux-rdma, eli, tariqt,
	danielm, Håkon Bugge



> On 2 Jan 2020, at 22:58, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Jan 02, 2020 at 09:45:52PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 2 Jan 2020, at 21:29, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>> 
>>> On Thu, Jan 02, 2020 at 07:44:36PM +0200, Liran Alon wrote:
>>>> Currently, mlx5e_notify_hw() executes wmb() to complete writes to cache-coherent
>>>> memory before ringing doorbell. Doorbell is written to by mlx5_write64()
>>>> which use __raw_writeX().
>>>> 
>>>> This is semantically correct but executes reduntant wmb() in some architectures.
>>>> For example, in x86, a write to UC memory guarantees that any previous write to
>>>> WB memory will be globally visible before the write to UC memory. Therefore, there
>>>> is no need to also execute wmb() before write to doorbell which is mapped as UC memory.
>>>> 
>>>> The consideration regarding this between different architectures is handled
>>>> properly by the writeX() macro. Which is defined differently for different
>>>> architectures. E.g. On x86, it is just a memory write. However, on ARM, it
>>>> is defined as __iowmb() folowed by a memory write. __iowmb() is defined
>>>> as wmb().
>>> 
>>> This reasoning seems correct, though I would recommend directly
>>> refering to locking/memory-barriers.txt which explains this.
>> 
>> I find memory-barriers.txt not explicit enough on the semantics of writeX().
>> (For example: Should it flush write-combined buffers before writing to the UC memory?)
>> That’s why I preferred to explicitly state here how I perceive it.
> 
> AFAIK WC is largely unspecified by the memory model. Is wmb() even
> formally specified to interact with WC?

As I said, I haven’t seen such semantics defined in kernel documentation such as memory-barriers.txt.
However, in practice, it does flush WC buffers. At least for x86 and ARM which I’m familiar enough with.
I think it’s reasonable to assume that wmb() should flush WC buffers while dma_wmb()/smp_wmb() doesn’t necessarily have to do this.

But this is exactly the types of things that bothered me with memory-barriers.txt. That it doesn’t define semantics regarding these.

> 
> At least in this mlx5 case there is no WC, right?

Right. As I made sure to explicitly mention in commit message.
This is also why I haven’t yet also fixed a similar issue in mlx4 kernel driver that does use BlueFlame (WC memory).

> The kernel UAR is
> mapped UC?

Yes.

> 
> So we don't need to worry about the poor specification of WC access
> and you can refer to memory-barriers.txt at least for this patch.

Right.
I just gave it as a general example about that I wasn’t personally able to understand the exact intended semantics of writeX() by just reading memory-barriers.txt.
Anyway, sure I will rephrase the commit message to refer to memory-barriers.txt. No objection there.

> 
>>> 
>>>> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
>>>> it's callers.
>>> 
>>> Yes, wmb(); writel(); is always redundant
>> 
>> Well, unfortunately not…
>> See: https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D157798859215697-26w-3D2&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Ox1lCS1KAGBvJrf24kiFQrranIaNi_zeo05sqCUEf7Y&s=Mz6MJzUQ862DGjgGnj3neX4ZpjI88nOI9KpZhNF9TqQ&e= 
>> (See my suggestion to add flush_wc_writeX())
> 
> Well, the last time wmb & writel came up Linus was pretty clear that
> writel is supposed to remain in program order and have the barriers
> needed to do that.

Right. But that doesn’t take into account that WC writes are considered completed when they are still posted in CPU WC buffers.
The semantics as I understand of writeX() is that it guarantees all prior writes have been completed.
It means that all prior stores have executed and that store-buffer is flushed. But it doesn’t mean that WC buffers is flushed as-well.

> 
> I don't think WC was considered when that discussion happened, but we
> really don't have a formal model for how WC works at all within the
> kernel.

Agree. I kinda assumed that wmb() have an extra semantics of also guaranteeing that WC buffers are flushed.
(Because in practice, I think it’s true for all cases. In contrast to using dma_wmb()/smp_wmb()).
This is why I think we are missing a flush_wc_writeX() macro as I specified in that commit message.

> 
> The above patch is really not a wmb(); writel() pairing, the wmb() is
> actually closing/serializing an earlier WC transaction, and yes you need various
> special things to keep WC working right.

Right.

> 
> IMHO you should start there before going around and adding/removing wmbs
> related to WC. Update membory-barriers.txt and related with the model
> ordering for WC access and get agreement.

I disagree here. It’s more important to fix a real bug (e.g. Not flushing WC buffers on x86 AMD).
Then, we can later formalise this and refactor code as necessary. Which will also optimise it as-well.
Bug fix can be merged before we finish all these discussions and get agreement.

I do completely agree we should have this discussion on WC and barriers and I already sent an
email on this to all the memory-barriers.txt maintainers. Waiting to see how that discussion go
and get community feedback before I will submit a patch-series that will introduce new changes
to memory-barriers.txt and probably also new barrier macro.

> 
> For instance does wmb() even effect WC? Does WC have to be contained
> by spinlocks? Do we need extra special barriers like flush_wc and
> flush_wc_before_spin_unlock ? etc.
> 
> Perhaps Will has some advice?

I also sent Will an email about this a few days ago. Thanks for Cc him. :)

> 
>>>> diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
>>>> index 40748fc1b11b..28744a725e64 100644
>>>> +++ b/include/linux/mlx5/cq.h
>>>> @@ -162,11 +162,6 @@ static inline void mlx5_cq_arm(struct mlx5_core_cq *cq, u32 cmd,
>>>> 
>>>> 	*cq->arm_db = cpu_to_be32(sn << 28 | cmd | ci);
>>>> 
>>>> -	/* Make sure that the doorbell record in host memory is
>>>> -	 * written before ringing the doorbell via PCI MMIO.
>>>> -	 */
>>>> -	wmb();
>>>> -
>>> 
>>> Why did this one change? The doorbell memory here is not a writel():
>> 
>> Well, it’s not seen in the diff but actually the full code is:
>> 
>>    /* Make sure that the doorbell record in host memory is
>>     * written before ringing the doorbell via PCI MMIO.
>>     */
>>    wmb();
>> 
>>    doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>>    doorbell[1] = cpu_to_be32(cq->cqn);
>> 
>>    mlx5_write64(doorbell, uar_page + MLX5_CQ_DOORBELL);
> 
> Ah OK, we have another thing called doorbell which is actually DMA'ble
> memory.

Yep.

> 
>>>> 	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>>>> 	doorbell[1] = cpu_to_be32(cq->cqn);
>>> 
>>>> static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
>>>> {
>>>> #if BITS_PER_LONG == 64
>>>> -	__raw_writeq(*(u64 *)val, dest);
>>>> +	writeq(*(u64 *)val, dest);
>>> 
>>> I want to say this might cause problems with endian swapping as writeq
>>> also does some swaps that __raw does not? Is this true?
>> 
>> Hmm... Looking at ARM64 version, writeq() indeed calls cpu_to_le64()
>> on parameter before passing it to __raw_writeq().  Quite surprising
>> from API perspective to be honest.
> 
> For PCI-E devices writel(x) is defined to generate the same TLP on the
> PCI-E bus, across all arches.

Good to know.
Question: Where is this documented?

> __raw_* does something arch specific and
> should not be called from drivers. It is a long standing bug that this
> code is written like this.

Agree. That’s what caught my eye to this in the first place.

> 
>> So should I change this instead to iowrite64be(*(u64 *)val, dest)?
> 
> This always made my head hurt, but IIRC, when I looked at it years ago
> the weird array construction caused problems with that simple conversion.
> 
> The userspace version looks like this now:
> 
>        uint64_t doorbell;
>        uint32_t sn;
>        uint32_t ci;
>        uint32_t cmd;
> 
>        sn  = cq->arm_sn & 3;
>        ci  = cq->cons_index & 0xffffff;
>        cmd = solicited ? MLX5_CQ_DB_REQ_NOT_SOL : MLX5_CQ_DB_REQ_NOT;
> 
>        doorbell = sn << 28 | cmd | ci;
>        doorbell <<= 32;
>        doorbell |= cq->cqn;
> 
>        mmio_write64_be(ctx->uar[0].reg + MLX5_CQ_DOORBELL, htobe64(doorbell));
> 
> Where on all supported platforms the mmio_write64_be() expands to a
> simple store (no swap)
> 
> Which does look functionally the same as
> 
>   iowrite64be(doorbell, dest);
> 
> So this patch should change the mlx5_write64 to accept a u64 like we
> did in userspace when this was all cleaned there.

If I understand you correctly, you suggest to change callers to pass here a standard u64 and then
modify mlx5_write64() to just call iowrite64be(). If so, I agree. Just want to confirm before sending v2.

Thanks for the insightful review,
-Liran

> 
> Jason


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-02 22:21       ` Liran Alon
@ 2020-01-03 13:37         ` Jason Gunthorpe
  2020-01-03 16:31           ` Liran Alon
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-01-03 13:37 UTC (permalink / raw)
  To: Liran Alon
  Cc: Will Deacon, saeedm, leon, netdev, linux-rdma, eli, tariqt,
	danielm, Håkon Bugge

On Fri, Jan 03, 2020 at 12:21:06AM +0200, Liran Alon wrote:

> > AFAIK WC is largely unspecified by the memory model. Is wmb() even
> > formally specified to interact with WC?
> 
> As I said, I haven’t seen such semantics defined in kernel
> documentation such as memory-barriers.txt.  However, in practice, it
> does flush WC buffers. At least for x86 and ARM which I’m familiar
> enough with.  I think it’s reasonable to assume that wmb() should
> flush WC buffers while dma_wmb()/smp_wmb() doesn’t necessarily have
> to do this.

It is because WC is rarely used and laregly undefined for the kernel
:(

> >>>> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
> >>>> it's callers.
> >>> 
> >>> Yes, wmb(); writel(); is always redundant
> >> 
> >> Well, unfortunately not…
> >> See: https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D157798859215697-26w-3D2&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Ox1lCS1KAGBvJrf24kiFQrranIaNi_zeo05sqCUEf7Y&s=Mz6MJzUQ862DGjgGnj3neX4ZpjI88nOI9KpZhNF9TqQ&e=
> >> (See my suggestion to add flush_wc_writeX())
> > 
> > Well, the last time wmb & writel came up Linus was pretty clear that
> > writel is supposed to remain in program order and have the barriers
> > needed to do that.
> 
> Right. But that doesn’t take into account that WC writes are
> considered completed when they are still posted in CPU WC buffers.

> The semantics as I understand of writeX() is that it guarantees all
> prior writes have been completed.  It means that all prior stores
> have executed and that store-buffer is flushed. But it doesn’t mean
> that WC buffers is flushed as-well.

The semantic for writel is that prior program order stores will be
observable by DMA from the device receiving the writel. This is
required for UC and NC stores today. WC is undefined, I think.

This is why ARM has the additional barrier in writel.

It would logically make sense if WC followed the same rule, however,
adding a barrier to writel to make WC ordered would not be popular, so
I think we are left with using special accessors for WC and placing
the barrier there..

> > IMHO you should start there before going around and adding/removing wmbs
> > related to WC. Update membory-barriers.txt and related with the model
> > ordering for WC access and get agreement.
> 
> I disagree here. It’s more important to fix a real bug (e.g. Not
> flushing WC buffers on x86 AMD).  Then, we can later formalise this
> and refactor code as necessary. Which will also optimise it as-well.
> Bug fix can be merged before we finish all these discussions and get
> agreement.

Is it a real bug that people actually hit? It wasn't clear from the
commit message. If so, sure, it should be fixed and the commit message
clarified. (but I'd put the wmb near the WC writes..)

I am surprised that AMD is different here, the evolution of the WC
feature on x86 was to transparently speed up graphics, so I'm pretty
surprised AMD can get away with not ordering the same as Intel..

> I do completely agree we should have this discussion on WC and
> barriers and I already sent an email on this to all the
> memory-barriers.txt maintainers. Waiting to see how that discussion
> go and get community feedback before I will submit a patch-series
> that will introduce new changes to memory-barriers.txt and probably
> also new barrier macro.

The barrier macros have been unpopular, ie the confusing mmiowb was
dumped, and I strongly suspect to contain WC within a spinlock (which
is very important!) you need a barrier instruction on some archs.

New accessors might work better.

> >>>> 	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
> >>>> 	doorbell[1] = cpu_to_be32(cq->cqn);
> >>> 
> >>>> static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
> >>>> {
> >>>> #if BITS_PER_LONG == 64
> >>>> -	__raw_writeq(*(u64 *)val, dest);
> >>>> +	writeq(*(u64 *)val, dest);
> >>> 
> >>> I want to say this might cause problems with endian swapping as writeq
> >>> also does some swaps that __raw does not? Is this true?
> >> 
> >> Hmm... Looking at ARM64 version, writeq() indeed calls cpu_to_le64()
> >> on parameter before passing it to __raw_writeq().  Quite surprising
> >> from API perspective to be honest.
> > 
> > For PCI-E devices writel(x) is defined to generate the same TLP on the
> > PCI-E bus, across all arches.
> 
> Good to know.
> Question: Where is this documented?

Hmm, haven't ever seen it documented like that. It is sort of a
logical requirement. If writel(x) produces different TLPs (ie
different byte order) how could a driver ever work with a PCI-E device
that requires only one TLP for x?

> >> So should I change this instead to iowrite64be(*(u64 *)val, dest)?
> > 
> > This always made my head hurt, but IIRC, when I looked at it years ago
> > the weird array construction caused problems with that simple conversion.
> > 
> > The userspace version looks like this now:
> > 
> >        uint64_t doorbell;
> >        uint32_t sn;
> >        uint32_t ci;
> >        uint32_t cmd;
> > 
> >        sn  = cq->arm_sn & 3;
> >        ci  = cq->cons_index & 0xffffff;
> >        cmd = solicited ? MLX5_CQ_DB_REQ_NOT_SOL : MLX5_CQ_DB_REQ_NOT;
> > 
> >        doorbell = sn << 28 | cmd | ci;
> >        doorbell <<= 32;
> >        doorbell |= cq->cqn;
> > 
> >        mmio_write64_be(ctx->uar[0].reg + MLX5_CQ_DOORBELL, htobe64(doorbell));
> > 
> > Where on all supported platforms the mmio_write64_be() expands to a
> > simple store (no swap)
> > 
> > Which does look functionally the same as
> > 
> >   iowrite64be(doorbell, dest);
> > 
> > So this patch should change the mlx5_write64 to accept a u64 like we
> > did in userspace when this was all cleaned there.
> 
> If I understand you correctly, you suggest to change callers to pass
> here a standard u64 and then modify mlx5_write64() to just call
> iowrite64be(). If so, I agree. Just want to confirm before sending
> v2.

Yes, this is what I did to userspace and it made this all make
sense. I strongly recommend to do the same in the kernel, particularly
now that we have iowrite64be()!

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-03 13:37         ` Jason Gunthorpe
@ 2020-01-03 16:31           ` Liran Alon
  2020-01-03 16:36             ` Jason Gunthorpe
  2020-01-03 18:38             ` Liran Alon
  0 siblings, 2 replies; 11+ messages in thread
From: Liran Alon @ 2020-01-03 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, saeedm, leon, netdev, linux-rdma, eli, tariqt,
	danielm, Håkon Bugge



> On 3 Jan 2020, at 15:37, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Jan 03, 2020 at 12:21:06AM +0200, Liran Alon wrote:
> 
>>> AFAIK WC is largely unspecified by the memory model. Is wmb() even
>>> formally specified to interact with WC?
>> 
>> As I said, I haven’t seen such semantics defined in kernel
>> documentation such as memory-barriers.txt.  However, in practice, it
>> does flush WC buffers. At least for x86 and ARM which I’m familiar
>> enough with.  I think it’s reasonable to assume that wmb() should
>> flush WC buffers while dma_wmb()/smp_wmb() doesn’t necessarily have
>> to do this.
> 
> It is because WC is rarely used and laregly undefined for the kernel
> :(

Yep.

> 
>>>>>> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
>>>>>> it's callers.
>>>>> 
>>>>> Yes, wmb(); writel(); is always redundant
>>>> 
>>>> Well, unfortunately not…
>>>> See: https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D157798859215697-26w-3D2&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Ox1lCS1KAGBvJrf24kiFQrranIaNi_zeo05sqCUEf7Y&s=Mz6MJzUQ862DGjgGnj3neX4ZpjI88nOI9KpZhNF9TqQ&e=
>>>> (See my suggestion to add flush_wc_writeX())
>>> 
>>> Well, the last time wmb & writel came up Linus was pretty clear that
>>> writel is supposed to remain in program order and have the barriers
>>> needed to do that.
>> 
>> Right. But that doesn’t take into account that WC writes are
>> considered completed when they are still posted in CPU WC buffers.
> 
>> The semantics as I understand of writeX() is that it guarantees all
>> prior writes have been completed.  It means that all prior stores
>> have executed and that store-buffer is flushed. But it doesn’t mean
>> that WC buffers is flushed as-well.
> 
> The semantic for writel is that prior program order stores will be
> observable by DMA from the device receiving the writel. This is
> required for UC and NC stores today. WC is undefined, I think.
> 
> This is why ARM has the additional barrier in writel.

Yep.

> 
> It would logically make sense if WC followed the same rule, however,
> adding a barrier to writel to make WC ordered would not be popular, so
> I think we are left with using special accessors for WC and placing
> the barrier there..

Right.

> 
>>> IMHO you should start there before going around and adding/removing wmbs
>>> related to WC. Update membory-barriers.txt and related with the model
>>> ordering for WC access and get agreement.
>> 
>> I disagree here. It’s more important to fix a real bug (e.g. Not
>> flushing WC buffers on x86 AMD).  Then, we can later formalise this
>> and refactor code as necessary. Which will also optimise it as-well.
>> Bug fix can be merged before we finish all these discussions and get
>> agreement.
> 
> Is it a real bug that people actually hit? It wasn't clear from the
> commit message. If so, sure, it should be fixed and the commit message
> clarified. (but I'd put the wmb near the WC writes..)

I found this bug during code review. I’m not aware if AWS saw this bug happening in production.
But according to AMD SDM and Optimization Guide SDM, this is a bug.

I think it doesn’t happen in practice because the write of the Tx descriptor + 128 first bytes of packet
Effectively fills the relevant WC buffers and when a WC buffer is fully written to, the CPU *should*
(Not *must*) flush the WC buffer to memory.

Having said that, I (and AWS maintainers of this code which I spoke with them about this) still think
that we should first apply this patch and later refactor this code when we will introduce the proper
memory accessor (flush_wc_writeX()).

But I would be glad to hear thoughts from memory-barriers.txt maintainers such as Will on this.

> 
> I am surprised that AMD is different here, the evolution of the WC
> feature on x86 was to transparently speed up graphics, so I'm pretty
> surprised AMD can get away with not ordering the same as Intel..

Completely agree. I was very surprised to see this from AMD SDM and Optimization Guide SDM.
It made sense to me too that graphics frame buffer is written to WC memory and then is committed
to GPU by writing to some doorbell register mapped as UC memory.

My personal theory is that the difference is rooted from the fact that AMD use dedicated CPU registers
for WC buffers in contrast to Intel which utilise LFB entries for that (In addition to their standard role).
But this is a pure guess of course. :)

> 
>> I do completely agree we should have this discussion on WC and
>> barriers and I already sent an email on this to all the
>> memory-barriers.txt maintainers. Waiting to see how that discussion
>> go and get community feedback before I will submit a patch-series
>> that will introduce new changes to memory-barriers.txt and probably
>> also new barrier macro.
> 
> The barrier macros have been unpopular, ie the confusing mmiowb was
> dumped, and I strongly suspect to contain WC within a spinlock (which
> is very important!) you need a barrier instruction on some archs.
> 
> New accessors might work better.

We agree. I just use wrong terminology :)
I meant we should probably introduce something as flush_wc_writeX().

> 
>>>>>> 	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>>>>>> 	doorbell[1] = cpu_to_be32(cq->cqn);
>>>>> 
>>>>>> static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
>>>>>> {
>>>>>> #if BITS_PER_LONG == 64
>>>>>> -	__raw_writeq(*(u64 *)val, dest);
>>>>>> +	writeq(*(u64 *)val, dest);
>>>>> 
>>>>> I want to say this might cause problems with endian swapping as writeq
>>>>> also does some swaps that __raw does not? Is this true?
>>>> 
>>>> Hmm... Looking at ARM64 version, writeq() indeed calls cpu_to_le64()
>>>> on parameter before passing it to __raw_writeq().  Quite surprising
>>>> from API perspective to be honest.
>>> 
>>> For PCI-E devices writel(x) is defined to generate the same TLP on the
>>> PCI-E bus, across all arches.
>> 
>> Good to know.
>> Question: Where is this documented?
> 
> Hmm, haven't ever seen it documented like that. It is sort of a
> logical requirement. If writel(x) produces different TLPs (ie
> different byte order) how could a driver ever work with a PCI-E device
> that requires only one TLP for x?

Makes sense indeed.

> 
>>>> So should I change this instead to iowrite64be(*(u64 *)val, dest)?
>>> 
>>> This always made my head hurt, but IIRC, when I looked at it years ago
>>> the weird array construction caused problems with that simple conversion.
>>> 
>>> The userspace version looks like this now:
>>> 
>>>       uint64_t doorbell;
>>>       uint32_t sn;
>>>       uint32_t ci;
>>>       uint32_t cmd;
>>> 
>>>       sn  = cq->arm_sn & 3;
>>>       ci  = cq->cons_index & 0xffffff;
>>>       cmd = solicited ? MLX5_CQ_DB_REQ_NOT_SOL : MLX5_CQ_DB_REQ_NOT;
>>> 
>>>       doorbell = sn << 28 | cmd | ci;
>>>       doorbell <<= 32;
>>>       doorbell |= cq->cqn;
>>> 
>>>       mmio_write64_be(ctx->uar[0].reg + MLX5_CQ_DOORBELL, htobe64(doorbell));
>>> 
>>> Where on all supported platforms the mmio_write64_be() expands to a
>>> simple store (no swap)
>>> 
>>> Which does look functionally the same as
>>> 
>>>  iowrite64be(doorbell, dest);
>>> 
>>> So this patch should change the mlx5_write64 to accept a u64 like we
>>> did in userspace when this was all cleaned there.
>> 
>> If I understand you correctly, you suggest to change callers to pass
>> here a standard u64 and then modify mlx5_write64() to just call
>> iowrite64be(). If so, I agree. Just want to confirm before sending
>> v2.
> 
> Yes, this is what I did to userspace and it made this all make
> sense. I strongly recommend to do the same in the kernel, particularly
> now that we have iowrite64be()!

Ack.

Thanks,
-Liran

> 
> Jason


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-03 16:31           ` Liran Alon
@ 2020-01-03 16:36             ` Jason Gunthorpe
  2020-01-03 16:42               ` Liran Alon
  2020-01-03 18:38             ` Liran Alon
  1 sibling, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-01-03 16:36 UTC (permalink / raw)
  To: Liran Alon
  Cc: Will Deacon, saeedm, leon, netdev, linux-rdma, eli, tariqt,
	danielm, Håkon Bugge

On Fri, Jan 03, 2020 at 06:31:18PM +0200, Liran Alon wrote:

> > I am surprised that AMD is different here, the evolution of the WC
> > feature on x86 was to transparently speed up graphics, so I'm pretty
> > surprised AMD can get away with not ordering the same as Intel..
> 
> Completely agree. I was very surprised to see this from AMD SDM and
> Optimization Guide SDM.  It made sense to me too that graphics frame
> buffer is written to WC memory and then is committed to GPU by
> writing to some doorbell register mapped as UC memory.

It is possible this manual is wrong or misleading?

Having WC writes not strongly order after UC writes to the same
device, on x86, seems very, very surprising to me. Everything is
ordered on x86 :)

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-03 16:36             ` Jason Gunthorpe
@ 2020-01-03 16:42               ` Liran Alon
  0 siblings, 0 replies; 11+ messages in thread
From: Liran Alon @ 2020-01-03 16:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, saeedm, leon, netdev, linux-rdma, eli, tariqt,
	danielm, Håkon Bugge



> On 3 Jan 2020, at 18:36, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Jan 03, 2020 at 06:31:18PM +0200, Liran Alon wrote:
> 
>>> I am surprised that AMD is different here, the evolution of the WC
>>> feature on x86 was to transparently speed up graphics, so I'm pretty
>>> surprised AMD can get away with not ordering the same as Intel..
>> 
>> Completely agree. I was very surprised to see this from AMD SDM and
>> Optimization Guide SDM.  It made sense to me too that graphics frame
>> buffer is written to WC memory and then is committed to GPU by
>> writing to some doorbell register mapped as UC memory.
> 
> It is possible this manual is wrong or misleading?
> 
> Having WC writes not strongly order after UC writes to the same
> device, on x86, seems very, very surprising to me. Everything is
> ordered on x86 :)
> 
> Jason

I thought so the same at first. This is why I checked both AMD SDM and AMD Optimisation Guide
and made sure to quote relevant section this document. I will be glad to be corrected that’s solely
a mistake. But it seems very intentional and explicitly documented in multiple places.

Also note that WC memory is considered weakly-ordered in x86. E.g. non-temporal stores that
appear in program order after a previous store to WB memory, can complete before the store to
WB memory. In addition, a store to WC memory is considered complete once the stored data reach
the WC buffer, where it’s not globally visible. In contrast to stores to WB/UC memory that are globally
visible once they are complete (In contrast to retired).

But again, I agree this is very surprising and unexpected… To have a single arch have different caching
behaviour that needs to be considered based on CPU vendor...

-Liran


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-03 16:31           ` Liran Alon
  2020-01-03 16:36             ` Jason Gunthorpe
@ 2020-01-03 18:38             ` Liran Alon
  2020-01-03 19:06               ` Jason Gunthorpe
  1 sibling, 1 reply; 11+ messages in thread
From: Liran Alon @ 2020-01-03 18:38 UTC (permalink / raw)
  To: Liran Alon
  Cc: Jason Gunthorpe, Will Deacon, saeedm, leon, netdev, linux-rdma,
	eli, tariqt, danielm, Håkon Bugge



> On 3 Jan 2020, at 18:31, Liran Alon <liran.alon@oracle.com> wrote:
> 
> 
> 
>> On 3 Jan 2020, at 15:37, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> 
>> On Fri, Jan 03, 2020 at 12:21:06AM +0200, Liran Alon wrote:
>> 
>>>> AFAIK WC is largely unspecified by the memory model. Is wmb() even
>>>> formally specified to interact with WC?
>>> 
>>> As I said, I haven’t seen such semantics defined in kernel
>>> documentation such as memory-barriers.txt.  However, in practice, it
>>> does flush WC buffers. At least for x86 and ARM which I’m familiar
>>> enough with.  I think it’s reasonable to assume that wmb() should
>>> flush WC buffers while dma_wmb()/smp_wmb() doesn’t necessarily have
>>> to do this.
>> 
>> It is because WC is rarely used and laregly undefined for the kernel
>> :(
> 
> Yep.
> 
>> 
>>>>>>> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
>>>>>>> it's callers.
>>>>>> 
>>>>>> Yes, wmb(); writel(); is always redundant
>>>>> 
>>>>> Well, unfortunately not…
>>>>> See: https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D157798859215697-26w-3D2&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Ox1lCS1KAGBvJrf24kiFQrranIaNi_zeo05sqCUEf7Y&s=Mz6MJzUQ862DGjgGnj3neX4ZpjI88nOI9KpZhNF9TqQ&e=
>>>>> (See my suggestion to add flush_wc_writeX())
>>>> 
>>>> Well, the last time wmb & writel came up Linus was pretty clear that
>>>> writel is supposed to remain in program order and have the barriers
>>>> needed to do that.
>>> 
>>> Right. But that doesn’t take into account that WC writes are
>>> considered completed when they are still posted in CPU WC buffers.
>> 
>>> The semantics as I understand of writeX() is that it guarantees all
>>> prior writes have been completed.  It means that all prior stores
>>> have executed and that store-buffer is flushed. But it doesn’t mean
>>> that WC buffers is flushed as-well.
>> 
>> The semantic for writel is that prior program order stores will be
>> observable by DMA from the device receiving the writel. This is
>> required for UC and NC stores today. WC is undefined, I think.
>> 
>> This is why ARM has the additional barrier in writel.
> 
> Yep.
> 
>> 
>> It would logically make sense if WC followed the same rule, however,
>> adding a barrier to writel to make WC ordered would not be popular, so
>> I think we are left with using special accessors for WC and placing
>> the barrier there..
> 
> Right.
> 
>> 
>>>> IMHO you should start there before going around and adding/removing wmbs
>>>> related to WC. Update membory-barriers.txt and related with the model
>>>> ordering for WC access and get agreement.
>>> 
>>> I disagree here. It’s more important to fix a real bug (e.g. Not
>>> flushing WC buffers on x86 AMD).  Then, we can later formalise this
>>> and refactor code as necessary. Which will also optimise it as-well.
>>> Bug fix can be merged before we finish all these discussions and get
>>> agreement.
>> 
>> Is it a real bug that people actually hit? It wasn't clear from the
>> commit message. If so, sure, it should be fixed and the commit message
>> clarified. (but I'd put the wmb near the WC writes..)
> 
> I found this bug during code review. I’m not aware if AWS saw this bug happening in production.
> But according to AMD SDM and Optimization Guide SDM, this is a bug.
> 
> I think it doesn’t happen in practice because the write of the Tx descriptor + 128 first bytes of packet
> Effectively fills the relevant WC buffers and when a WC buffer is fully written to, the CPU *should*
> (Not *must*) flush the WC buffer to memory.

Actually after re-reading AMD Optimization Guide SDM, I see it is guaranteed that:
“Write-combining is closed if all 64 bytes of the write buffer are valid”.
And this is indeed always the case for AWS ENA LLQ. Because as can be seen at
ena_com_config_llq_info(), desc_list_entry_size is either 128, 192 or 256. i.e. Always
a multiple of 64 bytes. So this explains why this wasn’t an issue in production.

-Liran



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
  2020-01-03 18:38             ` Liran Alon
@ 2020-01-03 19:06               ` Jason Gunthorpe
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2020-01-03 19:06 UTC (permalink / raw)
  To: Liran Alon
  Cc: Will Deacon, saeedm, leon, netdev, linux-rdma, eli, tariqt,
	danielm, Håkon Bugge

On Fri, Jan 03, 2020 at 08:38:51PM +0200, Liran Alon wrote:

> Actually after re-reading AMD Optimization Guide SDM, I see it is
> guaranteed that: “Write-combining is closed if all 64 bytes of the
> write buffer are valid”.  And this is indeed always the case for AWS
> ENA LLQ. Because as can be seen at ena_com_config_llq_info(),
> desc_list_entry_size is either 128, 192 or 256. i.e. Always a
> multiple of 64 bytes. So this explains why this wasn’t an issue in
> production.

That makes a lot of sense

So, the problem with defining WC in the kernel memory model still
remains, but it is back to a theoretical one.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-01-03 19:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-02 17:44 [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb() Liran Alon
2020-01-02 19:29 ` Jason Gunthorpe
2020-01-02 19:45   ` Liran Alon
2020-01-02 20:58     ` Jason Gunthorpe
2020-01-02 22:21       ` Liran Alon
2020-01-03 13:37         ` Jason Gunthorpe
2020-01-03 16:31           ` Liran Alon
2020-01-03 16:36             ` Jason Gunthorpe
2020-01-03 16:42               ` Liran Alon
2020-01-03 18:38             ` Liran Alon
2020-01-03 19:06               ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.