Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()

From: Liran Alon <liran.alon@oracle.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: "Will Deacon" <will@kernel.org>,
	saeedm@mellanox.com, leon@kernel.org, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, eli@mellanox.com,
	tariqt@mellanox.com, danielm@mellanox.com,
	"Håkon Bugge" <haakon.bugge@oracle.com>
Subject: Re: [PATCH] net: mlx5: Use writeX() to ring doorbell and remove reduntant wmb()
Date: Fri, 3 Jan 2020 18:31:18 +0200	[thread overview]
Message-ID: <F7C45792-2F17-42AE-88A2-F744EEAD68A5@oracle.com> (raw)
In-Reply-To: <20200103133749.GA9706@ziepe.ca>

> On 3 Jan 2020, at 15:37, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Jan 03, 2020 at 12:21:06AM +0200, Liran Alon wrote:
> 
>>> AFAIK WC is largely unspecified by the memory model. Is wmb() even
>>> formally specified to interact with WC?
>> 
>> As I said, I haven’t seen such semantics defined in kernel
>> documentation such as memory-barriers.txt.  However, in practice, it
>> does flush WC buffers. At least for x86 and ARM which I’m familiar
>> enough with.  I think it’s reasonable to assume that wmb() should
>> flush WC buffers while dma_wmb()/smp_wmb() doesn’t necessarily have
>> to do this.
> 
> It is because WC is rarely used and laregly undefined for the kernel
> :(

Yep.

> 
>>>>>> Therefore, change mlx5_write64() to use writeX() and remove wmb() from
>>>>>> it's callers.
>>>>> 
>>>>> Yes, wmb(); writel(); is always redundant
>>>> 
>>>> Well, unfortunately not…
>>>> See: https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dnetdev-26m-3D157798859215697-26w-3D2&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Ox1lCS1KAGBvJrf24kiFQrranIaNi_zeo05sqCUEf7Y&s=Mz6MJzUQ862DGjgGnj3neX4ZpjI88nOI9KpZhNF9TqQ&e=
>>>> (See my suggestion to add flush_wc_writeX())
>>> 
>>> Well, the last time wmb & writel came up Linus was pretty clear that
>>> writel is supposed to remain in program order and have the barriers
>>> needed to do that.
>> 
>> Right. But that doesn’t take into account that WC writes are
>> considered completed when they are still posted in CPU WC buffers.
> 
>> The semantics as I understand of writeX() is that it guarantees all
>> prior writes have been completed.  It means that all prior stores
>> have executed and that store-buffer is flushed. But it doesn’t mean
>> that WC buffers is flushed as-well.
> 
> The semantic for writel is that prior program order stores will be
> observable by DMA from the device receiving the writel. This is
> required for UC and NC stores today. WC is undefined, I think.
> 
> This is why ARM has the additional barrier in writel.

Yep.

> 
> It would logically make sense if WC followed the same rule, however,
> adding a barrier to writel to make WC ordered would not be popular, so
> I think we are left with using special accessors for WC and placing
> the barrier there..

Right.

> 
>>> IMHO you should start there before going around and adding/removing wmbs
>>> related to WC. Update membory-barriers.txt and related with the model
>>> ordering for WC access and get agreement.
>> 
>> I disagree here. It’s more important to fix a real bug (e.g. Not
>> flushing WC buffers on x86 AMD).  Then, we can later formalise this
>> and refactor code as necessary. Which will also optimise it as-well.
>> Bug fix can be merged before we finish all these discussions and get
>> agreement.
> 
> Is it a real bug that people actually hit? It wasn't clear from the
> commit message. If so, sure, it should be fixed and the commit message
> clarified. (but I'd put the wmb near the WC writes..)

I found this bug during code review. I’m not aware if AWS saw this bug happening in production.
But according to AMD SDM and Optimization Guide SDM, this is a bug.

I think it doesn’t happen in practice because the write of the Tx descriptor + 128 first bytes of packet
Effectively fills the relevant WC buffers and when a WC buffer is fully written to, the CPU *should*
(Not *must*) flush the WC buffer to memory.

Having said that, I (and AWS maintainers of this code which I spoke with them about this) still think
that we should first apply this patch and later refactor this code when we will introduce the proper
memory accessor (flush_wc_writeX()).

But I would be glad to hear thoughts from memory-barriers.txt maintainers such as Will on this.

> 
> I am surprised that AMD is different here, the evolution of the WC
> feature on x86 was to transparently speed up graphics, so I'm pretty
> surprised AMD can get away with not ordering the same as Intel..

Completely agree. I was very surprised to see this from AMD SDM and Optimization Guide SDM.
It made sense to me too that graphics frame buffer is written to WC memory and then is committed
to GPU by writing to some doorbell register mapped as UC memory.

My personal theory is that the difference is rooted from the fact that AMD use dedicated CPU registers
for WC buffers in contrast to Intel which utilise LFB entries for that (In addition to their standard role).
But this is a pure guess of course. :)

> 
>> I do completely agree we should have this discussion on WC and
>> barriers and I already sent an email on this to all the
>> memory-barriers.txt maintainers. Waiting to see how that discussion
>> go and get community feedback before I will submit a patch-series
>> that will introduce new changes to memory-barriers.txt and probably
>> also new barrier macro.
> 
> The barrier macros have been unpopular, ie the confusing mmiowb was
> dumped, and I strongly suspect to contain WC within a spinlock (which
> is very important!) you need a barrier instruction on some archs.
> 
> New accessors might work better.

We agree. I just use wrong terminology :)
I meant we should probably introduce something as flush_wc_writeX().

> 
>>>>>> 	doorbell[0] = cpu_to_be32(sn << 28 | cmd | ci);
>>>>>> 	doorbell[1] = cpu_to_be32(cq->cqn);
>>>>> 
>>>>>> static inline void mlx5_write64(__be32 val[2], void __iomem *dest)
>>>>>> {
>>>>>> #if BITS_PER_LONG == 64
>>>>>> -	__raw_writeq(*(u64 *)val, dest);
>>>>>> +	writeq(*(u64 *)val, dest);
>>>>> 
>>>>> I want to say this might cause problems with endian swapping as writeq
>>>>> also does some swaps that __raw does not? Is this true?
>>>> 
>>>> Hmm... Looking at ARM64 version, writeq() indeed calls cpu_to_le64()
>>>> on parameter before passing it to __raw_writeq().  Quite surprising
>>>> from API perspective to be honest.
>>> 
>>> For PCI-E devices writel(x) is defined to generate the same TLP on the
>>> PCI-E bus, across all arches.
>> 
>> Good to know.
>> Question: Where is this documented?
> 
> Hmm, haven't ever seen it documented like that. It is sort of a
> logical requirement. If writel(x) produces different TLPs (ie
> different byte order) how could a driver ever work with a PCI-E device
> that requires only one TLP for x?

Makes sense indeed.

> 
>>>> So should I change this instead to iowrite64be(*(u64 *)val, dest)?
>>> 
>>> This always made my head hurt, but IIRC, when I looked at it years ago
>>> the weird array construction caused problems with that simple conversion.
>>> 
>>> The userspace version looks like this now:
>>> 
>>>       uint64_t doorbell;
>>>       uint32_t sn;
>>>       uint32_t ci;
>>>       uint32_t cmd;
>>> 
>>>       sn  = cq->arm_sn & 3;
>>>       ci  = cq->cons_index & 0xffffff;
>>>       cmd = solicited ? MLX5_CQ_DB_REQ_NOT_SOL : MLX5_CQ_DB_REQ_NOT;
>>> 
>>>       doorbell = sn << 28 | cmd | ci;
>>>       doorbell <<= 32;
>>>       doorbell |= cq->cqn;
>>> 
>>>       mmio_write64_be(ctx->uar[0].reg + MLX5_CQ_DOORBELL, htobe64(doorbell));
>>> 
>>> Where on all supported platforms the mmio_write64_be() expands to a
>>> simple store (no swap)
>>> 
>>> Which does look functionally the same as
>>> 
>>>  iowrite64be(doorbell, dest);
>>> 
>>> So this patch should change the mlx5_write64 to accept a u64 like we
>>> did in userspace when this was all cleaned there.
>> 
>> If I understand you correctly, you suggest to change callers to pass
>> here a standard u64 and then modify mlx5_write64() to just call
>> iowrite64be(). If so, I agree. Just want to confirm before sending
>> v2.
> 
> Yes, this is what I did to userspace and it made this all make
> sense. I strongly recommend to do the same in the kernel, particularly
> now that we have iowrite64be()!

Ack.

Thanks,
-Liran

> 
> Jason