Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers

From: Anssi Hannula <anssi.hannula@bitwise.fi>
To: Claudiu.Beznea@microchip.com
Cc: Nicolas.Ferre@microchip.com, davem@davemloft.net, netdev@vger.kernel.org
Subject: Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
Date: Wed, 12 Dec 2018 13:27:14 +0200	[thread overview]
Message-ID: <70b55891-1bc4-be1a-1e45-09f073d3e021@bitwise.fi> (raw)
In-Reply-To: <0212fb04-8c2d-264c-a459-6d579a56507e@microchip.com>

On 12.12.2018 12:58, Claudiu.Beznea@microchip.com wrote:
>
> On 11.12.2018 15:21, Anssi Hannula wrote:
>> On 10.12.2018 12:34, Claudiu.Beznea@microchip.com wrote:
>>> On 07.12.2018 14:00, Anssi Hannula wrote:
>>>> On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
>>>>> Hi Anssi,
>>>> Hi!
>>>>
>>>>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>>>>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>>>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>>>>> that bit.
>>>>>>>>
>>>>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>>>>
>>>>>>>> I did not observe any actual issues caused by these being missing,
>>>>>>>> though.
>>>>>>>>
>>>>>>>> Tested on a ZynqMP based system.
>>>>>>>>
>>>>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>>>>>> ---
>>>>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>>>>  
>>>>>>>>  			/* First, update TX stats if needed */
>>>>>>>>  			if (skb) {
>>>>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>>>>> +				 * as ctrl (TX_USED bit)
>>>>>>>> +				 */
>>>>>>>> +				dma_rmb();
>>>>>>>> +
>>>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>>>>> data to be send is updated under a locked spinlock.
>>>>>> The previous rmb() is before the TX_USED check, so my understanding is
>>>>>> that the following could happen in theory:
>>>>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>>>>> understand from [1] that dsb completes when:
>>>>> "All explicit memory accesses before this instruction complete.
>>>>> All Cache, Branch predictor and TLB maintenance operations before this
>>>>> instruction complete."
>>>>>
>>>>>> 1. rmb().
>>>>> According to [1] this should end after all previous instructions (loads,
>>>>> stores) ends.
>>>>>
>>>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>>>>> are crossed.
>>>>> But, as per [1], no onward instruction will be reached until all
>>>>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>>>>> should be updated, right?
>>>> The descriptor that triggered the TX interrupt should be visible now, yes.
>>>> However, the controller may be writing to any other descriptors at the
>>>> same time as the loop is processing through them, as there are multiple
>>>> TX buffers.
>>> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
>>> the beginning of loop intended for that... In the 2nd loop you are
>>> operating with the same descriptor which was read in the first loop.
>> I'm concerned about the 2nd iteration of the first for loop in
>> macb_tx_interrupt().
>> That operates on a different descriptor due to tail++, and that
>> different descriptor may have in-flight writes from controller as the
>> interrupt may or may not already be raised for it.
> I agree with that, there may be in-flights updates, but since ownership bit
> should be updated at the end of the TX (I expect at this moment to be
> updated all the descriptor's parts, including timestamps), if you read a
> descriptor and it hasn't been treated by the hardware the descriptor is own
> by the hardware and there is this check after ctrl is read:
>
> if (!(ctrl & MACB_BIT(TX_USED)))
> 	break;
>
> Next time a new tx interrupt will be raised this descriptor would be also
> treated.
>
> In case the above instruction doesn't break then it means the descriptor
> was transmitted. And since ownership should be updated at the end of the
> TX, this means no in-flights updates should be done on this descriptor, so,
> when you will read the timestamps these should be the good ones (even it is
> the first one, the 3rd, and so on).

Agreed with above, the only problem is if the ts1/ts2 load is reordered
before ctrl load.

> Moreover, this new barrier barrier is in the 2nd loop, you're placing a
> memory barrier after every step in the 2nd loop for the same descriptor
> read in the 1st loop (it was already loaded in a processor register).

The barrier is inside "if (skb)", and skb is only set for the last frame
as indicated by the code at the end of the loop:

            /* skb is set only for the last buffer of the frame.
             * WARNING: at this point skb has been freed by
             * macb_tx_unmap().
             */
            if (skb)
                break;

> To avoid reordering of ctrl and ts1/ts2 loads wouldn't be enough to have
> memory barrier after:
> ctrl = desc->ctrl;

Yes, it would.
My v2 moved it into the other direction, into the ts code, though, which
has the added benefit of no barrier when timestamps are disabled (which
IIRC is the default).
But I can move it to just after ctrl if you so prefer.

>> Or am I missing something?
>>
>>>>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>>>>> I expect hardware to set TX_USED and timestamp before raising TX complete
>>>>> interrupt. If so, there should be no on-flight updates of this descriptor,
>>>>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>>>>> descriptor like this and hangs TX.
>>>> For the first iteration of the loop, that is correct - there should be
>>>> no in-flight writes from controller as it already raised the interrupt.
>>>> However, the following iterations of the loop process descriptors that
>>>> may or may not have the interrupt raised yet, and therefore may still
>>>> have in-flight writes.
>>> I expect the hardware to give up ownership of the descriptor just at the
>>> end of TX operation, so that the word containing TX_USED to be the last one
>>> updated. If you reads the descriptor and TX_USED is not there the first
>>> loop breaks, queue->tx_tail is updated with the last processed descriptor
>>> in the queue (see "queue->tx_tail = tail;") then the next interrupt will
>>> start processing with this last one descriptor.
>> Agreed, that is how it works. Issues only occur if we somehow operate on
>> data before ownership was transferred.
> Could this happens? Shouldn't this:
> 	if (!(ctrl & MACB_BIT(TX_USED)))
> 		break;
>
> assure that this could not happen and the fact that the ownership is passed
> to CPU at the end of descriptor update?

I don't see how that prevents desc->ctrl and desc->ts_1/ts_2 reordering,
which is the only issue that I can see.
By "data" above I meant ts_1/ts_2, AFAICS there is nothing else we read
from the descriptors aside from ctrl itself.

> If load reordering b/w desc->ctrl and desc->ts_1/ts_2 wouldn't be enough to
> have a barrier after
> ctrl = desc->ctrl instruction?

Yes, per above.

>>>>>> 4. Code checks TX_USED.
>>>>>> 5. Code operates on timestamp that is actually garbage.
>>>>>>
>>>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>>>>> than dma_rmb(), though.
>>>>> If you still think this scenario could happen why not calling a dsb in
>>>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
>>>> OK, I will move it there. Unless we arrive at a conclusion that it is
>>>> unnecessary altogether, of course :)
>>>>
>>>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>>>>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>>>>> and TX_USED to be set by hardware before raising TX complete interrupt.
>>>> Yes, but since my concern is that without barriers in between,
>>>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
>>>> set even though ts_1 is not yet an actual timestamp. And per above, all
>>>> this may occur before the TX complete interrupt is raised for the
>>>> descriptor in question.
>>> If so, why not placing the barrier when reading timestamps? From my point
>>> of view the place of "dma_rmb()" in this patch doesn't guarantees that
>>> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).
>> If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl
>> is read before the barrier, my understanding is that they are guaranteed
>> to be ordered so that dev->ctrl is read before ts1/ts2.
>>
>> Per [1], "[DMB] ensures that all explicit memory accesses that appear in
>> program order before the |DMB| instruction are observed before any
>> explicit memory accesses that appear in program order after the |DMB|
>> instruction".
>> And the compiler barrier ("memory" asm clobber) included within
>> dma_rmb() ensures that in program order, ctrl is loaded before the
>> barrier, and ts1/ts2 after it.
>>
>> But as mentioned, it makes sense to have it closer to the ts1/ts2 reads
>> so I'll make that change.
>>
>> [1]
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>
>>
>>>> I agree that this TX case seems somewhat unlikely to be triggered in
>>>> real life at least at present, though, as the ts_1/ts_2 read is so far
>>>> after the desc->ctrl read, and behind function calls, so unlikely to be
>>>> reordered before desc->ctrl read.
>>>> But the similar RX cases below seem more problematic as the racy loads
>>>> in question are right after each other in code.
>>>>
[...]

-- 
Anssi Hannula / Bitwise Oy
+358 503803997